GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery by Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du (2026)

This note was initially drafted with LLM assistance. Generated notes are periodically reviewed and revised by the author.
tags
Vision Language Models, Geospatial AI, Tool calling, GRPO, Reinforcement learning
source
(Wang et al. 2026)

Summary

GeoEyes addresses visual question answering (VQA) on ultra-high-resolution (UHR) remote sensing imagery — scenes where task-relevant cues occupy only tiny fractions of the full image. The authors target the “thinking-with-images” paradigm, in which Vision Language Models interleave textual reasoning with active zoom_in Tool calling to acquire high-resolution evidence on demand. The paper diagnoses a systematic failure of existing zoom-enabled MLLMs (e.g., DeepEyes) on UHR benchmarks: Tool Usage Homogenization, where the policy collapses to a near-constant “always zoom once” behavior that is blind to task difficulty. The authors attribute this to two UHR-specific properties: task heterogeneity (global tasks need no zoom, fine-grained tasks need multiple rounds) and low effective evidence density (sparse outcome rewards fail to guide multi-step search).

The proposed remedy is a two-stage recipe. First, a cold-start supervised fine-tuning dataset, UHR Chain-of-Zoom (UHR-CoZ), of 25,467 interleaved image–text reasoning trajectories spanning no-zoom, single-zoom, and multi-round progressive zoom regimes. UHR-CoZ is built via an automated GLM-4.5V pipeline with answer-cleaning and trajectory-cleaning quality control. Second, an Reinforcement learning stage, AdaZoom-GRPO, built on top of GRPO with a reshaped reward that combines accuracy, format compliance, and three bespoke components: an Adaptive Efficiency reward with category-specific step allowances and instance-level difficulty modulation, a Chain-of-Focus reward implementing directional IoU containment to encourage “coarse-to-fine” geometric zoom progressions (and tolerate backtracking), and a Necessity-Aware Process Verification reward that penalizes confidently answering fine-grained queries without any zoom.

On XLRS-Bench, GeoEyes (7B) reaches 54.23% average accuracy, surpassing DeepEyes (50.0%), the domain-specialized GeoLLaVA-8K (51.5%), and much larger general-purpose MLLMs including Qwen3-VL-235B (51.1%) and Qwen2.5-VL-72B (50.2%). Ablations show that dropping the CoF reward hurts most on tool-intensive subtasks, dropping the tool-efficiency reward causes a ~2% macro-accuracy drop, and the necessity-aware judge outperforms a pure logic-consistency judge.

Key Ideas

  • Tool Usage Homogenization: a diagnosed failure mode in which zoom-enabled MLLMs converge to a task-agnostic single-call policy. Motivating evidence (Fig. 1): DeepEyes invokes the tool on 100% of samples with near-constant depth, while GeoEyes triggers it on 68.44%.
  • UHR Chain-of-Zoom (UHR-CoZ): a 25k-sample interleaved image–text chain-of-thought dataset with explicit depth distribution (no-zoom 6.4%, one-zoom 86.7%, ≥3 zooms 6.9%) and agent-orchestrated annotation.
  • AdaZoom-GRPO, with a multi-component reward:
    • Adaptive Efficiency reward R_tool = P_α · exp(−γ · ΔN), with a category-specific free quota N_base and an instance-difficulty scalar P_α = 1 − p(y|x)_base, which only penalizes excess steps for easy samples while tolerating exploration on hard ones.
    • Chain-of-Focus reward R_cof: directional IoU giving +β_zoom when the next bbox is contained in the current one and shrinks, 0 on backtracking (safe harbor), and −β_drift on disjoint drift.
    • Process Verification reward R_proc: an LLM-judge “Necessity-Aware” signal that punishes confident answers to detail-heavy questions without any zoom action.
  • Directional IoU as a geometry-aware alternative to standard IoU, which is ill-suited to progressive zooming because large scale shifts produce low IoU even when the zoom is correct.
  • Staged SFT → RL pipeline: cold-start SFT initializes the policy with visual planning capabilities and task-difficulty awareness; RL then refines the evidence-gain policy. Removing the SFT cold start drops accuracy to 47.53%, validating the staging.

Comments

This paper is a nice case study of how generic agentic-RL recipes break down under domain-specific distribution shifts. The “tool-usage homogenization” diagnosis — that a binary tool-use reward collapses to “always zoom once” when tasks are heterogeneous — is plausibly a general phenomenon for tool-augmented MLLMs, not just for remote sensing; one would expect to see similar failure modes in document VQA or scientific figure understanding where a subset of queries needs no visual grounding at all. In that sense the contribution is less about remote sensing and more about how to shape RL rewards for tool-use policies that should decide whether, where, and how many times to act.

The three reward components each map cleanly to a distinct failure mode: R_tool for over-triggering, R_cof for aimless drift, R_proc for ungrounded hallucination. The geometric R_cof reward with a “backtracking safe harbor” is a pleasant touch — it preserves the ability to re-expand the context without being penalized as drift, which is a common pitfall of pure-IoU trajectory rewards.

A few open questions:

  • Several reward terms rely on instance-level difficulty P_α computed from the base model’s confidence. This creates a chicken-and-egg dependency on the reference policy — drift during training could make the difficulty estimate stale.
  • The Necessity-Aware process judge is itself an LLM call; the paper does not report cost, and it bakes another model’s priors into the reward signal. Ablating this term is fine; deploying it at scale is another matter.
  • The paper frames the work around a single benchmark family (XLRS-Bench, SuperRS-VQA, HighRS-VQA). It would be interesting to see whether the same recipe transfers to non-satellite UHR settings (histopathology, microscopy) where the inductive biases about “global vs. fine-grained” sub-tasks are different.

The GRPO variant here is a relatively light modification — the novelty is really in the reward design, not the optimization algorithm. This is consistent with a growing pattern in agentic RLVR: the optimizer is commoditized, reward engineering is where the work lives.

Connections

  • Related to Geospatial AI because the benchmark, model design, and data sources are all centered on remote sensing VQA.
  • Related to Vision Language Models because GeoEyes is a VLM that interleaves visual perception steps with textual reasoning.
  • Related to Tool calling because the entire paper is about how to train an MLLM to invoke a zoom_in tool task-adaptively rather than collapsing to a homogeneous tool-use policy.
  • Related to GRPO because AdaZoom-GRPO builds on GRPO as its RL optimizer and reshapes only the reward.
  • Related to Reinforcement learning with verifiable rewards because the training signal is an answer-correctness reward augmented with verifiable process-level rewards (directional IoU, step-count quotas, LLM-judge necessity).
  • Related to Spatial reasoning because zoom trajectories are spatial focus-of-attention policies over a high-resolution scene.
  • Related to Grounding because the framework explicitly optimizes for evidence-grounded answers, punishing confidently ungrounded responses.

Bibliography

  1. . . "GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery". https://arxiv.org/abs/2602.14201.
Last changed | authored by

Comments

Loading comments...

Leave a comment

Back to Notes