- tags
- Vision Language Models, Geospatial AI, Tool calling, GRPO, Reinforcement learning
- source
- (Wang et al. 2026)
Summary
GeoEyes addresses visual question answering (VQA) on ultra-high-resolution (UHR) remote sensing imagery — scenes where task-relevant cues occupy only tiny fractions of the full image. The authors target the “thinking-with-images” paradigm, in which Vision Language Models interleave textual reasoning with active zoom_in Tool calling to acquire high-resolution evidence on demand. The paper diagnoses a systematic failure of existing zoom-enabled MLLMs (e.g., DeepEyes) on UHR benchmarks: Tool Usage Homogenization, where the policy collapses to a near-constant “always zoom once” behavior that is blind to task difficulty. The authors attribute this to two UHR-specific properties: task heterogeneity (global tasks need no zoom, fine-grained tasks need multiple rounds) and low effective evidence density (sparse outcome rewards fail to guide multi-step search).
The proposed remedy is a two-stage recipe. First, a cold-start supervised fine-tuning dataset, UHR Chain-of-Zoom (UHR-CoZ), of 25,467 interleaved image–text reasoning trajectories spanning no-zoom, single-zoom, and multi-round progressive zoom regimes. UHR-CoZ is built via an automated GLM-4.5V pipeline with answer-cleaning and trajectory-cleaning quality control. Second, an Reinforcement learning stage, AdaZoom-GRPO, built on top of GRPO with a reshaped reward that combines accuracy, format compliance, and three bespoke components: an Adaptive Efficiency reward with category-specific step allowances and instance-level difficulty modulation, a Chain-of-Focus reward implementing directional IoU containment to encourage “coarse-to-fine” geometric zoom progressions (and tolerate backtracking), and a Necessity-Aware Process Verification reward that penalizes confidently answering fine-grained queries without any zoom.
On XLRS-Bench, GeoEyes (7B) reaches 54.23% average accuracy, surpassing DeepEyes (50.0%), the domain-specialized GeoLLaVA-8K (51.5%), and much larger general-purpose MLLMs including Qwen3-VL-235B (51.1%) and Qwen2.5-VL-72B (50.2%). Ablations show that dropping the CoF reward hurts most on tool-intensive subtasks, dropping the tool-efficiency reward causes a ~2% macro-accuracy drop, and the necessity-aware judge outperforms a pure logic-consistency judge.
Key Ideas
- Tool Usage Homogenization: a diagnosed failure mode in which zoom-enabled MLLMs converge to a task-agnostic single-call policy. Motivating evidence (Fig. 1): DeepEyes invokes the tool on 100% of samples with near-constant depth, while GeoEyes triggers it on 68.44%.
- UHR Chain-of-Zoom (UHR-CoZ): a 25k-sample interleaved image–text chain-of-thought dataset with explicit depth distribution (no-zoom 6.4%, one-zoom 86.7%, ≥3 zooms 6.9%) and agent-orchestrated annotation.
-
AdaZoom-GRPO, with a multi-component reward:
- Adaptive Efficiency reward R_tool = P_α · exp(−γ · ΔN), with a category-specific free quota N_base and an instance-difficulty scalar P_α = 1 − p(y|x)_base, which only penalizes excess steps for easy samples while tolerating exploration on hard ones.
- Chain-of-Focus reward R_cof: directional IoU giving +β_zoom when the next bbox is contained in the current one and shrinks, 0 on backtracking (safe harbor), and −β_drift on disjoint drift.
- Process Verification reward R_proc: an LLM-judge “Necessity-Aware” signal that punishes confident answers to detail-heavy questions without any zoom action.
- Directional IoU as a geometry-aware alternative to standard IoU, which is ill-suited to progressive zooming because large scale shifts produce low IoU even when the zoom is correct.
- Staged SFT → RL pipeline: cold-start SFT initializes the policy with visual planning capabilities and task-difficulty awareness; RL then refines the evidence-gain policy. Removing the SFT cold start drops accuracy to 47.53%, validating the staging.
Comments
This paper is a nice case study of how generic agentic-RL recipes break down under domain-specific distribution shifts. The “tool-usage homogenization” diagnosis — that a binary tool-use reward collapses to “always zoom once” when tasks are heterogeneous — is plausibly a general phenomenon for tool-augmented MLLMs, not just for remote sensing; one would expect to see similar failure modes in document VQA or scientific figure understanding where a subset of queries needs no visual grounding at all. In that sense the contribution is less about remote sensing and more about how to shape RL rewards for tool-use policies that should decide whether, where, and how many times to act.
The three reward components each map cleanly to a distinct failure mode: R_tool for over-triggering, R_cof for aimless drift, R_proc for ungrounded hallucination. The geometric R_cof reward with a “backtracking safe harbor” is a pleasant touch — it preserves the ability to re-expand the context without being penalized as drift, which is a common pitfall of pure-IoU trajectory rewards.
A few open questions:
- Several reward terms rely on instance-level difficulty P_α computed from the base model’s confidence. This creates a chicken-and-egg dependency on the reference policy — drift during training could make the difficulty estimate stale.
- The Necessity-Aware process judge is itself an LLM call; the paper does not report cost, and it bakes another model’s priors into the reward signal. Ablating this term is fine; deploying it at scale is another matter.
- The paper frames the work around a single benchmark family (XLRS-Bench, SuperRS-VQA, HighRS-VQA). It would be interesting to see whether the same recipe transfers to non-satellite UHR settings (histopathology, microscopy) where the inductive biases about “global vs. fine-grained” sub-tasks are different.
The GRPO variant here is a relatively light modification — the novelty is really in the reward design, not the optimization algorithm. This is consistent with a growing pattern in agentic RLVR: the optimizer is commoditized, reward engineering is where the work lives.
Connections
- Related to Geospatial AI because the benchmark, model design, and data sources are all centered on remote sensing VQA.
- Related to Vision Language Models because GeoEyes is a VLM that interleaves visual perception steps with textual reasoning.
- Related to Tool calling because the entire paper is about how to train an MLLM to invoke a
zoom_intool task-adaptively rather than collapsing to a homogeneous tool-use policy. - Related to GRPO because AdaZoom-GRPO builds on GRPO as its RL optimizer and reshapes only the reward.
- Related to Reinforcement learning with verifiable rewards because the training signal is an answer-correctness reward augmented with verifiable process-level rewards (directional IoU, step-count quotas, LLM-judge necessity).
- Related to Spatial reasoning because zoom trajectories are spatial focus-of-attention policies over a high-resolution scene.
- Related to Grounding because the framework explicitly optimizes for evidence-grounded answers, punishing confidently ungrounded responses.
Bibliography
- Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, et al.. . "GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery". https://arxiv.org/abs/2602.14201.
Loading comments...