- tags
- Vision Language Models, Reinforcement learning, GRPO, Tool calling, Grounding, Reinforcement learning with verifiable rewards
- source
- (Zheng et al. 2025)
Summary
DeepEyes is a Vision-Language Model that learns to “think with images” — interleaving textual chain-of-thought with self-initiated image zoom-ins during reasoning — and is trained purely with end-to-end Reinforcement learning, without any cold-start supervised fine-tuning on synthetic reasoning traces. Built on Qwen2.5-VL-7B and optimized with GRPO, the model emits grounding coordinates whose cropped patches are fed back into the context as new observation tokens, forming what the authors call an interleaved Multimodal Chain-of-Thought (iMCoT). Unlike prior workflow-based systems (SEAL, DyFo, ZoomEye) that depend on hand-designed pipelines and large SFT corpora, DeepEyes treats visual grounding as an intrinsic Tool calling native ability — the model itself decides when and where to zoom — and is trained end-to-end on question/answer pairs only.
The contribution sits at the intersection of active perception and Reinforcement learning with verifiable rewards. The reward has three components: an accuracy reward (correct final answer), a format reward (well-structured output), and a conditional tool-use bonus that only fires when both the answer is correct and at least one active perception step was triggered. This conditional coupling is crucial: with no tool reward the model quickly stops zooming; with an unconditional tool reward the model zooms constantly but stays at low accuracy; only the conditional bonus yields growing, selective zoom behavior. To bootstrap sample efficiency without SFT, the training data is curated with a three-stage filter (difficulty curation via Qwen2.5-VL-7B accuracy, format standardization, and a perception-utility filter that keeps only samples solvable via active perception with ground-truth regions) over V*, ArxivQA, and ThinkLite-VL.
On high-resolution benchmarks DeepEyes-7B reaches 90.1% on V* (+18.9 over Qwen2.5-VL 7B) and improves HR-Bench-4K/8K by 6.3% and 7.3%. It also improves general perception (MME-RealWorld-Lite +10.9%), grounding (refCOCO family), hallucination (POPE), and math-reasoning benchmarks (MathVista, MathVerse, LogicVista). Training dynamics on V* reveal three distinct emergent stages: (1) initial exploration with high but uncoordinated zoom rates and low grounding IoU; (2) high-frequency engagement where zoom count and response length peak; and (3) efficient utilization where the model becomes selective, reducing zoom frequency while maintaining high accuracy and IoU — a pattern the authors describe as co-evolution between perception and reasoning. A zero-shot experiment introduces a rotate tool (not seen in training) and shows that DeepEyes can generalize its tool-use scaffolding to the new tool without retraining.
Key Ideas
- iMCoT (interleaved Multimodal Chain-of-Thought): a Markov Decision Process formulation where actions are either language tokens or image observation tokens from tool returns; the policy gradient is applied to the entire trajectory, with a loss mask on observation tokens not produced by the model.
- Grounding as native internal tool: rather than a separate grounding model or a SAM-style segmenter, the same VLM emits bounding boxes; the crop is fed back as image tokens. This sidesteps the error compounding of modular pipelines and makes the tool “implicit” in the model’s own weights.
- Conditional tool reward: \(R(\tau) = R_{acc} + R_{format} + \mathbb{1}_{R_{acc}>0}\, R_{tool}\). The indicator on correctness is what prevents reward hacking via spammy zooms.
- End-to-end RL without SFT cold start: made viable by a perception- utility data selection filter that keeps only samples where the zoomed ground-truth region is actually necessary to answer correctly.
- Emergent three-stage training dynamic: exploration → high-frequency engagement → selective exploitation, observable in zoom count, response length, and grounding IoU curves.
- Four thinking patterns identified post-hoc: visual search, visual comparison, visual confirmation, and hallucination mitigation. In the hallucination case, active perception demonstrably overrides a language prior (e.g., the model initially hallucinates “rocks” on a beach scene but, after zooming, corrects the answer to “clock”).
- Zero-shot tool generalization: adding a
rotatetool at inference only (prompt-level, no retraining) gives +3.5% on a rotated OCR benchmark while keeping V* performance stable. - Scaling behavior: moving from 7B to 32B widens the gap over the Qwen2.5-VL baseline, yields longer reasoning chains, and higher grounding IoU — capacity amplifies the iMCoT payoff.
Triage (Vision-Language Model)
Architecture
- Base is Qwen2.5-VL-7B (also reported at 32B): native Qwen2.5-VL vision encoder + projection MLP feeding the Qwen2.5 LLM backbone. Nothing new on the encoder / connector side — the paper reuses the stock Qwen2.5-VL stack and intervenes only on training.
- Resolution strategy is inherited from Qwen2.5-VL (dynamic-resolution ViT with native aspect ratios); high-res inputs are handled by the model’s own crop tool rather than by a tiling scheme. The LLM emits
[bbox_2d]coordinates and the crop is re-encoded by the same vision encoder and appended as new image observation tokens. - The full model (vision encoder + projection + LLM) is optimized end to end during RL; there is no freeze schedule because there is no multi-stage pretraining — only the single RL post-training stage.
Training phases
- Single-stage end-to-end RL, explicitly no SFT cold start. The entire stack (vision encoder, projector, LLM) is updated with GRPO.
- 80 GRPO iterations on H100s. Each batch: 256 prompts × 16 rollouts, up to 6 active-perception steps per rollout, max response length 20480 tokens, KL coefficient 0.0, token-wise loss mask on observation tokens (image crops not produced by the policy).
- Reward: \(R(\tau) = R_{\text{acc}} + R_{\text{format}} + \mathbb{1}_{R_{\text{acc}}>0} R_{\text{tool}}\). The conditional gating on correctness is what keeps the zoom behavior useful instead of spammy.
Data
- Three complementary sources, all image+text QA: V\* training set (fine-grained perception on natural images), ArxivQA (chart/figure QA for diversity), ThinkLite-VL (challenging reasoning samples).
- No interleaved corpora, no text-only SFT, no synthetic reasoning traces. Everything is Q→A pairs; the reasoning chain is produced by the policy at rollout time and rewarded only by final outcome.
- Three-stage curation: (1) difficulty curation — drop items Qwen2.5-VL-7B solves at 100% or 0%; (2) format standardization + label verification; (3) perception-utility filter — keep only samples where the ground-truth bbox is actually required to answer (applied to V\* only, not to ArxivQA / ThinkLite-VL). The filter is the lever that lets outcome rewards bootstrap without SFT.
Token budget
- Per-image token count is inherited from Qwen2.5-VL’s dynamic-resolution ViT (variable, depends on input). No new compression / resampling is introduced.
- Active perception adds tokens dynamically: each crop is re-encoded and appended as a new image observation segment, growing the context per zoom step. Hard cap of 6 zooms per rollout and 20480 total response tokens.
- Long documents / video are not addressed — the whole framework is image-only, with high-resolution static images treated via recursive cropping rather than frame sampling.
Evals
- Reported: high-res (V\*, HR-Bench-4K, HR-Bench-8K); general perception (MME-RealWorld-Lite, OCR/RS/DT/MO/AD splits); grounding (refCOCO/+/g, ReasonSeg); hallucination (POPE adversarial / popular / random); math-ish reasoning (MathVista, MathVerse, MathVision, WeMath, DynaMath, LogicVista).
- Omitted: no video benchmarks (Video-MME, EgoSchema, etc.), no generic short-VQA (VQAv2, OK-VQA, GQA), no doc/chart understanding beyond the ArxivQA used in training (no DocVQA, ChartQA, InfoVQA), no OCR benchmarks (TextVQA, OCRBench). The selection skews toward benchmarks where fine-grained localization is the bottleneck — which is exactly where iMCoT is expected to shine; the silences are informative.
- Baselines include workflow-based systems (SEAL, DyFo, ZoomEye) flagged as E2E=✗, putting DeepEyes’ 7B E2E numbers against 7B pipelined systems and 32B dense baselines on the same tables.
Comments
The core methodological claim — that “thinking with images” can emerge natively from end-to-end RL without an SFT cold start — is interesting and non-trivial, because the companion GeoEyes paper explicitly argues the opposite for the ultra-high-resolution remote sensing regime: GeoEyes observes that DeepEyes-style training collapses into “Tool Usage Homogenization” (always-one-zoom) on UHR imagery and requires a UHR-Chain-of-Zoom SFT corpus plus a multi-component reward to avoid this failure mode. The two papers together make a nice dialectic on when RL alone is sufficient: DeepEyes succeeds on V*/HR-Bench with 7B and conditional rewards, while GeoEyes claims the UHR regime has such sparse evidence density and such heterogeneous task difficulty that outcome rewards alone collapse. Reading both sides clarifies that the conditional tool reward in DeepEyes is the sharp technical lever — and its limits.
The conditional tool reward is the engineering insight most worth remembering. The ablation (Table 5 / Figure 4) cleanly separates the three regimes: no tool reward, unconditional tool reward, and correctness- conditional tool reward. This is a template applicable beyond vision: any agent training with outcome rewards that wants to incentivize a specific behavior (tool use, self-consistency checks, retrieval) should consider gating the behavior bonus on the final outcome rather than rewarding the behavior unconditionally. The failure mode of the unconditional variant (zooming without the zooms helping accuracy) is a classic reward-hacking pattern, solved here by a single indicator in the reward.
The hallucination-mitigation case study (Figure 5) is compelling because it shows active perception overriding a linguistic prior, with relevancy maps that contrast a hallucinated token’s attribution (bag-of-scene- features) versus a grounded token’s attribution (localized on the actual object). It suggests that “thinking with images” is not merely a test-time compute trick: it actually reshapes the model’s causal attribution toward visual evidence. This hooks naturally into the broader observation that Grounding is a defense against language-prior hallucination in VLMs.
The zero-shot tool generalization result (rotate tool) is provocative but narrow, it adds one operation on top of an existing scaffolding the model already knows how to use. It would be interesting to see whether DeepEyes can bootstrap to a qualitatively different tool (e.g., object segmentation or OCR) with only a prompt-level description, or whether the training-time tool (crop) is doing more latent work than the authors acknowledge.
Connections
- Related to Notes on: GeoEyes because GeoEyes directly builds on and critiques DeepEyes: it identifies the Tool-Usage Homogenization failure mode when DeepEyes-style end-to-end RL is transferred to UHR imagery, and proposes a UHR-CoZ SFT cold start plus a multi-component AdaZoom-GRPO reward in response. Reading them in pair cleanly maps the sufficiency boundary of outcome-only RL for active perception.
- Related to Reinforcement learning with verifiable rewards because the DeepEyes reward is a textbook RLVR design (binary accuracy + format + conditional tool bonus) applied to the multimodal setting, and the conditional gating on correctness is a transferable insight.
- Related to GRPO because the optimizer is GRPO with KL coefficient 0 and a token-wise loss mask for observation tokens; an instructive reference implementation for multimodal GRPO with external observations.
- Related to Tool calling because DeepEyes frames grounding as a native internal tool rather than an external API, arguing that implicit (weight-encoded) tool use is preferable to modular pipelines when the tool can be expressed as a VLM capability.
- Related to Grounding because the paper shows grounding doubling as a hallucination-mitigation mechanism — the relevancy-map case study is a concrete demonstration that forcing visual re-engagement breaks language-prior-driven hallucinations.
- Related to Vision Language Models because DeepEyes is a VLM recipe that demonstrates a specific way to close the vision–language modality gap: rather than fusing modalities harder at the encoder, the model iteratively re-enters the image during reasoning.
- Related to Test-time compute because iMCoT is a new axis for scaling test-time compute: instead of (or in addition to) longer chains of thought, the model spends compute by issuing more perception actions, with training dynamics showing this compute becomes more efficient over RL iterations.
Bibliography
- Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu. . "DeepEyes: Incentivizing ``Thinking with Images'' via Reinforcement Learning". https://arxiv.org/abs/2505.14362.
Loading comments...