- tags
- Vision Language Models, Grounding, Synthetic training data, Spatial Reasoning, Foundation models
- source
- (Clark et al. 2026)
Summary
Molmo2 is a family of fully open Vision Language Models (4B, 8B built on Qwen3, and a 7B built on OLMo) trained without distillation from proprietary systems. The work closes the open-source gap for video-capable VLMs with a core focus on Grounding — producing pixel-level pointing and tracking outputs across single images, multi-image sets, and videos — a capability that even proprietary models largely lack.
Its two principal contributions are (i) a collection of 9 new human-and-synthetic datasets (7 video, 2 multi-image) totalling several million instances across dense video captioning, long-form video QA, video pointing, and video object tracking, all collected without using closed VLMs; and (ii) a training recipe featuring efficient sequence packing with message-tree encoding (~15x training throughput), bi-directional attention over visual tokens, and a per-task token-weighting scheme that prevents long-output captioning tasks from dominating the loss.
Evaluated across 12 video benchmarks, 11 image benchmarks, grounding benchmarks (BURST, Ref-YT-VOS, ReasonVOS, and newly-introduced Molmo2-VC/VP/Track), and a 105k-rating human preference Elo, the 8B variant is state-of-the-art among open-weight/open-data models on short-video understanding, counting, captioning, and point-based grounding — notably beating Qwen3-VL-8B by ~6 points on video counting and surpassing Gemini 3 Pro on pointing F1 (38.4 vs 20.0) and tracking J&F (56.2 vs 41.1). The model lags proprietary systems on long-video (≥10 min) and multimodal reasoning benchmarks.
Key Ideas
- Fully open VLM with explicit no-distillation principle: all training data is human- or open-source-derived, no outputs from closed VLMs used as supervision.
- Three-stage training: image-only pre-training (captioning + pointing + NLP) → joint video/image SFT → short long-context SFT (seq 36,864, F=384) for long-video handling.
- Sequence packing + message-tree encoding: 3.8 examples per 16,348-token sequence on average, yielding ~15x training efficiency while masking cross-example attention.
- Bi-directional attention between visual tokens (rather than causal) boosts both QA and captioning.
- Token-weighting scheme: fixed 0.1 weight for video captions, 0.2 for image captions, 4/√n for other tasks — prevents long-output captioning from swamping the loss.
- Nine new datasets: Molmo2-Cap (dense video captions, 924 words/video on average), -AskModelAnything, -CapQA, -SubtitleQA, -VideoPoint (650k queries, 280k videos), -VideoTrack (15k complex referring queries), -MultiImageQA, -SynMultiImageQA, AcademicVideoPoint/Track.
- Extends 2D pointing to the temporal domain: video-pointing (when and where) and video-tracking (continuous position) with plain-text normalized-coordinate format plus timestamp/image-index.
Triage (Vision-Language Model)
Architecture
- Vision encoder: ViT (identity not stated in main body — details in appendix); features pulled from the 3rd-from-last and 9th-from-last layers (Molmo pattern). Image resolution handled by K=8 training / K=24 inference overlapping crops; videos sampled at 2 fps up to F=128 frames (F=384 for long-context training).
- Connector: multi-head attention pooling — 2×2 patch windows for images, 3×3 for video frames — followed by a shared MLP projection. Mean of patches serves as query.
- LLM backbone: Qwen3 (Molmo2-4B, Molmo2-8B) or OLMo-7B (Molmo2-O-7B).
- Trainability: no freezing — all parameters (ViT, connector, LLM) are fine-tuned at every stage, with separate learning rates per module following the Molmo recipe.
Training phases
- Stage 1 — image-only pre-training: PixMo-Cap captioning + image-pointing (PixMo-Points, PixMo-Count, CoSyn-Point) + Tulu NLP. Mix: 60% captioning, 30% pointing, 10% NLP. 32k steps, batch 128, ≈4 epochs on PixMo-Cap. All parameters trainable.
- Stage 2 — joint video+image SFT: PixMo + Molmo2 datasets + open-source video/image/NLP data, per-category sampling rates manually set (Table 1). 30k steps, batch 128, max seq 16,384.
- Stage 3 — long-context SFT: same mixture, seq 36,864, F=384, 2k steps, context parallelism (Ullysses attention) across 8 GPUs per example. Short final stage because of training overhead.
Data
- Core principle: no distillation from proprietary VLMs at any stage (generation, captioning, QA synthesis, or filtering).
- SFT category mix (sampling rates): image QA 22.7%, video QA 18.2%, video pointing 13.6%, video tracking 13.6%, captions/long-QA 13.6%, image pointing 9.1%, NLP text-only 9.1%.
- Multi-image, interleaved, and video inputs all supported. Dense video captions average 924 words (vs 75 in VLN, 547 in LLaVA-Video-178K).
- Mid-training / annealing: effectively yes — the long-context SFT at stage 3 acts as an annealing phase for long-video handling on the same mixture.
Token budget
- Images: K=8 crops during training, K=24 during inference; 2×2 patch pooling in the connector reduces per-crop token count.
- Videos: up to 128 frames at 2 fps during SFT, 384 frames during long-context SFT; 3×3 pooling reduces video-frame token count more aggressively than for images.
- Dynamic scaling: the visible token budget grows with input resolution and video length, but a custom on-the-fly packing algorithm merges multiple short examples into 16,384-token sequences (≈3.8 examples per pack), making training throughput ≈15x higher.
- Long-document / video handling: message-tree encoding (videos/images as a first message, each annotation as a branch) + sequence packing + long-context SFT up to 36,864 tokens.
Evals
- Reports: 12 video QA benchmarks (NextQA, PerceptionTest, MVBench, Tomato, MotionBench, TempCompass, Video-MCQ, Video-MME (incl. Sub), LongVideoBench, MLVU, LVBench, VideoEvalPro, EgoSchema); video captioning (Molmo2-Caption); video counting (Molmo2-Count); video grounding (BURST-VC, Molmo2-VC/VP); video tracking (MeVIS, Ref-YT-VOS, Ref-Davis, ReasonVOS, Molmo2-Track); 11 image benchmarks (AI2D, ChartQA, DocVQA, InfoQA, TextVQA, VQA v2.0, RWQA, MMMU, MathVista, CountBench, PixMoCount); multi-image (MultiBench, MMIU, Blink); image pointing (Point-Bench); and a Bradley-Terry Elo ranking from 105k human pairwise preferences.
- Weaknesses acknowledged rather than omitted: lags open-weight competitors on long-videos (>10 min), on OCR-heavy benchmarks (DocVQA, InfoQA), and on multimodal-reasoning benchmarks (MathVista, MMMU).
- Not reported: MMVet, SEED-Bench, ScreenSpot/WebUI-grounding, 3D-aware benchmarks.
Comments
Molmo2 is the reference open model for video grounding in 2026: most open VLMs don’t do temporal pointing or tracking at all, and even Gemini 3 Pro is weaker on F1 pointing. The combination of fully-open data + model + code + training recipe makes it the natural successor to Perception Encoder as an ingredient for reproducible VLM work, though they solve different layers — PE is a vision encoder, Molmo2 is a full VLM stack.
The aggressive token-budget design (packing + message trees + per-task token weighting) is worth copying for anyone training video-VLMs — the reported 15x throughput gain isolates a real bottleneck (padding waste from heterogeneous sequence lengths) that most open recipes ignore.
The no-distillation principle is more than ideology: the authors argue (and the Elo results support) that distilled-from-proprietary open models inherit the base model’s biases silently. This is a useful framing for evaluating other “open” VLMs.
Connections
- Related to Perception Encoder (Bolya et al., 2025) because both are fully-open building blocks for VLMs; PE provides a vision-only encoder, Molmo2 is a complete VLM that could in principle use PE as its backbone.
- Related to GeoEyes (Wang et al., 2026) because both operate in the intersection of VLMs and explicit visual grounding/pointing, with GeoEyes using RL to focus on evidence regions and Molmo2 training grounding directly into the base model.
- Related to DeepEyes (Zheng et al., 2025) because both push VLMs toward visual reasoning with grounding primitives; DeepEyes uses RL-based tool-calling, Molmo2 builds the grounding capability into the pretraining/SFT data directly.
- Related to SAM 3 (Carion et al., 2025) because both target open-vocabulary segmentation/tracking; SAM 3 is concept-driven, Molmo2 is a generalist VLM that emits points rather than masks but is evaluated against SAM-2-based tracking pipelines.
- Related to Synthetic training data because a major methodological contribution is a no-distillation synthetic-data pipeline (Molmo2-CapQA, -SynMultiImageQA) that uses Molmo2’s own captioner rather than a proprietary VLM.
- Related to Spatial Reasoning because video pointing and tracking directly encode spatial-temporal reasoning at the token level (normalized x,y + timestamps) rather than through language.
Bibliography
- Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, et al.. . "Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding". https://arxiv.org/abs/2601.10611.
Loading comments...