Notes on: Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding by Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna (2026)

tags: Vision Language Models, Grounding, Synthetic training data, Spatial Reasoning, Foundation models
source: (Clark et al. 2026)

Summary

Molmo2 is a family of fully open Vision Language Models (4B, 8B built on Qwen3, and a 7B built on OLMo) trained without distillation from proprietary systems. The work closes the open-source gap for video-capable VLMs with a core focus on Grounding — producing pixel-level pointing and tracking outputs across single images, multi-image sets, and videos — a capability that even proprietary models largely lack.

Its two principal contributions are (i) a collection of 9 new human-and-synthetic datasets (7 video, 2 multi-image) totalling several million instances across dense video captioning, long-form video QA, video pointing, and video object tracking, all collected without using closed VLMs; and (ii) a training recipe featuring efficient sequence packing with message-tree encoding (~15x training throughput), bi-directional attention over visual tokens, and a per-task token-weighting scheme that prevents long-output captioning tasks from dominating the loss.

Evaluated across 12 video benchmarks, 11 image benchmarks, grounding benchmarks (BURST, Ref-YT-VOS, ReasonVOS, and newly-introduced Molmo2-VC/VP/Track), and a 105k-rating human preference Elo, the 8B variant is state-of-the-art among open-weight/open-data models on short-video understanding, counting, captioning, and point-based grounding — notably beating Qwen3-VL-8B by ~6 points on video counting and surpassing Gemini 3 Pro on pointing F1 (38.4 vs 20.0) and tracking J&F (56.2 vs 41.1). The model lags proprietary systems on long-video (≥10 min) and multimodal reasoning benchmarks.

Key Ideas

Fully open VLM with explicit no-distillation principle: all training data is human- or open-source-derived, no outputs from closed VLMs used as supervision.
Three-stage training: image-only pre-training (captioning + pointing + NLP) → joint video/image SFT → short long-context SFT (seq 36,864, F=384) for long-video handling.
Sequence packing + message-tree encoding: 3.8 examples per 16,348-token sequence on average, yielding ~15x training efficiency while masking cross-example attention.
Bi-directional attention between visual tokens (rather than causal) boosts both QA and captioning.
Token-weighting scheme: fixed 0.1 weight for video captions, 0.2 for image captions, 4/√n for other tasks — prevents long-output captioning from swamping the loss.
Nine new datasets: Molmo2-Cap (dense video captions, 924 words/video on average), -AskModelAnything, -CapQA, -SubtitleQA, -VideoPoint (650k queries, 280k videos), -VideoTrack (15k complex referring queries), -MultiImageQA, -SynMultiImageQA, AcademicVideoPoint/Track.
Extends 2D pointing to the temporal domain: video-pointing (when and where) and video-tracking (continuous position) with plain-text normalized-coordinate format plus timestamp/image-index.

Triage (Vision-Language Model)

Architecture

Vision encoder: ViT (identity not stated in main body — details in appendix); features pulled from the 3rd-from-last and 9th-from-last layers (Molmo pattern). Image resolution handled by K=8 training / K=24 inference overlapping crops; videos sampled at 2 fps up to F=128 frames (F=384 for long-context training).
Connector: multi-head attention pooling — 2×2 patch windows for images, 3×3 for video frames — followed by a shared MLP projection. Mean of patches serves as query.
LLM backbone: Qwen3 (Molmo2-4B, Molmo2-8B) or OLMo-7B (Molmo2-O-7B).
Trainability: no freezing — all parameters (ViT, connector, LLM) are fine-tuned at every stage, with separate learning rates per module following the Molmo recipe.

Training phases

Stage 1 — image-only pre-training: PixMo-Cap captioning + image-pointing (PixMo-Points, PixMo-Count, CoSyn-Point) + Tulu NLP. Mix: 60% captioning, 30% pointing, 10% NLP. 32k steps, batch 128, ≈4 epochs on PixMo-Cap. All parameters trainable.
Stage 2 — joint video+image SFT: PixMo + Molmo2 datasets + open-source video/image/NLP data, per-category sampling rates manually set (Table 1). 30k steps, batch 128, max seq 16,384.
Stage 3 — long-context SFT: same mixture, seq 36,864, F=384, 2k steps, context parallelism (Ullysses attention) across 8 GPUs per example. Short final stage because of training overhead.

Data

Core principle: no distillation from proprietary VLMs at any stage (generation, captioning, QA synthesis, or filtering).
SFT category mix (sampling rates): image QA 22.7%, video QA 18.2%, video pointing 13.6%, video tracking 13.6%, captions/long-QA 13.6%, image pointing 9.1%, NLP text-only 9.1%.
Multi-image, interleaved, and video inputs all supported. Dense video captions average 924 words (vs 75 in VLN, 547 in LLaVA-Video-178K).
Mid-training / annealing: effectively yes — the long-context SFT at stage 3 acts as an annealing phase for long-video handling on the same mixture.

Token budget

Images: K=8 crops during training, K=24 during inference; 2×2 patch pooling in the connector reduces per-crop token count.
Videos: up to 128 frames at 2 fps during SFT, 384 frames during long-context SFT; 3×3 pooling reduces video-frame token count more aggressively than for images.
Dynamic scaling: the visible token budget grows with input resolution and video length, but a custom on-the-fly packing algorithm merges multiple short examples into 16,384-token sequences (≈3.8 examples per pack), making training throughput ≈15x higher.
Long-document / video handling: message-tree encoding (videos/images as a first message, each annotation as a branch) + sequence packing + long-context SFT up to 36,864 tokens.

Evals

Reports: 12 video QA benchmarks (NextQA, PerceptionTest, MVBench, Tomato, MotionBench, TempCompass, Video-MCQ, Video-MME (incl. Sub), LongVideoBench, MLVU, LVBench, VideoEvalPro, EgoSchema); video captioning (Molmo2-Caption); video counting (Molmo2-Count); video grounding (BURST-VC, Molmo2-VC/VP); video tracking (MeVIS, Ref-YT-VOS, Ref-Davis, ReasonVOS, Molmo2-Track); 11 image benchmarks (AI2D, ChartQA, DocVQA, InfoQA, TextVQA, VQA v2.0, RWQA, MMMU, MathVista, CountBench, PixMoCount); multi-image (MultiBench, MMIU, Blink); image pointing (Point-Bench); and a Bradley-Terry Elo ranking from 105k human pairwise preferences.
Weaknesses acknowledged rather than omitted: lags open-weight competitors on long-videos (>10 min), on OCR-heavy benchmarks (DocVQA, InfoQA), and on multimodal-reasoning benchmarks (MathVista, MMMU).
Not reported: MMVet, SEED-Bench, ScreenSpot/WebUI-grounding, 3D-aware benchmarks.

Comments

Molmo2 is the reference open model for video grounding in 2026: most open VLMs don’t do temporal pointing or tracking at all, and even Gemini 3 Pro is weaker on F1 pointing. The combination of fully-open data + model + code + training recipe makes it the natural successor to Perception Encoder as an ingredient for reproducible VLM work, though they solve different layers — PE is a vision encoder, Molmo2 is a full VLM stack.

The aggressive token-budget design (packing + message trees + per-task token weighting) is worth copying for anyone training video-VLMs — the reported 15x throughput gain isolates a real bottleneck (padding waste from heterogeneous sequence lengths) that most open recipes ignore.

The no-distillation principle is more than ideology: the authors argue (and the Elo results support) that distilled-from-proprietary open models inherit the base model’s biases silently. This is a useful framing for evaluating other “open” VLMs.

Connections

Related to Perception Encoder (Bolya et al., 2025) because both are fully-open building blocks for VLMs; PE provides a vision-only encoder, Molmo2 is a complete VLM that could in principle use PE as its backbone.
Related to GeoEyes (Wang et al., 2026) because both operate in the intersection of VLMs and explicit visual grounding/pointing, with GeoEyes using RL to focus on evidence regions and Molmo2 training grounding directly into the base model.
Related to DeepEyes (Zheng et al., 2025) because both push VLMs toward visual reasoning with grounding primitives; DeepEyes uses RL-based tool-calling, Molmo2 builds the grounding capability into the pretraining/SFT data directly.
Related to SAM 3 (Carion et al., 2025) because both target open-vocabulary segmentation/tracking; SAM 3 is concept-driven, Molmo2 is a generalist VLM that emits points rather than masks but is evaluated against SAM-2-based tracking pipelines.
Related to Synthetic training data because a major methodological contribution is a no-distillation synthetic-data pipeline (Molmo2-CapQA, -SynMultiImageQA) that uses Molmo2’s own captioner rather than a proprietary VLM.
Related to Spatial Reasoning because video pointing and tracking directly encode spatial-temporal reasoning at the token level (normalized x,y + timestamps) rather than through language.

Bibliography

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, et al.. January 15, 2026. "Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding". https://arxiv.org/abs/2601.10611.

Summary

Key Ideas

Triage (Vision-Language Model)

Architecture

Training phases

Data

Token budget

Evals

Comments

Connections

Bibliography

Comments

Leave a comment