- tags
- Vision Language Models, Computer vision, CLIP, Contrastive learning, Vision transformer, Foundation models
- source
- (Bolya et al. 2025)
Summary
This paper introduces Perception Encoder (PE), a family of vision encoders from Meta FAIR trained with a purely global CLIP-style contrastive vision-language objective that nonetheless produces state-of-the-art features for tasks as diverse as zero-shot classification, MLLM-based Q&A, grounding, object detection, tracking, and depth estimation. The central empirical finding is that general-purpose features already exist inside a well-trained CLIP model — but they live in intermediate layers, not at the output. The paper spends its first half building the best possible contrastive encoder (PE_core, up to 2B params, trained on 5.4B images + 22M recaptioned videos) and its second half developing two alignment tuning methods to lift those hidden features to the end of the network: language alignment (producing PE_lang for VLM-style MLLM tasks) and spatial alignment (producing PE_spatial for dense prediction).
PE_core is built by a careful ablation pipeline over OpenCLIP ViT-L/14: nine changes (progressive resolution, batch-size doubling, LAMB optimizer, high final resolution 336px, 2D RoPE, attention pooling with class token, tuned data augmentation, MaskFeat-style mask regularization) improve zero-shot ImageNet robustness from 75.3 to 80.9 while keeping FLOPs fixed. Crucially, the same recipe ~+10 mAP on frozen-feature COCO detection and vastly improves scaling behavior — vanilla CLIP plateaus at L scale (~300M) while the new recipe scales to G (2B) and beyond. A novel video data engine generates 22M aligned video-text pairs by combining per-frame captions, a Perception Language Model video captioner, and Llama-3.3-70B summarization; finetuning on this data improves both image and video performance and yields the 1M-video PVD dataset.
Language alignment (§4) plugs PE_core into Llama-3.2-3B via a 2-layer MLP projector and performs midtraining (70M samples) to align intermediate features (layer 47 of 50 works best) to the network’s output, producing PE_lang. PE_lang-G + Llama-3.1-8B (= PLM-8B) matches or beats InternVL 3 and Qwen2-VL on DocVQA (94.6), InfographicVQA (80.9), PerceptionTest (82.7) and most video benchmarks. Spatial alignment (§5) self-distills PE_core’s intermediate layer 41 features to its own output plus SAM 2.1 mask-logit features for locality, producing PE_spatial; this sets a new COCO absolute state-of-the-art of 66.0 box mAP with a simple DETR-style decoder, without using any detection data for alignment.
Key Ideas
- The alignment problem in CLIP. Vanilla contrastive pretraining buries general features in the middle of the network because fitting the global contrastive objective requires a learned “decoder” of the last few layers (global tokens appear at layer ~33 in PE_core-G). Intermediate layers therefore outperform the last layer on OCR Q&A, visual Q&A, grounding, detection, depth, and tracking.
- A single robust contrastive recipe that scales. The combined ablations matter most for difficult benchmarks (ObjectNet, ImageNet-A) and for downstream tasks (+10 mAP on COCO). Progressive resolution and attention pooling in particular move the argmax-best layer deeper into the network.
- Video data engine for contrastive video training. Three stages: (1) PLM baseline captioner; (2) refine with 265K human-annotated captions (+6.2 CIDEr); (3) Llama-3.3-70B synthesizes the final caption from frame captions, video captions, and metadata. Simple average-pooled 8-frame embeddings outperform attention-based video pooling.
- PE Video Dataset (PVD): 1M motion-centric videos with 120K human-refined captions (two forms: short CLIP-style captions ~57 words; long fine-grained captions ~112 words). 15K used as a new text↔video retrieval benchmark.
- Language alignment = MLLM midtraining on intermediate features. The authors systematically show that unfreezing the LLM, using a 2-layer MLP projector, tapping layer 47 (not the last layer), and regularizing the encoder with LayerScale + DropPath all matter. Final average boost: +2.1 over the 20M ablation setting, lifting performance to 82.2.
- Spatial alignment via SAM mask-logit distillation. SAM’s own feature space has global tokens too, so they don’t use it directly. Instead they query SAM 2.1-L on a 32×32 grid and concatenate the per-point mask logits into an H×W×1024 tensor, which is locally smooth. They then distill pairwise cosine similarity between student and teacher tokens via MSE loss. Combining SAM distillation with self-distillation to layer 41 gives the best overall encoder.
- Generality matters. The same PE_lang-G vision encoder transfers across LLM backbones — aligned to Llama-3.2 3B, it still beats all baselines when paired with QwenLM 2.5 7B, and even improves on some OCR/video metrics.
- New state-of-the-art without proprietary data. PE_core is the first open contrastive model to outperform proprietary JFT-3B / WebLI-trained models on general zero-shot classification, while also achieving SOTA on retrieval and fine-grained classification simultaneously.
Triage (Vision-Language Model)
Architecture
- Vision encoder: PE family of ViTs — B (0.09B, 12L), L (0.32B, 24L), G (1.88B, 50L). Patch size 14, 2D RoPE, attention-pooling with class token, CLIP embedding dim 1024 (B/L) or 1280 (G). Native resolution 448px for G/L, 336 at intermediate stages. Dynamic tiling used at eval (36 tiles + 32 video frames for PLM-8B).
- Connector: 2-layer MLP projector from layer 47 of PE_core-G (last 3 layers discarded) — empirically beats a linear projector and last-layer feature taps. ~13.5M → 27M projector params.
- LLM backbone: primarily Llama-3.2-3B (for alignment training) and Llama-3.1-8B (for system-level PLM-8B benchmarks). Transferability demonstrated to QwenLM-2.5-7B.
- Training state: during language alignment, the projector is warmed up (LLM frozen), then the full LLM is unfrozen for the main midtraining stage — freezing the LLM costs 1.6 points on average.
Training phases
- Robust image pretraining — 5.4B MetaCLIP-curated image-text pairs, progressively 98→154→224→336px, 86B samples seen for G (58B for B/L). Optimizer: LAMB, LR 2×10⁻³, global batch 131K. End-to-end contrastive CLIP loss.
- Image+video finetuning — 50M image cooldown + 22M recaptioned videos at max resolution; video clips encoded as 8 uniformly-sampled frames, averaged into one embedding and contrasted with the video caption text embedding.
- Smaller model distillation — PE_core-G distills into B and L via temperature-scaled soft-target CLIP distillation on ~4B samples (~8% of pretraining), no weight decay.
- Language alignment (PE_lang) — warmup on 1M image-text with projector only unfrozen, then full-model next-token-prediction training on 70M mixed samples (natural images + docs/charts + videos); LayerScale + DropPath regularization on the encoder.
- Spatial alignment (PE_spatial) — self-distillation from layer 41 of the frozen PE_core-G + distillation of pairwise-similarity to SAM 2.1 mask logits at 448px; MaskFeat 75% masking + DropPath + LayerScale; no extra parameters beyond LayerScale.
Data
- Pretraining: 2.3B (ablations) then 5.4B (final) image-text pairs curated with MetaCLIP. No JFT-3B or WebLI — the paper is notably the first open contrastive model to match proprietary-data models.
- Video finetuning: 22M recaptioned videos from the paper’s video data engine. Ablation shows each engine component (title, description, frame captions, video captions) contributes; human-refined captions on 265K clips lift ablation CIDEr from 51.9→71.1 on AuroraCap.
- PE Video Dataset (PVD): 1M human-tagged videos released; 120K with human-refined synthetic captions (57.1-word short + 111.7-word long).
- Language alignment midtraining: 70M samples mixing natural images, documents/charts/diagrams, and videos (the PLM midtrain mix from (Carion et al. 2025) companion work). 1M image-text warmup for the projector.
- No detection data used for spatial alignment — a key differentiator from MAE/BEiT-style dense encoders.
Token budget
- Per-image tokens at native resolution: 256 (224/14 × 224/14 ≈ 256) at B scale; 376 (= 384/14²) at L; 1024 (= 448/14²) at G. With tiling, PLM-8B uses 36 tiles per image (~1024 × 36 ≈ 37K tokens per image) at the system level.
- Videos: 8 uniformly sampled frames for contrastive video training; 32 frames × 1024 tokens/frame for PLM-8B video benchmarks. Frame embeddings are average-pooled — simple mean beats attention-based video pooling, consistent with prior work.
- Tiling behavior: both PE_core-L and PE_core-G beat InternViT 2.5 and SigLIP2-so when tiling with 4 tiles + 1 thumbnail, including on grounding, which InternViT 2.5 was specifically trained for.
Evals (and omissions)
- Zero-shot classification: ImageNet val/v2/A/R/Sketch/ObjectNet (full and reduced); Food, Flowers, Cars, Aircrafts, Countries, Scenes, Satellite.
- Zero-shot retrieval: MS-COCO, Flickr30k.
- Zero-shot video: Kinetics-400/600/700, UCF, HMDB classification; MSR-VTT, MSVD, ActivityNet retrieval; the new PVD Benchmark.
- OCR-heavy: DocVQA (94.6), InfographicVQA (80.9), ChartQA, AI2D, OCR VQA.
- General VQA: TextVQA, OK-VQA, POPE, VQAv2.
- Captioning: Flickr, COCO, NoCap.
- Video understanding: VideoMME, STAR, TGIF-QA, EgoSchema, MVBench, PerceptionTest.
- Grounding: RefCOCO/+/g.
- Dense prediction: COCO detection (66.0 box mAP SOTA), LVIS, ADE20k semantic segmentation, DAVIS tracking (J&F), NYU depth (RMSE).
- Frozen-feature probing: k-NN, linear, and attention probing on ImageNet-1k.
- Notable omissions: no results on compositional/reasoning VLM benchmarks like BLINK, MMBench, MME, SEED, or on video-specific reasoning benchmarks beyond PerceptionTest/MVBench; no hallucination evaluation beyond POPE; no multilingual OCR/VQA; no chart generation or UI grounding specifically.
Comments
The paper’s core thesis — that a pure global contrastive loss can learn state-of-the-art features for spatial and language tasks, but fails to expose them at the output — is simple and striking. It cleanly separates two historically conflated questions: “what objective learns good features?” and “which layer are the good features at?” The finding that CLIP’s last-layer degradation is caused by a learned decoder of global tokens (layers 33–50 in PE_core-G) aligns with CLIPer and REPA’s earlier observations but is argued much more systematically across three task families.
The Distillation-flavored alignment methods are pragmatic: rather than changing the pretraining objective or architecture, they finetune with a teacher that is the model itself at a better layer (for language/spatial) or an external model with known locality bias (SAM 2.1). The grounding result is especially telling — PE_lang’s training data does not contain grounding data, yet RefCOCO performance lifts substantially, showing the intermediate grounding features were already there and just needed to be exposed.
Limitations and open questions:
- The alignment pipeline trades generality for specialization: one encoder for language (PE_lang), one for spatial (PE_spatial). A unified alignment is left as future work and would be the natural next step.
- System-level PLM-8B comparisons beat InternVL 3 on some benchmarks but lose on others (e.g., VQAv2, POPE). The paper does not quantify what it costs in alignment tuning compute vs. a full from-scratch MLLM.
- The spatial alignment depends on SAM 2.1 as a teacher — inheriting any biases or failure modes of SAM into PE_spatial. The mask-logit trick (using SAM’s predictions rather than its features) is clever and probably generalizes to other dense-prediction teachers.
- The video data engine is pragmatic but heavyweight; a 70B LLM summarizer in the loop is not cheap. It is an open question whether a lighter recaptioning pipeline would suffice.
Connection to the broader literature: this work sits at the intersection of contrastive scaling (following SigLIP/SigLIP2, MetaCLIP, EVA-CLIP), representation-probing work (AIMv2, DINOv2, REPA), and MLLM encoder design (LLaVA-OneVision, InternVL, Qwen2-VL). The paper argues persuasively that the pretraining objective matters less than the community assumed, and that where you read features out matters more.
Connections
- Related to SAM 3: Segment Anything with Concepts — PE’s spatial alignment distills from SAM 2.1 mask logits; several Meta FAIR authors overlap (Peize Sun, Tengyu Ma, Piotr Dollár, Nikhila Ravi, Christoph Feichtenhofer), and both papers share the PLM MLLM as a captioning/labeling tool.
- Related to DeepEyes: Incentivizing “Thinking with Images” — both are VLM works, but PE contributes the encoder side (where to read features from), while DeepEyes contributes the post-training side (RL-driven multi-step visual reasoning). PE_lang-L would be a natural encoder for DeepEyes-style RL.
- Related to GeoEyes — GeoEyes uses VLM tool-use on ultra-high-resolution remote sensing imagery; PE’s tiling behavior (36 tiles + thumbnail) is directly relevant, and PE_core-G could serve as a stronger drop-in encoder.
- Related to CLIP — PE is built on CLIP’s contrastive loss but systematically refines pretraining, scaling, and data to reach a state where features are strong throughout the network, not just at the output.
- Related to Vision transformer — all PE backbones are ViTs; the paper studies layerwise feature quality as a function of depth for the first time across both contrastive (PE_core), captioning (AIMv2), and self-supervised (DINOv2) ViTs.
- Related to Contrastive learning — the paper is a systematic study of how far a pure contrastive objective can be pushed, challenging the community assumption that captioning or self-supervised objectives are required for downstream transfer.
- Related to Foundation models — PE is positioned as a general vision foundation encoder, and the paper’s model family (B/L/G + Lang + Spatial variants) is released open-weight.
Bibliography
- Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, et al.. . "Perception Encoder: The Best Visual Embeddings Are Not at the Output of the Network". https://arxiv.org/abs/2504.13181.
- Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, et al.. . "SAM 3: Segment Anything with Concepts". https://arxiv.org/abs/2511.16719. See notes
Loading comments...