Perception Encoder: The best visual embeddings are not at the output of the network by Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, Christoph Feichtenhofer (2025)

This note was initially drafted with LLM assistance. Generated notes are periodically reviewed and revised by the author.
tags
Vision Language Models, Computer vision, CLIP, Contrastive learning, Vision transformer, Foundation models
source
(Bolya et al. 2025)

Summary

This paper introduces Perception Encoder (PE), a family of vision encoders from Meta FAIR trained with a purely global CLIP-style contrastive vision-language objective that nonetheless produces state-of-the-art features for tasks as diverse as zero-shot classification, MLLM-based Q&A, grounding, object detection, tracking, and depth estimation. The central empirical finding is that general-purpose features already exist inside a well-trained CLIP model — but they live in intermediate layers, not at the output. The paper spends its first half building the best possible contrastive encoder (PE_core, up to 2B params, trained on 5.4B images + 22M recaptioned videos) and its second half developing two alignment tuning methods to lift those hidden features to the end of the network: language alignment (producing PE_lang for VLM-style MLLM tasks) and spatial alignment (producing PE_spatial for dense prediction).

PE_core is built by a careful ablation pipeline over OpenCLIP ViT-L/14: nine changes (progressive resolution, batch-size doubling, LAMB optimizer, high final resolution 336px, 2D RoPE, attention pooling with class token, tuned data augmentation, MaskFeat-style mask regularization) improve zero-shot ImageNet robustness from 75.3 to 80.9 while keeping FLOPs fixed. Crucially, the same recipe ~+10 mAP on frozen-feature COCO detection and vastly improves scaling behavior — vanilla CLIP plateaus at L scale (~300M) while the new recipe scales to G (2B) and beyond. A novel video data engine generates 22M aligned video-text pairs by combining per-frame captions, a Perception Language Model video captioner, and Llama-3.3-70B summarization; finetuning on this data improves both image and video performance and yields the 1M-video PVD dataset.

Language alignment (§4) plugs PE_core into Llama-3.2-3B via a 2-layer MLP projector and performs midtraining (70M samples) to align intermediate features (layer 47 of 50 works best) to the network’s output, producing PE_lang. PE_lang-G + Llama-3.1-8B (= PLM-8B) matches or beats InternVL 3 and Qwen2-VL on DocVQA (94.6), InfographicVQA (80.9), PerceptionTest (82.7) and most video benchmarks. Spatial alignment (§5) self-distills PE_core’s intermediate layer 41 features to its own output plus SAM 2.1 mask-logit features for locality, producing PE_spatial; this sets a new COCO absolute state-of-the-art of 66.0 box mAP with a simple DETR-style decoder, without using any detection data for alignment.

Key Ideas

  • The alignment problem in CLIP. Vanilla contrastive pretraining buries general features in the middle of the network because fitting the global contrastive objective requires a learned “decoder” of the last few layers (global tokens appear at layer ~33 in PE_core-G). Intermediate layers therefore outperform the last layer on OCR Q&A, visual Q&A, grounding, detection, depth, and tracking.
  • A single robust contrastive recipe that scales. The combined ablations matter most for difficult benchmarks (ObjectNet, ImageNet-A) and for downstream tasks (+10 mAP on COCO). Progressive resolution and attention pooling in particular move the argmax-best layer deeper into the network.
  • Video data engine for contrastive video training. Three stages: (1) PLM baseline captioner; (2) refine with 265K human-annotated captions (+6.2 CIDEr); (3) Llama-3.3-70B synthesizes the final caption from frame captions, video captions, and metadata. Simple average-pooled 8-frame embeddings outperform attention-based video pooling.
  • PE Video Dataset (PVD): 1M motion-centric videos with 120K human-refined captions (two forms: short CLIP-style captions ~57 words; long fine-grained captions ~112 words). 15K used as a new text↔video retrieval benchmark.
  • Language alignment = MLLM midtraining on intermediate features. The authors systematically show that unfreezing the LLM, using a 2-layer MLP projector, tapping layer 47 (not the last layer), and regularizing the encoder with LayerScale + DropPath all matter. Final average boost: +2.1 over the 20M ablation setting, lifting performance to 82.2.
  • Spatial alignment via SAM mask-logit distillation. SAM’s own feature space has global tokens too, so they don’t use it directly. Instead they query SAM 2.1-L on a 32×32 grid and concatenate the per-point mask logits into an H×W×1024 tensor, which is locally smooth. They then distill pairwise cosine similarity between student and teacher tokens via MSE loss. Combining SAM distillation with self-distillation to layer 41 gives the best overall encoder.
  • Generality matters. The same PE_lang-G vision encoder transfers across LLM backbones — aligned to Llama-3.2 3B, it still beats all baselines when paired with QwenLM 2.5 7B, and even improves on some OCR/video metrics.
  • New state-of-the-art without proprietary data. PE_core is the first open contrastive model to outperform proprietary JFT-3B / WebLI-trained models on general zero-shot classification, while also achieving SOTA on retrieval and fine-grained classification simultaneously.

Triage (Vision-Language Model)

Architecture

  • Vision encoder: PE family of ViTs — B (0.09B, 12L), L (0.32B, 24L), G (1.88B, 50L). Patch size 14, 2D RoPE, attention-pooling with class token, CLIP embedding dim 1024 (B/L) or 1280 (G). Native resolution 448px for G/L, 336 at intermediate stages. Dynamic tiling used at eval (36 tiles + 32 video frames for PLM-8B).
  • Connector: 2-layer MLP projector from layer 47 of PE_core-G (last 3 layers discarded) — empirically beats a linear projector and last-layer feature taps. ~13.5M → 27M projector params.
  • LLM backbone: primarily Llama-3.2-3B (for alignment training) and Llama-3.1-8B (for system-level PLM-8B benchmarks). Transferability demonstrated to QwenLM-2.5-7B.
  • Training state: during language alignment, the projector is warmed up (LLM frozen), then the full LLM is unfrozen for the main midtraining stage — freezing the LLM costs 1.6 points on average.

Training phases

  1. Robust image pretraining — 5.4B MetaCLIP-curated image-text pairs, progressively 98→154→224→336px, 86B samples seen for G (58B for B/L). Optimizer: LAMB, LR 2×10⁻³, global batch 131K. End-to-end contrastive CLIP loss.
  2. Image+video finetuning — 50M image cooldown + 22M recaptioned videos at max resolution; video clips encoded as 8 uniformly-sampled frames, averaged into one embedding and contrasted with the video caption text embedding.
  3. Smaller model distillation — PE_core-G distills into B and L via temperature-scaled soft-target CLIP distillation on ~4B samples (~8% of pretraining), no weight decay.
  4. Language alignment (PE_lang) — warmup on 1M image-text with projector only unfrozen, then full-model next-token-prediction training on 70M mixed samples (natural images + docs/charts + videos); LayerScale + DropPath regularization on the encoder.
  5. Spatial alignment (PE_spatial) — self-distillation from layer 41 of the frozen PE_core-G + distillation of pairwise-similarity to SAM 2.1 mask logits at 448px; MaskFeat 75% masking + DropPath + LayerScale; no extra parameters beyond LayerScale.

Data

  • Pretraining: 2.3B (ablations) then 5.4B (final) image-text pairs curated with MetaCLIP. No JFT-3B or WebLI — the paper is notably the first open contrastive model to match proprietary-data models.
  • Video finetuning: 22M recaptioned videos from the paper’s video data engine. Ablation shows each engine component (title, description, frame captions, video captions) contributes; human-refined captions on 265K clips lift ablation CIDEr from 51.9→71.1 on AuroraCap.
  • PE Video Dataset (PVD): 1M human-tagged videos released; 120K with human-refined synthetic captions (57.1-word short + 111.7-word long).
  • Language alignment midtraining: 70M samples mixing natural images, documents/charts/diagrams, and videos (the PLM midtrain mix from (Carion et al. 2025) companion work). 1M image-text warmup for the projector.
  • No detection data used for spatial alignment — a key differentiator from MAE/BEiT-style dense encoders.

Token budget

  • Per-image tokens at native resolution: 256 (224/14 × 224/14 ≈ 256) at B scale; 376 (= 384/14²) at L; 1024 (= 448/14²) at G. With tiling, PLM-8B uses 36 tiles per image (~1024 × 36 ≈ 37K tokens per image) at the system level.
  • Videos: 8 uniformly sampled frames for contrastive video training; 32 frames × 1024 tokens/frame for PLM-8B video benchmarks. Frame embeddings are average-pooled — simple mean beats attention-based video pooling, consistent with prior work.
  • Tiling behavior: both PE_core-L and PE_core-G beat InternViT 2.5 and SigLIP2-so when tiling with 4 tiles + 1 thumbnail, including on grounding, which InternViT 2.5 was specifically trained for.

Evals (and omissions)

  • Zero-shot classification: ImageNet val/v2/A/R/Sketch/ObjectNet (full and reduced); Food, Flowers, Cars, Aircrafts, Countries, Scenes, Satellite.
  • Zero-shot retrieval: MS-COCO, Flickr30k.
  • Zero-shot video: Kinetics-400/600/700, UCF, HMDB classification; MSR-VTT, MSVD, ActivityNet retrieval; the new PVD Benchmark.
  • OCR-heavy: DocVQA (94.6), InfographicVQA (80.9), ChartQA, AI2D, OCR VQA.
  • General VQA: TextVQA, OK-VQA, POPE, VQAv2.
  • Captioning: Flickr, COCO, NoCap.
  • Video understanding: VideoMME, STAR, TGIF-QA, EgoSchema, MVBench, PerceptionTest.
  • Grounding: RefCOCO/+/g.
  • Dense prediction: COCO detection (66.0 box mAP SOTA), LVIS, ADE20k semantic segmentation, DAVIS tracking (J&F), NYU depth (RMSE).
  • Frozen-feature probing: k-NN, linear, and attention probing on ImageNet-1k.
  • Notable omissions: no results on compositional/reasoning VLM benchmarks like BLINK, MMBench, MME, SEED, or on video-specific reasoning benchmarks beyond PerceptionTest/MVBench; no hallucination evaluation beyond POPE; no multilingual OCR/VQA; no chart generation or UI grounding specifically.

Comments

The paper’s core thesis — that a pure global contrastive loss can learn state-of-the-art features for spatial and language tasks, but fails to expose them at the output — is simple and striking. It cleanly separates two historically conflated questions: “what objective learns good features?” and “which layer are the good features at?” The finding that CLIP’s last-layer degradation is caused by a learned decoder of global tokens (layers 33–50 in PE_core-G) aligns with CLIPer and REPA’s earlier observations but is argued much more systematically across three task families.

The Distillation-flavored alignment methods are pragmatic: rather than changing the pretraining objective or architecture, they finetune with a teacher that is the model itself at a better layer (for language/spatial) or an external model with known locality bias (SAM 2.1). The grounding result is especially telling — PE_lang’s training data does not contain grounding data, yet RefCOCO performance lifts substantially, showing the intermediate grounding features were already there and just needed to be exposed.

Limitations and open questions:

  • The alignment pipeline trades generality for specialization: one encoder for language (PE_lang), one for spatial (PE_spatial). A unified alignment is left as future work and would be the natural next step.
  • System-level PLM-8B comparisons beat InternVL 3 on some benchmarks but lose on others (e.g., VQAv2, POPE). The paper does not quantify what it costs in alignment tuning compute vs. a full from-scratch MLLM.
  • The spatial alignment depends on SAM 2.1 as a teacher — inheriting any biases or failure modes of SAM into PE_spatial. The mask-logit trick (using SAM’s predictions rather than its features) is clever and probably generalizes to other dense-prediction teachers.
  • The video data engine is pragmatic but heavyweight; a 70B LLM summarizer in the loop is not cheap. It is an open question whether a lighter recaptioning pipeline would suffice.

Connection to the broader literature: this work sits at the intersection of contrastive scaling (following SigLIP/SigLIP2, MetaCLIP, EVA-CLIP), representation-probing work (AIMv2, DINOv2, REPA), and MLLM encoder design (LLaVA-OneVision, InternVL, Qwen2-VL). The paper argues persuasively that the pretraining objective matters less than the community assumed, and that where you read features out matters more.

Connections

  • Related to SAM 3: Segment Anything with Concepts — PE’s spatial alignment distills from SAM 2.1 mask logits; several Meta FAIR authors overlap (Peize Sun, Tengyu Ma, Piotr Dollár, Nikhila Ravi, Christoph Feichtenhofer), and both papers share the PLM MLLM as a captioning/labeling tool.
  • Related to DeepEyes: Incentivizing “Thinking with Images” — both are VLM works, but PE contributes the encoder side (where to read features from), while DeepEyes contributes the post-training side (RL-driven multi-step visual reasoning). PE_lang-L would be a natural encoder for DeepEyes-style RL.
  • Related to GeoEyes — GeoEyes uses VLM tool-use on ultra-high-resolution remote sensing imagery; PE’s tiling behavior (36 tiles + thumbnail) is directly relevant, and PE_core-G could serve as a stronger drop-in encoder.
  • Related to CLIP — PE is built on CLIP’s contrastive loss but systematically refines pretraining, scaling, and data to reach a state where features are strong throughout the network, not just at the output.
  • Related to Vision transformer — all PE backbones are ViTs; the paper studies layerwise feature quality as a function of depth for the first time across both contrastive (PE_core), captioning (AIMv2), and self-supervised (DINOv2) ViTs.
  • Related to Contrastive learning — the paper is a systematic study of how far a pure contrastive objective can be pushed, challenging the community assumption that captioning or self-supervised objectives are required for downstream transfer.
  • Related to Foundation models — PE is positioned as a general vision foundation encoder, and the paper’s model family (B/L/G + Lang + Spatial variants) is released open-weight.

Bibliography

  1. . . "Perception Encoder: The Best Visual Embeddings Are Not at the Output of the Network". https://arxiv.org/abs/2504.13181.
  2. . . "SAM 3: Segment Anything with Concepts". https://arxiv.org/abs/2511.16719. See notes
Last changed | authored by

Comments

Loading comments...

Leave a comment

Back to Notes