Vision Language Models

tags: LLM, Vision transformer

Vision language models (VLMs) are generative AI models trained on both text and images. They can be effective tools for Image classification.

CLIP is an early and influential example of contrastive vision-language pre-training. VLMs are a class of Foundation models.

Remote Sensing VLMs

VLMs for remote sensing are maturing rapidly (as of early 2026), approaching the point where analysts can query satellite archives in natural language and receive spatially grounded, visually referenced answers.

GeoChat (CVPR 2024, Mohamed bin Zayed University of AI): first grounded VLM for remote sensing. Fine-tuned on 318K RS instruction pairs. Capabilities: image captioning, VQA, scene classification, referring object detection. (TheCVF, GitHub)
SkyEyeGPT: 968K instruction samples, superior performance across 8 datasets. (ScienceDirect)
RemoteCLIP and successor RSCLIP (ISPRS Geospatial Week 2025): contrastive learning for satellite-text retrieval (Liu et al. 2023).
Google Remote Sensing Foundation Models: RS-OWL-ViT-v2 for object detection, 16%+ improvement on text-based search, 2× baseline for zero-shot detection. (Google Research)
Meta Segment Anything Model (SAM → SAM 2 → SAM 3, 2025): “promptable concept segmentation”, open-vocabulary detection where text prompts segment every matching instance. The open-source segment-geospatial (SamGeo) package integrates SAM into QGIS and ArcGIS workflows.

Bibliography

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, Jun Zhou. June 19, 2023. "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing". https://arxiv.org/abs/2306.11029.

Remote Sensing VLMs

Bibliography

Links to this note

Comments

Leave a comment