CLIP

Architecture

It is an encoder-only model which combines ViT and ResNet to encode images and a transformer for the text encoding.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al.. February 26, 2021. "Learning Transferable Visual Models from Natural Language Supervision". arXiv. DOI.