- Transformers, NLP, Computer vision
- (Radford et al. 2021)
It is an encoder-only model which combines ViT and ResNet to encode images and a transformer for the text encoding.
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al.. . "Learning Transferable Visual Models from Natural Language Supervision". arXiv. DOI.