CLIP

tags
Transformers, NLP, Computer vision
paper
(Radford et al. 2021)

Architecture

It is an encoder-only model which combines ViT and ResNet to encode images and a transformer for the text encoding.

Bibliography

  1. . . "Learning Transferable Visual Models from Natural Language Supervision". arXiv. DOI.

Links to this note

Comments


← Back to Notes