It is an extension of the BERT architecture that can be trained on patches of images.
86M to 632M
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al.. . "An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv. DOI.