- tags
- Transformers, Diffusion models, Computer vision, NLP, T5, CLIP
- paper
- (Saharia et al. 2022)
Architecture
This is based on the U-net diffusion architecture with a few extensions. T5 or CLIP or BERT is used as a frozen text encoder.
Parameter count
2B
Bibliography
- Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, et al.. . "Photorealistic Text-to-image Diffusion Models with Deep Language Understanding". arXiv. DOI.