- tags
- Transformers, Diffusion models, CLIP
- paper
- (Ramesh et al. 2022)
Architecture
This is the successor of DALL-E, it is an encoder/decoder model that uses a combination of CLIP and Diffusion models to generate images from text. The diffusion decoder is similar to GLIDE.
Parameter count
3.5B
Bibliography
- Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. . "Hierarchical Text-conditional Image Generation with CLIP Latents". arXiv. DOI.