Imagen

tags
Transformers, Diffusion models, Computer vision, NLP, T5, CLIP
paper
(Saharia et al. 2022)

Architecture

This is based on the U-net diffusion architecture with a few extensions. T5 or CLIP or BERT is used as a frozen text encoder.

Parameter count

2B

Bibliography

  1. . . "Photorealistic Text-to-image Diffusion Models with Deep Language Understanding". arXiv. DOI.
Last changed | authored by

Comments


← Back to Notes