- tags
- Transformers, GPT
- paper
- (Rae et al. 2022)
Architecture
This model is very similar to GPT-2 but uses RSNorm instead of LayerNorm and relative positional encoding rather than absolute positional encoding.
Parameter count
280B
Bibliography
- Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, et al.. . "Scaling Language Models: Methods, Analysis & Insights from Training Gopher". arXiv. DOI.