- tags
- Transformers, NLP
- paper
- (Chowdhery et al. 2022)
Architecture
This is a standard decoder-only architecture with some specific extensions:
- SwiGLU activation functions
- Parallel layers
- Multi-query attention
- RoPE embeddings
- Shared input-output embeddings
- No biaises
- A 256k SentencePiece vocabulary generated from the training data
Parameter count
540B
Bibliography
- Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al.. . "Palm: Scaling Language Modeling with Pathways". arXiv. http://arxiv.org/abs/2204.02311.