PaLM

tags
Transformers, NLP
paper
(Chowdhery et al. 2022)

Architecture

This is a standard decoder-only architecture with some specific extensions:

  • SwiGLU activation functions
  • Parallel layers
  • Multi-query attention
  • RoPE embeddings
  • Shared input-output embeddings
  • No biaises
  • A 256k SentencePiece vocabulary generated from the training data

Parameter count

540B

Bibliography

  1. . . "Palm: Scaling Language Modeling with Pathways". arXiv. http://arxiv.org/abs/2204.02311.

Links to this note

Last changed | authored by

Comments


← Back to Notes