GLaM

tags
Transformers, NLP
paper
(Du et al. 2021)

Architecture

The model is a mixture of 64 expert decoder-only transformer architectures. Two experts are activated per token, making the model relatively efficient for its number of parameters

Parameter count

1.2T total, 96B active per token.

Bibliography

  1. . . "Glam: Efficient Scaling of Language Models with Mixture-of-experts". arXiv. http://arxiv.org/abs/2112.06905.

Comments


← Back to Notes