GLaM

tags: Transformers, NLP
paper: (Du et al. 2021)

Architecture

The model is a mixture of 64 expert decoder-only transformer architectures. Two experts are activated per token, making the model relatively efficient for its number of parameters

Parameter count

1.2T total, 96B active per token.

Bibliography

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, et al.. December 13, 2021. "Glam: Efficient Scaling of Language Models with Mixture-of-experts". arXiv. http://arxiv.org/abs/2112.06905.

Architecture

Parameter count

Bibliography

Comments