The model is a mixture of 64 expert decoder-only transformer architectures. Two experts are activated per token, making the model relatively efficient for its number of parameters
1.2T total, 96B active per token.
- Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, et al.. . "Glam: Efficient Scaling of Language Models with Mixture-of-experts". arXiv. http://arxiv.org/abs/2112.06905.