- Transformers, GPT, BERT, T5
- (Shoeybi et al. 2020)
The principle of Megatron is to extend existing architectures by using model parallelism. It has a number of parameters that depends on the base model used.
- Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro. . "Megatron-lm: Training Multi-billion Parameter Language Models Using Model Parallelism". arXiv. DOI.