It is an encoder-only architecture. It extends BERT by using parameter-sharing and is more efficient than BERT with the same number of parameters.
- Base = 12M
- Large = 18M
- XLarge = 60M
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut. . "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". arXiv. DOI.