Mixture of Experts

tags: Transformers, LLM, Machine learning, Scaling laws

Sparse neural network architecture that routes inputs to a subset of expert subnetworks, enabling parameter scaling without proportional compute increase.

Links to this note

Foundation models
Knowledge Base Index
Notes on: MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention by MiniMax (2025)
Switch transformer

Last changed 2026.04.08 | authored by Hugo Cisneros

Comments

Loading comments...

Back to Notes

Mixture of Experts

Links to this note

Comments

Leave a comment