Linear Attention

tags: Attention, Transformers, Machine learning, Applied maths

Attention variants that replace softmax with linear kernels, reducing complexity from quadratic to linear in sequence lengt

Links to this note

Notes on: Attention Residuals by Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su et al. (2026)
Notes on: MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention by MiniMax (2025)
Notes on: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention by Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020)

Last changed 2026.04.08 | authored by Hugo Cisneros

Comments

Loading comments...

Back to Notes