Attention variants that replace softmax with linear kernels, reducing complexity from quadratic to linear in sequence lengt
Linear Attention
Links to this note
- Notes on: Attention Residuals by Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su et al. (2026)
- Notes on: MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention by MiniMax (2025)
- Notes on: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention by Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020)
Last changed | authored by Hugo Cisneros
Loading comments...