- tags
- Transformers
- source
- (Xiong et al. 2021)
Summary
This paper describes a way of applying the Nyström method for approximating matrix multiplication to transformers. More precisely, the approximation is used in the self-attention mechanism’s softmax calculation.
This approximation adresses one of the biggest downside of attention: its computational complexity. The authors claim that their method reduces it from \(O(n^2)\) to \(O(n)\).
The goal of the method is to efficiently approximate the matrix
\[\begin{align*} S = \text{softmax}\left(\dfrac{QK^T}{\sqrt{d_q}}\right) \end{align*}\]
Why the standard Nyström method won’t work
To apply matrix approximation to this problem, one would start by writing \(S\) as
\[ S = \begin{bmatrix} A_S & B_S \newline F_S & C_S \end{bmatrix}\]
The full matrix \(S\) can be approximated using only \(m\) columns and \(m\) rows
\[ \hat{S} = \begin{bmatrix} A_S \newline F_S \end{bmatrix} A^+ \begin{bmatrix} A_S & B_S \end{bmatrix} \]
Linearized self-attention via the Nyström method
Comments
Bibliography
- Xiong, Yunyang, Zhanpeng Zeng, Rudrasis Chakraborty, Mingxing Tan, Glenn Fung, Yin Li, and Vikas Singh. February 7, 2021. "Nystr\\“Omformer: A Nystr\\“Om-Based Algorithm for Approximating Self-Attention". arXiv:2102.03902 [Cs]. http://arxiv.org/abs/2102.03902.