# Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention by Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., & Singh, V. (2021)

tags
Transformers
source
(Xiong et al. 2021)

## Summary

This paper describes a way of applying the Nyström method for approximating matrix multiplication to transformers. More precisely, the approximation is used in the self-attention mechanism’s softmax calculation.

This approximation adresses one of the biggest downside of attention: its computational complexity. The authors claim that their method reduces it from $$O(n^2)$$ to $$O(n)$$.

The goal of the method is to efficiently approximate the matrix

\begin{align*} S = \text{softmax}\left(\dfrac{QK^T}{\sqrt{d_q}}\right) \end{align*}

### Why the standard Nyström method won’t work

To apply matrix approximation to this problem, one would start by writing $$S$$ as

$S = \begin{bmatrix} A_S & B_S \newline F_S & C_S \end{bmatrix}$

The full matrix $$S$$ can be approximated using only $$m$$ columns and $$m$$ rows

$\hat{S} = \begin{bmatrix} A_S \newline F_S \end{bmatrix} A^+ \begin{bmatrix} A_S & B_S \end{bmatrix}$