- tags
- Neural networks

## Implementation

Self-attention is a weighted average of all input elements from a sequence, with a weight proportional to a similarity score between representations. The input \(x \in \mathbb{R}^{L \times F}\) is projected by matrices \(W_Q \in \mathbb{R}^{F \times D}\), \(W_K \in \mathbb{R}^{F\times D}\) and \(W_V \in \mathbb{R}^{F\times M}\) to representations \(Q\) (*queries*), \(K\) (*keys*) and \(V\) (*values*).

\[ Q = xW_Q\] \[ K = xW_K\] \[ V = xW_V\]

Output for all positions in a sequence \(x\), is written

\[A(x) = V’ = \text{softmax}\left( \dfrac{QK^{T}}{\sqrt{D}} \right) V.\]

The softmax is applied row-wise in the equation above.

## Possible interpretation

Keys and queries have a relatively simple interpretation. The keys are embeddings of tokens that expose some useful information about them:

The key \(K_3\) associated with `cat`

should probably encode some information about the fact that it’s a noun, that it refers to a living entity, an animal, etc. On the other hand, the key \(K_2\) encodes the fact that `pretty`

is an adjective, and is used to denote some positive things about the subject’s appearance. That key is probably close to keys for `beautiful`

and `nice`

.

The query encodes another type of information about what types of keys would be useful for that particular token. In the case of query \(Q_3\) it is probably useful to attend to any adjective-like key that could show something interesting about the current word. Therefore, the quantity \(\text{softmax}\left( \dfrac{QK^{T}}{\sqrt{D}} \right)\) will be larger and will contribute more heavily in the resulting vector \(V’\). This is illustrated in the graph above with heavier edges.

## Is it necessary?

Maybe not (Zhai et al. 2021).

## Bibliography

- Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, Josh Susskind. . "An Attention Free Transformer".
*Arxiv:2105.14103 [cs]*. http://arxiv.org/abs/2105.14103.