- tags
- Neural networks

Self-attention is a weighted average of all input elements from
a sequence, with a weight proportional to a similarity score
between representations. The input \(x \in \mathbb{R}^{L \times
F}\) is projected by matrices \(W_Q \in \mathbb{R}^{F \times D}\),
\(W_K \in \mathbb{R}^{F\times D}\) and \(W_V \in
\mathbb{R}^{F\times M}\) to representations \(Q\)
(*queries*), \(K\) (*keys*) and \(V\)
(*values*).

\[ Q = xW_Q\] \[ K = xW_K\] \[ V = xW_V\]

Output for all positions in a sequence \(x\), is written

\[A(x) = V' = \text{softmax}\left( \dfrac{QK^{T}}{\sqrt{D}} \right) V.\]

The softmax is applied row-wise in the equation above.

## Backlinks

- Transformers
- Notes on: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention by Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020)
- Notes on: Hopfield Networks is All You Need by Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Gruber, L., Holzleitner, M., … (2020)