- tags
- Neural networks

## Implementation

Self-attention is a weighted average of all input elements from
a sequence, with a weight proportional to a similarity score
between representations. The input \(x \in \mathbb{R}^{L \times
F}\) is projected by matrices \(W_Q \in \mathbb{R}^{F \times D}\),
\(W_K \in \mathbb{R}^{F\times D}\) and \(W_V \in
\mathbb{R}^{F\times M}\) to representations \(Q\)
(*queries*), \(K\) (*keys*) and \(V\)
(*values*).

\[ Q = xW_Q\] \[ K = xW_K\] \[ V = xW_V\]

Output for all positions in a sequence \(x\), is written

\[A(x) = V' = \text{softmax}\left( \dfrac{QK^{T}}{\sqrt{D}} \right) V.\]

The softmax is applied row-wise in the equation above.