# Attention

Neural networks

Self-attention is a weighted average of all input elements from a sequence, with a weight proportional to a similarity score between representations. The input $$x \in \mathbb{R}^{L \times F}$$ is projected by matrices $$W_Q \in \mathbb{R}^{F \times D}$$, $$W_K \in \mathbb{R}^{F\times D}$$ and $$W_V \in \mathbb{R}^{F\times M}$$ to representations $$Q$$ (queries), $$K$$ (keys) and $$V$$ (values).

$Q = xW_Q$ $K = xW_K$ $V = xW_V$

Output for all positions in a sequence $$x$$, is written

$A(x) = V' = \text{softmax}\left( \dfrac{QK^{T}}{\sqrt{D}} \right) V.$

The softmax is applied row-wise in the equation above.

