Thinking Like Transformers by Weiss, G., Goldberg, Y., & Yahav, E. (2021)

tags
NLP, Computer science
source
(Weiss et al. 2021)

Summary

This paper introduces a programming language that is inspired by the way Transformers process input data. The language is called Restricted Access Sequence Processing Language (RASP).

Data is represented as sequences, which is the structure transformers manipulate (since they have been designed for NLP applications). The language has two types of internal data representation:

  • Sequence operators (s-ops) are functions that translate sequences into sequences. For example, the tokens operator is the identity and the length operator is defined by length(["h", "i"]) = [2, 2].
  • Selectors are functions that takes two sequences and a boolean predicate that can be applied to all pairs of input values. It returns a matrix which values are the result of the boolean op on pairs of values. \[ \text{select}([1, 2, 3], [0, 3, 4], <) = \begin{bmatrix} \text{F} & \text{T} & \text{T}\newline \text{F} & \text{T} & \text{T}\newline \text{F} & \text{F} & \text{T} \end{bmatrix} \] Selectors are paired with aggregators that will convert a selector matrix and a sequence to a sequence. The selector filters and average input values: \[ \text{aggregate}(\begin{bmatrix} \text{F} & \text{T} & \text{T} \newline \text{F} & \text{T} & \text{T}\newline \text{F} & \text{F} & \text{T} \end{bmatrix} , [0, 12, 3]) = [7.5, 7.5, 3] \]

Correspondence between RASP ops and transformer components:

  • Built-in s-ops like indices and tokens are similar to input embeddings of a transformer.
  • Element-wise s-ops are related to feed-forward components in transformers which are just regular neural networks.
  • Selection and aggregation are akin to attention in transformers.

To evaluate how “good” this new language is for understanding and designing transformer architectures, the authors study three main questions:

  1. its ability to upper bound the number of heads and layers required to solve a task,
  2. the tightness of that bound,
  3. its feasibility in a transformer

Bibliography

  1. . . "Thinking Like Transformers". Arxiv:2106.06981 [cs]. http://arxiv.org/abs/2106.06981.
Last changed | authored by

Comments


← Back to Notes