The process of byte-pair encoding can be summarized as follow:
- Each character is a token
- Find pairs that occur most often
- Create a new token that encoded those common pairs
- Repeat the process until target vocabulary size is reached
The output of this process is both a vocabulary and a set of merging rules for tokens to be used to process more data.
This technique has many advantages:
- It is inexpensive
- It can deal with previously unseen words and make reasonable predictions about them if the token matches semantic information about the word.
For these reasons, this encoding method is currently the dominant one for transformers architecture.