Byte-pair encoding

tags
NLP

The process of byte-pair encoding can be summarized as follow:

  • Each character is a token
  • Find pairs that occur most often
  • Create a new token that encoded those common pairs
  • Repeat the process until target vocabulary size is reached

The output of this process is both a vocabulary and a set of merging rules for tokens to be used to process more data.

This technique has several advantages:

  • It is inexpensive
  • It can deal with previously unseen words and make reasonable predictions about them if the token matches semantic information about the word.

For these reasons, this encoding method is currently the dominant one for transformers architecture for example.

Minimal example in Python

Here is a very simple Python snippet to apply a BPE pass on a sequence of tokens.

from collections import counter
from typing import Sequence, Tuple

# An encoding step will reduce the most common pair to a single token and returns
# the modified "dataset" along with the vocabulary.
def bp_encode_step(words: Sequence[str]) -> Tuple[list[str], set[str]]:
    """This iteratively creates a new list of tokens where the most common pair
    has been merged into a single token."""
    pair_count = counter(zip(words[:-1], words[1:]))
    most_common_pair = pair_count.most_common(1)[0][0]
    output = []
    n = 0
    while n < len(words):
        w = words[n]

        if (
            n < len(words) - 1
            and w == most_common_pair[0]
            and words[n + 1] == most_common_pair[1]
        ):
            output.append(w + words[n + 1])
            n += 1
        else:
            output.append(w)
        n += 1
    return output, set(output)

We apply the encoding function to a fake dataset for several steps:

# The "dataset" we will be using
words = "This is the first time I experienced this kind of thing. \
Let's try this thing again."

for iteration in range(24):
    words, vocab = bp_encode_step(words)
    if iteration % 4 == 0:
        print(f"Step {iteration} vocab size is {len(vocab)},",
              "tokenized sentence:", "_".join(words), "\n")

Results

The various tokens are separated by the symbol _ in the results:

Step 0 vocab size is 25, tokenized sentence: T_h_i_s_ _i_s_ t_h_e_ _f_i_r_s_t_ t_i_m_e_ _I_ _e_x_p_e_r_i_e_n_c_e_d_ t_h_i_s_ _k_i_n_d_ _o_f_ t_h_i_n_g_._ _L_e_t_'_s_ t_r_y_ t_h_i_s_ t_h_i_n_g_ _a_g_a_i_n_.
Step 4 vocab size is 29, tokenized sentence: T_hi_s _i_s t_h_e_ _f_i_r_s_t_ t_i_m_e_ _I_ _e_x_p_e_r_i_e_n_c_e_d_ thi_s _k_i_n_d_ _o_f_ thi_n_g_._ _L_e_t_'_s t_r_y_ thi_s_ thi_n_g_ _a_g_a_i_n_.
Step 8 vocab size is 32, tokenized sentence: T_hi_s _i_s t_h_e _f_i_r_s_t_ t_i_m_e _I_ _e_x_p_e_r_i_e_n_c_e_d_ thi_s _k_in_d_ _o_f_ thing_._ _L_e_t_'_s t_r_y_ thi_s_ thing_ _a_g_a_in_.
Step 12 vocab size is 31, tokenized sentence: This is t_h_e _f_i_r_s_t_ t_i_m_e _I_ _e_x_p_e_r_i_e_n_c_e_d_ thi_s _k_in_d_ _o_f_ thing_._ _L_e_t_'_s t_r_y_ thi_s_ thing_ _a_g_a_in_.
Step 16 vocab size is 30, tokenized sentence: This is the fi_r_s_t_ t_i_m_e _I_ _e_x_p_e_r_i_e_n_c_e_d_ thi_s _k_in_d_ _o_f_ thing_._ _L_e_t_'_s t_r_y_ thi_s_ thing_ _a_g_a_in_.
Step 20 vocab size is 29, tokenized sentence: This is the first t_i_m_e _I_ _e_x_p_e_r_i_e_n_c_e_d_ thi_s _k_in_d_ _o_f_ thing_._ _L_e_t_'_s t_r_y_ thi_s_ thing_ _a_g_a_in_.

On such a small “dataset”, the benefits of BPE are not apparent. The vocabulary quickly ends up with pairs of tokens appearing only once for which there isn’t anything more to do.

Notice how from step 8, the very common word this is encoded as a single token.

Links to this note

Comments


← Back to Notes