BPE tokenizer

BPE tokenizer BPE tokenizer

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

The BPE (Byte Pair Encoding - byte pair encoding) algorithm is a data compression technique used to create a subword vocabulary from a text corpus. This algorithm is based on the frequency of byte pairs in the text. It gained popularity because it was used as a tokenizer by LLMs such as GPT, GPT-2, RoBERTa, BART, and DeBERTa.

Training Algorithmlink image 5

Suppose we have a text corpus that only contains the following words hug, pug, pun, bun, and hugs, the first step is to create a vocabulary with all the characters present in the corpus, in our case it will be b, g, h, n, p, s, u

	
corpus_words = ["hug", "pug", "pun", "bun", "hugs"]
# Concatenate all the words in the corpus
initial_corpus_tokens = ""
for word in corpus_words:
initial_corpus_tokens += word
# Convert the concatenated string to a set of tokens to get unique tokens
initial_corpus_tokens = set(initial_corpus_tokens)
print(f"Corpus words: {corpus_words}")
print(f"Initial corpus tokens: {initial_corpus_tokens}")
print(f"Number of initial corpus tokens: {len(initial_corpus_tokens)}")
Copy
	
Corpus words: ['hug', 'pug', 'pun', 'bun', 'hugs']
Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
Number of initial corpus tokens: 7

Now let's assume this is our corpus of sentences; it's a made-up corpus, it doesn't make sense.

	
corpus = [
"hug hug hug pun pun bun hugs",
"hug hug pug pug pun pun hugs",
"hug hug pug pug pun pun pun pun hugs",
"pug pun pun pun bun hugs",
"hug hug hug pun bun bun hugs",
]
Copy

Let's count the number of times each word appears in the corpus, to check that what we put before is correct.

	
num_hug = 0
num_pug = 0
num_pun = 0
num_bun = 0
num_hugs = 0
for sentence in corpus:
words = sentence.split(" ")
for word in words:
if word == "hug":
num_hug += 1
elif word == "pug":
num_pug += 1
elif word == "pun":
num_pun += 1
elif word == "bun":
num_bun += 1
elif word == "hugs":
num_hugs += 1
print(f"Number of hug: {num_hug}")
print(f"Number of pug: {num_pug}")
print(f"Number of pun: {num_pun}")
print(f"Number of bun: {num_bun}")
print(f"Number of hugs: {num_hugs}")
Copy
	
Number of hug: 10
Number of pug: 5
Number of pun: 12
Number of bun: 4
Number of hugs: 5

Everything we had discussed is fine, we can continue.

Let's create a dictionary with the tokens of each word and the number of times they appear in the corpus

	
dict_tokens_by_word_appearance = {
"hug":
{
"count": num_hug,
"tokens": [character for character in "hug"],
},
"pug":
{
"count": num_pug,
"tokens": [character for character in "pug"],
},
"pun":
{
"count": num_pun,
"tokens": [character for character in "pun"],
},
"bun":
{
"count": num_bun,
"tokens": [character for character in "bun"],
},
"hugs":
{
"count": num_hugs,
"tokens": [character for character in "hugs"],
},
}
dict_tokens_by_word_appearance
Copy
	
{'hug': {'count': 10, 'tokens': ['h', 'u', 'g']},
'pug': {'count': 5, 'tokens': ['p', 'u', 'g']},
'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}

Now we are going to find the pair of consecutive tokens that appears most frequently in the dictionary

	
dict_keys = dict_tokens_by_word_appearance.keys()
list_consecutive_tokens = []
for i, key in enumerate(dict_keys):
# Get the tokens of the word
number_of_toneks_of_word = len(dict_tokens_by_word_appearance[key]["tokens"])
# Get consecituve tokens
for j in range(number_of_toneks_of_word-1):
# Get consecutive tokens
consecutive_tokens = dict_tokens_by_word_appearance[key]["tokens"][j] + dict_tokens_by_word_appearance[key]["tokens"][j+1]
# Append the consecutive tokens to the list the number of times the word appears
for _ in range(dict_tokens_by_word_appearance[key]["count"]):
list_consecutive_tokens.append(consecutive_tokens)
# Print the list of consecutive tokens
print(f"List of consecutive tokens: {list_consecutive_tokens}")
# Get consecutive tokens with maximum frequency
dict_consecutive_tokens = {}
for token in list_consecutive_tokens:
# Check if the token is already in the dictionary
if token in dict_consecutive_tokens:
# Increment the count of the token
dict_consecutive_tokens[token] += 1
# If the token is not in the dictionary
else:
# Add the token to the dictionary
dict_consecutive_tokens[token] = 1
# Print the dictionary of consecutive tokens
print(f"Dictionary of consecutive tokens: {dict_consecutive_tokens}")
# Get the consecutive token with maximum frequency
max_consecutive_token = None
while True:
# Get the token with maximum frequency
consecutive_token = max(dict_consecutive_tokens, key=dict_consecutive_tokens.get)
# Check if the token is already in the list of tokens
if consecutive_token in initial_corpus_tokens:
# Remove token from the dictionary
dict_consecutive_tokens.pop(consecutive_token)
# If the token is not in the list of tokens
else:
# Assign the token to the max_consecutive_token
max_consecutive_token = consecutive_token
break
# Print the consecutive token with maximum frequency
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")
Copy
	
List of consecutive tokens: ['hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'hu', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'ug', 'pu', 'pu', 'pu', 'pu', 'pu', 'ug', 'ug', 'ug', 'ug', 'ug', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'pu', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'un', 'bu', 'bu', 'bu', 'bu', 'un', 'un', 'un', 'un', 'hu', 'hu', 'hu', 'hu', 'hu', 'ug', 'ug', 'ug', 'ug', 'ug', 'gs', 'gs', 'gs', 'gs', 'gs']
Dictionary of consecutive tokens: {'hu': 15, 'ug': 20, 'pu': 17, 'un': 16, 'bu': 4, 'gs': 5}
Consecutive token with maximum frequency: ug

We have obtained the pair of tokens that appear most frequently. Let's encapsulate this in a function because we are going to use it more times.

	
def get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, list_corpus_tokens):
dict_keys = dict_tokens_by_word_appearance.keys()
list_consecutive_tokens = []
for i, key in enumerate(dict_keys):
# Get the tokens of the word
number_of_toneks_of_word = len(dict_tokens_by_word_appearance[key]["tokens"])
# Get consecituve tokens
for j in range(number_of_toneks_of_word-1):
# Get consecutive tokens
consecutive_tokens = dict_tokens_by_word_appearance[key]["tokens"][j] + dict_tokens_by_word_appearance[key]["tokens"][j+1]
# Append the consecutive tokens to the list
for _ in range(dict_tokens_by_word_appearance[key]["count"]):
list_consecutive_tokens.append(consecutive_tokens)
# Get consecutive tokens with maximum frequency
dict_consecutive_tokens = {}
for token in list_consecutive_tokens:
# Check if the token is already in the dictionary
if token in dict_consecutive_tokens:
# Increment the count of the token
dict_consecutive_tokens[token] += 1
# If the token is not in the dictionary
else:
# Add the token to the dictionary
dict_consecutive_tokens[token] = 1
# Get the consecutive token with maximum frequency
max_consecutive_token = None
while True:
# Get the token with maximum frequency
consecutive_token = max(dict_consecutive_tokens, key=dict_consecutive_tokens.get)
# Check if the token is already in the list of tokens
if consecutive_token in list_corpus_tokens:
# Remove token from the dictionary
dict_consecutive_tokens.pop(consecutive_token)
# If the token is not in the list of tokens
else:
# Assign the token to the max_consecutive_token
max_consecutive_token = consecutive_token
break
return max_consecutive_token
Copy

We check that we get the same as before

	
max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, initial_corpus_tokens)
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")
Copy
	
Consecutive token with maximum frequency: ug

We see that yes

Now our token corpus can be modified by adding the ug token

	
# new_corpus_tokens = initial_corpus_tokens + max_consecutive_token
new_corpus_tokens = initial_corpus_tokens.copy()
new_corpus_tokens.add(max_consecutive_token)
print(f"Initial corpus tokens: {initial_corpus_tokens}")
print(f"New corpus tokens: {new_corpus_tokens}")
Copy
	
Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
New corpus tokens: {'p', 'n', 'ug', 'g', 'b', 'u', 's', 'h'}

We also put this in a function

	
def get_new_corpus_tokens(max_consecutive_token, initial_corpus_tokens):
new_corpus_tokens = initial_corpus_tokens.copy()
new_corpus_tokens.add(max_consecutive_token)
return new_corpus_tokens
Copy

We will check again that we get the same as before

	
new_corpus_tokens = get_new_corpus_tokens(max_consecutive_token, initial_corpus_tokens)
print(f"Initial corpus tokens: {initial_corpus_tokens}")
print(f"New corpus tokens: {new_corpus_tokens}")
Copy
	
Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
New corpus tokens: {'p', 'n', 'ug', 'g', 'b', 'u', 's', 'h'}

We see that yes

Now we are going to modify the dictionary where the words, the tokens, and the number of times they appear are listed with the new token

	
import copy
dict_keys = dict_tokens_by_word_appearance.keys()
dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
for key in dict_keys:
# Check if the new token is in the word
if max_consecutive_token in key:
print(f"Token {max_consecutive_token} is in the word {key}")
# Add the new token to the word tokens
dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
print(f"New tokens of the word {key}: {dict_tokens_by_word_appearance_tmp[key]['tokens']}")
print(f"Initial tokens by word appearance: {dict_tokens_by_word_appearance}")
print(f"New tokens by word appearance: ")
dict_tokens_by_word_appearance_tmp
Copy
	
Token ug is in the word hug
New tokens of the word hug: ['h', 'u', 'g', 'ug']
Token ug is in the word pug
New tokens of the word pug: ['p', 'u', 'g', 'ug']
Token ug is in the word hugs
New tokens of the word hugs: ['h', 'u', 'g', 's', 'ug']
Initial tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}
New tokens by word appearance:
	
{'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']},
'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug']},
'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug']}}

We put this into a function

	
def update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token):
dict_tokens_by_word_appearance_tmp = copy.deepcopy(dict_tokens_by_word_appearance)
dict_keys = dict_tokens_by_word_appearance_tmp.keys()
for key in dict_keys:
# Check if the new token is in the word
if max_consecutive_token in key:
# Add the new token to the word tokens
dict_tokens_by_word_appearance_tmp[key]["tokens"].append(max_consecutive_token)
return dict_tokens_by_word_appearance_tmp
Copy

We check that it is correct

	
dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New tokens by word appearance: ")
dict_tokens_by_word_appearance
Copy
	
New tokens by word appearance:
	
{'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']},
'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug']},
'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug']}}

In summary, in the first iteration we have moved from a token corpus of s, g, h, u, n, p, b to the new token corpus h, u, n, p, s, g, b, ug

We now perform a second iteration, obtaining the pair of consecutive tokens that appear most frequently in the dictionary

	
max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, new_corpus_tokens)
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")
Copy
	
Consecutive token with maximum frequency: pu

We obtain the new token corpus

	
corpus_tokens = get_new_corpus_tokens(max_consecutive_token, new_corpus_tokens)
print(f"Initial corpus tokens: {new_corpus_tokens}")
print(f"New corpus tokens: {corpus_tokens}")
Copy
	
Initial corpus tokens: {'p', 'n', 'ug', 'g', 'b', 'u', 's', 'h'}
New corpus tokens: {'p', 'n', 'pu', 'u', 's', 'h', 'ug', 'g', 'b'}

And we get the new dictionary in which the words, the tokens, and the number of times they appear are listed.

	
dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New tokens by word appearance: ")
dict_tokens_by_word_appearance
Copy
	
New tokens by word appearance:
	
{'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']},
'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']},
'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu']},
'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug']}}

Now we can continue until we have a corpus of tokens of the size we want, let's create a corpus of 15 tokens

	
len_corpus_tokens = 15
while len(corpus_tokens) < len_corpus_tokens:
max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, corpus_tokens)
print(f"Consecutive token with maximum frequency: {max_consecutive_token}")
# If there are no more consecutive tokens break the loop
if max_consecutive_token is None:
break
corpus_tokens = get_new_corpus_tokens(max_consecutive_token, corpus_tokens)
print(f"New corpus tokens: {corpus_tokens}")
dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New tokens by word appearance: {dict_tokens_by_word_appearance} ")
Copy
	
Consecutive token with maximum frequency: un
New corpus tokens: {'p', 'n', 'pu', 'un', 'u', 's', 'h', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug']}}
Consecutive token with maximum frequency: hu
New corpus tokens: {'p', 'hu', 'n', 'pu', 'un', 'u', 's', 'h', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: gug
New corpus tokens: {'p', 'hu', 'n', 'pu', 'un', 'gug', 'u', 's', 'h', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: ughu
New corpus tokens: {'p', 'hu', 'n', 'pu', 'un', 'gug', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: npu
New corpus tokens: {'p', 'hu', 'n', 'npu', 'pu', 'un', 'gug', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}
Consecutive token with maximum frequency: puun
New corpus tokens: {'p', 'hu', 'n', 'npu', 'pu', 'un', 'gug', 'puun', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}
New tokens by word appearance: {'hug': {'count': 10, 'tokens': ['h', 'u', 'g', 'ug', 'hu']}, 'pug': {'count': 5, 'tokens': ['p', 'u', 'g', 'ug', 'pu']}, 'pun': {'count': 12, 'tokens': ['p', 'u', 'n', 'pu', 'un']}, 'bun': {'count': 4, 'tokens': ['b', 'u', 'n', 'un']}, 'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's', 'ug', 'hu']}}

Now that we have seen how the BPE tokenizer is trained, let's train it from scratch to reinforce our knowledge.

	
corpus_words = ["hug", "pug", "pun", "bun", "hugs"]
# Concatenate all the words in the corpus
initial_corpus_tokens = ""
for word in corpus_words:
initial_corpus_tokens += word
# Convert the concatenated string to a set of tokens to get unique tokens
corpus_tokens = set(initial_corpus_tokens)
print(f"Corpus words: {corpus_words}")
print(f"Initial corpus tokens: {corpus_tokens}")
print(f"Number of initial corpus tokens: {len(corpus_tokens)}")
Copy
	
Corpus words: ['hug', 'pug', 'pun', 'bun', 'hugs']
Initial corpus tokens: {'p', 'n', 'u', 's', 'h', 'g', 'b'}
Number of initial corpus tokens: 7
	
corpus = [
"hug hug hug pun pun bun hugs",
"hug hug pug pug pun pun hugs",
"hug hug pug pug pun pun pun pun hugs",
"pug pun pun pun bun hugs",
"hug hug hug pun bun bun hugs",
]
num_hug = 0
num_pug = 0
num_pun = 0
num_bun = 0
num_hugs = 0
for sentence in corpus:
words = sentence.split(" ")
for word in words:
if word == "hug":
num_hug += 1
elif word == "pug":
num_pug += 1
elif word == "pun":
num_pun += 1
elif word == "bun":
num_bun += 1
elif word == "hugs":
num_hugs += 1
dict_tokens_by_word_appearance = {
"hug":
{
"count": num_hug,
"tokens": [character for character in "hug"],
},
"pug":
{
"count": num_pug,
"tokens": [character for character in "pug"],
},
"pun":
{
"count": num_pun,
"tokens": [character for character in "pun"],
},
"bun":
{
"count": num_bun,
"tokens": [character for character in "bun"],
},
"hugs":
{
"count": num_hugs,
"tokens": [character for character in "hugs"],
},
}
dict_tokens_by_word_appearance
Copy
	
{'hug': {'count': 10, 'tokens': ['h', 'u', 'g']},
'pug': {'count': 5, 'tokens': ['p', 'u', 'g']},
'pun': {'count': 12, 'tokens': ['p', 'u', 'n']},
'bun': {'count': 4, 'tokens': ['b', 'u', 'n']},
'hugs': {'count': 5, 'tokens': ['h', 'u', 'g', 's']}}

We trained it from scratch until we obtained a corpus of 15 tokens

	
len_corpus_tokens = 15
print(f"Initial corpus tokens: ({len(corpus_tokens)}) {corpus_tokens}")
while len(corpus_tokens) < len_corpus_tokens:
max_consecutive_token = get_consecutive_tokens_with_max_frequency(dict_tokens_by_word_appearance, corpus_tokens)
# If there are no more consecutive tokens break the loop
if max_consecutive_token is None:
break
corpus_tokens = get_new_corpus_tokens(max_consecutive_token, corpus_tokens)
dict_tokens_by_word_appearance = update_tokens_by_word_appearance(dict_tokens_by_word_appearance, max_consecutive_token)
print(f"New corpus tokens: ({len(corpus_tokens)}) {corpus_tokens}")
Copy
	
Initial corpus tokens: (7) {'p', 'n', 'u', 's', 'h', 'g', 'b'}
New corpus tokens: (15) {'p', 'hu', 'n', 'npu', 'pu', 'un', 'gug', 'puun', 'u', 's', 'h', 'ughu', 'ug', 'g', 'b'}

Tokenizationlink image 6

If we now wanted to tokenize, we would first have to create a vocabulary, that is, assign an ID to each token.

	
vocab = {}
for i, token in enumerate(corpus_tokens):
vocab[token] = i
print(f"Vocabulary: ")
vocab
Copy
	
Vocabulary:
	
{'p': 0,
'hu': 1,
'sug': 2,
'npu': 3,
'ugpu': 4,
'gug': 5,
'u': 6,
'ug': 7,
'ughu': 8,
'n': 9,
'pu': 10,
'un': 11,
'puun': 12,
's': 13,
'h': 14,
'gs': 15,
'g': 16,
'b': 17}

We put it in a function

	
def get_vocabulary(corpus_tokens):
vocab = {}
for i, token in enumerate(corpus_tokens):
vocab[token] = i
return vocab
Copy

We check that it is correct

	
vocab = get_vocabulary(corpus_tokens)
print(f"Vocabulary: ")
vocab
Copy
	
Vocabulary:
	
{'p': 0,
'hu': 1,
'sug': 2,
'npu': 3,
'ugpu': 4,
'gug': 5,
'u': 6,
'ug': 7,
'ughu': 8,
'n': 9,
'pu': 10,
'un': 11,
'puun': 12,
's': 13,
'h': 14,
'gs': 15,
'g': 16,
'b': 17}

If we now want to tokenize the word bug we can do

	
word = 'bug'
# Get the maximum length of tokens
max_len = max(len(token) for token in vocab)
print(f"Maximum length of tokens: {max_len}")
# Create a empty list of tokens
tokens = []
while len(word) > 0:
# Flag to check if the token is found
found = False
# Iterate over the maximum length of tokens from max_len to 0
for i in range(max_len, 0, -1):
# Get the prefix of the word
prefix = word[:i]
print(f"Prefix: {prefix}")
# Check if the prefix is in the vocabulary
if prefix in vocab:
print(f"prefix {prefix} is in the vocabulary")
tokens.append(prefix)
word = word[i:]
found = True
break
# if not found:
# tokens.append('<UNK>')
# word = word[1:]
print(f"Tokens: {tokens}")
Copy
	
Maximum length of tokens: 4
Prefix: bug
Prefix: bug
Prefix: bu
Prefix: b
prefix b is in the vocabulary
Prefix: ug
prefix ug is in the vocabulary
Tokens: ['b', 'ug']

But if we now want to tokenize the word mug we couldn't because the character m is not in the vocabulary, so we tokenize it with the <UNK> token.

	
word = 'mug'
# Get the maximum length of tokens
max_len = max(len(token) for token in vocab)
print(f"Maximum length of tokens: {max_len}")
# Create a empty list of tokens
tokens = []
while len(word) &gt; 0:
# Flag to check if the token is found
found = False
# Iterate over the maximum length of tokens from max_len to 0
for i in range(max_len, 0, -1):
# Get the prefix of the word
prefix = word[:i]
print(f"Prefix: {prefix}")
# Check if the prefix is in the vocabulary
if prefix in vocab:
print(f"prefix {prefix} is in the vocabulary")
tokens.append(prefix)
word = word[i:]
found = True
break
if not found:
tokens.append('&lt;UNK&gt;')
word = word[1:]
print(f"Tokens: {tokens}")
Copy
	
Maximum length of tokens: 4
Prefix: mug
Prefix: mug
Prefix: mu
Prefix: m
Prefix: ug
prefix ug is in the vocabulary
Tokens: ['&lt;UNK&gt;', 'ug']

We put it in a function

	
def tokenize_word(word, vocab):
# Get the maximum length of tokens
max_len = max(len(token) for token in vocab)
# Create a empty list of tokens
tokens = []
while len(word) &gt; 0:
# Flag to check if the token is found
found = False
# Iterate over the maximum length of tokens from max_len to 0
for i in range(max_len, 0, -1):
# Get the prefix of the word
prefix = word[:i]
# Check if the prefix is in the vocabulary
if prefix in vocab:
tokens.append(prefix)
word = word[i:]
found = True
break
if not found:
tokens.append('&lt;UNK&gt;')
word = word[1:]
return tokens
Copy

We check that it is correct

	
print(f"Tokenization of the word 'bug': {tokenize_word('bug', vocab)}")
print(f"Tokenization of the word 'mug': {tokenize_word('mug', vocab)}")
Copy
	
Tokenization of the word 'bug': ['b', 'ug']
Tokenization of the word 'mug': ['&lt;UNK&gt;', 'ug']

Token Viewerlink image 7

Now that we know how a BPE tokenizer works, let's see using the visualizer the-tokenizer-playground what the tokens of any sentence would look like.

Continue reading

MCP: Complete Guide to Create servers and clients MCP (Model Context Protocol) with FastMCP

MCP: Complete Guide to Create servers and clients MCP (Model Context Protocol) with FastMCP

Learn what is the Model Context Protocol (MCP), the open-source standard developed by Anthropic that revolutionizes how AI models interact with external tools. In this practical and detailed guide, I take you step by step in creating an MCP server and client from scratch using the fastmcp library. You will build an "intelligent" AI agent with Claude Sonnet, capable of interacting with the GitHub API to query issues and repository information. We will cover from basic concepts to advanced features like filtering tools by tags, server composition, static resources and dynamic templates (resource templates), prompt generation, and secure authentication. Discover how MCP can standardize and simplify the integration of tools in your AI applications, analogously to how USB unified peripherals!

Last posts -->

Have you seen these projects?

Horeca chatbot

Horeca chatbot Horeca chatbot
Python
LangChain
PostgreSQL
PGVector
React
Kubernetes
Docker
GitHub Actions

Chatbot conversational for cooks of hotels and restaurants. A cook, kitchen manager or room service of a hotel or restaurant can talk to the chatbot to get information about recipes and menus. But it also implements agents, with which it can edit or create new recipes or menus

Subtify

Subtify Subtify
Python
Whisper
Spaces

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->