Tokens

Tokens Tokens

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

Now that LLMs are on the rise, we keep hearing about the number of tokens each model supports, but what are tokens? They are the smallest units of representation of words.

To explain what tokens are, let's first look at a practical example using the OpenAI tokenizer, called tiktoken.

So, first we install the package:

pip install tiktoken

Once installed, we create a tokenizer using the cl100k_base model, which in the example notebook How to count tokens with tiktoken explains is used by the models gpt-4, gpt-3.5-turbo and text-embedding-ada-002

	
import tiktoken
encoder = tiktoken.get_encoding("cl100k_base")
Copy

Now we create an example word to tokenize it

	
example_word = "breakdown"
Copy

And we tokenize it

	
tokens = encoder.encode(example_word)
tokens
Copy
	
[9137, 2996]

The word has been split into 2 tokens, the 9137 and the 2996. Let's see which words they correspond to.

	
word1 = encoder.decode([tokens[0]])
word2 = encoder.decode([tokens[1]])
word1, word2
Copy
	
('break', 'down')

The OpenAI tokenizer has split the word breakdown into the words break and down. That is, it has divided the word into 2 simpler ones.

This is important, as when it is said that an LLM supports x tokens, it does not mean that it supports x words, but rather that it supports x minimal units of word representation.

If you have a text and want to see the number of tokens it has for the OpenAI tokenizer, you can check it on the Tokenizer page, which displays each token in a different color.

tokenizer

We have seen the tokenizer of OpenAI, but each LLM may use a different one.

As we have said, the tokens are the minimal units of representation of words, so let's see how many distinct tokens tiktoken has.

	
n_vocab = encoder.n_vocab
print(f"Vocab size: {n_vocab}")
Copy
	
Vocab size: 100277

Let's see how it tokenizes another type of words

	
def encode_decode(word):
tokens = encoder.encode(word)
decode_tokens = []
for token in tokens:
decode_tokens.append(encoder.decode([token]))
return tokens, decode_tokens
Copy
	
word = "dog"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "tomorrow..."
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "artificial intelligence"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "Python"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "12/25/2023"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "😊"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
Copy
	
Word: dog ==> tokens: [18964], decode_tokens: ['dog']
Word: tomorrow... ==> tokens: [38501, 7924, 1131], decode_tokens: ['tom', 'orrow', '...']
Word: artificial intelligence ==> tokens: [472, 16895, 11478], decode_tokens: ['art', 'ificial', ' intelligence']
Word: Python ==> tokens: [31380], decode_tokens: ['Python']
Word: 12/25/2023 ==> tokens: [717, 14, 914, 14, 2366, 18], decode_tokens: ['12', '/', '25', '/', '202', '3']
Word: 😊 ==> tokens: [76460, 232], decode_tokens: ['�', '�']

Finally, we will see it with words in another language

	
word = "perro"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "perra"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "mañana..."
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "inteligencia artificial"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "Python"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "12/25/2023"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
word = "😊"
tokens, decode_tokens = encode_decode(word)
print(f"Word: {word} ==> tokens: {tokens}, decode_tokens: {decode_tokens}")
Copy
	
Word: perro ==> tokens: [716, 299], decode_tokens: ['per', 'ro']
Word: perra ==> tokens: [79, 14210], decode_tokens: ['p', 'erra']
Word: mañana... ==> tokens: [1764, 88184, 1131], decode_tokens: ['ma', 'ñana', '...']
Word: inteligencia artificial ==> tokens: [396, 39567, 8968, 21075], decode_tokens: ['int', 'elig', 'encia', ' artificial']
Word: Python ==> tokens: [31380], decode_tokens: ['Python']
Word: 12/25/2023 ==> tokens: [717, 14, 914, 14, 2366, 18], decode_tokens: ['12', '/', '25', '/', '202', '3']
Word: 😊 ==> tokens: [76460, 232], decode_tokens: ['�', '�']

We can see that for similar words, Spanish generates more tokens than English, so for the same text, with a similar number of words, the number of tokens will be greater in Spanish than in English.

Continue reading

Stream Information in MCP: Complete Guide to Real-time Progress Updates with FastMCP

Stream Information in MCP: Complete Guide to Real-time Progress Updates with FastMCP

Learn how to implement real-time streaming in MCP (Model Context Protocol) applications using FastMCP. This comprehensive guide shows you how to create MCP servers and clients that support progress updates and streaming information for long-running tasks. You'll build streaming-enabled tools that provide real-time feedback during data processing, file uploads, monitoring tasks, and other time-intensive operations. Discover how to use StreamableHttpTransport, implement progress handlers with Context, and create visual progress bars that enhance user experience when working with MCP applications that require continuous feedback.

MCP: Complete Guide to Create servers and clients MCP (Model Context Protocol) with FastMCP

MCP: Complete Guide to Create servers and clients MCP (Model Context Protocol) with FastMCP

Learn what is the Model Context Protocol (MCP), the open-source standard developed by Anthropic that revolutionizes how AI models interact with external tools. In this practical and detailed guide, I take you step by step in creating an MCP server and client from scratch using the fastmcp library. You will build an "intelligent" AI agent with Claude Sonnet, capable of interacting with the GitHub API to query issues and repository information. We will cover from basic concepts to advanced features like filtering tools by tags, server composition, static resources and dynamic templates (resource templates), prompt generation, and secure authentication. Discover how MCP can standardize and simplify the integration of tools in your AI applications, analogously to how USB unified peripherals!

Last posts -->

Have you seen these projects?

Horeca chatbot

Horeca chatbot Horeca chatbot
Python
LangChain
PostgreSQL
PGVector
React
Kubernetes
Docker
GitHub Actions

Chatbot conversational for cooks of hotels and restaurants. A cook, kitchen manager or room service of a hotel or restaurant can talk to the chatbot to get information about recipes and menus. But it also implements agents, with which it can edit or create new recipes or menus

Subtify

Subtify Subtify
Python
Whisper
Spaces

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->