llm.int8() – 8-bit Matrix Multiplication for Transformers at Scale

23 of july of 2024

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

In the post LLMs quantization we explain the importance of LLMs quantization to save memory. Additionally, we explain that there is a type of quantization called zero-point quantization which involves transforming the values of the parameters linearly, but this has the problem of language model degradation once they exceed 2.7B parameters.

Vector Quantization

Since the quantization of all parameters in models introduces errors in large language models, what is proposed in the paper llm.int8() is to perform vector quantization, which means separating weight matrices into vectors so that some of these vectors can be quantized to 8 bits, while others cannot. Therefore, those that can be quantized to 8 bits are quantized and matrix multiplications are performed in INT8 format, while the vectors that cannot be quantized remain in FP16 format and multiplications are performed in FP16 format.

Let's see it with an example

Suppose we have the matrix

and we want to multiply it by the matrix

We set a threshold value and all columns of the first matrix that have a value greater than this threshold are kept in FP16 format. The corresponding rows in the second matrix, equivalent to the rows of the first matrix, are also kept in FP16 format.

To make it clearer, since the second and fourth columns of the first matrix (yellow columns) have values greater than a certain threshold, then the second and fourth rows of the second matrix (yellow rows) are kept in FP16 format.

If there are threshold values in the second matrix, the same would be done, for example, if a row in the second matrix had a value greater than a threshold it would be kept in FP16 format, and that column in the first matrix would be kept in FP16 format.

The rest of the rows and columns that are not kept in FP16 format are quantized to 8 bits, and the multiplications are performed in INT8 format.

So we separate the first matrix into the two submatrices

And the second matrix in the two matrices

We multiply the matrices in INT8 on one side

And those that are in FP16 format on the other hand

As can be seen, multiplying the matrices in INT8 format gives us a result of a 3x2 matrix, and multiplying the matrices in FP16 format also gives us another 3x2 matrix, so if we add them together

Interestingly, it gives us the same result as if we had multiplied the original matrices

To understand why this happens, if we develop the cross product of the two original matrices

We see that the separation we have made does not cause any problems

Therefore, we can conclude that we can separate rows and columns of matrices to perform matrix multiplications. This separation will occur when some element of the row or column is greater than a threshold value, so that rows or columns that do not have a value greater than this threshold will be encoded in INT8 occupying only one byte, and rows or columns that have an element greater than this threshold will be converted to FP16 occupying 2 bytes. This way, we won't have rounding issues, as the calculations we perform in INT8 will be with values that ensure the multiplications do not exceed the range of 8 bits.

Threshold value α

As we have said, we are going to split into rows and columns that have some element greater than a threshold value, but what threshold value should we choose? The authors of the paper conducted experiments with several values and determined that this threshold value should be α=6. Above this value, they started to observe degradations in the language models.

Use of llm.int8()

Let's see how to quantize a model with llm.int8() using the transformers library. For this, you need to have bitsandbytes installed.

pip install bitsandbytes

We load a 1B parameter model twice, once in the usual way and the second time quantizing it with llm.int8()

	
		from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
 
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
model_8bit = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto", load_in_8bit=True)
	
	
		
	
	Copied

	
		The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.

We see how much memory each of the models occupies

	
		model.get_memory_footprint()/(1024**3), model_8bit.get_memory_footprint()/(1024**3)
	
	
		
	
	Copied

	
		(4.098002195358276, 1.1466586589813232)

As can be seen, the quantized model takes up much less memory

Let's now do a text generation test with the two models

	
		input_tokens = tokenizer("Hello my name is Maximo and I am a Machine Learning Engineer", return_tensors="pt").to(device)
input_tokens.input_ids
	
	
		
	
	Copied

	
		tensor([[    1, 15043,   590,  1024,   338,  5918,  4200,   322,   306,   626,
           263,  6189, 29257, 10863,   261]], device='cuda:0')

We see the output with the normal model

	
		import time
 
t0 = time.time()
max_new_tokens = 50
outputs = model.generate(
    input_ids=input_tokens.input_ids,
    attention_mask=input_tokens.attention_mask,
    max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time() - t0)
	
	
		
	
	Copied

	
		Hello my name is Maximo and I am a Machine Learning Engineer. I am currently working at [Company Name] as a Machine Learning Engineer. I have a Bachelor's degree in Computer Science from [University Name] and a Master's degree in Computer Science from [University Name]. I
1.7616662979125977

And now with the quantized model

	
		t0 = time.time()
max_new_tokens = 50
outputs = model_8bit.generate(
    input_ids=input_tokens.input_ids,
    attention_mask=input_tokens.attention_mask,
    max_length=input_tokens.input_ids.shape[1] + max_new_tokens,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time() - t0)
	
	
		
	
	Copied

	
		Hello my name is Maximo and I am a Machine Learning Engineer. I am currently working at [Company Name] as a Machine Learning Engineer. I have a Bachelor's degree in Computer Science from [University Name] and a Master's degree in Computer Science from [University Name]. I
9.100712776184082

We see two things: on the one hand, that the output we get is the same text; so with a much smaller model we can obtain the same output. However, the quantized model takes much longer to run, so if it needs to be used in real time, it would not be advisable.

This is contradictory, because we might think that a smaller model would run faster, but we have to consider that in reality both models, the normal and the quantized one, perform the same operations; it's just that one performs all operations in FP32 while the other does them in INT8 and FP16. However, the quantized model has to find rows and columns with values greater than the threshold value, separate them, perform the operations in INT8 and FP16, and then recombine the results, which is why the quantized model takes longer to run.

Continue reading

MCP Elicitation: Implementing Elicitation in Servers with FastMCP and Python

Learn how to implement elicitation in MCP (Model Context Protocol) servers with FastMCP. Complete step-by-step tutorial to create an intelligent travel booking agent that requests user information interactively. Includes server and client code, virtual environment setup with uv, and practical elicitation examples for real-time user data collection.

MCP Durability: Server and Client with Persistence for Long-Running Tasks

Learn to build durable MCP server and client for long-running tasks with persistence. Complete Model Context Protocol tutorial featuring SQLite data persistence, background task management, and real-time monitoring. Implement data migration, batch processing, and ML model training that survive server restarts. Python code examples using FastMCP, resources, tools, and durability patterns for enterprise applications.

Resumable MCP: How to Build Servers and Clients with Automatic Checkpoints

Learn to build resumable MCP servers and clients with automatic checkpoint capabilities. Complete tutorial on implementing task interruption handling, state persistence, and recovery for long-running processes in Model Control Protocol. Includes practical code with FastMCP, persistent session management, and real-world examples for processes that can be interrupted and resumed from where they left off.

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to watch any talk?

Tomorrow's Agents: Deciphering the Mysteries of Planning, UX and Memory

AI agents, powered by LLMs, promise to transform applications. But are they simple executors today or future intelligent collaborators? To reach their true potential, we must overcome critical barriers. This talk delves into the three puzzles that will define the next generation of agents: 1. Advanced Planning (The Brain): Today's agents often stumble on complex tasks. We'll explore how, beyond basic function calls, cognitive architectures enable robust plans, anticipation of problems, and deep reasoning. How do we make them think several steps ahead? 2. Revolutionary UX (The Soul): Interacting with an agent cannot be a source of frustration. We'll discuss how to transcend traditional chat toward human-on-the-loop interfaces—collaborative, generative, and accessible UX. How to Design Engaging Experiences? 3. Persistent Memory (The Legacy): An agent that forgets what it's learned is doomed to inefficiency. We'll look at techniques for empowering agents with meaningful memory that goes beyond their history, enabling them to learn and making each interaction smarter. With practical examples, we'll not only understand the magnitude of these challenges, but we'll also take away concrete ideas and a clear vision to help build the agents of tomorrow: smarter, more intuitive, and truly capable. Will you join us on the journey to unravel the next chapter of AI agents?

Create your own Apple intelligence

Learn to create an IA system to execute efficiently on a device

Last talks -->

Do you want to improve with these tips?

o1 prompt engineering

Create better prompts for o1 following an example

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.