Hugging Face Accelerate: Multi-GPU & TPU Training Guide

Q: How to run code on only one process (process_index) with Accelerate

Decorate a function with @accelerator.on_process(process_index=0) to run it only on that process, or use @accelerator.on_main_process for the main one. At runtime, accelerator.process_index gives the current process index and accelerator.num_processes the total number of processes. This is useful for logging, saving checkpoints, or printing just once in multi-GPU/TPU training.

Q: What is num_processes in the Accelerate config?

num_processes is the total number of parallel processes Accelerate launches (typically one per GPU/TPU core). You set it during accelerate config (saved in default_config.yaml) or with accelerate launch --num_processes N. accelerator.num_processes exposes it at runtime.

16 of may of 2024

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

**Hugging Face Accelerate Series**

👉 Installation, configuration, and training (you are here)2. Saving, mixed precision, and inference

Accelerate is a Hugging Face library that allows you to run the same PyTorch code in any distributed setup by adding only four lines of code.

Installation

To install accelerate with pip, simply run:

pip install accelerate

And with conda:

conda install -c conda-forge accelerate

Configuration

In each environment where accelerate is installed, the first thing that must be done is to configure it; to do this, we run in a terminal:

accelerate config

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!accelerate config
	
	Copied

>_ Output

			
				--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

In my case, the answers have been

¿En qué entorno de cómputo se está ejecutando?
[x] "This machine"
[_] "AWS (Amazon SageMaker)"

I want to set it up on my computer

¿Qué tipo de máquina estás usando?
[_] multi-CPU
[_] multi-XPU
[x] multi-GPU
[_] multi-NPU
[_] TPU

Since I have 2 GPUs and want to run distributed code on them, I choose multi-GPU

How many different machines will you use (use more than 1 for multi-node training)? [1]:
1

I choose 1 because I’m only going to run it on my computer

¿Se deben comprobar las operaciones distribuidas mientras se ejecutan para detectar errores? Esto puede evitar problemas de tiempo de espera, pero será más lento. [yes/NO]:
no

With this option, you can choose for accelerate to check for errors during execution, but that would make it slower, so I choose no, and if there are errors I change it to yes

¿Deseas optimizar tu script con torch dynamo?[yes/NO]:
no

¿Deseas usar FullyShardedDataParallel? [sí/NO]:
no

Do you want to use Megatron-LM? [yes/NO]:
no

How many GPU(s) should be used for distributed training? [1]:
2

I choose 2 because I have 2 GPUs

¿Qué GPU(s) (por id) deben usarse para el entrenamiento en esta máquina como una lista separada por comas? [all]:
0.1

I choose 0,1 because I want to use both GPUs

¿Deseas usar FP16 o BF16 (precisión mixta)?
[x] no
[_] fp16
[_] bf16
[_] fp8

For now I choose no, because to simplify the code when I’m not using accelerate we’re going to train in fp32, but ideally we would use fp16

The configuration will be saved in ~/.cache/huggingface/accelerate/default_config.yaml and can be modified at any time. Let's see what's inside

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!cat ~/.cache/huggingface/accelerate/default_config.yaml
	
	Copied

>_ Output

			
				compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Another way to view the configuration we have is by running in a terminal:

accelerate env

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!accelerate env
	
	Copied

>_ Output

			
				Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.28.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
- Python version: 3.11.8
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 31.24 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []

Once we have configured accelerate, we can test whether we did it correctly by running in a terminal:

accelerate test

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!accelerate test
	
	Copied

>_ Output

			
				Running:  accelerate-launch ~/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout: Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout:
...
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout:
stdout: **Breakpoint trigger test**
Test is a success! You are ready for your distributed training!

We see that it ends by saying Test is a success! You are ready for your distributed training!, so everything is correct.

Training

Training optimization

Base code

Let's first create a base training code, and then we'll optimize it to see how it's done and how it improves.

First, let’s look for a dataset. In my case, I’m going to use the tweet_eval dataset, which is a tweet classification dataset. Specifically, I’m going to download the emoji subset, which classifies tweets with emoticons.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from datasets import load_dataset
 
dataset = load_dataset("tweet_eval", "emoji")
dataset
	
	Copied

>_ Output

			
				DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		dataset["train"].info
	
	Copied

>_ Output

			
				DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='tweet_eval', config_name='emoji', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=3808792, num_examples=45000, shard_lengths=None, dataset_name='tweet_eval'), 'test': SplitInfo(name='test', num_bytes=4262151, num_examples=50000, shard_lengths=None, dataset_name='tweet_eval'), 'validation': SplitInfo(name='validation', num_bytes=396704, num_examples=5000, shard_lengths=None, dataset_name='tweet_eval')}, download_checksums={'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/train-00000-of-00001.parquet': {'num_bytes': 2609973, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/test-00000-of-00001.parquet': {'num_bytes': 3047341, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/validation-00000-of-00001.parquet': {'num_bytes': 281994, 'checksum': None}}, download_size=5939308, post_processing_size=None, dataset_size=8467647, size_in_bytes=14406955)

Let's see the classes

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		print(dataset["train"].info.features["label"].names)
	
	Copied

>_ Output

			
				['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜']

And the number of classes

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		num_classes = len(dataset["train"].info.features["label"].names)
num_classes
	
	Copied

>_ Output

We see that the dataset has 20 classes

Let's see the maximum sequence of each split

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		max_len_train = 0
max_len_val = 0
max_len_test = 0
 
split = "train"
for i in range(len(dataset[split])):
    len_i = len(dataset[split][i]["text"])
    if len_i &gt; max_len_train:
        max_len_train = len_i
split = "validation"
for i in range(len(dataset[split])):
    len_i = len(dataset[split][i]["text"])
    if len_i &gt; max_len_val:
        max_len_val = len_i
split = "test"
for i in range(len(dataset[split])):
    len_i = len(dataset[split][i]["text"])
    if len_i &gt; max_len_test:
        max_len_test = len_i
 
max_len_train, max_len_val, max_len_test
	
	Copied

>_ Output

			
				(142, 139, 167)

So we define the maximum sequence in general as 130 for tokenization

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		max_len = 130
	
	Copied

We are interested in the tokenized dataset, not the raw sequences, so we created a tokenizer

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from transformers import AutoTokenizer
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
	
	Copied

We create a tokenization function

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
	
	Copied

And now we tokenize the dataset

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
	
	Copied

>_ Output

			
				Map:   0%|          | 0/45000 [00:00&lt;?, ? examples/s]

>_ Output

			
				Map:   0%|          | 0/5000 [00:00&lt;?, ? examples/s]

>_ Output

			
				Map:   0%|          | 0/50000 [00:00&lt;?, ? examples/s]

As we can see, now we have the tokens (input_ids) and the attention masks (attention_mask), but let's see what kind of data we have

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"]), type(tokenized_dataset["train"][0]["label"])
	
	Copied

>_ Output

			
				(list, list, int)

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
type(tokenized_dataset["train"][0]["label"]), type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"])
	
	Copied

>_ Output

			
				(torch.Tensor, torch.Tensor, torch.Tensor)

We create a DataLoader

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		import torch
from torch.utils.data import DataLoader
BS = 64
 
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
	
	Copied

We load the model

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from transformers import AutoModelForSequenceClassification
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
	
	Copied

Let's see what the model is like

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		model
	
	Copied

>_ Output

			
				RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
...
        )
      )
    )
  )
  (classifier): RobertaClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=2, bias=True)
  )
)

Let's look at its last layer

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		model.classifier.out_proj
	
	Copied

>_ Output

			
				Linear(in_features=768, out_features=2, bias=True)

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		model.classifier.out_proj.in_features, model.classifier.out_proj.out_features
	
	Copied

>_ Output

			
				(768, 2)

We have seen that our dataset has 20 classes, but this model is trained for 2 classes, so we need to modify the last layer

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
model.classifier.out_proj
	
	Copied

>_ Output

			
				Linear(in_features=768, out_features=20, bias=True)

Now yes

Now we create a loss function

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		loss_function = torch.nn.CrossEntropyLoss()
	
	Copied

An optimizer

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from torch.optim import Adam
 
optimizer = Adam(model.parameters(), lr=5e-4)
	
	Copied

And lastly, a metric

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		import evaluate
 
metric = evaluate.load("accuracy")
	
	Copied

Let's check that everything is fine with a sample

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample = next(iter(dataloader["train"]))
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample["input_ids"].shape, sample["attention_mask"].shape
	
	Copied

>_ Output

			
				(torch.Size([64, 130]), torch.Size([64, 130]))

Now we feed that sample into the model

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		model.to("cuda")
ouputs = model(input_ids=sample["input_ids"].to("cuda"), attention_mask=sample["attention_mask"].to("cuda"))
ouputs.logits.shape
	
	Copied

>_ Output

			
				torch.Size([64, 20])

We see that the model outputs 64 batches, which is fine, because we configured BS = 20 and each one has 20 outputs, which is fine because we changed the model so that it has an output of 20 values

We obtain the one with the highest value

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		predictions = torch.argmax(ouputs.logits, axis=-1)
predictions.shape
	
	Copied

>_ Output

			
				torch.Size([64])

We obtain the loss

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		loss = loss_function(ouputs.logits, sample["label"].to("cuda"))
loss.item()
	
	Copied

>_ Output

			
				2.9990389347076416

And the accuracy

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		accuracy = metric.compute(predictions=predictions, references=sample["label"])["accuracy"]
accuracy
	
	Copied

>_ Output

			
				0.015625

We can now create a small training loop

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from fastprogress.fastprogress import master_bar, progress_bar
 
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
 
master_progress_bar = master_bar(range(epochs))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        loss.backward()
        optimizer.step()
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"
	
	Copied

>_ Output

			
				&lt;IPython.core.display.HTML object&gt;

>_ Output

			
				&lt;IPython.core.display.HTML object&gt;

Script with the base code

In most of the accelerate documentation, it is explained how to use accelerate with scripts, so for now we are going to do it that way and at the end we will explain how to do it with a notebook

First, let's create a folder where we are going to save the scripts.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!mkdir accelerate_scripts
	
	Copied

Now we write the base code in a script

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%writefile accelerate_scripts/01_code_base.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 64
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
 
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        loss.backward()
        optimizer.step()
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"
print(f"Accuracy = {accuracy['accuracy']}")
	
	Copied

>_ Output

			
				Overwriting accelerate_scripts/01_code_base.py

And now we run it

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%time
 
!python accelerate_scripts/01_code_base.py
	
	Copied

>_ Output

			
				Accuracy = 0.2112
CPU times: user 2.12 s, sys: 391 ms, total: 2.51 s
Wall time: 3min 36s

We see that on my computer it took about 3 and a half minutes

Code with accelerate

Now we replace some things

First, we import Accelerator and initialize it

from accelerate import Accelerator
accelerator = Accelerator()

We no longer do the typical

``` python

torch.device("cuda" if torch.cuda.is_available() else "cpu")

```

Otherwise, we let accelerate choose the device by means of

device = accelerator.device

We pass the relevant elements for training through the prepare method and no longer use model.to(device)

model, optimizer, dataloader["train"], dataloader["validation"] = prepare(model, optimizer, dataloader["train"], dataloader["validation"])

We no longer send the data and the model to the GPU with .to(device) since accelerate has taken care of that with the prepare method

Instead of performing backpropagation with loss.backward(), we let accelerate do it with

accelerator.backward(loss)

When calculating the metric in the validation loop, we need to collect the values from all the points, in case we are doing distributed training; to do this, we do

predictions = accelerator.gather_for_metrics(predictions)

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%writefile accelerate_scripts/02_accelerate_base_code.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 64
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
    print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"
 
print(f"Accuracy = {accuracy['accuracy']}")
	
	Copied

>_ Output

			
				Overwriting accelerate_scripts/02_accelerate_base_code.py

If you look closely, I have added these two lines print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}") and the line print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}"), I added them on purpose because they are going to reveal something very important

Now we run it; to execute the accelerate scripts, use the accelerate launch command

accelerate launch script.py

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%time
 
!accelerate launch accelerate_scripts/02_accelerate_base_code.py
	
	Copied

>_ Output

			
				End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
CPU times: user 1.6 s, sys: 272 ms, total: 1.88 s
Wall time: 2min 37s

We see that before it took about 3 and a half minutes, and now it takes about 2 and a half minutes. Quite an improvement. Also, if we look at the prints, we can see that they have been printed twice.

How can this be? Well, because accelerate has parallelized the training across the two GPUs I have, which made it much faster.

Also, when I ran the first script, that is, when I did not use accelerate, the GPU was almost full, whereas when I ran the second one, that is, the one that uses accelerate, both GPUs were very underutilized, so we can increase the batch size to try to fill both of them, let’s do it!

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%writefile accelerate_scripts/03_accelerate_base_code_more_bs.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"
 
print(f"Accuracy = {accuracy['accuracy']}")
	
	Copied

>_ Output

			
				Overwriting accelerate_scripts/03_accelerate_base_code_more_bs.py

I removed the extra prints, because we have already seen that the code is running on both GPUs, and I increased the batch size from 64 to 128. Let's run it and see.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%time
 
!accelerate launch accelerate_scripts/03_accelerate_base_code_more_bs.py
	
	Copied

>_ Output

			
				Accuracy = 0.1052
Accuracy = 0.1052
CPU times: user 1.41 s, sys: 180 ms, total: 1.59 s
Wall time: 2min 22s

Increasing the batch size has reduced the execution time by a few seconds

Process execution

Code execution in a single process

We saw earlier that the prints were printed twice; this is because accelerate creates as many processes as there are devices on which the code runs; in my case, it creates two processes because I have two GPUs.

However, not all code should run in every process; for example, prints slow down the code a lot, enough to run it multiple times, if checkpoints are saved, they would be saved twice, etc.

To be able to run part of a code in a single process, it must be encapsulated in a function and decorated with accelerator.on_local_main_process. For example, in the following code you will see that I have created the following function

@accelerator.on_local_main_process
def print_something(something):
  print(something)

Another option is to include the code inside an if accelerator.is_local_main_process like in the following code

if accelerator.is_local_main_process:
  print("Something")

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%writefile accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_something(something):
    print(something)
 
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"
 
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")
	
	Copied

>_ Output

			
				Overwriting accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py

Let's run it and see.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%time
 
!accelerate launch accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
	
	Copied

>_ Output

			
				Accuracy = 0.2098
End of script with 0.2098 accuracy
CPU times: user 1.38 s, sys: 197 ms, total: 1.58 s
Wall time: 2min 22s

Now the print has only been executed once

However, although it is not very visible, the progress bars run in each process.

I haven't found a way to avoid this with fastprogress progress bars, but I have with tqdm ones, so I'm going to replace the fastprogress progress bars with tqdm ones, and to make them run in a single process, you need to add the disable=not accelerator.is_local_main_process argument

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%writefile accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_something(something):
    print(something)
 
for i in range(EPOCHS):
    model.train()
    # progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        # master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    # progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")
	
	Copied

>_ Output

			
				Overwriting accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%time
 
!accelerate launch accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
	
	Copied

>_ Output

			
				100%|█████████████████████████████████████████| 176/176 [02:01&lt;00:00,  1.45it/s]
100%|███████████████████████████████████████████| 20/20 [00:06&lt;00:00,  3.30it/s]
Accuracy = 0.2166
End of script with 0.2166 accuracy
CPU times: user 1.33 s, sys: 195 ms, total: 1.52 s
Wall time: 2min 22s

We have shown an example of how to print in a single process, and this has been a way to run processes in a single process. But if what you want is just to print in a single process, you can use the print method of accelerate. Let's look at the same example as before with this method

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%writefile accelerate_scripts/06_accelerate_base_code_print_one_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
for i in range(EPOCHS):
    model.train()
    # progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        # master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    # progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
# print(f"Accuracy = {accuracy['accuracy']}")
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")
	
	Copied

>_ Output

			
				Writing accelerate_scripts/06_accelerate_base_code_print_one_process.py

We run it

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%time
 
!accelerate launch accelerate_scripts/06_accelerate_base_code_print_one_process.py
	
	Copied

>_ Output

			
				Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15433.52 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 11406.61 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15036.87 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14932.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14956.60 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00&lt;00:00,  1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00,  3.33it/s]
Accuracy = 0.2134
End of script with 0.2134 accuracy
CPU times: user 1.4 s, sys: 189 ms, total: 1.59 s
Wall time: 2min 27s

Code execution in all processes

However, there is code that must run in all processes, for example, if we upload the checkpoints to the hub, so here we have two options: wrap the code in a function and decorate it with accelerator.on_main_process

@accelerator.on_main_process
def do_my_thing():
"Something done once per server"
do_thing_once()

or put the code inside an if accelerator.is_main_process

if accelerator.is_main_process:
repo.push_to_hub()

Since we are doing training only to showcase the accelerate library and the model we are training is not good, it does not make sense to upload the checkpoints to the hub right now, so I’m going to do an example with prints

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%writefile accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_in_one_process(something):
    print(something)
 
@accelerator.on_main_process
def print_in_all_processes(something):
    print(something)
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")
 
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_main_process:
    print(f"All process: End of script with {accuracy['accuracy']} accuracy")
	
	Copied

>_ Output

			
				Overwriting accelerate_scripts/06_accelerate_base_code_some_code_in_all_process.py

Let's run it and see.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%time
 
!accelerate launch accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py
	
	Copied

>_ Output

			
				Map: 100%|██████████████████████| 45000/45000 [00:03&lt;00:00, 14518.44 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:03&lt;00:00, 14368.77 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 16466.33 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14806.14 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14253.33 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14337.07 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00&lt;00:00,  1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00,  3.34it/s]
Accuracy = 0.2092
End of script with 0.2092 accuracy
All process: Accuracy = 0.2092
All process: End of script with 0.2092 accuracy
CPU times: user 1.42 s, sys: 216 ms, total: 1.64 s
Wall time: 2min 27s

Code execution in process X

Finally, we can specify in which process we want to execute code; to do this, we need to create a function and decorate it with @accelerator.on_process(process_index=0)

@accelerator.on_process(process_index=0)
def do_my_thing():
"Algo hecho en el índice de proceso 0"
do_thing_on_index_zero()

or decorate it with @accelerator.on_local_process(local_process_idx=0)

@accelerator.on_local_process(local_process_index=0)def do_my_thing():
"Something done on process index 0 on each server"
do_thing_on_index_zero_on_each_server()

Here I have put process 0, but any number can be used

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%writefile accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_in_one_process(something):
    print(something)
 
@accelerator.on_main_process
def print_in_all_processes(something):
    print(something)
 
@accelerator.on_process(process_index=0)
def print_in_process_0(something):
    print("Process 0: " + something)
 
@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
    print("Process 1: " + something)
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")
 
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_main_process:
    print(f"All process: End of script with {accuracy['accuracy']} accuracy")
 
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
	
	Copied

>_ Output

			
				Overwriting accelerate_scripts/07_accelerate_base_code_some_code_in_some_process.py

We execute it

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%time
 
!accelerate launch accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py
	
	Copied

>_ Output

			
				Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 15735.58 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14906.20 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:02&lt;00:00,  1.44it/s]
100%|███████████████████████████████████████████| 20/20 [00:06&lt;00:00,  3.27it/s]
Process 1: End of process 1
Accuracy = 0.2128
End of script with 0.2128 accuracy
All process: Accuracy = 0.2128
All process: End of script with 0.2128 accuracy
Process 0: End of process 0
CPU times: user 1.42 s, sys: 295 ms, total: 1.71 s
Wall time: 2min 37s

Synchronize processes

If we have code that must run in all processes, it is useful to wait for it to finish in all processes before doing another task, so to do that we use accelerator.wait_for_everyone()

To see it, let’s add a delay in one of the print functions in a process

I’ve also added a break in the training loop so it doesn’t spend too much time training, which isn’t what we’re interested in right now

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		%%writefile accelerate_scripts/09_accelerate_base_code_sync_all_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_in_one_process(something):
    print(something)
 
@accelerator.on_main_process
def print_in_all_processes(something):
    print(something)
 
@accelerator.on_process(process_index=0)
def print_in_process_0(something):
    time.sleep(2)
    print("Process 0: " + something)
 
@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
    print("Process 1: " + something)
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
        break
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")
 
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_main_process:
    print(f"All process: End of script with {accuracy['accuracy']} accuracy")
 
print_in_one_process("Printing with delay in process 0")
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
accelerator.wait_for_everyone()
 
print_in_one_process("End of script")
	
	Copied

>_ Output

			
				Overwriting accelerate_scripts/08_accelerate_base_code_sync_all_process.py

We run it

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!accelerate launch accelerate_scripts/09_accelerate_base_code_sync_all_process.py
	
	Copied

>_ Output

			
				Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14218.23 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14666.25 examples/s]
  0%|                                                   | 0/176 [00:00&lt;?, ?it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00,  3.58it/s]
Process 1: End of process 1
Accuracy = 0.212
End of script with 0.212 accuracy
All process: Accuracy = 0.212
All process: End of script with 0.212 accuracy
Printing with delay in process 0
Process 0: End of process 0
End of script

As can be seen, first Process 1: End of process 1 has been printed and then the rest; this is because the rest of the prints are done either in process 0 or in all processes, so until the 2-second delay we have set finishes, the rest of the code is not executed.

---

➡️ **Continue in the second part:** Guardado, precisión mixta e inferencia, where we will see how to save and load models, train with mixed precision, and infer with the Hugging Face ecosystem.

Process control (FAQ)

How to run code on only one process (process_index) with Accelerate

Decorate a function with @accelerator.on_process(process_index=0) to run it only on that process, or use @accelerator.on_main_process for the main one. At runtime, accelerator.process_index gives the current process index and accelerator.num_processes the total number of processes. This is useful for logging, saving checkpoints, or printing just once in multi-GPU/TPU training.

What is num_processes in the Accelerate config?

num_processes is the total number of parallel processes Accelerate launches (typically one per GPU/TPU core). You set it during accelerate config (saved in default_config.yaml with fields like compute_environment and num_processes) or with accelerate launch --num_processes N. accelerator.num_processes exposes it at runtime.

DDP errors (FAQ)

Error: "found two or more forward outputs with same shape"

This error comes from PyTorch's DistributedDataParallel (DDP) reducer, which Accelerate uses under the hood for multi-GPU training. It typically appears when find_unused_parameters=True and the model returns several output tensors with the same shape, so DDP cannot unambiguously match gradients to forward outputs.

In Accelerate you control DDP behavior with DistributedDataParallelKwargs:

from accelerate import Accelerator, DistributedDataParallelKwargs
ddp_kwargs = DistributedDataParallelKwargs(find_unused_parameters=False)  # default
accelerator = Accelerator(kwargs_handlers=[ddp_kwargs])

Common fixes: if you enabled find_unused_parameters=True and don't need it, set it back to False (the default) — this is the most common fix. If you do need it (parameters used conditionally), restructure the model so it doesn't return tensors with duplicate shapes, or make sure every parameter participates in the forward pass.

Continue reading

Deep Research with LangGraph (3/3): the Writer agent and final report

Third and final part of the Deep Research with LangGraph series. Implement the Writer agent that drafts the final report from the research, assemble the full deep researcher graph, and run the complete assistant end to end.

Deep Research with LangGraph (2/3): the multi-agent Research Supervisor

Second part of the Deep Research with LangGraph series. Build the Research Supervisor that coordinates several Researche...

Deep Research with LangGraph (1/3): Scope and Researcher agents

First part of the series on building an AI research assistant with LangGraph. Learn the system architecture and build th...

Last posts -->

Have you seen these projects?

Gymnasia

Horeca chatbot

Naviground

View all projects -->

>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Write me LinkedIn

Do you want to watch any talk?

Tomorrow's Agents: Deciphering the Mysteries of Planning, UX and Memory

AI agents, powered by LLMs, promise to transform applications. But are they simple executors today or future intelligent collaborators? To reach their...

Create your own Apple intelligence

Learn to create an IA system to execute efficiently on a device

Last talks -->

Do you want to improve with these tips?

Best practices building agents with Claude Code

Technical talk: skills, subagents, slash commands and MCPs in Claude Code

o1 prompt engineering

Create better prompts for o1 following an example

Memory profiler

See the memory usage of a script

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

FLUX.1-RealismLora

token_hmr

View all containers -->

>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Write me LinkedIn

Do you want to train your model with these datasets?

short-jokes-dataset

HuggingFace

Dataset with jokes in English

Use: Fine-tuning text generation models for humor

231K rows 2 columns 45 MB

View on HuggingFace →

opus100

HuggingFace

Dataset with translations from English to Spanish

Use: Training English-Spanish translation models

1M rows 2 columns 210 MB

View on HuggingFace →

netflix_titles

HuggingFace

Dataset with Netflix movies and series

Use: Netflix catalog analysis and recommendation systems

8.8K rows 12 columns 3.5 MB

View on HuggingFace →

View more datasets -->

Installation

Configuration

Training

Training optimization

Base code

Script with the base code

Code with accelerate

Process execution

Code execution in a single process

Code execution in all processes

Code execution in process X

Synchronize processes

Process control (FAQ)

How to run code on only one process (process_index) with Accelerate

What is num_processes in the Accelerate config?

DDP errors (FAQ)

Error: "found two or more forward outputs with same shape"

Related posts

Continue reading

Deep Research with LangGraph (3/3): the Writer agent and final report

Deep Research with LangGraph (2/3): the multi-agent Research Supervisor

Deep Research with LangGraph (1/3): Scope and Researcher agents

Have you seen these projects?

Do you have an AI project?

Do you want to watch any talk?

Tomorrow's Agents: Deciphering the Mysteries of Planning, UX and Memory

Create your own Apple intelligence

Do you want to improve with these tips?

Best practices building agents with Claude Code

o1 prompt engineering

Memory profiler

Use this locally

Do you have an AI project?

Do you want to train your model with these datasets?

short-jokes-dataset

opus100

netflix_titles