HuggingFace Accelerate: Complete Guide to Train Models on GPU/TPU

16 of may of 2024

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

Accelerate is a Hugging Face library that allows running the same PyTorch code in any distributed setup by adding only four lines of code.

Installation

To install accelerate with pip, simply run:

pip install accelerate

And with conda:

conda install -c conda-forge accelerate

Configuration

In each environment where accelerate is installed, the first thing to do is to configure it. To do this, run in a terminal:

`accelerate config`

	
		!accelerate config

	
		--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

In my case, the answers have been

In which compute environment are you running?
[x] "This machine"
[_] "AWS (Amazon SageMaker)"

I want to set it up on my computer

What type of machine are you using?
[_] multi-CPU
[_] multi-XPU
[x] multi-GPU
[_] multi-NPU
[_] TPU

Since I have 2 GPUs and want to run distributed codes on them, I choose multi-GPU

How many different machines will you use (use more than 1 for multi-node training)? [1]:
1

I choose 1 because I'm only going to run it on my computer

Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]:
no

With this option, you can choose to have accelerate check for errors during execution, but it would make it slower, so I choose no, and if there are any errors I change it to yes.

Do you wish to optimize your script with torch dynamo? [yes/NO]:
no

Do you want to use FullyShardedDataParallel? [yes/NO]:
no

Do you want to use Megatron-LM? [yes/NO]:
no

How many GPU(s) should be used for distributed training? [1]:
2

I choose 2 because I have 2 GPUs

What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]:
0.1

I choose 0,1 because I want to use both GPUs

Do you wish to use FP16 or BF16 (mixed precision)?
[x] no
[_] fp16
[_] bf16
[_] fp8

For now I choose no, because to simplify the code when not using accelerate we are going to train in fp32, but ideally we would use fp16

The configuration will be saved in ~/.cache/huggingface/accelerate/default_config.yaml and can be modified at any time. Let's see what's inside.

	
		!cat ~/.cache/huggingface/accelerate/default_config.yaml

	
		compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Another way to view the configuration we have is by running in a terminal:

accelerate env

	
		!accelerate env

	
		Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.28.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
- Python version: 3.11.8
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 31.24 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []

Once we have configured accelerate we can test if we have done it correctly by running in a terminal:

accelerate test

	
		!accelerate test

	
		Running:  accelerate-launch ~/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout: Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout:
stdout: **Test process execution**
stdout:
stdout: **Test split between processes as a list**
stdout:
stdout: **Test split between processes as a dict**
stdout:
stdout: **Test split between processes as a tensor**
stdout:
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout:
stdout: **DataLoader integration test**
stdout: 0 1 tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout:         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout:         54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:1') &lt;class 'accelerate.data_loader.DataLoaderShard'&gt;
stdout: tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout:         18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout:         36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout:         54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') &lt;class 'accelerate.data_loader.DataLoaderShard'&gt;
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: Non-shuffled central dataloader passing.
stdout: Shuffled central dataloader passing.
stdout:
stdout: **Training integration test**
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout:
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: FP16 training check.
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout:
stdout: **Breakpoint trigger test**
Test is a success! You are ready for your distributed training!

We see that it ends by saying Test is a success! You are ready for your distributed training! so everything is correct.

Training

Optimization of Training

Base code

Let's start by creating a basic training code and then we'll optimize it to see how it's done and how it improves.

First, let's find a dataset. In my case, I will use the tweet_eval dataset, which is a tweet classification dataset. Specifically, I will download the emoji subset that classifies tweets with emojis.

	
		from datasets import load_dataset
 
dataset = load_dataset("tweet_eval", "emoji")
dataset

	
		DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 45000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

	
		dataset["train"].info

	
		DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='tweet_eval', config_name='emoji', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=3808792, num_examples=45000, shard_lengths=None, dataset_name='tweet_eval'), 'test': SplitInfo(name='test', num_bytes=4262151, num_examples=50000, shard_lengths=None, dataset_name='tweet_eval'), 'validation': SplitInfo(name='validation', num_bytes=396704, num_examples=5000, shard_lengths=None, dataset_name='tweet_eval')}, download_checksums={'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/train-00000-of-00001.parquet': {'num_bytes': 2609973, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/test-00000-of-00001.parquet': {'num_bytes': 3047341, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/validation-00000-of-00001.parquet': {'num_bytes': 281994, 'checksum': None}}, download_size=5939308, post_processing_size=None, dataset_size=8467647, size_in_bytes=14406955)

Let's see the classes

	
		print(dataset["train"].info.features["label"].names)

	
		['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜']

And the number of classes

	
		num_classes = len(dataset["train"].info.features["label"].names)
num_classes

We see that the dataset has 20 classes

Let's look at the maximum sequence of each split

	
		max_len_train = 0
max_len_val = 0
max_len_test = 0
 
split = "train"
for i in range(len(dataset[split])):
    len_i = len(dataset[split][i]["text"])
    if len_i &gt; max_len_train:
        max_len_train = len_i
split = "validation"
for i in range(len(dataset[split])):
    len_i = len(dataset[split][i]["text"])
    if len_i &gt; max_len_val:
        max_len_val = len_i
split = "test"
for i in range(len(dataset[split])):
    len_i = len(dataset[split][i]["text"])
    if len_i &gt; max_len_test:
        max_len_test = len_i
 
max_len_train, max_len_val, max_len_test

	
		(142, 139, 167)

So we define the maximum sequence in general as 130 for tokenization

	
		max_len = 130

We are interested in the tokenized dataset, not the raw sequences, so we create a tokenizer

	
		from transformers import AutoTokenizer
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)

We create a tokenization function

	
		def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")

And now we tokenize the dataset

	
		tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}

	
		Map:   0%|          | 0/45000 [00:00&lt;?, ? examples/s]

	
		Map:   0%|          | 0/5000 [00:00&lt;?, ? examples/s]

	
		Map:   0%|          | 0/50000 [00:00&lt;?, ? examples/s]

As we can see, now we have the tokens (input_ids) and the attention masks (attention_mask), but let's take a look at what kind of data we have.

	
		type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"]), type(tokenized_dataset["train"][0]["label"])

	
		(list, list, int)

	
		tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
type(tokenized_dataset["train"][0]["label"]), type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"])

	
		(torch.Tensor, torch.Tensor, torch.Tensor)

We create a DataLoader

	
		import torch
from torch.utils.data import DataLoader
BS = 64
 
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}

We load the model

	
		from transformers import AutoModelForSequenceClassification
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)

Let's see how the model is.

	
		model

	
		RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): RobertaIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
            (intermediate_act_fn): GELUActivation()
          )
          (output): RobertaOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
  )
  (classifier): RobertaClassificationHead(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
    (out_proj): Linear(in_features=768, out_features=2, bias=True)
  )
)

Let's take a look at its last layer

	
		model.classifier.out_proj

	
		Linear(in_features=768, out_features=2, bias=True)

	
		model.classifier.out_proj.in_features, model.classifier.out_proj.out_features

	
		(768, 2)

We have seen that our dataset has 20 classes, but this model is trained for 2 classes, so we need to modify the last layer.

	
		model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
model.classifier.out_proj

	
		Linear(in_features=768, out_features=20, bias=True)

Now yes

Now we create a loss function

	
		loss_function = torch.nn.CrossEntropyLoss()

An optimizer

	
		from torch.optim import Adam
 
optimizer = Adam(model.parameters(), lr=5e-4)

And lastly, a metric

	
		import evaluate
 
metric = evaluate.load("accuracy")

Let's check that everything is fine with a sample.

	
		sample = next(iter(dataloader["train"]))

	
		sample["input_ids"].shape, sample["attention_mask"].shape

	
		(torch.Size([64, 130]), torch.Size([64, 130]))

Now we feed that sample to the model

	
		model.to("cuda")
ouputs = model(input_ids=sample["input_ids"].to("cuda"), attention_mask=sample["attention_mask"].to("cuda"))
ouputs.logits.shape

	
		torch.Size([64, 20])

We see that the model outputs 64 batches, which is fine because we set BS = 20 and each one with 20 outputs, which is good because we modified the model to have an output of 20 values.

We obtain the one with the highest value

	
		predictions = torch.argmax(ouputs.logits, axis=-1)
predictions.shape

	
		torch.Size([64])

We obtain the loss

	
		loss = loss_function(ouputs.logits, sample["label"].to("cuda"))
loss.item()

	
		2.9990389347076416

And the accuracy

	
		accuracy = metric.compute(predictions=predictions, references=sample["label"])["accuracy"]
accuracy

	
		0.015625

We can now create a small training loop

	
		from fastprogress.fastprogress import master_bar, progress_bar
 
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
 
master_progress_bar = master_bar(range(epochs))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        loss.backward()
        optimizer.step()
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"

	
		&lt;IPython.core.display.HTML object&gt;

	
		&lt;IPython.core.display.HTML object&gt;

Script with the base code

In most of the accelerate documentation, it is explained how to use accelerate with scripts, so for now we will do it this way and at the end we will explain how to do it with a notebook.

First, we are going to create a folder where we will save the scripts

	
		!mkdir accelerate_scripts

Now we write the base code in a script

	
		%%writefile accelerate_scripts/01_code_base.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 64
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
 
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        loss.backward()
        optimizer.step()
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"
print(f"Accuracy = {accuracy['accuracy']}")

	
		Overwriting accelerate_scripts/01_code_base.py

And now we run it

	
		%%time
 
!python accelerate_scripts/01_code_base.py

	
		Accuracy = 0.2112
CPU times: user 2.12 s, sys: 391 ms, total: 2.51 s
Wall time: 3min 36s

We see that on my computer it took about 3 and a half minutes

Code with accelerate

Now we replace some things

First, we import Accelerator and initialize it

from accelerate import Accelerator
accelerator = Accelerator()

We don't do the typical anymore

``` python

torch.device("cuda" if torch.cuda.is_available() else "cpu")

```

But we let accelerate choose the device.

device = accelerator.device

We pass the relevant elements for training through the prepare method and no longer do model.to(device)

model, optimizer, dataloader["train"], dataloader["validation"] = prepare(model, optimizer, dataloader["train"], dataloader["validation"])

We no longer send the data and model to the GPU with .to(device) since accelerate has taken care of it with the prepare method.

Instead of performing backpropagation with loss.backward(), we let accelerate handle it with

accelerator.backward(loss)

When calculating the metric in the validation loop, we need to gather the values from all points, especially if we are doing distributed training, for this we do

predictions = accelerator.gather_for_metrics(predictions)

	
		%%writefile accelerate_scripts/02_accelerate_base_code.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 64
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
    print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"
 
print(f"Accuracy = {accuracy['accuracy']}")

	
		Overwriting accelerate_scripts/02_accelerate_base_code.py

If you notice, I've added these two lines print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}") and the line print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}"), I added them on purpose because they will reveal something very important.

Now let's run it. To execute the accelerate scripts, use the command accelerate launch

accelerate launch script.py

	
		%%time
 
!accelerate launch accelerate_scripts/02_accelerate_base_code.py

	
		End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
CPU times: user 1.6 s, sys: 272 ms, total: 1.88 s
Wall time: 2min 37s

We see that before it took about 3 and a half minutes, and now it takes around 2 and a half minutes. Quite an improvement. Additionally, if we look at the prints, we can see that they have been printed twice.

And how can this be? Well, because accelerate has parallelized the training across the two GPUs I have, so it was much faster.

Moreover, when I ran the first script, that is, when I didn't use accelerate, the GPU was almost full, while when I ran the second one, that is, the one that uses accelerate, both GPUs were barely utilized, so we can increase the batch size to try to fill both. Let's do it!

	
		%%writefile accelerate_scripts/03_accelerate_base_code_more_bs.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"
 
print(f"Accuracy = {accuracy['accuracy']}")

	
		Overwriting accelerate_scripts/03_accelerate_base_code_more_bs.py

I have removed the extra prints, as we have already seen that the code is running on both GPUs, and I increased the batch size from 64 to 128. Let's run it and see.

	
		%%time
 
!accelerate launch accelerate_scripts/03_accelerate_base_code_more_bs.py

	
		Accuracy = 0.1052
Accuracy = 0.1052
CPU times: user 1.41 s, sys: 180 ms, total: 1.59 s
Wall time: 2min 22s

Increasing the batch size has reduced the execution time by a few seconds.

Execution of processes

Execution of code in a single process

We have previously seen that the prints were printed twice, this is because accelerate creates as many processes as there are devices where the code is executed, in my case it creates two processes due to having two GPUs.

However, not all code should be executed in all processes, for example, the prints slow down the code a lot, especially if they are executed multiple times, if checkpoints are saved, they would be saved twice, etc.

To be able to execute part of the code in a single process, it has to be encapsulated in a function and decorated with accelerator.on_local_main_process. For example, in the following code you will see that I have created the following function

@accelerator.on_local_main_process
def print_something(something):
print(something)

Another option is to include the code within an if accelerator.is_local_main_process as in the following code

if accelerator.is_local_main_process:
print("Something")

	
		%%writefile accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_something(something):
    print(something)
 
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
    model.train()
    progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
    master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}
"
 
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")

	
		Overwriting accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py

Let's run it and see

	
		%%time
 
!accelerate launch accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py

	
		Accuracy = 0.2098
End of script with 0.2098 accuracy
CPU times: user 1.38 s, sys: 197 ms, total: 1.58 s
Wall time: 2min 22s

Now the print has only been executed once

However, although not visible much, the progress bars run in each process.

I haven't found a way to avoid this with the progress bars from fastprogress, but I have with those from tqdm, so I'm going to replace the fastprogress progress bars with tqdm ones, and to make them run in a single process, you need to add the argument disable=not accelerator.is_local_main_process

	
		%%writefile accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_something(something):
    print(something)
 
for i in range(EPOCHS):
    model.train()
    # progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        # master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    # progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")

	
		Overwriting accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py

	
		%%time
 
!accelerate launch accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py

	
		100%|█████████████████████████████████████████| 176/176 [02:01&lt;00:00,  1.45it/s]
100%|███████████████████████████████████████████| 20/20 [00:06&lt;00:00,  3.30it/s]
Accuracy = 0.2166
End of script with 0.2166 accuracy
CPU times: user 1.33 s, sys: 195 ms, total: 1.52 s
Wall time: 2min 22s

We have shown an example of how to print in a single process, and this has been a way to execute processes in a single process. But if what you want is just to print in a single process, the print method from accelerate can be used. Let's see the same example as before with this method.

	
		%%writefile accelerate_scripts/06_accelerate_base_code_print_one_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
for i in range(EPOCHS):
    model.train()
    # progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
        # master_progress_bar.child.comment = f'loss: {loss}'
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    # progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
# print(f"Accuracy = {accuracy['accuracy']}")
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")

	
		Writing accelerate_scripts/06_accelerate_base_code_print_one_process.py

We run it

	
		%%time
 
!accelerate launch accelerate_scripts/06_accelerate_base_code_print_one_process.py

	
		Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15433.52 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 11406.61 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15036.87 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14932.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14956.60 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00&lt;00:00,  1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00,  3.33it/s]
Accuracy = 0.2134
End of script with 0.2134 accuracy
CPU times: user 1.4 s, sys: 189 ms, total: 1.59 s
Wall time: 2min 27s

Code Execution in All Processes

However, there is code that must run in all processes, for example, if we upload the checkpoints to the hub, so here we have two options: wrap the code in a function and decorate it with accelerator.on_main_process

@accelerator.on_main_process
def do_my_thing():
"Something done once per server"
do_thing_once()

or put the code inside an if accelerator.is_main_process

if accelerator.is_main_process:
repo.push_to_hub()

Since we are only doing training to showcase the accelerate library and the model we are training is not good, it doesn't make sense to upload the checkpoints to the hub right now, so I will do an example with prints.

	
		%%writefile accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_in_one_process(something):
    print(something)
 
@accelerator.on_main_process
def print_in_all_processes(something):
    print(something)
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")
 
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_main_process:
    print(f"All process: End of script with {accuracy['accuracy']} accuracy")

	
		Overwriting accelerate_scripts/06_accelerate_base_code_some_code_in_all_process.py

Let's run it to see

	
		%%time
 
!accelerate launch accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py

	
		Map: 100%|██████████████████████| 45000/45000 [00:03&lt;00:00, 14518.44 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:03&lt;00:00, 14368.77 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 16466.33 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14806.14 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14253.33 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14337.07 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00&lt;00:00,  1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00,  3.34it/s]
Accuracy = 0.2092
End of script with 0.2092 accuracy
All process: Accuracy = 0.2092
All process: End of script with 0.2092 accuracy
CPU times: user 1.42 s, sys: 216 ms, total: 1.64 s
Wall time: 2min 27s

Code Execution in Process X

Finally, we can specify in which process we want to execute code. For this, we need to create a function and decorate it with @accelerator.on_process(process_index=0)

@accelerator.on_process(process_index=0)
def do_my_thing():
"Something done on process index 0"
do_thing_on_index_zero()

or decorate it with @accelerator.on_local_process(local_process_idx=0)

@accelerator.on_local_process(local_process_index=0)def do_my_thing():
"Something done on process index 0 on each server"
do_thing_on_index_zero_on_each_server()

Here I have put process 0, but any number can be put

	
		%%writefile accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_in_one_process(something):
    print(something)
 
@accelerator.on_main_process
def print_in_all_processes(something):
    print(something)
 
@accelerator.on_process(process_index=0)
def print_in_process_0(something):
    print("Process 0: " + something)
 
@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
    print("Process 1: " + something)
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")
 
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_main_process:
    print(f"All process: End of script with {accuracy['accuracy']} accuracy")
 
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")

	
		Overwriting accelerate_scripts/07_accelerate_base_code_some_code_in_some_process.py

We run it

	
		%%time
 
!accelerate launch accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py

	
		Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 15735.58 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14906.20 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:02&lt;00:00,  1.44it/s]
100%|███████████████████████████████████████████| 20/20 [00:06&lt;00:00,  3.27it/s]
Process 1: End of process 1
Accuracy = 0.2128
End of script with 0.2128 accuracy
All process: Accuracy = 0.2128
All process: End of script with 0.2128 accuracy
Process 0: End of process 0
CPU times: user 1.42 s, sys: 295 ms, total: 1.71 s
Wall time: 2min 37s

Synchronizing Processes

If we have code that needs to run on all processes, it's interesting to wait for it to finish on all processes before doing another task, so for this we use accelerator.wait_for_everyone()

To see it, we are going to introduce a delay in one of the print functions in a process.

I've also added a break in the training loop so that it doesn't train for too long, which isn't what we're interested in right now.

	
		%%writefile accelerate_scripts/09_accelerate_base_code_sync_all_process.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
@accelerator.on_local_main_process
def print_in_one_process(something):
    print(something)
 
@accelerator.on_main_process
def print_in_all_processes(something):
    print(something)
 
@accelerator.on_process(process_index=0)
def print_in_process_0(something):
    time.sleep(2)
    print("Process 0: " + something)
 
@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
    print("Process 1: " + something)
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
        break
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_local_main_process:
    print(f"End of script with {accuracy['accuracy']} accuracy")
 
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
 
if accelerator.is_main_process:
    print(f"All process: End of script with {accuracy['accuracy']} accuracy")
 
print_in_one_process("Printing with delay in process 0")
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
accelerator.wait_for_everyone()
 
print_in_one_process("End of script")

	
		Overwriting accelerate_scripts/08_accelerate_base_code_sync_all_process.py

We run it

	
		!accelerate launch accelerate_scripts/09_accelerate_base_code_sync_all_process.py

	
		Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14218.23 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14666.25 examples/s]
  0%|                                                   | 0/176 [00:00&lt;?, ?it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00,  3.58it/s]
Process 1: End of process 1
Accuracy = 0.212
End of script with 0.212 accuracy
All process: Accuracy = 0.212
All process: End of script with 0.212 accuracy
Printing with delay in process 0
Process 0: End of process 0
End of script

As can be seen, first Process 1: End of process 1 is printed and then the rest. This is because the remaining prints are either done in process 0 or in all processes, so until the 2-second delay we set is over, the rest of the code does not execute.

Save and load the state dict

When we train, sometimes we save the state so we can continue at another time.

To save the state, we will have to use the methods save_state() and load_state()

	
		%%writefile accelerate_scripts/10_accelerate_save_and_load_checkpoints.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
@accelerator.on_local_main_process
def print_something(something):
    print(something)
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
 
    # Guardamos los pesos
    accelerator.save_state("accelerate_scripts/checkpoints")
 
print_something(f"Accuracy = {accuracy['accuracy']}")
 
# Cargamos los pesos
accelerator.load_state("accelerate_scripts/checkpoints")

	
		Overwriting accelerate_scripts/09_accelerate_save_and_load_checkpoints.py

We run it

	
		!accelerate launch accelerate_scripts/10_accelerate_save_and_load_checkpoints.py

	
		100%|█████████████████████████████████████████| 176/176 [01:58&lt;00:00,  1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00,  3.40it/s]
Accuracy = 0.2142

Save the model

When the prepare method was used, the model was wrapped to be able to save it to the necessary devices. Therefore, when saving it, we need to use the save_model method, which first unwraps it and then saves it. Additionally, if we use the parameter safe_serialization=True, the model will be saved as a safe tensor.

	
		%%writefile accelerate_scripts/11_accelerate_save_model.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
@accelerator.on_local_main_process
def print_something(something):
    print(something)
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
 
    # Guardamos el modelo
    accelerator.wait_for_everyone()
    accelerator.save_model(model, "accelerate_scripts/model", safe_serialization=True)
 
print_something(f"Accuracy = {accuracy['accuracy']}")

	
		Writing accelerate_scripts/11_accelerate_save_model.py

We run it

	
		!accelerate launch accelerate_scripts/11_accelerate_save_model.py

	
		100%|█████████████████████████████████████████| 176/176 [01:58&lt;00:00,  1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00,  3.35it/s]
Accuracy = 0.214

Save the `pretrained` model

In models that use the transformers library, we must save the model using the save_pretrained method to be able to load it with the from_pretrained method. Before saving it, you need to unwrap it using the unwrap_model method.

	
		%%writefile accelerate_scripts/12_accelerate_save_pretrained.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
@accelerator.on_local_main_process
def print_something(something):
    print(something)
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
 
    # Guardamos el modelo pretrained
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(
        "accelerate_scripts/model_pretrained",
        is_main_process=accelerator.is_main_process,
        save_function=accelerator.save,
    )
 
print_something(f"Accuracy = {accuracy['accuracy']}")

	
		Writing accelerate_scripts/11_accelerate_save_pretrained.py

We run it

	
		!accelerate launch accelerate_scripts/12_accelerate_save_pretrained.py

	
		Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15152.47 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15119.13 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 12724.70 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 12397.49 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 15247.21 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 15138.03 examples/s]
100%|█████████████████████████████████████████| 176/176 [01:59&lt;00:00,  1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00,  3.37it/s]
Accuracy = 0.21

Now we could load it

	
		from transformers import AutoModel
 
checkpoints = "accelerate_scripts/model_pretrained"
tokenizer = AutoModel.from_pretrained(checkpoints)

	
		Some weights of RobertaModel were not initialized from the model checkpoint at accelerate_scripts/model_pretrained and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Training in notebooks

So far we have seen how to run scripts, but if you want to run the code in a notebook, we can write the same code as before, but encapsulated in a function

First we import the libraries

	
		import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
# from accelerate import Accelerator

Now we create the function

	
		def train_code(batch_size: int = 64):
    from accelerate import Accelerator
    accelerator = Accelerator()
 
    dataset = load_dataset("tweet_eval", "emoji")
    num_classes = len(dataset["train"].info.features["label"].names)
    max_len = 130
 
    checkpoints = "cardiffnlp/twitter-roberta-base-irony"
    tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
    def tokenize_function(dataset):
        return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
    tokenized_dataset = {
        "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
        "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
        "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
    }
    tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
    tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
    tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
    BS = batch_size
    dataloader = {
        "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
        "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
        "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
    }
 
    model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
    model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
    loss_function = torch.nn.CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=5e-4)
    metric = evaluate.load("accuracy")
 
    EPOCHS = 1
    # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    device = accelerator.device
 
    # model.to(device)
    model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
    for i in range(EPOCHS):
        model.train()
        progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
        for batch in progress_bar_train:
            optimizer.zero_grad()
 
            input_ids = batch["input_ids"]#.to(device)
            attention_mask = batch["attention_mask"]#.to(device)
            labels = batch["label"]#.to(device)
 
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_function(outputs['logits'], labels)
 
            # loss.backward()
            accelerator.backward(loss)
            optimizer.step()
 
        model.eval()
        progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
        for batch in progress_bar_validation:
            input_ids = batch["input_ids"]#.to(device)
            attention_mask = batch["attention_mask"]#.to(device)
            labels = batch["label"]#.to(device)
 
            with torch.no_grad():
                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            predictions = torch.argmax(outputs['logits'], axis=-1)
            # Recopilamos las predicciones de todos los dispositivos
            predictions = accelerator.gather_for_metrics(predictions)
            labels = accelerator.gather_for_metrics(labels)
 
            accuracy = metric.add_batch(predictions=predictions, references=labels)
        accuracy = metric.compute()
        
    accelerator.print(f"Accuracy = {accuracy['accuracy']}")

To run the training in the notebook, we use the notebook_launcher function, to which we pass the function we want to execute, the arguments of that function, and the number of GPUs on which we will train using the num_processes variable.

	
		from accelerate import notebook_launcher
 
args = (128,)
notebook_launcher(train_code, args, num_processes=2)

	
		Launching training on 2 GPUs.

	
		100%|██████████| 176/176 [02:01&lt;00:00,  1.45it/s]
100%|██████████| 20/20 [00:06&lt;00:00,  3.31it/s]

	
		Accuracy = 0.2112

Training in FP16

When we first set up accelerate, it asked us Do you wish to use FP16 or BF16 (mixed precision)? and we said no, so now we are going to tell it yes, that we want FP16.

So far we have trained in FP32, which means that each weight of the model is a 32-bit floating-point number, and now we are going to use a 16-bit floating-point number, meaning the model will take up less space. As a result, two things will happen: we will be able to use a larger batch size, and it will also be faster.

First we run accelerate config again and tell it that we want FP16

	
		!accelerate config

	
		--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
fp16
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Now we create a script to train, with the same batch size as before, to see if it takes less time to train

	
		%%writefile accelerate_scripts/13_accelerate_base_code_fp16_bs128.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 128
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
accelerator.print(f"Accuracy = {accuracy['accuracy']}")

	
		Overwriting accelerate_scripts/12_accelerate_base_code_fp16_bs128.py

Let's run it to see how long it takes

	
		%%time
 
!accelerate launch accelerate_scripts/13_accelerate_base_code_fp16_bs128.py

	
		Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14983.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14315.47 examples/s]
100%|█████████████████████████████████████████| 176/176 [01:01&lt;00:00,  2.88it/s]
100%|███████████████████████████████████████████| 20/20 [00:02&lt;00:00,  6.84it/s]
Accuracy = 0.2094
CPU times: user 812 ms, sys: 163 ms, total: 976 ms
Wall time: 1min 27s

When we ran this training in FP32 it took about 2 and a half minutes, and now roughly 1 and a half minutes. Let's see if instead of training with a batch size of 128, we do it with one of 256.

	
		%%writefile accelerate_scripts/14_accelerate_base_code_fp16_bs256.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 256
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
accelerator.print(f"Accuracy = {accuracy['accuracy']}")

	
		Overwriting accelerate_scripts/13_accelerate_base_code_fp16_bs256.py

We run it

	
		%%time
 
!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py

	
		Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 15390.30 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14990.92 examples/s]
100%|███████████████████████████████████████████| 88/88 [00:54&lt;00:00,  1.62it/s]
100%|███████████████████████████████████████████| 10/10 [00:02&lt;00:00,  3.45it/s]
Accuracy = 0.2236
CPU times: user 670 ms, sys: 91.6 ms, total: 761 ms
Wall time: 1min 12s

It has only dropped by about 15 seconds

Training in BF16

We have previously trained in FP16 and now we are going to do it in BF16, what is the difference?

As we can see in the image, while FP16 compared to FP32 has fewer bits in the mantissa and the exponent, which makes its range much smaller, BF16 compared to FP32 has the same number of bits for the exponent but fewer in the mantissa, which means that BF16 has the same range of numbers as FP32, but is less precise.

This is beneficial because in FP16 some calculations could result in very high numbers, which cannot be represented in the FP16 format. Additionally, there are certain HW devices that are optimized for this format.

Just like before, we run accelerate config and indicate that we want BF16

	
		!accelerate config

	
		--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Now we run the last script we created, that is, with a batch size of 256

	
		%%time
 
!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py

	
		Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14814.95 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14506.83 examples/s]
100%|███████████████████████████████████████████| 88/88 [00:51&lt;00:00,  1.70it/s]
100%|███████████████████████████████████████████| 10/10 [00:03&lt;00:00,  3.21it/s]
Accuracy = 0.2112
CPU times: user 688 ms, sys: 144 ms, total: 832 ms
Wall time: 1min 17s

It took a similar amount of time as before, which is normal since we trained a model with 16-bit weights, just like before.

Training in FP8

Now we are going to train in FP8 format, which, as the name suggests, is a floating-point format where each weight has 8 bits, so we run accelerate config to tell it that we want FP8

	
		!accelerate config

	
		--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
fp8
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Now we run the last script, the one with a batch size of 256

	
		%%time
 
!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py

	
		Traceback (most recent call last):
  File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/13_accelerate_base_code_fp16_bs256.py", line 12, in &lt;module&gt;
    accelerator = Accelerator()
                  ^^^^^^^^^^^^^
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/accelerator.py", line 371, in __init__
    self.state = AcceleratorState(
                 ^^^^^^^^^^^^^^^^^
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/state.py", line 790, in __init__
    raise ValueError(
ValueError: Using `fp8` precision requires `transformer_engine` or `MS-AMP` to be installed.
Traceback (most recent call last):
  File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/13_accelerate_base_code_fp16_bs256.py", line 12, in &lt;module&gt;
    accelerator = Accelerator()
                  ^^^^^^^^^^^^^
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/accelerator.py", line 371, in __init__
    self.state = AcceleratorState(
                 ^^^^^^^^^^^^^^^^^
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/state.py", line 790, in __init__
    raise ValueError(
ValueError: Using `fp8` precision requires `transformer_engine` or `MS-AMP` to be installed.
[2024-05-13 21:40:56,455] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 501480) of binary: /home/wallabot/miniconda3/envs/nlp/bin/python
Traceback (most recent call last):
  File "/home/wallabot/miniconda3/envs/nlp/bin/accelerate", line 8, in &lt;module&gt;
    sys.exit(main())
             ^^^^^^
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1048, in launch_command
    multi_gpu_launcher(args)
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
accelerate_scripts/13_accelerate_base_code_fp16_bs256.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-05-13_21:40:56
  host      : wallabot
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 501481)
  error_file: &lt;N/A&gt;
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-13_21:40:56
  host      : wallabot
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 501480)
  error_file: &lt;N/A&gt;
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
CPU times: user 65.1 ms, sys: 14.5 ms, total: 79.6 ms
Wall time: 7.24 s

Since the weights are now 8-bit and take up half the memory, we will increase the batch size to 512.

	
		%%writefile accelerate_scripts/15_accelerate_base_code_fp8_bs512.py
 
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
 
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
 
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
 
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
 
def tokenize_function(dataset):
    return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
    "train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
    "test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
 
BS = 512
dataloader = {
    "train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
    "validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
    "test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
 
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
 
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
 
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
 
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
 
for i in range(EPOCHS):
    model.train()
    progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_train:
        optimizer.zero_grad()
 
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = loss_function(outputs['logits'], labels)
 
        # loss.backward()
        accelerator.backward(loss)
        optimizer.step()
 
    model.eval()
    progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
    for batch in progress_bar_validation:
        input_ids = batch["input_ids"]#.to(device)
        attention_mask = batch["attention_mask"]#.to(device)
        labels = batch["label"]#.to(device)
 
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.argmax(outputs['logits'], axis=-1)
        # Recopilamos las predicciones de todos los dispositivos
        predictions = accelerator.gather_for_metrics(predictions)
        labels = accelerator.gather_for_metrics(labels)
 
        accuracy = metric.add_batch(predictions=predictions, references=labels)
    accuracy = metric.compute()
    
accelerator.print(f"Accuracy = {accuracy['accuracy']}")

	
		Writing accelerate_scripts/15_accelerate_base_code_fp8_bs512.py

We run it

	
		%%time
 
!accelerate launch accelerate_scripts/15_accelerate_base_code_fp8_bs512.py

Model Inference

Usage of the Hugging Face Ecosystem

Let's see how to perform inference with large models using the transformers library from Hugging Face.

Inference with `pipeline`

If we use the Hugging Face ecosystem, it's very simple, as everything happens under the hood without us having to do much. In the case of using pipeline, which is the easiest way to perform inference with the transformers library, we just need to specify the model we want to use and, importantly, pass device_map="auto". This will make accelerate distribute the model across different GPUs, CPU RAM, or hard drive if necessary.

There are more possible values for device_map, which we will see later, but for now stick with "auto".

We are going to use the Llama3 8B model, which, as its name suggests, is a model with around 8 billion parameters. Since each parameter by default is in FP32 format, which corresponds to 4 bytes (32 bits), this means that if we multiply 8 billion parameters by 4 bytes, we get that it would require a GPU with around 32 GB of VRAM.

In my case, I have 2 GPUs with 24 GB of VRAM each, so it wouldn't fit into a single GPU. But thanks to setting device_map="auto", accelerate will distribute the model across both GPUs and I will be able to perform inference.

	
		%%writefile accelerate_scripts/16_inference_with_pipeline.py
 
from transformers import pipeline
 
checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
generator = pipeline(model=checkpoints, device_map="auto")
 
prompt = "Conoces accelerate de hugging face?"
output = generator(prompt)
print(output)

	
		Overwriting accelerate_scripts/09_inference_with_pipeline.py

Now we run it, only as the pipeline uses accelerate under the hood, we don't need to run it with accelerate launch script.py but instead with python script.py.

	
		!python accelerate_scripts/16_inference_with_pipeline.py

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09&lt;00:00, 2.27s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[{'generated_text': 'Conoces accelerate de hugging face? ¿Qué es el modelo de lenguaje de transformers y cómo se utiliza en el marco de hugging face? ¿Cómo puedo utilizar modelos de lenguaje de transformers en mi aplicación? ¿Qué son los tokenizers y cómo se utilizan en el marco de hugging face? ¿Cómo puedo crear un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los datasets y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar datasets para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning'}]

As can be seen, it has not responded, but has continued to ask questions. This is because Llama3 is a language model that predicts the next token, so with the prompt I passed it, it considered that the next best tokens were ones that correspond to more questions. This makes sense because sometimes people have doubts about a topic and generate many questions, so to get it to answer the question, we need to condition it a bit.

	
		%%writefile accelerate_scripts/17_inference_with_pipeline_condition.py
 
from transformers import pipeline
 
checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
generator = pipeline(model=checkpoints, device_map="auto")
 
prompt = "Conoces accelerate de hugging face?"
messages = [
    {
        "role": "system",
        "content": "Eres un chatbot amigable que siempre intenta solucionar las dudas",
    },
    {"role": "user", "content": f"{prompt}"},
]
output = generator(messages)
print(output[0]['generated_text'][-1])

	
		Overwriting accelerate_scripts/10_inference_with_pipeline_condition.py

As you can see, a message has been generated with roles, conditioning the model and with the prompt

	
		!python accelerate_scripts/17_inference_with_pipeline_condition.py

Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09&lt;00:00, 2.41s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '¡Hola!

Sí, conozco Accelerate de Hugging Face. Accelerate es una biblioteca de Python desarrollada por Hugging Face que se enfoca en simplificar y acelerar el entrenamiento y la evaluación de modelos de lenguaje en diferentes dispositivos y entornos.

Con Accelerate, puedes entrenar modelos de lenguaje en diferentes plataformas y dispositivos, como GPUs, TPUs, CPUs y servidores, sin necesidad de cambiar el código de tu modelo. Esto te permite aprovechar al máximo la potencia de cálculo de tus dispositivos y reducir el tiempo de entrenamiento.

Accelerate también ofrece varias características adicionales, como:

* Soporte para diferentes frameworks de machine learning, como TensorFlow, PyTorch y JAX.
* Integración con diferentes sistemas de almacenamiento y procesamiento de datos, como Amazon S3 y Google Cloud Storage.
* Soporte para diferentes protocolos de comunicación, como HTTP y gRPC.
* Herramientas para monitorear y depurar tus modelos en tiempo real.

En resumen, Accelerate es una herramienta muy útil para desarrolladores de modelos de lenguaje que buscan simplificar y acelerar el proceso de entrenamiento y evaluación de sus modelos.

¿Tienes alguna pregunta específica sobre Accelerate o necesitas ayuda para implementarlo en tu proyecto?'}

Now the response does answer our prompt

Inference with `AutoClass`

Lastly, we will see how to perform inference using only AutoClass.

	
		%%writefile accelerate_scripts/18_inference_with_autoclass.py
 
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
 
checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
 
tokenizer = AutoTokenizer.from_pretrained(checkpoints, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(checkpoints, device_map="auto")
streamer = TextStreamer(tokenizer)
 
prompt = "Conoces accelerate de hugging face?"
tokens_input = tokenizer([prompt], return_tensors="pt").to(model.device)
 
_ = model.generate(**tokens_input, streamer=streamer, max_new_tokens=500, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)

	
		Overwriting accelerate_scripts/11_inference_with_autoclass.py

As can be seen, the streamer object has been created and is then passed to the model's generate method. This is useful for printing each word as it is generated, rather than waiting for the entire output to be generated before printing it.

	
		!python accelerate_scripts/18_inference_with_autoclass.py

	
		Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09&lt;00:00,  2.28s/it]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
&lt;|begin_of_text|&gt;Conoces accelerate de hugging face? Si es así, puedes utilizar la biblioteca `transformers` de Hugging Face para crear un modelo de lenguaje que pueda predecir la siguiente palabra en una secuencia de texto.
Aquí te muestro un ejemplo de cómo hacerlo:
```
import pandas as pd
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Cargar el modelo y el tokenizador
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Cargar el conjunto de datos
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
# Preprocesar los datos
train_texts = train_df["text"]
train_labels = train_df["label"]
test_texts = test_df["text"]
# Convertir los textos en entradas para el modelo
train_encodings = tokenizer.batch_encode_plus(train_texts,
                                              add_special_tokens=True,
                                              max_length=512,
                                              return_attention_mask=True,
                                              return_tensors='pt')
test_encodings = tokenizer.batch_encode_plus(test_texts,
                                             add_special_tokens=True,
                                             max_length=512,
                                             return_attention_mask=True,
                                             return_tensors='pt')
# Crear un dataloader para entrenar el modelo
train_dataset = torch.utils.data.TensorDataset(train_encodings["input_ids"],
                                               train_encodings["attention_mask"],
                                               torch.tensor(train_labels))
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
# Entrenar el modelo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
for epoch in range(5):
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels = batch[2].to(device)
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1}, Loss: {total

Use PyTorch

Normally, the way to make inferences with PyTorch is to create a model with randomly initialized weights and then load a state dict with the pretrained model's weights. So, to get that state dict, we'll first take a small shortcut and download it.

	
		import torch
import torchvision.models as models
 
model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)
torch.save(model.state_dict(), 'accelerate_scripts/resnet152_pretrained.pth')

	
		Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/maximo.fernandez/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth
100%|██████████| 230M/230M [02:48&lt;00:00, 1.43MB/s]

Now that we have the state dict, we are going to perform inference as it is typically done in PyTorch.

	
		import torch
import torchvision.models as models
 
device = "cuda" if torch.cuda.is_available() else "cpu" # Set device
 
resnet152 = models.resnet152().to(device) # Create model with random weights and move to device
state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device) # Load pretrained weights into device memory
resnet152.load_state_dict(state_dict) # Load this weights into the model
 
input = torch.rand(1, 3, 224, 224).to(device) # Random image with batch size 1
output = resnet152(input)
output.shape

	
		torch.Size([1, 1000])

Let's explain what has happened

When we did resnet152 = models.resnet152().to(device), a ResNet152 with random weights was loaded into the GPU memory.
When we did state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device), a dictionary with the trained weights was loaded into the GPU memory.
When we did resnet152.load_state_dict(state_dict), those pretrained weights were assigned to the model.

That is, the model has been loaded twice into the GPU memory.

You might be wondering why we did first

model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)
torch.save(model.state_dict(), 'accelerate_scripts/resnet152_pretrained.pth')

To do later

resnet152 = models.resnet152().to(device)
state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device)
resnet152.load_state_dict(state_dict)

And why don't we use directly

model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)

And we stop saving the state dict to load it later. Well, because Pytorch does the same thing under the hood. So, to see the whole process, we have broken down into several lines what Pytorch does in one.

This way of working has been effective so far, while models were manageable by user GPUs. But since the arrival of LLMs, this approach no longer makes sense.

For example, a 6B parameter model would occupy 24 GB in memory, and since it is loaded twice with this way of working, you would need a 48 GB GPU.

So to fix this, the way to load a pretrained model in PyTorch is:

Create an empty model with init_empty_weights that will not occupy RAM memory* Then load the weights with load_checkpoint_and_dispatch which will load a checkpoint into the empty model and distribute the weights for each layer across all available devices (GPU, CPU, RAM, and hard drive), thanks to setting device_map="auto"

	
		import torch
import torchvision.models as models
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
 
with init_empty_weights():
    resnet152 = models.resnet152()
 
resnet152 = load_checkpoint_and_dispatch(resnet152, checkpoint='accelerate_scripts/resnet152_pretrained.pth', device_map="auto")
 
device = "cuda" if torch.cuda.is_available() else "cpu"
 
input = torch.rand(1, 3, 224, 224).to(device) # Random image with batch size 1
output = resnet152(input)
output.shape

	
		torch.Size([1, 1000])

How accelerate works under the hood

In this video you can see graphically how accelerate works under the hood

Initialization of an empty model

Accelerate creates the skeleton of an empty model using init_empty_weights so that it occupies as little memory as possible.

For example, let's see how much RAM I have available on my computer now.

	
		import psutil
 
def get_ram_info():
    ram = dict(psutil.virtual_memory()._asdict())
    print(f"Total RAM: {(ram['total']/1024/1024/1024):.2f} GB, Available RAM: {(ram['available']/1024/1024/1024):.2f} GB, Used RAM: {(ram['used']/1024/1024/1024):.2f} GB")
 
get_ram_info()

	
		Total RAM: 31.24 GB, Available RAM: 22.62 GB, Used RAM: 7.82 GB

I have about 22 GB of RAM available.

Now let's try to create a model with 5000x1000x1000 parameters, that is, 5B parameters. If each parameter is in FP32, it would require 20 GB of RAM.

	
		import torch
from torch import nn
 
model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])

If we look at the RAM again

	
		get_ram_info()

	
		Total RAM: 31.24 GB, Available RAM: 3.77 GB, Used RAM: 26.70 GB

As we can see, now we only have 3 GB of RAM available

Now we are going to delete the model to free up RAM

	
		del model
get_ram_info()

	
		Total RAM: 31.24 GB, Available RAM: 22.44 GB, Used RAM: 8.03 GB

We have about 22 GB of RAM available again.

Let's now use init_empty_weights from accelerate and then check the RAM

	
		from accelerate import init_empty_weights
 
with init_empty_weights():
    model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])
 
get_ram_info()

	
		Total RAM: 31.24 GB, Available RAM: 22.32 GB, Used RAM: 8.16 GB

We previously had exactly 22.44 GB free, and after creating the model with init_empty_weights we have 22.32 GB. The RAM savings are huge! Almost no RAM was used to create the model.

This is based on the meta-device introduced in PyTorch 1.9, so it is important that to use accelerate we have a version of Pytorch later than that.

Loading the weights

Once we have initialized the model, we need to load its weights, which we do through load_checkpoint_and_dispatch, which, as its name suggests, loads the weights and dispatches them to the necessary device or devices.

Continue reading

MCP: Complete Guide to Create servers and clients MCP (Model Context Protocol) with FastMCP

Learn what is the Model Context Protocol (MCP), the open-source standard developed by Anthropic that revolutionizes how AI models interact with external tools. In this practical and detailed guide, I take you step by step in creating an MCP server and client from scratch using the fastmcp library. You will build an "intelligent" AI agent with Claude Sonnet, capable of interacting with the GitHub API to query issues and repository information. We will cover from basic concepts to advanced features like filtering tools by tags, server composition, static resources and dynamic templates (resource templates), prompt generation, and secure authentication. Discover how MCP can standardize and simplify the integration of tools in your AI applications, analogously to how USB unified peripherals!

Agents patterns

Are your agents falling short? Elevate your AI projects with advanced patterns: ReAct, planning, multi-agents, and more. Practical guide with code!

LangGraph: Revolutionize your AI agents

🚀 Revolutionize your AI agents! 🧠 LangGraph is not just another library, it's the orchestration framework that gives you total control to build complex agents, with long-term memory and even human intervention! Say goodbye to basic chatbots, it's time to create true intelligence. Dive into this post and discover it!

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

FLUX.1-RealismLora

token_hmr

View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->

Installation

Configuration

Training

Optimization of Training

Base code

Script with the base code

Code with accelerate

Execution of processes

Execution of code in a single process

Code Execution in All Processes

Code Execution in Process X

Synchronizing Processes

Save and load the state dict

Save the model

Save the pretrained model

Training in notebooks

Training in FP16

Training in BF16

Training in FP8

Model Inference

Usage of the Hugging Face Ecosystem

Inference with pipeline

Inference with AutoClass

Use PyTorch

How accelerate works under the hood

Initialization of an empty model

Loading the weights

Continue reading

MCP: Complete Guide to Create servers and clients MCP (Model Context Protocol) with FastMCP

Agents patterns

LangGraph: Revolutionize your AI agents

Have you seen these projects?

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

DataLoader with pin_memory and num_workers

py-smi

Use this locally

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

opus100

netflix_titles

Save the `pretrained` model

Inference with `pipeline`

Inference with `AutoClass`