HF Accelerate: Train Models on GPU/TPU

HF Accelerate: Train Models on GPU/TPU HF Accelerate: Train Models on GPU/TPU

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

Accelerate is a Hugging Face library that allows running the same PyTorch code in any distributed setup by adding only four lines of code.

Installationlink image 29

To install accelerate with pip, simply run:

pip install accelerate

And with conda:

conda install -c conda-forge accelerate

Configurationlink image 30

In each environment where accelerate is installed, the first thing to do is to configure it. To do this, run in a terminal:

`accelerate config`
	
<> Input Python
!accelerate config
Copied
	
>_ Output
--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

In my case, the answers have been

  • In which compute environment are you running?
  • [x] "This machine"
  • [_] "AWS (Amazon SageMaker)"

I want to set it up on my computer

  • What type of machine are you using?
  • [_] multi-CPU
  • [_] multi-XPU
  • [x] multi-GPU
  • [_] multi-NPU
  • [_] TPU

Since I have 2 GPUs and want to run distributed codes on them, I choose multi-GPU

  • How many different machines will you use (use more than 1 for multi-node training)? [1]:
  • 1

I choose 1 because I'm only going to run it on my computer

  • Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]:
  • no

With this option, you can choose to have accelerate check for errors during execution, but it would make it slower, so I choose no, and if there are any errors I change it to yes.

  • Do you wish to optimize your script with torch dynamo? [yes/NO]:
  • no
  • Do you want to use FullyShardedDataParallel? [yes/NO]:
  • no
  • Do you want to use Megatron-LM? [yes/NO]:
  • no
  • How many GPU(s) should be used for distributed training? [1]:
  • 2

I choose 2 because I have 2 GPUs

  • What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]:
  • 0.1

I choose 0,1 because I want to use both GPUs

  • Do you wish to use FP16 or BF16 (mixed precision)?
  • [x] no
  • [_] fp16
  • [_] bf16
  • [_] fp8

For now I choose no, because to simplify the code when not using accelerate we are going to train in fp32, but ideally we would use fp16

The configuration will be saved in ~/.cache/huggingface/accelerate/default_config.yaml and can be modified at any time. Let's see what's inside.

	
<> Input Python
!cat ~/.cache/huggingface/accelerate/default_config.yaml
Copied
	
>_ Output
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Another way to view the configuration we have is by running in a terminal:

accelerate env
	
<> Input Python
!accelerate env
Copied
	
>_ Output
Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.28.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
- Python version: 3.11.8
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 31.24 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []

Once we have configured accelerate we can test if we have done it correctly by running in a terminal:

accelerate test
	
<> Input Python
!accelerate test
Copied
	
>_ Output
Running: accelerate-launch ~/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout:
stdout: **Test process execution**
stdout:
stdout: **Test split between processes as a list**
stdout:
stdout: **Test split between processes as a dict**
stdout:
stdout: **Test split between processes as a tensor**
stdout:
stdout: **Test random number generator synchronization**
stdout: All rng are properly synched.
stdout:
stdout: **DataLoader integration test**
stdout: 0 1 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:1') &lt;class 'accelerate.data_loader.DataLoaderShard'&gt;
stdout: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') &lt;class 'accelerate.data_loader.DataLoaderShard'&gt;
stdout: Non-shuffled dataloader passing.
stdout: Shuffled dataloader passing.
stdout: Non-shuffled central dataloader passing.
stdout: Shuffled central dataloader passing.
stdout:
stdout: **Training integration test**
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout:
stdout: Training yielded the same results on one CPU or distributed setup with no batch split.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: FP16 training check.
stdout: Training yielded the same results on one CPU or distributes setup with batch split.
stdout: FP16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout:
stdout: **Breakpoint trigger test**
Test is a success! You are ready for your distributed training!

We see that it ends by saying Test is a success! You are ready for your distributed training! so everything is correct.

Traininglink image 31

Optimization of Traininglink image 32

Base codelink image 33

Let's start by creating a basic training code and then we'll optimize it to see how it's done and how it improves.

First, let's find a dataset. In my case, I will use the tweet_eval dataset, which is a tweet classification dataset. Specifically, I will download the emoji subset that classifies tweets with emojis.

	
<> Input Python
from datasets import load_dataset
dataset = load_dataset("tweet_eval", "emoji")
dataset
Copied
	
>_ Output
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 45000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
validation: Dataset({
features: ['text', 'label'],
num_rows: 5000
})
})
	
<> Input Python
dataset["train"].info
Copied
	
>_ Output
DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='tweet_eval', config_name='emoji', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=3808792, num_examples=45000, shard_lengths=None, dataset_name='tweet_eval'), 'test': SplitInfo(name='test', num_bytes=4262151, num_examples=50000, shard_lengths=None, dataset_name='tweet_eval'), 'validation': SplitInfo(name='validation', num_bytes=396704, num_examples=5000, shard_lengths=None, dataset_name='tweet_eval')}, download_checksums={'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/train-00000-of-00001.parquet': {'num_bytes': 2609973, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/test-00000-of-00001.parquet': {'num_bytes': 3047341, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/validation-00000-of-00001.parquet': {'num_bytes': 281994, 'checksum': None}}, download_size=5939308, post_processing_size=None, dataset_size=8467647, size_in_bytes=14406955)

Let's see the classes

	
<> Input Python
print(dataset["train"].info.features["label"].names)
Copied
	
>_ Output
['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜']

And the number of classes

	
<> Input Python
num_classes = len(dataset["train"].info.features["label"].names)
num_classes
Copied
	
>_ Output
20

We see that the dataset has 20 classes

Let's look at the maximum sequence of each split

	
<> Input Python
max_len_train = 0
max_len_val = 0
max_len_test = 0
split = "train"
for i in range(len(dataset[split])):
len_i = len(dataset[split][i]["text"])
if len_i &gt; max_len_train:
max_len_train = len_i
split = "validation"
for i in range(len(dataset[split])):
len_i = len(dataset[split][i]["text"])
if len_i &gt; max_len_val:
max_len_val = len_i
split = "test"
for i in range(len(dataset[split])):
len_i = len(dataset[split][i]["text"])
if len_i &gt; max_len_test:
max_len_test = len_i
max_len_train, max_len_val, max_len_test
Copied
	
>_ Output
(142, 139, 167)

So we define the maximum sequence in general as 130 for tokenization

	
<> Input Python
max_len = 130
Copied

We are interested in the tokenized dataset, not the raw sequences, so we create a tokenizer

	
<> Input Python
from transformers import AutoTokenizer
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
Copied

We create a tokenization function

	
<> Input Python
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
Copied

And now we tokenize the dataset

	
<> Input Python
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
Copied
	
>_ Output
Map: 0%| | 0/45000 [00:00&lt;?, ? examples/s]
	
>_ Output
Map: 0%| | 0/5000 [00:00&lt;?, ? examples/s]
	
>_ Output
Map: 0%| | 0/50000 [00:00&lt;?, ? examples/s]

As we can see, now we have the tokens (input_ids) and the attention masks (attention_mask), but let's take a look at what kind of data we have.

	
<> Input Python
type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"]), type(tokenized_dataset["train"][0]["label"])
Copied
	
>_ Output
(list, list, int)
	
<> Input Python
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
type(tokenized_dataset["train"][0]["label"]), type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"])
Copied
	
>_ Output
(torch.Tensor, torch.Tensor, torch.Tensor)

We create a DataLoader

	
<> Input Python
import torch
from torch.utils.data import DataLoader
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
Copied

We load the model

	
<> Input Python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
Copied

Let's see how the model is.

	
<> Input Python
model
Copied
	
>_ Output
RobertaForSequenceClassification(
(roberta): RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50265, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-11): 12 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): RobertaIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): RobertaOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
)
(classifier): RobertaClassificationHead(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(out_proj): Linear(in_features=768, out_features=2, bias=True)
)
)

Let's take a look at its last layer

	
<> Input Python
model.classifier.out_proj
Copied
	
>_ Output
Linear(in_features=768, out_features=2, bias=True)
	
<> Input Python
model.classifier.out_proj.in_features, model.classifier.out_proj.out_features
Copied
	
>_ Output
(768, 2)

We have seen that our dataset has 20 classes, but this model is trained for 2 classes, so we need to modify the last layer.

	
<> Input Python
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
model.classifier.out_proj
Copied
	
>_ Output
Linear(in_features=768, out_features=20, bias=True)

Now yes

Now we create a loss function

	
<> Input Python
loss_function = torch.nn.CrossEntropyLoss()
Copied

An optimizer

	
<> Input Python
from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=5e-4)
Copied

And lastly, a metric

	
<> Input Python
import evaluate
metric = evaluate.load("accuracy")
Copied

Let's check that everything is fine with a sample.

	
<> Input Python
sample = next(iter(dataloader["train"]))
Copied
	
<> Input Python
sample["input_ids"].shape, sample["attention_mask"].shape
Copied
	
>_ Output
(torch.Size([64, 130]), torch.Size([64, 130]))

Now we feed that sample to the model

	
<> Input Python
model.to("cuda")
ouputs = model(input_ids=sample["input_ids"].to("cuda"), attention_mask=sample["attention_mask"].to("cuda"))
ouputs.logits.shape
Copied
	
>_ Output
torch.Size([64, 20])

We see that the model outputs 64 batches, which is fine because we set BS = 20 and each one with 20 outputs, which is good because we modified the model to have an output of 20 values.

We obtain the one with the highest value

	
<> Input Python
predictions = torch.argmax(ouputs.logits, axis=-1)
predictions.shape
Copied
	
>_ Output
torch.Size([64])

We obtain the loss

	
<> Input Python
loss = loss_function(ouputs.logits, sample["label"].to("cuda"))
loss.item()
Copied
	
>_ Output
2.9990389347076416

And the accuracy

	
<> Input Python
accuracy = metric.compute(predictions=predictions, references=sample["label"])["accuracy"]
accuracy
Copied
	
>_ Output
0.015625

We can now create a small training loop

	
<> Input Python
from fastprogress.fastprogress import master_bar, progress_bar
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
master_progress_bar = master_bar(range(epochs))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
loss.backward()
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
Copied
	
>_ Output
&lt;IPython.core.display.HTML object&gt;
	
>_ Output
&lt;IPython.core.display.HTML object&gt;

Script with the base codelink image 34

In most of the accelerate documentation, it is explained how to use accelerate with scripts, so for now we will do it this way and at the end we will explain how to do it with a notebook.

First, we are going to create a folder where we will save the scripts

	
<> Input Python
!mkdir accelerate_scripts
Copied

Now we write the base code in a script

	
<> Input Python
%%writefile accelerate_scripts/01_code_base.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
loss.backward()
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
print(f"Accuracy = {accuracy['accuracy']}")
Copied
	
>_ Output
Overwriting accelerate_scripts/01_code_base.py

And now we run it

	
<> Input Python
%%time
!python accelerate_scripts/01_code_base.py
Copied
	
>_ Output
Accuracy = 0.2112
CPU times: user 2.12 s, sys: 391 ms, total: 2.51 s
Wall time: 3min 36s

We see that on my computer it took about 3 and a half minutes

Code with acceleratelink image 35

Now we replace some things

  • First, we import Accelerator and initialize it
from accelerate import Accelerator
accelerator = Accelerator()
  • We don't do the typical anymore

``` python

torch.device("cuda" if torch.cuda.is_available() else "cpu")

```

  • But we let accelerate choose the device.
device = accelerator.device
  • We pass the relevant elements for training through the prepare method and no longer do model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = prepare(model, optimizer, dataloader["train"], dataloader["validation"])
  • We no longer send the data and model to the GPU with .to(device) since accelerate has taken care of it with the prepare method.
  • Instead of performing backpropagation with loss.backward(), we let accelerate handle it with
accelerator.backward(loss)
  • When calculating the metric in the validation loop, we need to gather the values from all points, especially if we are doing distributed training, for this we do
predictions = accelerator.gather_for_metrics(predictions)
	
<> Input Python
%%writefile accelerate_scripts/02_accelerate_base_code.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
print(f"Accuracy = {accuracy['accuracy']}")
Copied
	
>_ Output
Overwriting accelerate_scripts/02_accelerate_base_code.py

If you notice, I've added these two lines print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}") and the line print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}"), I added them on purpose because they will reveal something very important.

Now let's run it. To execute the accelerate scripts, use the command accelerate launch

accelerate launch script.py
	
<> Input Python
%%time
!accelerate launch accelerate_scripts/02_accelerate_base_code.py
Copied
	
>_ Output
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
CPU times: user 1.6 s, sys: 272 ms, total: 1.88 s
Wall time: 2min 37s

We see that before it took about 3 and a half minutes, and now it takes around 2 and a half minutes. Quite an improvement. Additionally, if we look at the prints, we can see that they have been printed twice.

And how can this be? Well, because accelerate has parallelized the training across the two GPUs I have, so it was much faster.

Moreover, when I ran the first script, that is, when I didn't use accelerate, the GPU was almost full, while when I ran the second one, that is, the one that uses accelerate, both GPUs were barely utilized, so we can increase the batch size to try to fill both. Let's do it!

	
<> Input Python
%%writefile accelerate_scripts/03_accelerate_base_code_more_bs.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
print(f"Accuracy = {accuracy['accuracy']}")
Copied
	
>_ Output
Overwriting accelerate_scripts/03_accelerate_base_code_more_bs.py

I have removed the extra prints, as we have already seen that the code is running on both GPUs, and I increased the batch size from 64 to 128. Let's run it and see.

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/03_accelerate_base_code_more_bs.py
Copied
	
>_ Output
Accuracy = 0.1052
Accuracy = 0.1052
CPU times: user 1.41 s, sys: 180 ms, total: 1.59 s
Wall time: 2min 22s

Increasing the batch size has reduced the execution time by a few seconds.

Execution of processeslink image 36

Execution of code in a single processlink image 37

We have previously seen that the prints were printed twice, this is because accelerate creates as many processes as there are devices where the code is executed, in my case it creates two processes due to having two GPUs.

However, not all code should be executed in all processes, for example, the prints slow down the code a lot, especially if they are executed multiple times, if checkpoints are saved, they would be saved twice, etc.

To be able to execute part of the code in a single process, it has to be encapsulated in a function and decorated with accelerator.on_local_main_process. For example, in the following code you will see that I have created the following function

@accelerator.on_local_main_process
def print_something(something):
print(something)

Another option is to include the code within an if accelerator.is_local_main_process as in the following code

if accelerator.is_local_main_process:
print("Something")
	
<> Input Python
%%writefile accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_something(something):
print(something)
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
Copied
	
>_ Output
Overwriting accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py

Let's run it and see

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
Copied
	
>_ Output
Accuracy = 0.2098
End of script with 0.2098 accuracy
CPU times: user 1.38 s, sys: 197 ms, total: 1.58 s
Wall time: 2min 22s

Now the print has only been executed once

However, although not visible much, the progress bars run in each process.

I haven't found a way to avoid this with the progress bars from fastprogress, but I have with those from tqdm, so I'm going to replace the fastprogress progress bars with tqdm ones, and to make them run in a single process, you need to add the argument disable=not accelerator.is_local_main_process

	
<> Input Python
%%writefile accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_something(something):
print(something)
for i in range(EPOCHS):
model.train()
# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
Copied
	
>_ Output
Overwriting accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
	
<> Input Python
%%time
!accelerate launch accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
Copied
	
>_ Output
100%|█████████████████████████████████████████| 176/176 [02:01&lt;00:00, 1.45it/s]
100%|███████████████████████████████████████████| 20/20 [00:06&lt;00:00, 3.30it/s]
Accuracy = 0.2166
End of script with 0.2166 accuracy
CPU times: user 1.33 s, sys: 195 ms, total: 1.52 s
Wall time: 2min 22s

We have shown an example of how to print in a single process, and this has been a way to execute processes in a single process. But if what you want is just to print in a single process, the print method from accelerate can be used. Let's see the same example as before with this method.

	
<> Input Python
%%writefile accelerate_scripts/06_accelerate_base_code_print_one_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
model.train()
# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# print(f"Accuracy = {accuracy['accuracy']}")
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
Copied
	
>_ Output
Writing accelerate_scripts/06_accelerate_base_code_print_one_process.py

We run it

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/06_accelerate_base_code_print_one_process.py
Copied
	
>_ Output
Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15433.52 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 11406.61 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15036.87 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14932.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14956.60 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00&lt;00:00, 1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00, 3.33it/s]
Accuracy = 0.2134
End of script with 0.2134 accuracy
CPU times: user 1.4 s, sys: 189 ms, total: 1.59 s
Wall time: 2min 27s

Code Execution in All Processeslink image 38

However, there is code that must run in all processes, for example, if we upload the checkpoints to the hub, so here we have two options: wrap the code in a function and decorate it with accelerator.on_main_process

@accelerator.on_main_process
def do_my_thing():
"Something done once per server"
do_thing_once()

or put the code inside an if accelerator.is_main_process

if accelerator.is_main_process:
repo.push_to_hub()

Since we are only doing training to showcase the accelerate library and the model we are training is not good, it doesn't make sense to upload the checkpoints to the hub right now, so I will do an example with prints.

	
<> Input Python
%%writefile accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_in_one_process(something):
print(something)
@accelerator.on_main_process
def print_in_all_processes(something):
print(something)
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
if accelerator.is_main_process:
print(f"All process: End of script with {accuracy['accuracy']} accuracy")
Copied
	
>_ Output
Overwriting accelerate_scripts/06_accelerate_base_code_some_code_in_all_process.py

Let's run it to see

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py
Copied
	
>_ Output
Map: 100%|██████████████████████| 45000/45000 [00:03&lt;00:00, 14518.44 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:03&lt;00:00, 14368.77 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 16466.33 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14806.14 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14253.33 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14337.07 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00&lt;00:00, 1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00, 3.34it/s]
Accuracy = 0.2092
End of script with 0.2092 accuracy
All process: Accuracy = 0.2092
All process: End of script with 0.2092 accuracy
CPU times: user 1.42 s, sys: 216 ms, total: 1.64 s
Wall time: 2min 27s

Code Execution in Process Xlink image 39

Finally, we can specify in which process we want to execute code. For this, we need to create a function and decorate it with @accelerator.on_process(process_index=0)

@accelerator.on_process(process_index=0)
def do_my_thing():
"Something done on process index 0"
do_thing_on_index_zero()

or decorate it with @accelerator.on_local_process(local_process_idx=0)

@accelerator.on_local_process(local_process_index=0)def do_my_thing():
"Something done on process index 0 on each server"
do_thing_on_index_zero_on_each_server()

Here I have put process 0, but any number can be put

	
<> Input Python
%%writefile accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_in_one_process(something):
print(something)
@accelerator.on_main_process
def print_in_all_processes(something):
print(something)
@accelerator.on_process(process_index=0)
def print_in_process_0(something):
print("Process 0: " + something)
@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
print("Process 1: " + something)
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
if accelerator.is_main_process:
print(f"All process: End of script with {accuracy['accuracy']} accuracy")
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
Copied
	
>_ Output
Overwriting accelerate_scripts/07_accelerate_base_code_some_code_in_some_process.py

We run it

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py
Copied
	
>_ Output
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 15735.58 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14906.20 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:02&lt;00:00, 1.44it/s]
100%|███████████████████████████████████████████| 20/20 [00:06&lt;00:00, 3.27it/s]
Process 1: End of process 1
Accuracy = 0.2128
End of script with 0.2128 accuracy
All process: Accuracy = 0.2128
All process: End of script with 0.2128 accuracy
Process 0: End of process 0
CPU times: user 1.42 s, sys: 295 ms, total: 1.71 s
Wall time: 2min 37s

Synchronizing Processeslink image 40

If we have code that needs to run on all processes, it's interesting to wait for it to finish on all processes before doing another task, so for this we use accelerator.wait_for_everyone()

To see it, we are going to introduce a delay in one of the print functions in a process.

I've also added a break in the training loop so that it doesn't train for too long, which isn't what we're interested in right now.

	
<> Input Python
%%writefile accelerate_scripts/09_accelerate_base_code_sync_all_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_in_one_process(something):
print(something)
@accelerator.on_main_process
def print_in_all_processes(something):
print(something)
@accelerator.on_process(process_index=0)
def print_in_process_0(something):
time.sleep(2)
print("Process 0: " + something)
@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
print("Process 1: " + something)
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
break
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
if accelerator.is_main_process:
print(f"All process: End of script with {accuracy['accuracy']} accuracy")
print_in_one_process("Printing with delay in process 0")
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
accelerator.wait_for_everyone()
print_in_one_process("End of script")
Copied
	
>_ Output
Overwriting accelerate_scripts/08_accelerate_base_code_sync_all_process.py

We run it

	
<> Input Python
!accelerate launch accelerate_scripts/09_accelerate_base_code_sync_all_process.py
Copied
	
>_ Output
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14218.23 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14666.25 examples/s]
0%| | 0/176 [00:00&lt;?, ?it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00, 3.58it/s]
Process 1: End of process 1
Accuracy = 0.212
End of script with 0.212 accuracy
All process: Accuracy = 0.212
All process: End of script with 0.212 accuracy
Printing with delay in process 0
Process 0: End of process 0
End of script

As can be seen, first Process 1: End of process 1 is printed and then the rest. This is because the remaining prints are either done in process 0 or in all processes, so until the 2-second delay we set is over, the rest of the code does not execute.

Save and load the state dictlink image 41

When we train, sometimes we save the state so we can continue at another time.

To save the state, we will have to use the methods save_state() and load_state()

	
<> Input Python
%%writefile accelerate_scripts/10_accelerate_save_and_load_checkpoints.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
@accelerator.on_local_main_process
def print_something(something):
print(something)
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# Guardamos los pesos
accelerator.save_state("accelerate_scripts/checkpoints")
print_something(f"Accuracy = {accuracy['accuracy']}")
# Cargamos los pesos
accelerator.load_state("accelerate_scripts/checkpoints")
Copied
	
>_ Output
Overwriting accelerate_scripts/09_accelerate_save_and_load_checkpoints.py

We run it

	
<> Input Python
!accelerate launch accelerate_scripts/10_accelerate_save_and_load_checkpoints.py
Copied
	
>_ Output
100%|█████████████████████████████████████████| 176/176 [01:58&lt;00:00, 1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00, 3.40it/s]
Accuracy = 0.2142

Save the modellink image 42

When the prepare method was used, the model was wrapped to be able to save it to the necessary devices. Therefore, when saving it, we need to use the save_model method, which first unwraps it and then saves it. Additionally, if we use the parameter safe_serialization=True, the model will be saved as a safe tensor.

	
<> Input Python
%%writefile accelerate_scripts/11_accelerate_save_model.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
@accelerator.on_local_main_process
def print_something(something):
print(something)
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# Guardamos el modelo
accelerator.wait_for_everyone()
accelerator.save_model(model, "accelerate_scripts/model", safe_serialization=True)
print_something(f"Accuracy = {accuracy['accuracy']}")
Copied
	
>_ Output
Writing accelerate_scripts/11_accelerate_save_model.py

We run it

	
<> Input Python
!accelerate launch accelerate_scripts/11_accelerate_save_model.py
Copied
	
>_ Output
100%|█████████████████████████████████████████| 176/176 [01:58&lt;00:00, 1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00, 3.35it/s]
Accuracy = 0.214

Save the pretrained modellink image 43

In models that use the transformers library, we must save the model using the save_pretrained method to be able to load it with the from_pretrained method. Before saving it, you need to unwrap it using the unwrap_model method.

	
<> Input Python
%%writefile accelerate_scripts/12_accelerate_save_pretrained.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
@accelerator.on_local_main_process
def print_something(something):
print(something)
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# Guardamos el modelo pretrained
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(
"accelerate_scripts/model_pretrained",
is_main_process=accelerator.is_main_process,
save_function=accelerator.save,
)
print_something(f"Accuracy = {accuracy['accuracy']}")
Copied
	
>_ Output
Writing accelerate_scripts/11_accelerate_save_pretrained.py

We run it

	
<> Input Python
!accelerate launch accelerate_scripts/12_accelerate_save_pretrained.py
Copied
	
>_ Output
Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15152.47 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15119.13 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 12724.70 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 12397.49 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 15247.21 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 15138.03 examples/s]
100%|█████████████████████████████████████████| 176/176 [01:59&lt;00:00, 1.48it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00, 3.37it/s]
Accuracy = 0.21

Now we could load it

	
<> Input Python
from transformers import AutoModel
checkpoints = "accelerate_scripts/model_pretrained"
tokenizer = AutoModel.from_pretrained(checkpoints)
Copied
	
>_ Output
Some weights of RobertaModel were not initialized from the model checkpoint at accelerate_scripts/model_pretrained and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Training in notebookslink image 44

So far we have seen how to run scripts, but if you want to run the code in a notebook, we can write the same code as before, but encapsulated in a function

First we import the libraries

	
<> Input Python
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
# from accelerate import Accelerator
Copied

Now we create the function

	
<> Input Python
def train_code(batch_size: int = 64):
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = batch_size
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Copied

To run the training in the notebook, we use the notebook_launcher function, to which we pass the function we want to execute, the arguments of that function, and the number of GPUs on which we will train using the num_processes variable.

	
<> Input Python
from accelerate import notebook_launcher
args = (128,)
notebook_launcher(train_code, args, num_processes=2)
Copied
	
>_ Output
Launching training on 2 GPUs.
	
>_ Output
100%|██████████| 176/176 [02:01&lt;00:00, 1.45it/s]
100%|██████████| 20/20 [00:06&lt;00:00, 3.31it/s]
	
>_ Output
Accuracy = 0.2112

Training in FP16link image 45

When we first set up accelerate, it asked us Do you wish to use FP16 or BF16 (mixed precision)? and we said no, so now we are going to tell it yes, that we want FP16.

So far we have trained in FP32, which means that each weight of the model is a 32-bit floating-point number, and now we are going to use a 16-bit floating-point number, meaning the model will take up less space. As a result, two things will happen: we will be able to use a larger batch size, and it will also be faster.

First we run accelerate config again and tell it that we want FP16

	
<> Input Python
!accelerate config
Copied
	
>_ Output
--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
fp16
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Now we create a script to train, with the same batch size as before, to see if it takes less time to train

	
<> Input Python
%%writefile accelerate_scripts/13_accelerate_base_code_fp16_bs128.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Copied
	
>_ Output
Overwriting accelerate_scripts/12_accelerate_base_code_fp16_bs128.py

Let's run it to see how long it takes

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/13_accelerate_base_code_fp16_bs128.py
Copied
	
>_ Output
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14983.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14315.47 examples/s]
100%|█████████████████████████████████████████| 176/176 [01:01&lt;00:00, 2.88it/s]
100%|███████████████████████████████████████████| 20/20 [00:02&lt;00:00, 6.84it/s]
Accuracy = 0.2094
CPU times: user 812 ms, sys: 163 ms, total: 976 ms
Wall time: 1min 27s

When we ran this training in FP32 it took about 2 and a half minutes, and now roughly 1 and a half minutes. Let's see if instead of training with a batch size of 128, we do it with one of 256.

	
<> Input Python
%%writefile accelerate_scripts/14_accelerate_base_code_fp16_bs256.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 256
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Copied
	
>_ Output
Overwriting accelerate_scripts/13_accelerate_base_code_fp16_bs256.py

We run it

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py
Copied
	
>_ Output
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 15390.30 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14990.92 examples/s]
100%|███████████████████████████████████████████| 88/88 [00:54&lt;00:00, 1.62it/s]
100%|███████████████████████████████████████████| 10/10 [00:02&lt;00:00, 3.45it/s]
Accuracy = 0.2236
CPU times: user 670 ms, sys: 91.6 ms, total: 761 ms
Wall time: 1min 12s

It has only dropped by about 15 seconds

Training in BF16link image 46

We have previously trained in FP16 and now we are going to do it in BF16, what is the difference?

FP32_FP16_BF16

As we can see in the image, while FP16 compared to FP32 has fewer bits in the mantissa and the exponent, which makes its range much smaller, BF16 compared to FP32 has the same number of bits for the exponent but fewer in the mantissa, which means that BF16 has the same range of numbers as FP32, but is less precise.

This is beneficial because in FP16 some calculations could result in very high numbers, which cannot be represented in the FP16 format. Additionally, there are certain HW devices that are optimized for this format.

Just like before, we run accelerate config and indicate that we want BF16

	
<> Input Python
!accelerate config
Copied
	
>_ Output
--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Now we run the last script we created, that is, with a batch size of 256

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py
Copied
	
>_ Output
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14814.95 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14506.83 examples/s]
100%|███████████████████████████████████████████| 88/88 [00:51&lt;00:00, 1.70it/s]
100%|███████████████████████████████████████████| 10/10 [00:03&lt;00:00, 3.21it/s]
Accuracy = 0.2112
CPU times: user 688 ms, sys: 144 ms, total: 832 ms
Wall time: 1min 17s

It took a similar amount of time as before, which is normal since we trained a model with 16-bit weights, just like before.

Training in FP8link image 47

Now we are going to train in FP8 format, which, as the name suggests, is a floating-point format where each weight has 8 bits, so we run accelerate config to tell it that we want FP8

	
<> Input Python
!accelerate config
Copied
	
>_ Output
--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
fp8
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Now we run the last script, the one with a batch size of 256

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/14_accelerate_base_code_fp16_bs256.py
Copied
	
>_ Output
Traceback (most recent call last):
File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/13_accelerate_base_code_fp16_bs256.py", line 12, in &lt;module&gt;
accelerator = Accelerator()
^^^^^^^^^^^^^
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/accelerator.py", line 371, in __init__
self.state = AcceleratorState(
^^^^^^^^^^^^^^^^^
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/state.py", line 790, in __init__
raise ValueError(
ValueError: Using `fp8` precision requires `transformer_engine` or `MS-AMP` to be installed.
Traceback (most recent call last):
File "/home/wallabot/Documentos/web/portafolio/posts/accelerate_scripts/13_accelerate_base_code_fp16_bs256.py", line 12, in &lt;module&gt;
accelerator = Accelerator()
^^^^^^^^^^^^^
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/accelerator.py", line 371, in __init__
self.state = AcceleratorState(
^^^^^^^^^^^^^^^^^
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/state.py", line 790, in __init__
raise ValueError(
ValueError: Using `fp8` precision requires `transformer_engine` or `MS-AMP` to be installed.
[2024-05-13 21:40:56,455] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 501480) of binary: /home/wallabot/miniconda3/envs/nlp/bin/python
Traceback (most recent call last):
File "/home/wallabot/miniconda3/envs/nlp/bin/accelerate", line 8, in &lt;module&gt;
sys.exit(main())
^^^^^^
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1048, in launch_command
multi_gpu_launcher(args)
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/commands/launch.py", line 702, in multi_gpu_launcher
distrib_run.run(args)
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
accelerate_scripts/13_accelerate_base_code_fp16_bs256.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-05-13_21:40:56
host : wallabot
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 501481)
error_file: &lt;N/A&gt;
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-05-13_21:40:56
host : wallabot
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 501480)
error_file: &lt;N/A&gt;
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
CPU times: user 65.1 ms, sys: 14.5 ms, total: 79.6 ms
Wall time: 7.24 s

Since the weights are now 8-bit and take up half the memory, we will increase the batch size to 512.

	
<> Input Python
%%writefile accelerate_scripts/15_accelerate_base_code_fp8_bs512.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 512
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
Copied
	
>_ Output
Writing accelerate_scripts/15_accelerate_base_code_fp8_bs512.py

We run it

	
<> Input Python
%%time
!accelerate launch accelerate_scripts/15_accelerate_base_code_fp8_bs512.py
Copied

Model Inferencelink image 48

Usage of the Hugging Face Ecosystemlink image 49

Let's see how to perform inference with large models using the transformers library from Hugging Face.

Inference with pipelinelink image 50

If we use the Hugging Face ecosystem, it's very simple, as everything happens under the hood without us having to do much. In the case of using pipeline, which is the easiest way to perform inference with the transformers library, we just need to specify the model we want to use and, importantly, pass device_map="auto". This will make accelerate distribute the model across different GPUs, CPU RAM, or hard drive if necessary.

There are more possible values for device_map, which we will see later, but for now stick with "auto".

We are going to use the Llama3 8B model, which, as its name suggests, is a model with around 8 billion parameters. Since each parameter by default is in FP32 format, which corresponds to 4 bytes (32 bits), this means that if we multiply 8 billion parameters by 4 bytes, we get that it would require a GPU with around 32 GB of VRAM.

In my case, I have 2 GPUs with 24 GB of VRAM each, so it wouldn't fit into a single GPU. But thanks to setting device_map="auto", accelerate will distribute the model across both GPUs and I will be able to perform inference.

	
<> Input Python
%%writefile accelerate_scripts/16_inference_with_pipeline.py
from transformers import pipeline
checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
generator = pipeline(model=checkpoints, device_map="auto")
prompt = "Conoces accelerate de hugging face?"
output = generator(prompt)
print(output)
Copied
	
>_ Output
Overwriting accelerate_scripts/09_inference_with_pipeline.py

Now we run it, only as the pipeline uses accelerate under the hood, we don't need to run it with accelerate launch script.py but instead with python script.py.

	
<> Input Python
!python accelerate_scripts/16_inference_with_pipeline.py
Copied
	
>_ Output
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09&lt;00:00, 2.27s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
[{'generated_text': 'Conoces accelerate de hugging face? ¿Qué es el modelo de lenguaje de transformers y cómo se utiliza en el marco de hugging face? ¿Cómo puedo utilizar modelos de lenguaje de transformers en mi aplicación? ¿Qué son los tokenizers y cómo se utilizan en el marco de hugging face? ¿Cómo puedo crear un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los datasets y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar datasets para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar finetuning para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los checkpoints y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar checkpoints para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los evaluadores y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar evaluadores para evaluar el rendimiento de un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los pre-trainados y cómo se utilizan en el marco de hugging face? ¿Cómo puedo utilizar pre-trainados para entrenar un modelo de lenguaje personalizado utilizando transformers y hugging face? ¿Qué son los finetuning'}]

As can be seen, it has not responded, but has continued to ask questions. This is because Llama3 is a language model that predicts the next token, so with the prompt I passed it, it considered that the next best tokens were ones that correspond to more questions. This makes sense because sometimes people have doubts about a topic and generate many questions, so to get it to answer the question, we need to condition it a bit.

	
<> Input Python
%%writefile accelerate_scripts/17_inference_with_pipeline_condition.py
from transformers import pipeline
checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
generator = pipeline(model=checkpoints, device_map="auto")
prompt = "Conoces accelerate de hugging face?"
messages = [
{
"role": "system",
"content": "Eres un chatbot amigable que siempre intenta solucionar las dudas",
},
{"role": "user", "content": f"{prompt}"},
]
output = generator(messages)
print(output[0]['generated_text'][-1])
Copied
	
>_ Output
Overwriting accelerate_scripts/10_inference_with_pipeline_condition.py

As you can see, a message has been generated with roles, conditioning the model and with the prompt

	
<> Input Python
!python accelerate_scripts/17_inference_with_pipeline_condition.py
Copied
	
>_ Output
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09&lt;00:00, 2.41s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
{'role': 'assistant', 'content': '¡Hola! Sí, conozco Accelerate de Hugging Face. Accelerate es una biblioteca de Python desarrollada por Hugging Face que se enfoca en simplificar y acelerar el entrenamiento y la evaluación de modelos de lenguaje en diferentes dispositivos y entornos. Con Accelerate, puedes entrenar modelos de lenguaje en diferentes plataformas y dispositivos, como GPUs, TPUs, CPUs y servidores, sin necesidad de cambiar el código de tu modelo. Esto te permite aprovechar al máximo la potencia de cálculo de tus dispositivos y reducir el tiempo de entrenamiento. Accelerate también ofrece varias características adicionales, como: * Soporte para diferentes frameworks de machine learning, como TensorFlow, PyTorch y JAX. * Integración con diferentes sistemas de almacenamiento y procesamiento de datos, como Amazon S3 y Google Cloud Storage. * Soporte para diferentes protocolos de comunicación, como HTTP y gRPC. * Herramientas para monitorear y depurar tus modelos en tiempo real. En resumen, Accelerate es una herramienta muy útil para desarrolladores de modelos de lenguaje que buscan simplificar y acelerar el proceso de entrenamiento y evaluación de sus modelos. ¿Tienes alguna pregunta específica sobre Accelerate o necesitas ayuda para implementarlo en tu proyecto?'}

Now the response does answer our prompt

Inference with AutoClasslink image 51

Lastly, we will see how to perform inference using only AutoClass.

	
<> Input Python
%%writefile accelerate_scripts/18_inference_with_autoclass.py
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
checkpoints = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoints, device_map="auto")
model = AutoModelForCausalLM.from_pretrained(checkpoints, device_map="auto")
streamer = TextStreamer(tokenizer)
prompt = "Conoces accelerate de hugging face?"
tokens_input = tokenizer([prompt], return_tensors="pt").to(model.device)
_ = model.generate(**tokens_input, streamer=streamer, max_new_tokens=500, do_sample=True, top_k=50, top_p=0.95, temperature=0.7)
Copied
	
>_ Output
Overwriting accelerate_scripts/11_inference_with_autoclass.py

As can be seen, the streamer object has been created and is then passed to the model's generate method. This is useful for printing each word as it is generated, rather than waiting for the entire output to be generated before printing it.

	
<> Input Python
!python accelerate_scripts/18_inference_with_autoclass.py
Copied
	
>_ Output
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████| 4/4 [00:09&lt;00:00, 2.28s/it]
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
&lt;|begin_of_text|&gt;Conoces accelerate de hugging face? Si es así, puedes utilizar la biblioteca `transformers` de Hugging Face para crear un modelo de lenguaje que pueda predecir la siguiente palabra en una secuencia de texto.
Aquí te muestro un ejemplo de cómo hacerlo:
```
import pandas as pd
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Cargar el modelo y el tokenizador
model_name = "bert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Cargar el conjunto de datos
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
# Preprocesar los datos
train_texts = train_df["text"]
train_labels = train_df["label"]
test_texts = test_df["text"]
# Convertir los textos en entradas para el modelo
train_encodings = tokenizer.batch_encode_plus(train_texts,
add_special_tokens=True,
max_length=512,
return_attention_mask=True,
return_tensors='pt')
test_encodings = tokenizer.batch_encode_plus(test_texts,
add_special_tokens=True,
max_length=512,
return_attention_mask=True,
return_tensors='pt')
# Crear un dataloader para entrenar el modelo
train_dataset = torch.utils.data.TensorDataset(train_encodings["input_ids"],
train_encodings["attention_mask"],
torch.tensor(train_labels))
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
# Entrenar el modelo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
for epoch in range(5):
model.train()
total_loss = 0
for batch in train_loader:
input_ids = batch[0].to(device)
attention_mask = batch[1].to(device)
labels = batch[2].to(device)
optimizer.zero_grad()
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}, Loss: {total

Use PyTorchlink image 52

Normally, the way to make inferences with PyTorch is to create a model with randomly initialized weights and then load a state dict with the pretrained model's weights. So, to get that state dict, we'll first take a small shortcut and download it.

	
<> Input Python
import torch
import torchvision.models as models
model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)
torch.save(model.state_dict(), 'accelerate_scripts/resnet152_pretrained.pth')
Copied
	
>_ Output
Downloading: "https://download.pytorch.org/models/resnet152-394f9c45.pth" to /home/maximo.fernandez/.cache/torch/hub/checkpoints/resnet152-394f9c45.pth
100%|██████████| 230M/230M [02:48&lt;00:00, 1.43MB/s]

Now that we have the state dict, we are going to perform inference as it is typically done in PyTorch.

	
<> Input Python
import torch
import torchvision.models as models
device = "cuda" if torch.cuda.is_available() else "cpu" # Set device
resnet152 = models.resnet152().to(device) # Create model with random weights and move to device
state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device) # Load pretrained weights into device memory
resnet152.load_state_dict(state_dict) # Load this weights into the model
input = torch.rand(1, 3, 224, 224).to(device) # Random image with batch size 1
output = resnet152(input)
output.shape
Copied
	
>_ Output
torch.Size([1, 1000])

Let's explain what has happened

  • When we did resnet152 = models.resnet152().to(device), a ResNet152 with random weights was loaded into the GPU memory.
  • When we did state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device), a dictionary with the trained weights was loaded into the GPU memory.
  • When we did resnet152.load_state_dict(state_dict), those pretrained weights were assigned to the model.

That is, the model has been loaded twice into the GPU memory.

You might be wondering why we did first

model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)
torch.save(model.state_dict(), 'accelerate_scripts/resnet152_pretrained.pth')

To do later

resnet152 = models.resnet152().to(device)
state_dict = torch.load('accelerate_scripts/resnet152_pretrained.pth', map_location=device)
resnet152.load_state_dict(state_dict)

And why don't we use directly

model = models.resnet152(weights=models.ResNet152_Weights.IMAGENET1K_V1)

And we stop saving the state dict to load it later. Well, because Pytorch does the same thing under the hood. So, to see the whole process, we have broken down into several lines what Pytorch does in one.

This way of working has been effective so far, while models were manageable by user GPUs. But since the arrival of LLMs, this approach no longer makes sense.

For example, a 6B parameter model would occupy 24 GB in memory, and since it is loaded twice with this way of working, you would need a 48 GB GPU.

So to fix this, the way to load a pretrained model in PyTorch is:

  • Create an empty model with init_empty_weights that will not occupy RAM memory* Then load the weights with load_checkpoint_and_dispatch which will load a checkpoint into the empty model and distribute the weights for each layer across all available devices (GPU, CPU, RAM, and hard drive), thanks to setting device_map="auto"
	
<> Input Python
import torch
import torchvision.models as models
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
with init_empty_weights():
resnet152 = models.resnet152()
resnet152 = load_checkpoint_and_dispatch(resnet152, checkpoint='accelerate_scripts/resnet152_pretrained.pth', device_map="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
input = torch.rand(1, 3, 224, 224).to(device) # Random image with batch size 1
output = resnet152(input)
output.shape
Copied
	
>_ Output
torch.Size([1, 1000])

How accelerate works under the hoodlink image 53

In this video you can see graphically how accelerate works under the hood

Initialization of an empty modellink image 54

Accelerate creates the skeleton of an empty model using init_empty_weights so that it occupies as little memory as possible.

For example, let's see how much RAM I have available on my computer now.

	
<> Input Python
import psutil
def get_ram_info():
ram = dict(psutil.virtual_memory()._asdict())
print(f"Total RAM: {(ram['total']/1024/1024/1024):.2f} GB, Available RAM: {(ram['available']/1024/1024/1024):.2f} GB, Used RAM: {(ram['used']/1024/1024/1024):.2f} GB")
get_ram_info()
Copied
	
>_ Output
Total RAM: 31.24 GB, Available RAM: 22.62 GB, Used RAM: 7.82 GB

I have about 22 GB of RAM available.

Now let's try to create a model with 5000x1000x1000 parameters, that is, 5B parameters. If each parameter is in FP32, it would require 20 GB of RAM.

	
<> Input Python
import torch
from torch import nn
model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])
Copied

If we look at the RAM again

	
<> Input Python
get_ram_info()
Copied
	
>_ Output
Total RAM: 31.24 GB, Available RAM: 3.77 GB, Used RAM: 26.70 GB

As we can see, now we only have 3 GB of RAM available

Now we are going to delete the model to free up RAM

	
<> Input Python
del model
get_ram_info()
Copied
	
>_ Output
Total RAM: 31.24 GB, Available RAM: 22.44 GB, Used RAM: 8.03 GB

We have about 22 GB of RAM available again.

Let's now use init_empty_weights from accelerate and then check the RAM

	
<> Input Python
from accelerate import init_empty_weights
with init_empty_weights():
model = nn.Sequential(*[nn.Linear(5000, 1000) for _ in range(1000)])
get_ram_info()
Copied
	
>_ Output
Total RAM: 31.24 GB, Available RAM: 22.32 GB, Used RAM: 8.16 GB

We previously had exactly 22.44 GB free, and after creating the model with init_empty_weights we have 22.32 GB. The RAM savings are huge! Almost no RAM was used to create the model.

This is based on the meta-device introduced in PyTorch 1.9, so it is important that to use accelerate we have a version of Pytorch later than that.

Loading the weightslink image 55

Once we have initialized the model, we need to load its weights, which we do through load_checkpoint_and_dispatch, which, as its name suggests, loads the weights and dispatches them to the necessary device or devices.

Continue reading

Last posts -->

Have you seen these projects?

Gymnasia

Gymnasia Gymnasia
React Native
Expo
TypeScript
FastAPI
Next.js
OpenAI
Anthropic

Mobile personal training app with AI assistant, exercise library, workout tracking, diet and body measurements

Horeca chatbot

Horeca chatbot Horeca chatbot
Python
LangChain
PostgreSQL
PGVector
React
Kubernetes
Docker
GitHub Actions

Chatbot conversational for cooks of hotels and restaurants. A cook, kitchen manager or room service of a hotel or restaurant can talk to the chatbot to get information about recipes and menus. But it also implements agents, with which it can edit or create new recipes or menus

View all projects -->
>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Do you want to watch any talk?

Last talks -->

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->
>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Do you want to train your model with these datasets?

short-jokes-dataset

HuggingFace

Dataset with jokes in English

Use: Fine-tuning text generation models for humor

231K rows 2 columns 45 MB
View on HuggingFace →

opus100

HuggingFace

Dataset with translations from English to Spanish

Use: Training English-Spanish translation models

1M rows 2 columns 210 MB
View on HuggingFace →

netflix_titles

HuggingFace

Dataset with Netflix movies and series

Use: Netflix catalog analysis and recommendation systems

8.8K rows 12 columns 3.5 MB
View on HuggingFace →
View more datasets -->