Hugging Face Accelerate: Entrenar Modelos en GPU/TPU (1/2)

Hugging Face Accelerate: Entrenar Modelos en GPU/TPU (1/2)

**Serie Hugging Face Accelerate**

  1. 👉 Instalación, configuración y entrenamiento (estás aquí)
  2. Guardado, precisión mixta e inferencia

Accelerate es una biblioteca de Hugging Face que permite ejecutar el mismo código PyTorch en cualquier configuración distribuida añadiendo solo cuatro líneas de código.

Instalaciónlink image 1

Para instalar accelerate con pip simplemente ejecuta:

pip install accelerate

Y con conda:

conda install -c conda-forge accelerate

Configuraciónlink image 2

En cada entorno en el que se instale accelerate lo primero que se tiene que hacer es configurarlo, para ello ejecutamos en una terminal:

accelerate config
	
< > Input
Python
!accelerate config
Copied
>_ Output
			
--------------------------------------------------------------------------------
In which compute environment are you running?
This machine
--------------------------------------------------------------------------------
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: no
Do you want to use FullyShardedDataParallel? [yes/NO]: no
Do you want to use Megatron-LM ? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1
--------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
no
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

En mi caso las respuestas han sido

  • In which compute environment are you running?
    • [x] "This machine"
    • [_] "AWS (Amazon SageMaker)"

> Quiero configurarlo en mi ordenador

  • Which type of machine are you using?
    • [_] multi-CPU
    • [_] multi-XPU
    • [x] multi-GPU
    • [_] multi-NPU
    • [_] TPU

> Como tengo 2 GPUs y quiero ejecutar códigos distribuidos en ellas, elijo `multi-GPU`

  • How many different machines will you use (use more than 1 for multi-node training)? [1]:
    • 1

> Elijo `1` porque solo voy a ejecutar en mi ordenador

  • Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]:
    • no

> Con esta opción, se puede elegir que `accelerate` chequeé errores en la ejecución, pero haría que vaya más lento, así que elijo `no`, y en caso de que haya errores lo cambio a `yes`

  • Do you wish to optimize your script with torch dynamo?[yes/NO]:
    • no
  • Do you want to use FullyShardedDataParallel? [yes/NO]:
    • no
  • Do you want to use Megatron-LM ? [yes/NO]:
    • no
  • How many GPU(s) should be used for distributed training? [1]:
    • 2

> Elijo `2` porque tengo 2 GPUs

  • What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]: - 0,1

> Elijo `0,1` porque quiero usar las dos GPUs

  • Do you wish to use FP16 or BF16 (mixed precision)?
    • [x] no
    • [_] fp16
    • [_] bf16
    • [_] fp8

> De momento elijo `no`, porque para simplificar el código cuando no uso `acelerate` vamos a entrenar en fp32, pero lo ideal sería usar fp16

La configuración se guardará en ~/.cache/huggingface/accelerate/default_config.yaml y se puede modificar en cualquier momento. Vamos a ver qué hay dentro

	
< > Input
Python
!cat ~/.cache/huggingface/accelerate/default_config.yaml
Copied
>_ Output
			
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Otra forma de ver la configuración que tenemos es ejecutando en una terminal:

accelerate env
	
< > Input
Python
!accelerate env
Copied
>_ Output
			
Copy-and-paste the text below in your GitHub issue
- `Accelerate` version: 0.28.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31
- Python version: 3.11.8
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 31.24 GB
- GPU type: NVIDIA GeForce RTX 3090
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []

Una vez hemos configurado accelerate, podemos probar si lo hemos hecho bien ejecutando en una terminal:

accelerate test
	
< > Input
Python
!accelerate test
Copied
>_ Output
			
Running: accelerate-launch ~/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout: Distributed environment: DistributedType.MULTI_GPU Backend: nccl
stdout: Num processes: 2
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout:
...
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Keep fp32 wrapper check.
stdout: Keep fp32 wrapper check.
stdout: BF16 training check.
stdout: BF16 training check.
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32
stdout:
stdout: **Breakpoint trigger test**
Test is a success! You are ready for your distributed training!

Vemos que termina diciendo Test is a success! You are ready for your distributed training! por lo que todo está correcto.

Entrenamientolink image 3

Optimización del entrenamientolink image 4

Código baselink image 5

Vamos a hacer primero un código de entrenamiento base y luego lo optimizaremos para ver cómo se hace y cómo mejora

Primero vamos a buscar un dataset, en mi caso voy a usar el dataset tweet_eval, que es un dataset de clasificación de tweets, en concreto voy a descargar el subset emoji que clasifica los tweets con emoticonos

	
< > Input
Python
from datasets import load_dataset
dataset = load_dataset("tweet_eval", "emoji")
dataset
Copied
>_ Output
			
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 45000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 50000
})
validation: Dataset({
features: ['text', 'label'],
num_rows: 5000
})
})
	
< > Input
Python
dataset["train"].info
Copied
>_ Output
			
DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='tweet_eval', config_name='emoji', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=3808792, num_examples=45000, shard_lengths=None, dataset_name='tweet_eval'), 'test': SplitInfo(name='test', num_bytes=4262151, num_examples=50000, shard_lengths=None, dataset_name='tweet_eval'), 'validation': SplitInfo(name='validation', num_bytes=396704, num_examples=5000, shard_lengths=None, dataset_name='tweet_eval')}, download_checksums={'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/train-00000-of-00001.parquet': {'num_bytes': 2609973, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/test-00000-of-00001.parquet': {'num_bytes': 3047341, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/validation-00000-of-00001.parquet': {'num_bytes': 281994, 'checksum': None}}, download_size=5939308, post_processing_size=None, dataset_size=8467647, size_in_bytes=14406955)

Vamos a ver las clases

	
< > Input
Python
print(dataset["train"].info.features["label"].names)
Copied
>_ Output
			
['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜']

Y el número de clases

	
< > Input
Python
num_classes = len(dataset["train"].info.features["label"].names)
num_classes
Copied
>_ Output
			
20

Vemos que el dataset tiene 20 clases

Vamos a ver la secuencia máxima de cada split

	
< > Input
Python
max_len_train = 0
max_len_val = 0
max_len_test = 0
split = "train"
for i in range(len(dataset[split])):
len_i = len(dataset[split][i]["text"])
if len_i &gt; max_len_train:
max_len_train = len_i
split = "validation"
for i in range(len(dataset[split])):
len_i = len(dataset[split][i]["text"])
if len_i &gt; max_len_val:
max_len_val = len_i
split = "test"
for i in range(len(dataset[split])):
len_i = len(dataset[split][i]["text"])
if len_i &gt; max_len_test:
max_len_test = len_i
max_len_train, max_len_val, max_len_test
Copied
>_ Output
			
(142, 139, 167)

Así que definimos la secuencia máxima en general como 130 para la tokenización

	
< > Input
Python
max_len = 130
Copied

A nosotros nos interesa el dataset tokenizado, no las secuencias en crudo, así que creamos un tokenizador

	
< > Input
Python
from transformers import AutoTokenizer
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
Copied

Creamos una función de tokenización

	
< > Input
Python
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
Copied

Y ahora tokenizamos el dataset

	
< > Input
Python
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
Copied
>_ Output
			
Map: 0%| | 0/45000 [00:00&lt;?, ? examples/s]
>_ Output
			
Map: 0%| | 0/5000 [00:00&lt;?, ? examples/s]
>_ Output
			
Map: 0%| | 0/50000 [00:00&lt;?, ? examples/s]

Como vemos, ahora tenemos los tokens (input_ids) y las máscaras de atención (attention_mask), pero vamos a ver qué tipo de datos tenemos

	
< > Input
Python
type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"]), type(tokenized_dataset["train"][0]["label"])
Copied
>_ Output
			
(list, list, int)
	
< > Input
Python
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
type(tokenized_dataset["train"][0]["label"]), type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"])
Copied
>_ Output
			
(torch.Tensor, torch.Tensor, torch.Tensor)

Creamos un DataLoader

	
< > Input
Python
import torch
from torch.utils.data import DataLoader
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
Copied

Cargamos el modelo

	
< > Input
Python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
Copied

Vamos a ver cómo es el modelo

	
< > Input
Python
model
Copied
>_ Output
			
RobertaForSequenceClassification(
(roberta): RobertaModel(
(embeddings): RobertaEmbeddings(
(word_embeddings): Embedding(50265, 768, padding_idx=1)
(position_embeddings): Embedding(514, 768, padding_idx=1)
(token_type_embeddings): Embedding(1, 768)
(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): RobertaEncoder(
(layer): ModuleList(
(0-11): 12 x RobertaLayer(
(attention): RobertaAttention(
(self): RobertaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): RobertaSelfOutput(
...
)
)
)
)
(classifier): RobertaClassificationHead(
(dense): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(out_proj): Linear(in_features=768, out_features=2, bias=True)
)
)

Vamos a ver su última capa

	
< > Input
Python
model.classifier.out_proj
Copied
>_ Output
			
Linear(in_features=768, out_features=2, bias=True)
	
< > Input
Python
model.classifier.out_proj.in_features, model.classifier.out_proj.out_features
Copied
>_ Output
			
(768, 2)

Hemos visto que nuestro dataset tiene 20 clases, pero este modelo está entrenado para 2 clases, así que tenemos que modificar la última capa

	
< > Input
Python
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
model.classifier.out_proj
Copied
>_ Output
			
Linear(in_features=768, out_features=20, bias=True)

Ahora sí

Ahora creamos una función de loss

	
< > Input
Python
loss_function = torch.nn.CrossEntropyLoss()
Copied

Un optimizador

	
< > Input
Python
from torch.optim import Adam
optimizer = Adam(model.parameters(), lr=5e-4)
Copied

Y por último, una métrica

	
< > Input
Python
import evaluate
metric = evaluate.load("accuracy")
Copied

Vamos a comprobar que está todo bien con una muestra

	
< > Input
Python
sample = next(iter(dataloader["train"]))
Copied
	
< > Input
Python
sample["input_ids"].shape, sample["attention_mask"].shape
Copied
>_ Output
			
(torch.Size([64, 130]), torch.Size([64, 130]))

Ahora esa muestra se la metemos al modelo

	
< > Input
Python
model.to("cuda")
ouputs = model(input_ids=sample["input_ids"].to("cuda"), attention_mask=sample["attention_mask"].to("cuda"))
ouputs.logits.shape
Copied
>_ Output
			
torch.Size([64, 20])

Vemos que el modelo saca 64 batches, lo cual está bien, porque configuramos BS = 20 y cada una con 20 salidas, lo cual está bien porque cambiamos el modelo para que tenga la salida de 20 valores

Obtenemos la de mayor valor

	
< > Input
Python
predictions = torch.argmax(ouputs.logits, axis=-1)
predictions.shape
Copied
>_ Output
			
torch.Size([64])

Obtenemos la loss

	
< > Input
Python
loss = loss_function(ouputs.logits, sample["label"].to("cuda"))
loss.item()
Copied
>_ Output
			
2.9990389347076416

Y el accuracy

	
< > Input
Python
accuracy = metric.compute(predictions=predictions, references=sample["label"])["accuracy"]
accuracy
Copied
>_ Output
			
0.015625

Ya podemos crear un pequeño bucle de entrenamiento

	
< > Input
Python
from fastprogress.fastprogress import master_bar, progress_bar
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
master_progress_bar = master_bar(range(epochs))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
loss.backward()
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
Copied
>_ Output
			
&lt;IPython.core.display.HTML object&gt;
>_ Output
			
&lt;IPython.core.display.HTML object&gt;

Script con el código baselink image 6

En la mayoría de la documentación de accelerate se explica cómo usar accelerate con scripts, así que de momento vamos a hacerlo así y al final explicaremos cómo hacerlo con un notebook

Primero vamos a crear una carpeta en la que vamos a guardar los scripts

	
< > Input
Python
!mkdir accelerate_scripts
Copied

Ahora escribimos el código base en un script

	
< > Input
Python
%%writefile accelerate_scripts/01_code_base.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
loss.backward()
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
print(f"Accuracy = {accuracy['accuracy']}")
Copied
>_ Output
			
Overwriting accelerate_scripts/01_code_base.py

Y ahora lo ejecutamos

	
< > Input
Python
%%time
!python accelerate_scripts/01_code_base.py
Copied
>_ Output
			
Accuracy = 0.2112
CPU times: user 2.12 s, sys: 391 ms, total: 2.51 s
Wall time: 3min 36s

Vemos que en mi ordenador ha tardado unos 3 minutos y medio

Código con acceleratelink image 7

Ahora reemplazamos algunas cosas

  • En primer lugar importamos Accelerator y lo inicializamos
from accelerate import Accelerator
accelerator = Accelerator()
  • Ya no hacemos el típico

``` python

torch.device("cuda" if torch.cuda.is_available() else "cpu")

```

  • Si no que dejamos que sea acelerate el que elija el dispositivo mediante
device = accelerator.device
  • Pasamos los elementos relevantes para el entrenamiento por el método prepare y ya no hacemos model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = prepare(model, optimizer, dataloader["train"], dataloader["validation"])
  • Ya no mandamos los datos y el modelo a la GPU con .to(device) ya que accelerate se ha encargado de ello con el método prepare
  • En vez de hacer el backpropagation con loss.backward() dejamos que lo haga accelerate con
accelerator.backward(loss)
  • A la hora de calcular la métrica en el bucle de validación, necesitamos recopilar los valores de todos los puntos, en caso de estar haciendo un entrenamiento distribuido, para ello hacemos
predictions = accelerator.gather_for_metrics(predictions)
	
< > Input
Python
%%writefile accelerate_scripts/02_accelerate_base_code.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 64
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
print(f"Accuracy = {accuracy['accuracy']}")
Copied
>_ Output
			
Overwriting accelerate_scripts/02_accelerate_base_code.py

Si te fijas, he añadido estas dos líneas print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}") y la línea print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}"), las he añadido a propósito porque nos van a revelar algo muy importante

Ahora lo ejecutamos, para ejecutar los scripts de accelerate se hace con el comando accelerate launch

accelerate launch script.py
	
< > Input
Python
%%time
!accelerate launch accelerate_scripts/02_accelerate_base_code.py
Copied
>_ Output
			
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])
Accuracy = 0.206
CPU times: user 1.6 s, sys: 272 ms, total: 1.88 s
Wall time: 2min 37s

Vemos que antes tardó unos 3 minutos y medio y ahora tarda más o menos 2 minutos y medio. Bastante mejora. Además, si vemos los prints, podemos ver que se han impreso dos veces.

¿Y esto cómo puede ser? Pues porque accelerate ha paralelizado el entrenamiento en las dos GPUs que tengo, por lo que ha sido mucho más rápido.

Además, cuando ejecuté el primer script, es decir, cuando no usé accelerate, la GPU estaba casi llena, mientras que cuando he ejecutado el segundo, es decir, el que usa accelerate, las dos GPUs estaban muy poco utilizadas, por lo que podemos aumentar el batch size para intentar llenar las dos, ¡vamos a ello!

	
< > Input
Python
%%writefile accelerate_scripts/03_accelerate_base_code_more_bs.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
print(f"Accuracy = {accuracy['accuracy']}")
Copied
>_ Output
			
Overwriting accelerate_scripts/03_accelerate_base_code_more_bs.py

He quitado los prints extra, porque ya hemos visto que el código se está ejecutando en las dos GPUs y he aumentado el batch size de 64 a 128. Lo ejecutamos a ver

	
< > Input
Python
%%time
!accelerate launch accelerate_scripts/03_accelerate_base_code_more_bs.py
Copied
>_ Output
			
Accuracy = 0.1052
Accuracy = 0.1052
CPU times: user 1.41 s, sys: 180 ms, total: 1.59 s
Wall time: 2min 22s

Aumentando el batch size ha bajado unos segundos el tiempo de ejecución

Ejecución de procesoslink image 8

Ejecución de código en un único procesolink image 9

Antes hemos visto que los prints se imprimían dos veces, esto es porque accelerate crea tantos procesos como dispositivos donde se ejecuta el código; en mi caso, crea dos procesos por tener dos GPUs.

Sin embargo, no todo el código debería ejecutarse en todos los procesos, por ejemplo, los prints ralentizan mucho el código, como para ejecutarlo varias veces, si se guardan los checkpoints, se guardarían dos veces, etc.

Para poder ejecutar parte de un código en un único proceso, se tiene que encapsular en una función y decorarla con accelerator.on_local_main_process. Por ejemplo, en el siguiente código vas a ver que he creado la siguiente función

@accelerator.on_local_main_process
def print_something(something):
print(something)

Otra opción es incluir el código dentro de un if accelerator.is_local_main_process como en el siguiente código

if accelerator.is_local_main_process:
print("Something")
	
< > Input
Python
%%writefile accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
from fastprogress.fastprogress import master_bar, progress_bar
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_something(something):
print(something)
master_progress_bar = master_bar(range(EPOCHS))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
Copied
>_ Output
			
Overwriting accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py

Vamos a ejecutarlo a ver

	
< > Input
Python
%%time
!accelerate launch accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
Copied
>_ Output
			
Accuracy = 0.2098
End of script with 0.2098 accuracy
CPU times: user 1.38 s, sys: 197 ms, total: 1.58 s
Wall time: 2min 22s

Ahora solo se ha impreso el print una vez

Sin embargo, aunque no se ve mucho, las barras de progreso se ejecutan en cada proceso.

No he encontrado una manera de evitar esto con las barras de progreso de fastprogress, pero sí con las de tqdm, así que voy a sustituir las barras de progreso de fastprogress por las de tqdm y para que se ejecuten en un único proceso hay que añadirle el argumento disable=not accelerator.is_local_main_process

	
< > Input
Python
%%writefile accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_something(something):
print(something)
for i in range(EPOCHS):
model.train()
# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# print(f"Accuracy = {accuracy['accuracy']}")
print_something(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
Copied
>_ Output
			
Overwriting accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
	
< > Input
Python
%%time
!accelerate launch accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
Copied
>_ Output
			
100%|█████████████████████████████████████████| 176/176 [02:01&lt;00:00, 1.45it/s]
100%|███████████████████████████████████████████| 20/20 [00:06&lt;00:00, 3.30it/s]
Accuracy = 0.2166
End of script with 0.2166 accuracy
CPU times: user 1.33 s, sys: 195 ms, total: 1.52 s
Wall time: 2min 22s

Hemos mostrado un ejemplo de cómo imprimir en un solo proceso, y esto ha sido una manera de ejecutar procesos en un solo proceso. Pero si lo que quieres es solo imprimir en un solo proceso, se puede usar el método print de accelerate. Vamos a ver el mismo ejemplo de antes con este método

	
< > Input
Python
%%writefile accelerate_scripts/06_accelerate_base_code_print_one_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
for i in range(EPOCHS):
model.train()
# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# master_progress_bar.child.comment = f'loss: {loss}'
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
# print(f"Accuracy = {accuracy['accuracy']}")
accelerator.print(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
Copied
>_ Output
			
Writing accelerate_scripts/06_accelerate_base_code_print_one_process.py

Lo ejecutamos

	
< > Input
Python
%%time
!accelerate launch accelerate_scripts/06_accelerate_base_code_print_one_process.py
Copied
>_ Output
			
Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15433.52 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 11406.61 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:02&lt;00:00, 15036.87 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14932.76 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14956.60 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00&lt;00:00, 1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00, 3.33it/s]
Accuracy = 0.2134
End of script with 0.2134 accuracy
CPU times: user 1.4 s, sys: 189 ms, total: 1.59 s
Wall time: 2min 27s

Ejecución de código en todos los procesoslink image 10

Sin embargo, hay código que debe ejecutarse en todos los procesos, por ejemplo, si subimos los checkpoints al hub, así que aquí tenemos dos opciones: encapsular el código en una función y decorarla con accelerator.on_main_process

@accelerator.on_main_process
def do_my_thing():
"Something done once per server"
do_thing_once()

o meter el código dentro de un if accelerator.is_main_process

if accelerator.is_main_process:
repo.push_to_hub()

Como estamos haciendo entrenamientos solo para mostrar la librería accelerate y el modelo que estamos entrenando no es bueno, no tiene sentido ahora subir los checkpoints al hub, así que voy a hacer un ejemplo con prints

	
< > Input
Python
%%writefile accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_in_one_process(something):
print(something)
@accelerator.on_main_process
def print_in_all_processes(something):
print(something)
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
if accelerator.is_main_process:
print(f"All process: End of script with {accuracy['accuracy']} accuracy")
Copied
>_ Output
			
Overwriting accelerate_scripts/06_accelerate_base_code_some_code_in_all_process.py

Lo ejecutamos a ver

	
< > Input
Python
%%time
!accelerate launch accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py
Copied
>_ Output
			
Map: 100%|██████████████████████| 45000/45000 [00:03&lt;00:00, 14518.44 examples/s]
Map: 100%|██████████████████████| 45000/45000 [00:03&lt;00:00, 14368.77 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 16466.33 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14806.14 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14253.33 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14337.07 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:00&lt;00:00, 1.46it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00, 3.34it/s]
Accuracy = 0.2092
End of script with 0.2092 accuracy
All process: Accuracy = 0.2092
All process: End of script with 0.2092 accuracy
CPU times: user 1.42 s, sys: 216 ms, total: 1.64 s
Wall time: 2min 27s

Ejecución de código en el proceso Xlink image 11

Por último, podemos especificar en qué proceso queremos ejecutar código, para esto hay que crear una función y decorarla con @accelerator.on_process(process_index=0)

@accelerator.on_process(process_index=0)
def do_my_thing():
"Something done on process index 0"
do_thing_on_index_zero()

o decorarla con @accelerator.on_local_process(local_process_idx=0)

@accelerator.on_local_process(local_process_index=0)
def do_my_thing():
"Something done on process index 0 on each server"
do_thing_on_index_zero_on_each_server()

Aquí he puesto el proceso 0, pero se puede poner cualquier número

	
< > Input
Python
%%writefile accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_in_one_process(something):
print(something)
@accelerator.on_main_process
def print_in_all_processes(something):
print(something)
@accelerator.on_process(process_index=0)
def print_in_process_0(something):
print("Process 0: " + something)
@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
print("Process 1: " + something)
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
if accelerator.is_main_process:
print(f"All process: End of script with {accuracy['accuracy']} accuracy")
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
Copied
>_ Output
			
Overwriting accelerate_scripts/07_accelerate_base_code_some_code_in_some_process.py

Lo ejecutamos

	
< > Input
Python
%%time
!accelerate launch accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py
Copied
>_ Output
			
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 15735.58 examples/s]
Map: 100%|██████████████████████| 50000/50000 [00:03&lt;00:00, 14906.20 examples/s]
100%|█████████████████████████████████████████| 176/176 [02:02&lt;00:00, 1.44it/s]
100%|███████████████████████████████████████████| 20/20 [00:06&lt;00:00, 3.27it/s]
Process 1: End of process 1
Accuracy = 0.2128
End of script with 0.2128 accuracy
All process: Accuracy = 0.2128
All process: End of script with 0.2128 accuracy
Process 0: End of process 0
CPU times: user 1.42 s, sys: 295 ms, total: 1.71 s
Wall time: 2min 37s

Sincronizar procesoslink image 12

Si tenemos código que debe ejecutarse en todos los procesos, es interesante esperar a que termine en todos los procesos antes de hacer otra tarea, así que para ello usamos accelerator.wait_for_everyone()

Para verlo, vamos a meter un retraso en una de las funciones de imprimir en un proceso

Además he puesto un break en el bucle de entrenamiento para que no esté mucho tiempo entrenando, que no es lo que ahora nos interesa

	
< > Input
Python
%%writefile accelerate_scripts/09_accelerate_base_code_sync_all_process.py
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import evaluate
import tqdm
import time
# Importamos e inicializamos Accelerator
from accelerate import Accelerator
accelerator = Accelerator()
dataset = load_dataset("tweet_eval", "emoji")
num_classes = len(dataset["train"].info.features["label"].names)
max_len = 130
checkpoints = "cardiffnlp/twitter-roberta-base-irony"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
def tokenize_function(dataset):
return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])
tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])
BS = 128
dataloader = {
"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),
"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),
"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),
}
model = AutoModelForSequenceClassification.from_pretrained(checkpoints)
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)
loss_function = torch.nn.CrossEntropyLoss()
optimizer = Adam(model.parameters(), lr=5e-4)
metric = evaluate.load("accuracy")
EPOCHS = 1
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device = accelerator.device
# model.to(device)
model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])
@accelerator.on_local_main_process
def print_in_one_process(something):
print(something)
@accelerator.on_main_process
def print_in_all_processes(something):
print(something)
@accelerator.on_process(process_index=0)
def print_in_process_0(something):
time.sleep(2)
print("Process 0: " + something)
@accelerator.on_local_process(local_process_index=1)
def print_in_process_1(something):
print("Process 1: " + something)
for i in range(EPOCHS):
model.train()
progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
# loss.backward()
accelerator.backward(loss)
optimizer.step()
break
model.eval()
progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)
for batch in progress_bar_validation:
input_ids = batch["input_ids"]#.to(device)
attention_mask = batch["attention_mask"]#.to(device)
labels = batch["label"]#.to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
# Recopilamos las predicciones de todos los dispositivos
predictions = accelerator.gather_for_metrics(predictions)
labels = accelerator.gather_for_metrics(labels)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
print_in_one_process(f"Accuracy = {accuracy['accuracy']}")
if accelerator.is_local_main_process:
print(f"End of script with {accuracy['accuracy']} accuracy")
print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")
if accelerator.is_main_process:
print(f"All process: End of script with {accuracy['accuracy']} accuracy")
print_in_one_process("Printing with delay in process 0")
print_in_process_0("End of process 0")
print_in_process_1("End of process 1")
accelerator.wait_for_everyone()
print_in_one_process("End of script")
Copied
>_ Output
			
Overwriting accelerate_scripts/08_accelerate_base_code_sync_all_process.py

Lo ejecutamos

	
< > Input
Python
!accelerate launch accelerate_scripts/09_accelerate_base_code_sync_all_process.py
Copied
>_ Output
			
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14218.23 examples/s]
Map: 100%|████████████████████████| 5000/5000 [00:00&lt;00:00, 14666.25 examples/s]
0%| | 0/176 [00:00&lt;?, ?it/s]
100%|███████████████████████████████████████████| 20/20 [00:05&lt;00:00, 3.58it/s]
Process 1: End of process 1
Accuracy = 0.212
End of script with 0.212 accuracy
All process: Accuracy = 0.212
All process: End of script with 0.212 accuracy
Printing with delay in process 0
Process 0: End of process 0
End of script

Como se puede ver, primero se ha impreso Process 1: End of process 1 y luego el resto, esto es porque el resto de prints se hacen o en el proceso 0 o en todos los procesos, así que hasta que no termine el delay de 2 segundos que hemos puesto no se ejecuta el resto de código

---

➡️ **Continúa en la segunda parte:** Guardado, precisión mixta e inferencia, donde veremos cómo guardar y cargar modelos, entrenar con precisión mixta e inferir con el ecosistema de Hugging Face.

Seguir leyendo

Últimos posts -->

¿Has visto estos proyectos?

Gymnasia

Gymnasia Gymnasia
React Native
Expo
TypeScript
FastAPI
Next.js
OpenAI
Anthropic

Aplicación móvil de entrenamiento personal con asistente de IA, biblioteca de ejercicios, seguimiento de rutinas, dieta y medidas corporales

Horeca chatbot

Horeca chatbot Horeca chatbot
Python
LangChain
PostgreSQL
PGVector
React
Kubernetes
Docker
GitHub Actions

Chatbot conversacional para cocineros de hoteles y restaurantes. Un cocinero, jefe de cocina o camaeror de un hotel o restaurante puede hablar con el chatbot para obtener información de recetas y menús. Pero además implementa agentes, con los cuales puede editar o crear nuevas recetas o menús

Naviground

Naviground Naviground
Ver todos los proyectos -->
>_ Disponible para proyectos

¿Tienes un proyecto con IA?

Hablemos.

maximofn@gmail.com

Especialista en Machine Learning e Inteligencia Artificial. Desarrollo soluciones con IA generativa, agentes inteligentes y modelos personalizados.

¿Quieres ver alguna charla?

Últimas charlas -->

¿Quieres mejorar con estos tips?

Últimos tips -->

Usa esto en local

Los espacios de Hugging Face nos permite ejecutar modelos con demos muy sencillas, pero ¿qué pasa si la demo se rompe? O si el usuario la elimina? Por ello he creado contenedores docker con algunos espacios interesantes, para poder usarlos de manera local, pase lo que pase. De hecho, es posible que si pinchas en alún botón de ver proyecto te lleve a un espacio que no funciona.

Flow edit

Flow edit Flow edit

Edita imágenes con este modelo de Flow. Basándose en SD3 o FLUX puedes editar cualquier imagen y generar nuevas

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
Ver todos los contenedores -->
>_ Disponible para proyectos

¿Tienes un proyecto con IA?

Hablemos.

maximofn@gmail.com

Especialista en Machine Learning e Inteligencia Artificial. Desarrollo soluciones con IA generativa, agentes inteligentes y modelos personalizados.

¿Quieres entrenar tu modelo con estos datasets?

short-jokes-dataset

HuggingFace

Dataset de chistes en inglés

Uso: Fine-tuning de modelos de generación de texto humorístico

231K filas 2 columnas 45 MB
Ver en HuggingFace →

opus100

HuggingFace

Dataset con traducciones de inglés a español

Uso: Entrenamiento de modelos de traducción inglés-español

1M filas 2 columnas 210 MB
Ver en HuggingFace →

netflix_titles

HuggingFace

Dataset con películas y series de Netflix

Uso: Análisis de catálogo de Netflix y sistemas de recomendación

8.8K filas 12 columnas 3.5 MB
Ver en HuggingFace →
Ver más datasets -->