Aceleração do rosto do abraço
O Accelerate é uma biblioteca Hugging Face que permite executar o mesmo código PyTorch em qualquer configuração distribuída, adicionando apenas quatro linhas de código.
Este caderno foi traduzido automaticamente para torná-lo acessível a mais pessoas, por favor me avise se você vir algum erro de digitação..
Instalação
Para instalar o accelerate
com o pip
, basta executar:
pip install accelerate
E com conda
:
conda install -c conda-forge accelerate
Configuração
Em todos os ambientes em que o accelerate
está instalado, a primeira coisa a fazer é configurá-lo. Para isso, executamos em um terminal:
acelerar a configuração
!accelerate config
--------------------------------------------------------------------------------In which compute environment are you running?This machine--------------------------------------------------------------------------------multi-GPUHow many different machines will you use (use more than 1 for multi-node training)? [1]: 1Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: noDo you wish to optimize your script with torch dynamo?[yes/NO]:noDo you want to use DeepSpeed? [yes/NO]: noDo you want to use FullyShardedDataParallel? [yes/NO]: noDo you want to use Megatron-LM ? [yes/NO]: noHow many GPU(s) should be used for distributed training? [1]:2What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:0,1--------------------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?noaccelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml
No meu caso, as respostas foram
- Em qual ambiente de computação você está executando?
- "Esta máquina"
- [_] "AWS (Amazon SageMaker)" [_] "AWS (Amazon SageMaker)" [_] "AWS (Amazon SageMaker)"
Quero configurá-lo em meu computador
- Que tipo de máquina você está usando?
- [_] multi-CPU
- [_] multi-XPU
- x] multi-GPU
- [_] multi-NPU
- [_] TPU
Como tenho 2 GPUs e quero executar código distribuído nelas, escolhi
multi-GPU
.
- Quantas máquinas diferentes você usará (use mais de uma para treinamento com vários nós)? [1]:
- 1
Escolhi
1
porque só vou executar em meu computador.
- As operações distribuídas devem ser verificadas durante a execução em busca de erros? Isso pode evitar problemas de tempo limite, mas será mais lento. [yes/NO]:
- não
Com essa opção, você pode optar por fazer com que o
accelerate
verifique se há erros na execução, mas isso tornaria a execução mais lenta, então eu escolhono
e, caso haja erros, altero parayes
.
Deseja otimizar seu script com o torch dynamo? [yes/NO]:
- não
Você deseja usar o FullyShardedDataParallel? [yes/NO]:
- não
Você quer usar o Megatron-LM? [sim/não]:
- não
Quantas GPUs devem ser usadas para treinamento distribuído? [1]:
Escolhi
2
porque tenho 2 GPUs
- Quais GPUs (por id) devem ser usadas para treinamento nesta máquina como uma lista separada por vírgulas? [all]:
- 0,1
Escolhi
0,1
porque quero usar as duas GPUs.
- Você deseja usar FP16 ou BF16 (precisão mista)?
- x] não
- [_] fp16
- [_] bf16
- [_] fp8
No momento, escolhi
no
, pois, para simplificar o código, quando não estivermos usando oaccelerate
, treinaremos em fp32, mas o ideal seria usar fp16.
A configuração será armazenada em ~/.cache/huggingface/accelerate/default_config.yaml
e poderá ser modificada a qualquer momento. Vamos ver o que há dentro dele
!cat ~/.cache/huggingface/accelerate/default_config.yaml
compute_environment: LOCAL_MACHINEdebug: falsedistributed_type: MULTI_GPUdowncast_bf16: 'no'gpu_ids: 0,1machine_rank: 0main_training_function: mainmixed_precision: fp16num_machines: 1num_processes: 2rdzv_backend: staticsame_network: truetpu_env: []tpu_use_cluster: falsetpu_use_sudo: falseuse_cpu: false
Outra maneira de ver a configuração que temos é executá-la em um terminal:
acelerar o ambiente
!accelerate env
Copy-and-paste the text below in your GitHub issue- `Accelerate` version: 0.28.0- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.31- Python version: 3.11.8- Numpy version: 1.26.4- PyTorch version (GPU?): 2.2.1+cu121 (True)- PyTorch XPU available: False- PyTorch NPU available: False- System RAM: 31.24 GB- GPU type: NVIDIA GeForce RTX 3090- `Accelerate` default config:- compute_environment: LOCAL_MACHINE- distributed_type: MULTI_GPU- mixed_precision: fp16- use_cpu: False- debug: False- num_processes: 2- machine_rank: 0- num_machines: 1- gpu_ids: 0,1- rdzv_backend: static- same_network: True- main_training_function: main- downcast_bf16: no- tpu_use_cluster: False- tpu_use_sudo: False- tpu_env: []
Depois de configurar o accelerate
, podemos testar se fizemos tudo certo executando-o em um terminal:
teste de aceleração
!accelerate test
Running: accelerate-launch ~/miniconda3/envs/nlp/lib/python3.11/site-packages/accelerate/test_utils/scripts/test_script.pystdout: **Initialization**stdout: Testing, testing. 1, 2, 3.stdout: Distributed environment: DistributedType.MULTI_GPU Backend: ncclstdout: Num processes: 2stdout: Process index: 0stdout: Local process index: 0stdout: Device: cuda:0stdout:stdout: Mixed precision type: fp16stdout:stdout: Distributed environment: DistributedType.MULTI_GPU Backend: ncclstdout: Num processes: 2stdout: Process index: 1stdout: Local process index: 1stdout: Device: cuda:1stdout:stdout: Mixed precision type: fp16stdout:stdout:stdout: **Test process execution**stdout:stdout: **Test split between processes as a list**stdout:stdout: **Test split between processes as a dict**stdout:stdout: **Test split between processes as a tensor**stdout:stdout: **Test random number generator synchronization**stdout: All rng are properly synched.stdout:stdout: **DataLoader integration test**stdout: 0 1 tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:1') <class 'accelerate.data_loader.DataLoaderShard'>stdout: tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,stdout: 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,stdout: 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,stdout: 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], device='cuda:0') <class 'accelerate.data_loader.DataLoaderShard'>stdout: Non-shuffled dataloader passing.stdout: Shuffled dataloader passing.stdout: Non-shuffled central dataloader passing.stdout: Shuffled central dataloader passing.stdout:stdout: **Training integration test**stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Training yielded the same results on one CPU or distributed setup with no batch split.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Training yielded the same results on one CPU or distributes setup with batch split.stdout: FP16 training check.stdout: FP16 training check.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Keep fp32 wrapper check.stdout: Keep fp32 wrapper check.stdout: BF16 training check.stdout: BF16 training check.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout:stdout: Training yielded the same results on one CPU or distributed setup with no batch split.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: FP16 training check.stdout: Training yielded the same results on one CPU or distributes setup with batch split.stdout: FP16 training check.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Keep fp32 wrapper check.stdout: Keep fp32 wrapper check.stdout: BF16 training check.stdout: BF16 training check.stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout: Model dtype: torch.float32, torch.float32. Input dtype: torch.float32stdout:stdout: **Breakpoint trigger test**Test is a success! You are ready for your distributed training!
Vemos que ele termina dizendo `O teste foi um sucesso! Você está pronto para seu treinamento distribuído, portanto, tudo está correto.
Treinamento
Otimização do treinamento
Código base
Primeiro, criaremos um código de treinamento básico e, em seguida, o otimizaremos para ver como ele é feito e como melhora.
Primeiro, vamos procurar um conjunto de dados; no meu caso, usarei o conjunto de dados tweet_eval, que é um conjunto de dados de classificação de tweets; em particular, farei o download do subconjunto emoji
, que classifica tweets com emoticons.
from datasets import load_datasetdataset = load_dataset("tweet_eval", "emoji")dataset
DatasetDict({train: Dataset({features: ['text', 'label'],num_rows: 45000})test: Dataset({features: ['text', 'label'],num_rows: 50000})validation: Dataset({features: ['text', 'label'],num_rows: 5000})})
dataset["train"].info
DatasetInfo(description='', citation='', homepage='', license='', features={'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜'], id=None)}, post_processed=None, supervised_keys=None, task_templates=None, builder_name='parquet', dataset_name='tweet_eval', config_name='emoji', version=0.0.0, splits={'train': SplitInfo(name='train', num_bytes=3808792, num_examples=45000, shard_lengths=None, dataset_name='tweet_eval'), 'test': SplitInfo(name='test', num_bytes=4262151, num_examples=50000, shard_lengths=None, dataset_name='tweet_eval'), 'validation': SplitInfo(name='validation', num_bytes=396704, num_examples=5000, shard_lengths=None, dataset_name='tweet_eval')}, download_checksums={'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/train-00000-of-00001.parquet': {'num_bytes': 2609973, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/test-00000-of-00001.parquet': {'num_bytes': 3047341, 'checksum': None}, 'hf://datasets/tweet_eval@b3a375baf0f409c77e6bc7aa35102b7b3534f8be/emoji/validation-00000-of-00001.parquet': {'num_bytes': 281994, 'checksum': None}}, download_size=5939308, post_processing_size=None, dataset_size=8467647, size_in_bytes=14406955)
Vamos dar uma olhada nas classes
print(dataset["train"].info.features["label"].names)
['❤', '😍', '😂', '💕', '🔥', '😊', '😎', '✨', '💙', '😘', '📷', '🇺🇸', '☀', '💜', '😉', '💯', '😁', '🎄', '📸', '😜']
E o número de classes
num_classes = len(dataset["train"].info.features["label"].names)num_classes
20
Vemos que o conjunto de dados tem 20 classes
Vamos dar uma olhada na sequência máxima de cada divisão
max_len_train = 0max_len_val = 0max_len_test = 0split = "train"for i in range(len(dataset[split])):len_i = len(dataset[split][i]["text"])if len_i > max_len_train:max_len_train = len_isplit = "validation"for i in range(len(dataset[split])):len_i = len(dataset[split][i]["text"])if len_i > max_len_val:max_len_val = len_isplit = "test"for i in range(len(dataset[split])):len_i = len(dataset[split][i]["text"])if len_i > max_len_test:max_len_test = len_imax_len_train, max_len_val, max_len_test
(142, 139, 167)
Portanto, definimos a sequência máxima em geral como 130 para a tokenização.
max_len = 130
Estamos interessados no conjunto de dados tokenizado, não nas sequências brutas, portanto, criamos um tokenizador
max_len = 130from transformers import AutoTokenizercheckpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)
Criamos uma função de tokenização
max_len = 130from transformers import AutoTokenizercheckpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")
E agora vamos tokenizar o conjunto de dados
tokenized_dataset = {
"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),
"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),
"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),
}
Como podemos ver agora, temos os tokens (input_ids
) e as máscaras de atenção (attention_mask
), mas vamos ver que tipo de dados temos.
max_len = 130from transformers import AutoTokenizercheckpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"]), type(tokenized_dataset["train"][0]["label"])
Map: 0%| | 0/45000 [00:00<?, ? examples/s](list, list, int)
tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])type(tokenized_dataset["train"][0]["label"]), type(tokenized_dataset["train"][0]["input_ids"]), type(tokenized_dataset["train"][0]["attention_mask"])
(torch.Tensor, torch.Tensor, torch.Tensor)
Criamos um carregador de dados
import torchfrom torch.utils.data import DataLoaderBS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}
Carregamos o modelo
import torchfrom torch.utils.data import DataLoaderBS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}from transformers import AutoModelForSequenceClassificationmodel = AutoModelForSequenceClassification.from_pretrained(checkpoints)
Vamos ver como é o modelo
import torchfrom torch.utils.data import DataLoaderBS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}from transformers import AutoModelForSequenceClassificationmodel = AutoModelForSequenceClassification.from_pretrained(checkpoints)model
RobertaForSequenceClassification((roberta): RobertaModel((embeddings): RobertaEmbeddings((word_embeddings): Embedding(50265, 768, padding_idx=1)(position_embeddings): Embedding(514, 768, padding_idx=1)(token_type_embeddings): Embedding(1, 768)(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))(encoder): RobertaEncoder((layer): ModuleList((0-11): 12 x RobertaLayer((attention): RobertaAttention((self): RobertaSelfAttention((query): Linear(in_features=768, out_features=768, bias=True)(key): Linear(in_features=768, out_features=768, bias=True)(value): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False))(output): RobertaSelfOutput((dense): Linear(in_features=768, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False)))(intermediate): RobertaIntermediate((dense): Linear(in_features=768, out_features=3072, bias=True)(intermediate_act_fn): GELUActivation())(output): RobertaOutput((dense): Linear(in_features=3072, out_features=768, bias=True)(LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)(dropout): Dropout(p=0.1, inplace=False))))))(classifier): RobertaClassificationHead((dense): Linear(in_features=768, out_features=768, bias=True)(dropout): Dropout(p=0.1, inplace=False)(out_proj): Linear(in_features=768, out_features=2, bias=True)))
Vamos dar uma olhada em sua última camada
model.classifier.out_proj
Linear(in_features=768, out_features=2, bias=True)
model.classifier.out_proj.in_features, model.classifier.out_proj.out_features
(768, 2)
Vimos que nosso conjunto de dados tem 20 classes, mas esse modelo foi treinado para 2 classes, portanto, temos que modificar a última camada
model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)model.classifier.out_proj
Linear(in_features=768, out_features=20, bias=True)
Agora é
Agora criamos uma função de perda
loss_function = torch.nn.CrossEntropyLoss()
Um otimizador
loss_function = torch.nn.CrossEntropyLoss()from torch.optim import Adamoptimizer = Adam(model.parameters(), lr=5e-4)
E, finalmente, uma métrica
loss_function = torch.nn.CrossEntropyLoss()from torch.optim import Adamoptimizer = Adam(model.parameters(), lr=5e-4)import evaluatemetric = evaluate.load("accuracy")
Vamos verificar se está tudo certo com uma amostra
loss_function = torch.nn.CrossEntropyLoss()from torch.optim import Adamoptimizer = Adam(model.parameters(), lr=5e-4)import evaluatemetric = evaluate.load("accuracy")sample = next(iter(dataloader["train"]))
loss_function = torch.nn.CrossEntropyLoss()from torch.optim import Adamoptimizer = Adam(model.parameters(), lr=5e-4)import evaluatemetric = evaluate.load("accuracy")sample = next(iter(dataloader["train"]))sample["input_ids"].shape, sample["attention_mask"].shape
(torch.Size([64, 130]), torch.Size([64, 130]))
Agora, colocamos essa amostra no modelo
model.to("cuda")ouputs = model(input_ids=sample["input_ids"].to("cuda"), attention_mask=sample["attention_mask"].to("cuda"))ouputs.logits.shape
torch.Size([64, 20])
Vemos que o modelo produz 64 lotes, o que é bom, porque definimos BS = 20
e cada um com 20 saídas, o que é bom porque alteramos o modelo para produzir 20 valores.
Obtemos aquele com o valor mais alto
predictions = torch.argmax(ouputs.logits, axis=-1)predictions.shape
torch.Size([64])
Obtemos a perda
loss = loss_function(ouputs.logits, sample["label"].to("cuda"))loss.item()
2.9990389347076416
E a precisão
accuracy = metric.compute(predictions=predictions, references=sample["label"])["accuracy"]accuracy
0.015625
Agora podemos criar um pequeno loop de treinamento
from fastprogress.fastprogress import master_bar, progress_bar
epochs = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
master_progress_bar = master_bar(range(epochs))
for i in master_progress_bar:
model.train()
progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)
for batch in progress_bar_train:
optimizer.zero_grad()
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_function(outputs['logits'], labels)
master_progress_bar.child.comment = f'loss: {loss}'
loss.backward()
optimizer.step()
model.eval()
progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)
for batch in progress_bar_validation:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels = batch["label"].to(device)
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
predictions = torch.argmax(outputs['logits'], axis=-1)
accuracy = metric.add_batch(predictions=predictions, references=labels)
accuracy = metric.compute()
master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"
Script com a base de código
A maior parte da documentação do accelerate
explica como usar o accelerate
com scripts, portanto, faremos isso por enquanto e explicaremos como fazer isso com um notebook no final.
Primeiro, vamos criar uma pasta na qual salvaremos os scripts.
from fastprogress.fastprogress import master_bar, progress_barepochs = 1device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)master_progress_bar = master_bar(range(epochs))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'loss.backward()optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']} "!mkdir accelerate_scripts
Agora escrevemos o código base em um script
from fastprogress.fastprogress import master_bar, progress_barepochs = 1device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)master_progress_bar = master_bar(range(epochs))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'loss.backward()optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"!mkdir accelerate_scripts%%writefile accelerate_scripts/01_code_base.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluatefrom fastprogress.fastprogress import master_bar, progress_bardataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1device = torch.device("cuda" if torch.cuda.is_available() else "cpu")model.to(device)master_progress_bar = master_bar(range(EPOCHS))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'loss.backward()optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"].to(device)attention_mask = batch["attention_mask"].to(device)labels = batch["label"].to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/01_code_base.py
E agora vamos executá-lo
%%time!python accelerate_scripts/01_code_base.py
Accuracy = 0.2112CPU times: user 2.12 s, sys: 391 ms, total: 2.51 sWall time: 3min 36s
Em meu computador, levou cerca de 3 minutos e meio.
Código com aceleração
Agora vamos substituir alguns itens
- Primeiro, importamos o
Accelerator
e o inicializamos.
from accelerate import Accelerator
acelerador = Acelerador()
- Não fazemos mais o típico
torch.device("cuda" if torch.cuda.is_available() else "cpu")
- Em vez disso, deixamos o
accelerate
escolher o dispositivo por meio de
dispositivo = accelerator.device
- Passamos os elementos relevantes para treinamento por meio do método
prepare
e não fazemos maismodel.to(device)
.
model, optimizer, dataloader["train"], dataloader["validation"] = preprare(model, optimizer, dataloader["train"], dataloader["validation"])
Não enviamos mais os dados e o modelo para a GPU com
.to(device)
, pois oaccelerate
cuidou disso com o métodoprepare
.Em vez de fazer a retropropagação com
loss.backward()
, deixamos que oaccelerate
faça isso comloss.backward()
.
accelerator.backward(loss)
- Ao calcular a métrica no loop de validação, precisamos coletar os valores de todos os pontos, caso estejamos fazendo um treinamento distribuído.
previsões = accelerator.gather_for_metrics(previsões)
%%writefile accelerate_scripts/02_accelerate_base_code.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluatefrom fastprogress.fastprogress import master_bar, progress_bar# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 64dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])master_progress_bar = master_bar(range(EPOCHS))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()print(f"End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()print(f"End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/02_accelerate_base_code.py
Se você notar que adicionei essas duas linhas print(f "End of training epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
e a linha print(f "End of validation epoch {i}, outputs['logits'].shape: {outputs['logits'].shape}, labels.shape: {labels.shape}")
, eu as adicionei de propósito porque elas revelarão algo muito importante
Agora vamos executá-lo. Para executar os scripts do accelerate
, usamos o comando accelerate launch
.
acelerar o lançamento do script.py
%%time!accelerate launch accelerate_scripts/02_accelerate_base_code.py
End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])End of training epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([64])End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])Accuracy = 0.206End of validation epoch 0, outputs['logits'].shape: torch.Size([64, 20]), labels.shape: torch.Size([8])Accuracy = 0.206CPU times: user 1.6 s, sys: 272 ms, total: 1.88 sWall time: 2min 37s
Vemos que antes levava cerca de 3 minutos e meio e agora leva cerca de 2 minutos e meio. Uma melhora significativa. Além disso, se observarmos as impressões, veremos que elas foram impressas duas vezes.
E como isso é possível? Porque o accelerate
paralelizou o treinamento nas duas GPUs que tenho, de modo que ele ficou muito mais rápido.
Além disso, quando executei o primeiro script, ou seja, quando não usei o accelerate
, a GPU estava quase cheia, enquanto que quando executei o segundo, ou seja, o que usava o accelerate
, as duas GPUs foram muito pouco usadas, portanto, podemos aumentar o tamanho do lote para tentar preencher ambas.
%%writefile accelerate_scripts/03_accelerate_base_code_more_bs.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluatefrom fastprogress.fastprogress import master_bar, progress_bar# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])master_progress_bar = master_bar(range(EPOCHS))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"print(f"Accuracy = {accuracy['accuracy']}")
Overwriting accelerate_scripts/03_accelerate_base_code_more_bs.py
Removi as impressões extras, pois já vimos que o código está sendo executado em ambas as GPUs e aumentei o tamanho do lote de 64 para 128.
%%time!accelerate launch accelerate_scripts/03_accelerate_base_code_more_bs.py
Accuracy = 0.1052Accuracy = 0.1052CPU times: user 1.41 s, sys: 180 ms, total: 1.59 sWall time: 2min 22s
O aumento do tamanho do lote reduziu o tempo de execução em alguns segundos.
Execução do processo
Execução de código em um único processo
Anteriormente, vimos que o print
foi impresso duas vezes, isso ocorre porque o accelerate
cria tantos processos quanto os dispositivos em que o código é executado; no meu caso, ele cria dois processos porque tenho duas GPUs.
No entanto, nem todo código deve ser executado em todos os processos, por exemplo, o print
torna o código muito lento para ser executado várias vezes, se os pontos de verificação forem salvos, eles serão salvos duas vezes etc.
Para executar parte de um código em um único processo, é necessário encapsulá-lo em uma função e decorá-la com accelerator.on_local_main_process
. Por exemplo, no código a seguir, você verá que criei a seguinte função
@accelerator.on_local_main_process
def print_something(something):
print(something)
Outra opção é colocar o código dentro de um if accelerator.is_local_main_process
, como o código a seguir
se accelerator.is_local_main_process:
print("Algo")
%%writefile accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluatefrom fastprogress.fastprogress import master_bar, progress_bar# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_something(something):print(something)master_progress_bar = master_bar(range(EPOCHS))for i in master_progress_bar:model.train()progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()master_progress_bar.main_bar.comment = f"Validation accuracy: {accuracy['accuracy']}\n"# print(f"Accuracy = {accuracy['accuracy']}")print_something(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")
Overwriting accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
Vamos executá-lo e ver
%%time!accelerate launch accelerate_scripts/04_accelerate_base_code_some_code_in_one_process.py
Accuracy = 0.2098End of script with 0.2098 accuracyCPU times: user 1.38 s, sys: 197 ms, total: 1.58 sWall time: 2min 22s
Agora, a impressão foi feita apenas uma vez
No entanto, embora você não veja muito, as barras de progresso são executadas em cada processo.
Não encontrei uma maneira de contornar isso com as barras de progresso fastprogress
, mas encontrei com as barras de progresso tqdm
, portanto, substituirei as barras de progresso fastprogress
pelas barras de progresso tqdm
e, para que sejam executadas em um único processo, adicionarei o argumento disable=not accelerator.is_local_main_process
.
%%writefile accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_something(something):print(something)for i in range(EPOCHS):model.train()# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# print(f"Accuracy = {accuracy['accuracy']}")print_something(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")
Overwriting accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
%%time!accelerate launch accelerate_scripts/05_accelerate_base_code_some_code_in_one_process.py
100%|█████████████████████████████████████████| 176/176 [02:01<00:00, 1.45it/s]100%|███████████████████████████████████████████| 20/20 [00:06<00:00, 3.30it/s]Accuracy = 0.2166End of script with 0.2166 accuracyCPU times: user 1.33 s, sys: 195 ms, total: 1.52 sWall time: 2min 22s
Mostramos um exemplo de como imprimir em um único processo, e essa foi uma maneira de executar processos em um único processo. Mas se você quiser imprimir em um único processo, poderá usar o método print
do accelerate
. Vejamos o mesmo exemplo anterior com esse método
%%writefile accelerate_scripts/06_accelerate_base_code_print_one_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()# progress_bar_train = progress_bar(dataloader["train"], parent=master_progress_bar)progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# master_progress_bar.child.comment = f'loss: {loss}'# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()# progress_bar_validation = progress_bar(dataloader["validation"], parent=master_progress_bar)progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# print(f"Accuracy = {accuracy['accuracy']}")accelerator.print(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")
Writing accelerate_scripts/06_accelerate_base_code_print_one_process.py
Nós o executamos
%%time!accelerate launch accelerate_scripts/06_accelerate_base_code_print_one_process.py
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15433.52 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 11406.61 examples/s]Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15036.87 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14932.76 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14956.60 examples/s]100%|█████████████████████████████████████████| 176/176 [02:00<00:00, 1.46it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.33it/s]Accuracy = 0.2134End of script with 0.2134 accuracyCPU times: user 1.4 s, sys: 189 ms, total: 1.59 sWall time: 2min 27s
Execução de código em todos os processos
No entanto, há um código que deve ser executado em todos os processos, por exemplo, se fizermos o upload dos pontos de verificação para o hub, portanto, temos duas opções: encapsular o código em uma função e decorá-lo com accelerator.on_main_process
.
@accelerator.on_main_process
def do_my_thing():
"Algo feito uma vez por servidor
do_thing_once()
ou colocar o código dentro de um if accelerator.is_main_process
.
se accelerator.is_main_process:
repo.push_to_hub()
Como estamos treinando apenas para mostrar a biblioteca accelerate
e o modelo que estamos treinando não é bom, não faz sentido carregar os pontos de verificação no hub, portanto, farei um exemplo com print
s.
%%writefile accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_in_one_process(something):print(something)@accelerator.on_main_processdef print_in_all_processes(something):print(something)for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()print_in_one_process(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")if accelerator.is_main_process:print(f"All process: End of script with {accuracy['accuracy']} accuracy")
Overwriting accelerate_scripts/06_accelerate_base_code_some_code_in_all_process.py
Nós o executamos para ver
%%time!accelerate launch accelerate_scripts/07_accelerate_base_code_some_code_in_all_process.py
Map: 100%|██████████████████████| 45000/45000 [00:03<00:00, 14518.44 examples/s]Map: 100%|██████████████████████| 45000/45000 [00:03<00:00, 14368.77 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 16466.33 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14806.14 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14253.33 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14337.07 examples/s]100%|█████████████████████████████████████████| 176/176 [02:00<00:00, 1.46it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.34it/s]Accuracy = 0.2092End of script with 0.2092 accuracyAll process: Accuracy = 0.2092All process: End of script with 0.2092 accuracyCPU times: user 1.42 s, sys: 216 ms, total: 1.64 sWall time: 2min 27s
Execução de código no processo X
Por fim, podemos especificar em qual processo queremos executar o código. Para isso, precisamos criar uma função e decorá-la com @accelerator.on_process(process_index=0)
.
@accelerator.on_process(process_index=0)
def do_my_thing():
"Algo feito no índice de processo 0".
do_thing_on_index_zero()
ou decorá-lo com @accelerator.on_local_process(local_process_idx=0)
.
@accelerator.on_local_process(local_process_index=0)
def do_my_thing():
"Algo feito no índice de processo 0 em cada servidor".
do_thing_on_index_zero_on_each_server()
Aqui eu coloquei o processo 0, mas você pode colocar qualquer número.
%%writefile accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_in_one_process(something):print(something)@accelerator.on_main_processdef print_in_all_processes(something):print(something)@accelerator.on_process(process_index=0)def print_in_process_0(something):print("Process 0: " + something)@accelerator.on_local_process(local_process_index=1)def print_in_process_1(something):print("Process 1: " + something)for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()print_in_one_process(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")if accelerator.is_main_process:print(f"All process: End of script with {accuracy['accuracy']} accuracy")print_in_process_0("End of process 0")print_in_process_1("End of process 1")
Overwriting accelerate_scripts/07_accelerate_base_code_some_code_in_some_process.py
Nós o executamos
%%time!accelerate launch accelerate_scripts/08_accelerate_base_code_some_code_in_some_process.py
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 15735.58 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 14906.20 examples/s]100%|█████████████████████████████████████████| 176/176 [02:02<00:00, 1.44it/s]100%|███████████████████████████████████████████| 20/20 [00:06<00:00, 3.27it/s]Process 1: End of process 1Accuracy = 0.2128End of script with 0.2128 accuracyAll process: Accuracy = 0.2128All process: End of script with 0.2128 accuracyProcess 0: End of process 0CPU times: user 1.42 s, sys: 295 ms, total: 1.71 sWall time: 2min 37s
Sincronizar processos
Se tivermos um código que precisa ser executado em todos os processos, é interessante esperar que ele termine em todos os processos antes de executar outra tarefa, portanto, usamos accelerator.wait_for_everyone()
para isso.
Para ver isso, vamos colocar um atraso em uma das funções de impressão em um processo.
Também coloquei um intervalo no ciclo de treinamento para que ele não passe muito tempo treinando, o que não é o que nos interessa no momento.
%%writefile accelerate_scripts/09_accelerate_base_code_sync_all_process.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdmimport time# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])@accelerator.on_local_main_processdef print_in_one_process(something):print(something)@accelerator.on_main_processdef print_in_all_processes(something):print(something)@accelerator.on_process(process_index=0)def print_in_process_0(something):time.sleep(2)print("Process 0: " + something)@accelerator.on_local_process(local_process_index=1)def print_in_process_1(something):print("Process 1: " + something)for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()breakmodel.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()print_in_one_process(f"Accuracy = {accuracy['accuracy']}")if accelerator.is_local_main_process:print(f"End of script with {accuracy['accuracy']} accuracy")print_in_all_processes(f"All process: Accuracy = {accuracy['accuracy']}")if accelerator.is_main_process:print(f"All process: End of script with {accuracy['accuracy']} accuracy")print_in_one_process("Printing with delay in process 0")print_in_process_0("End of process 0")print_in_process_1("End of process 1")accelerator.wait_for_everyone()print_in_one_process("End of script")
Overwriting accelerate_scripts/08_accelerate_base_code_sync_all_process.py
Nós o executamos
!accelerate launch accelerate_scripts/09_accelerate_base_code_sync_all_process.py
Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14218.23 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 14666.25 examples/s]0%| | 0/176 [00:00<?, ?it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.58it/s]Process 1: End of process 1Accuracy = 0.212End of script with 0.212 accuracyAll process: Accuracy = 0.212All process: End of script with 0.212 accuracyPrinting with delay in process 0Process 0: End of process 0End of script
Como você pode ver, primeiro imprimimos Process 1: End of process 1
e depois o restante, porque o restante das impressões é feito no processo 0 ou em todos os processos, portanto, até que o atraso de 2 segundos que colocamos não seja concluído, o restante do código não é executado.
Salvar e carregar o ditado de estado
Quando treinamos, às vezes salvamos o estado para que possamos continuar em outro momento.
Para salvar o estado, teremos que usar os métodos save_state()
e load_state()
.
%%writefile accelerate_scripts/10_accelerate_save_and_load_checkpoints.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")@accelerator.on_local_main_processdef print_something(something):print(something)EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# Guardamos los pesosaccelerator.save_state("accelerate_scripts/checkpoints")print_something(f"Accuracy = {accuracy['accuracy']}")# Cargamos los pesosaccelerator.load_state("accelerate_scripts/checkpoints")
Overwriting accelerate_scripts/09_accelerate_save_and_load_checkpoints.py
Nós o executamos
!accelerate launch accelerate_scripts/10_accelerate_save_and_load_checkpoints.py
100%|█████████████████████████████████████████| 176/176 [01:58<00:00, 1.48it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.40it/s]Accuracy = 0.2142
Salvar o modelo
Quando o método prepare
foi usado, o modelo foi empacotado para que pudesse ser salvo nos dispositivos necessários. Portanto, ao salvá-lo, temos que usar o método save_model
que primeiro o desembrulha e depois o salva. Além disso, se usarmos o parâmetro safe_serialization=True
, o modelo será salvo como um tensor safe
.
%%writefile accelerate_scripts/11_accelerate_save_model.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")@accelerator.on_local_main_processdef print_something(something):print(something)EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# Guardamos el modeloaccelerator.wait_for_everyone()accelerator.save_model(model, "accelerate_scripts/model", safe_serialization=True)print_something(f"Accuracy = {accuracy['accuracy']}")
Writing accelerate_scripts/11_accelerate_save_model.py
Nós o executamos
!accelerate launch accelerate_scripts/11_accelerate_save_model.py
100%|█████████████████████████████████████████| 176/176 [01:58<00:00, 1.48it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.35it/s]Accuracy = 0.214
Salvar o modelo "pré-treinado
Nos modelos que usam a biblioteca transformers
, devemos salvar o modelo com o método save_pretrained
para carregá-lo com o método from_pretrained
. Antes de salvar, o modelo deve ser desempacotado com o método unwrap_model
.
%%writefile accelerate_scripts/12_accelerate_save_pretrained.pyimport torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdm# Importamos e inicializamos Acceleratorfrom accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = 128dataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")@accelerator.on_local_main_processdef print_something(something):print(something)EPOCHS = 1# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")device = accelerator.device# model.to(device)model, optimizer, dataloader["train"], dataloader["validation"] = accelerator.prepare(model, optimizer, dataloader["train"], dataloader["validation"])for i in range(EPOCHS):model.train()progress_bar_train = tqdm.tqdm(dataloader["train"], disable=not accelerator.is_local_main_process)for batch in progress_bar_train:optimizer.zero_grad()input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)outputs = model(input_ids=input_ids, attention_mask=attention_mask)loss = loss_function(outputs['logits'], labels)# loss.backward()accelerator.backward(loss)optimizer.step()model.eval()progress_bar_validation = tqdm.tqdm(dataloader["validation"], disable=not accelerator.is_local_main_process)for batch in progress_bar_validation:input_ids = batch["input_ids"]#.to(device)attention_mask = batch["attention_mask"]#.to(device)labels = batch["label"]#.to(device)with torch.no_grad():outputs = model(input_ids=input_ids, attention_mask=attention_mask)predictions = torch.argmax(outputs['logits'], axis=-1)# Recopilamos las predicciones de todos los dispositivospredictions = accelerator.gather_for_metrics(predictions)labels = accelerator.gather_for_metrics(labels)accuracy = metric.add_batch(predictions=predictions, references=labels)accuracy = metric.compute()# Guardamos el modelo pretrainedunwrapped_model = accelerator.unwrap_model(model)unwrapped_model.save_pretrained("accelerate_scripts/model_pretrained",is_main_process=accelerator.is_main_process,save_function=accelerator.save,)print_something(f"Accuracy = {accuracy['accuracy']}")
Writing accelerate_scripts/11_accelerate_save_pretrained.py
Nós o executamos
!accelerate launch accelerate_scripts/12_accelerate_save_pretrained.py
Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15152.47 examples/s]Map: 100%|██████████████████████| 45000/45000 [00:02<00:00, 15119.13 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 12724.70 examples/s]Map: 100%|████████████████████████| 5000/5000 [00:00<00:00, 12397.49 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 15247.21 examples/s]Map: 100%|██████████████████████| 50000/50000 [00:03<00:00, 15138.03 examples/s]100%|█████████████████████████████████████████| 176/176 [01:59<00:00, 1.48it/s]100%|███████████████████████████████████████████| 20/20 [00:05<00:00, 3.37it/s]Accuracy = 0.21
Agora podemos carregá-lo
from transformers import AutoModelcheckpoints = "accelerate_scripts/model_pretrained"tokenizer = AutoModel.from_pretrained(checkpoints)
Some weights of RobertaModel were not initialized from the model checkpoint at accelerate_scripts/model_pretrained and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Treinamento em notebooks
Até agora, vimos como executar scripts, mas se quiser executar o código em um notebook, podemos escrever o mesmo código de antes, mas encapsulado em uma função
Primeiro, importamos as bibliotecas
import torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdmimport time# from accelerate import Accelerator
Agora criamos a função
import torchfrom torch.utils.data import DataLoaderfrom torch.optim import Adamfrom datasets import load_datasetfrom transformers import AutoTokenizer, AutoModelForSequenceClassificationimport evaluateimport tqdmimport time# from accelerate import Acceleratordef train_code(batch_size: int = 64):from accelerate import Acceleratoraccelerator = Accelerator()dataset = load_dataset("tweet_eval", "emoji")num_classes = len(dataset["train"].info.features["label"].names)max_len = 130checkpoints = "cardiffnlp/twitter-roberta-base-irony"tokenizer = AutoTokenizer.from_pretrained(checkpoints)def tokenize_function(dataset):return tokenizer(dataset["text"], max_length=max_len, padding="max_length", truncation=True, return_tensors="pt")tokenized_dataset = {"train": dataset["train"].map(tokenize_function, batched=True, remove_columns=["text"]),"validation": dataset["validation"].map(tokenize_function, batched=True, remove_columns=["text"]),"test": dataset["test"].map(tokenize_function, batched=True, remove_columns=["text"]),}tokenized_dataset["train"].set_format(type="torch", columns=['input_ids', 'attention_mask', 'label'])tokenized_dataset["validation"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])tokenized_dataset["test"].set_format(type="torch", columns=['label', 'input_ids', 'attention_mask'])BS = batch_sizedataloader = {"train": DataLoader(tokenized_dataset["train"], batch_size=BS, shuffle=True),"validation": DataLoader(tokenized_dataset["validation"], batch_size=BS, shuffle=True),"test": DataLoader(tokenized_dataset["test"], batch_size=BS, shuffle=True),}model = AutoModelForSequenceClassification.from_pretrained(checkpoints)model.classifier.out_proj = torch.nn.Linear(in_features=model.classifier.out_proj.in_features, out_features=num_classes, bias=True)loss_function = torch.nn.CrossEntropyLoss()optimizer = Adam(model.parameters(), lr=5e-4)metric = evaluate.load("accuracy")