Fine tuning SMLs

Fine tuning SMLs Fine tuning SMLs

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

In this post, we are going to see how to perform fine-tuning on small language models. We will explore how to do fine-tuning for text classification and text generation. First, we will look at how to do it using the Hugging Face libraries, as Hugging Face has become a very important player in the AI ecosystem at this moment.

But although the Hugging Face libraries are very important and useful, it is very important to know how the training is actually done and what is happening underneath, so we are going to repeat the training for classification and text generation but with Pytorch.

Fine tuning for text classification with Hugging Facelink image 41

Loginlink image 42

To be able to upload the training results to the hub, we need to log in first, for which we need a token

To create a token, you need to go to the settings/tokens page of your account, and you will see something like this

User-Access-Token-dark

We click on New token and a window will appear to create a new token

new-token-dark

We give the token a name and create it with the write role, or with the Fine-grained role, which allows us to select exactly which permissions the token will have.

Once created, we copy and paste it below.

	
from huggingface_hub import notebook_login
notebook_login()
Copy

Datasetlink image 43

Now we download a dataset, in this case we are going to download one of Amazon reviews from Amazon

	
from datasets import load_dataset
dataset = load_dataset("mteb/amazon_reviews_multi", "en")
Copy

Let's take a look at it a bit

	
dataset
Copy
	
DatasetDict({
train: Dataset({
features: ['id', 'text', 'label', 'label_text'],
num_rows: 200000
})
validation: Dataset({
features: ['id', 'text', 'label', 'label_text'],
num_rows: 5000
})
test: Dataset({
features: ['id', 'text', 'label', 'label_text'],
num_rows: 5000
})
})

We see that you have a training set with 200,000 samples, a validation set with 5,000 samples, and a test set of 5,000 samples.

Let's take a look at an example from the training set

	
from random import randint
idx = randint(0, len(dataset['train']) - 1)
dataset['train'][idx]
Copy
	
{'id': 'en_0907914',
'text': 'Mixed with fir it’s passable Not the scent I had hoped for . Love the scent of cedar, but this one missed',
'label': 3,
'label_text': '3'}

We see that the review is in the text field and the rating given by the user is in the label field.

As we are going to build a text classification model, we need to know how many classes we will have.

	
num_classes = len(dataset['train'].unique('label'))
num_classes
Copy
	
5

We will have 5 classes, now we are going to see the minimum value of these classes to know if the score starts at 0 or 1. For this, we use the unique method.

	
dataset.unique('label')
Copy
	
{'train': [0, 1, 2, 3, 4],
'validation': [0, 1, 2, 3, 4],
'test': [0, 1, 2, 3, 4]}

The minimum value will be 0

To train, the labels need to be in a field called labels, while in our dataset it is in a field called label, so we create the new field labels with the same value as label

We create a function that does what we want

	
def set_labels(example):
example['labels'] = example['label']
return example
Copy

We apply the function to the dataset

	
dataset = dataset.map(set_labels)
Copy

Let's see how the dataset looks like

	
dataset['train'][idx]
Copy
	
{'id': 'en_0907914',
'text': 'Mixed with fir it’s passable Not the scent I had hoped for . Love the scent of cedar, but this one missed',
'label': 3,
'label_text': '3',
'labels': 3}

Tokenizerlink image 44

Since we have the reviews in text form in the dataset, we need to tokenize them so that we can feed the tokens into the model.

	
from transformers import AutoTokenizer
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Copy

Now we create a function to tokenize the text. We will do this in such a way that all statements have the same length, so the tokenizer will truncate when necessary and add padding tokens when necessary. Additionally, we specify that it should return pytorch tensors.

We make the length of each sentence 768 tokens because we are using the small GPT2 model, which as we saw in the GPT2 post has an embedding dimension of 768 tokens.

	
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
Copy

Let's try to tokenize a text

	
tokens = tokenize_function(dataset['train'][idx])
Copy
	
---------------------------------------------------------------------------ValueError Traceback (most recent call last)Cell In[11], line 1
----> 1 tokens = tokenize_function(dataset['train'][idx])
Cell In[10], line 2, in tokenize_function(examples)
1 def tokenize_function(examples):
----> 2 return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2883, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2881 if not self._in_target_context_manager:
2882 self._switch_to_input_mode()
-> 2883 encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
2884 if text_target is not None:
2885 self._switch_to_target_mode()
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2989, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2969 return self.batch_encode_plus(
2970 batch_text_or_text_pairs=batch_text_or_text_pairs,
2971 add_special_tokens=add_special_tokens,
(...)
2986 **kwargs,
2987 )
2988 else:
-> 2989 return self.encode_plus(
2990 text=text,
2991 text_pair=text_pair,
2992 add_special_tokens=add_special_tokens,
2993 padding=padding,
2994 truncation=truncation,
2995 max_length=max_length,
2996 stride=stride,
2997 is_split_into_words=is_split_into_words,
2998 pad_to_multiple_of=pad_to_multiple_of,
2999 return_tensors=return_tensors,
3000 return_token_type_ids=return_token_type_ids,
3001 return_attention_mask=return_attention_mask,
3002 return_overflowing_tokens=return_overflowing_tokens,
3003 return_special_tokens_mask=return_special_tokens_mask,
3004 return_offsets_mapping=return_offsets_mapping,
3005 return_length=return_length,
3006 verbose=verbose,
3007 **kwargs,
3008 )
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:3053, in PreTrainedTokenizerBase.encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
3032 """
3033 Tokenize and prepare for the model a sequence or a pair of sequences.
3034
(...)
3049 method).
3050 """
3052 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
-> 3053 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
3054 padding=padding,
3055 truncation=truncation,
3056 max_length=max_length,
3057 pad_to_multiple_of=pad_to_multiple_of,
3058 verbose=verbose,
3059 **kwargs,
3060 )
3062 return self._encode_plus(
3063 text=text,
3064 text_pair=text_pair,
(...)
3080 **kwargs,
3081 )
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:2788, in PreTrainedTokenizerBase._get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
2786 # Test if we have a padding token
2787 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (self.pad_token is None or self.pad_token_id < 0):
-> 2788 raise ValueError(
2789 "Asking to pad but the tokenizer does not have a padding token. "
2790 "Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` "
2791 "or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`."
2792 )
2794 # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
2795 if (
2796 truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
2797 and padding_strategy != PaddingStrategy.DO_NOT_PAD
(...)
2800 and (max_length % pad_to_multiple_of != 0)
2801 ):
ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

We get an error because the GPT2 tokenizer does not have a token for padding and asks us to assign one, additionally it suggests doing tokenizer.pad_token = tokenizer.eos_token, so we do that.

	
tokenizer.pad_token = tokenizer.eos_token
Copy

We test the tokenization function again

	
tokens = tokenize_function(dataset['train'][idx])
tokens['input_ids'].shape, tokens['attention_mask'].shape
Copy
	
(torch.Size([1, 768]), torch.Size([1, 768]))

Now that we have checked that the function tokenizes well, we apply this function to the dataset, but we also apply it in batches so that it executes faster

Moreover, we take the opportunity to delete the columns that we won't need.

	
dataset = dataset.map(tokenize_function, batched=True, remove_columns=['text', 'label', 'id', 'label_text'])
Copy

We now see how the dataset looks.

	
dataset
Copy
	
DatasetDict({
train: Dataset({
features: ['id', 'text', 'label', 'label_text', 'labels'],
num_rows: 200000
})
validation: Dataset({
features: ['id', 'text', 'label', 'label_text', 'labels'],
num_rows: 5000
})
test: Dataset({
features: ['id', 'text', 'label', 'label_text', 'labels'],
num_rows: 5000
})
})

We see that we have the fields 'labels', 'input_ids', and 'attention_mask', which is what we are interested in for training.

Modellink image 45

We instantiate a model for sequence classification and specify the number of classes we have

	
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes)
Copy
	
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

It tells us that the weights of the score layer have been initialized randomly and that we need to retrain them, let's see why this happens.

The GPT2 model would be this

	
from transformers import AutoModelForCausalLM
casual_model = AutoModelForCausalLM.from_pretrained(checkpoint)
Copy

While the GPT2 model for generating text is this

Let's see its architecture

	
casual_model
Copy
	
GPT2LMHeadModel(
(transformer): GPT2Model(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

And now the architecture of the model we are going to use for classifying the reviews

	
model
Copy
	
GPT2ForSequenceClassification(
(transformer): GPT2Model(
(wte): Embedding(50257, 768)
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
(0-11): 12 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2Attention(
(c_attn): Conv1D()
(c_proj): Conv1D()
(attn_dropout): Dropout(p=0.1, inplace=False)
(resid_dropout): Dropout(p=0.1, inplace=False)
)
(ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(mlp): GPT2MLP(
(c_fc): Conv1D()
(c_proj): Conv1D()
(act): NewGELUActivation()
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
(ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(score): Linear(in_features=768, out_features=5, bias=False)
)

There are two things to mention here.

  • The first is that in both, the first layer has dimensions of 50257x768, which corresponds to 50257 possible tokens from the GPT-2 vocabulary and 768 dimensions of the embedding, so we have done well in tokenizing the reviews with a size of 768 tokens
  • The second is that the casual model (the text generation one) has at the end a Linear layer that generates 50257 values, meaning it is responsible for predicting the next token and assigns a value to each possible token. On the other hand, the classification model has a Linear layer that only generates 5 values, one for each class, which will give us the probability that the review belongs to each class.

That's why we were getting the message that the weights of the score layer had been initialized randomly, because the transformers library has removed the Linear layer of 768x50257 and added a Linear layer of 768x5, it has initialized it with random values and we need to train it for our specific problem.

We delete the casual model because we are not going to use it.

	
del casual_model
Copy

Trainerlink image 46

Let's now configure the training arguments

	
from transformers import TrainingArguments
metric_name = "accuracy"
model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
LR = 2e-5
BS_TRAIN = 28
BS_EVAL = 40
EPOCHS = 3
WEIGHT_DECAY = 0.01
training_args = TrainingArguments(
model_name,
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=LR,
per_device_train_batch_size=BS_TRAIN,
per_device_eval_batch_size=BS_EVAL,
num_train_epochs=EPOCHS,
weight_decay=WEIGHT_DECAY,
lr_scheduler_type="cosine",
warmup_ratio = 0.1,
fp16=True,
load_best_model_at_end=True,
metric_for_best_model=metric_name,
push_to_hub=True,
)
Copy

We define a metric for the validation dataloader

	
import numpy as np
from evaluate import load
metric = load("accuracy")
def compute_metrics(eval_pred):
print(eval_pred)
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return metric.compute(predictions=predictions, references=labels)
Copy

We now define the trainer

	
from transformers import Trainer
trainer = Trainer(
model,
training_args,
train_dataset=dataset['train'],
eval_dataset=dataset['validation'],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
Copy

We train

	
trainer.train()
Copy
	
0%| | 0/600000 [00:00<?, ?it/s]
	
---------------------------------------------------------------------------ValueError Traceback (most recent call last)Cell In[21], line 1
----> 1 trainer.train()
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:1876, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1873 try:
1874 # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
1875 hf_hub_utils.disable_progress_bars()
-> 1876 return inner_training_loop(
1877 args=args,
1878 resume_from_checkpoint=resume_from_checkpoint,
1879 trial=trial,
1880 ignore_keys_for_eval=ignore_keys_for_eval,
1881 )
1882 finally:
1883 hf_hub_utils.enable_progress_bars()
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:2178, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2175 rng_to_sync = True
2177 step = -1
-> 2178 for step, inputs in enumerate(epoch_iterator):
2179 total_batched_samples += 1
2181 if self.args.include_num_input_tokens_seen:
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/accelerate/data_loader.py:454, in DataLoaderShard.__iter__(self)
452 # We iterate one batch ahead to check when we are at the end
453 try:
--> 454 current_batch = next(dataloader_iter)
455 except StopIteration:
456 yield
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/torch/utils/data/dataloader.py:631, in _BaseDataLoaderIter.__next__(self)
628 if self._sampler_iter is None:
629 # TODO(https://github.com/pytorch/pytorch/issues/76750)
630 self._reset() # type: ignore[call-arg]
--> 631 data = self._next_data()
632 self._num_yielded += 1
633 if self._dataset_kind == _DatasetKind.Iterable and \
634 self._IterableDataset_len_called is not None and \
635 self._num_yielded > self._IterableDataset_len_called:
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/torch/utils/data/dataloader.py:675, in _SingleProcessDataLoaderIter._next_data(self)
673 def _next_data(self):
674 index = self._next_index() # may raise StopIteration
--> 675 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
676 if self._pin_memory:
677 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py:54, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
52 else:
53 data = self.dataset[possibly_batched_index]
---> 54 return self.collate_fn(data)
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/data/data_collator.py:271, in DataCollatorWithPadding.__call__(self, features)
270 def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
--> 271 batch = pad_without_fast_tokenizer_warning(
272 self.tokenizer,
273 features,
274 padding=self.padding,
275 max_length=self.max_length,
276 pad_to_multiple_of=self.pad_to_multiple_of,
277 return_tensors=self.return_tensors,
278 )
279 if "label" in batch:
280 batch["labels"] = batch["label"]
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/data/data_collator.py:66, in pad_without_fast_tokenizer_warning(tokenizer, *pad_args, **pad_kwargs)
63 tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = True
65 try:
---> 66 padded = tokenizer.pad(*pad_args, **pad_kwargs)
67 finally:
68 # Restore the state of the warning.
69 tokenizer.deprecation_warnings["Asking-to-pad-a-fast-tokenizer"] = warning_state
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:3299, in PreTrainedTokenizerBase.pad(self, encoded_inputs, padding, max_length, pad_to_multiple_of, return_attention_mask, return_tensors, verbose)
3297 # The model's main input name, usually `input_ids`, has be passed for padding
3298 if self.model_input_names[0] not in encoded_inputs:
-> 3299 raise ValueError(
3300 "You should supply an encoding or a list of encodings to this method "
3301 f"that includes {self.model_input_names[0]}, but you provided {list(encoded_inputs.keys())}"
3302 )
3304 required_input = encoded_inputs[self.model_input_names[0]]
3306 if required_input is None or (isinstance(required_input, Sized) and len(required_input) == 0):
ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ['label', 'labels']

We get the error again because the model does not have a padding token assigned, so just like with the tokenizer, we assign it.

	
model.config.pad_token_id = model.config.eos_token_id
Copy

We recreate the trainer arguments with the new model, which now has a padding token, the trainer, and we retrain.

	
training_args = TrainingArguments(
model_name,
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=LR,
per_device_train_batch_size=BS_TRAIN,
per_device_eval_batch_size=BS_EVAL,
num_train_epochs=EPOCHS,
weight_decay=WEIGHT_DECAY,
lr_scheduler_type="cosine",
warmup_ratio = 0.1,
fp16=True,
load_best_model_at_end=True,
metric_for_best_model=metric_name,
push_to_hub=True,
logging_dir="./runs",
)
trainer = Trainer(
model,
training_args,
train_dataset=dataset['train'],
eval_dataset=dataset['validation'],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
Copy

Now that we've seen everything is in order, we can train

	
trainer.train()
Copy
	
<IPython.core.display.HTML object>
	
<IPython.core.display.HTML object>
	
<transformers.trainer_utils.EvalPrediction object at 0x782767ea1450>
<transformers.trainer_utils.EvalPrediction object at 0x782767eeefe0>
<transformers.trainer_utils.EvalPrediction object at 0x782767eecfd0>
	
TrainOutput(global_step=21429, training_loss=0.7846888848762739, metrics={'train_runtime': 26367.7801, 'train_samples_per_second': 22.755, 'train_steps_per_second': 0.813, 'total_flos': 2.35173445632e+17, 'train_loss': 0.7846888848762739, 'epoch': 3.0})

Evaluationlink image 47

Once trained, we evaluate on the test dataset

	
trainer.evaluate(eval_dataset=dataset['test'])
Copy
	
<IPython.core.display.HTML object>
	
<transformers.trainer_utils.EvalPrediction object at 0x7826ddfded40>
	
{'eval_loss': 0.7973636984825134,
'eval_accuracy': 0.6626,
'eval_runtime': 76.3016,
'eval_samples_per_second': 65.529,
'eval_steps_per_second': 1.638,
'epoch': 3.0}

Publish the modellink image 48

We already have our model trained, so we can share it with the world. First, we create a **model card**.

	
trainer.create_model_card()
Copy

And we can publish it now. Since the first thing we did was log in to the Hugging Face Hub, we can upload it to our hub without any issues.

	
trainer.push_to_hub()
Copy

Usage of the modellink image 49

We clean everything as much as possible

	
import torch
import gc
def clear_hardwares():
torch.clear_autocast_cache()
torch.cuda.ipc_collect()
torch.cuda.empty_cache()
gc.collect()
clear_hardwares()
clear_hardwares()
Copy

Since we have uploaded the model to our hub, we can download and use it

	
from transformers import pipeline
user = "maximofn"
checkpoints = f"{user}/{model_name}"
task = "text-classification"
classifier = pipeline(task, model=checkpoints, tokenizer=checkpoints)
Copy

If we want it to return the probability of all classes, we simply use the classifier we just instantiated, with the parameter top_k=None

	
labels = classifier("I love this product", top_k=None)
labels
Copy
	
[{'label': 'LABEL_4', 'score': 0.8253807425498962},
{'label': 'LABEL_3', 'score': 0.15411493182182312},
{'label': 'LABEL_2', 'score': 0.013907806016504765},
{'label': 'LABEL_0', 'score': 0.003939222544431686},
{'label': 'LABEL_1', 'score': 0.0026572425849735737}]

If we only want the class with the highest probability, we do the same but with the parameter top_k=1

	
label = classifier("I love this product", top_k=1)
label
Copy
	
[{'label': 'LABEL_4', 'score': 0.8253807425498962}]

And if we want n classes, we do the same but with the parameter top_k=n

	
two_labels = classifier("I love this product", top_k=2)
two_labels
Copy
	
[{'label': 'LABEL_4', 'score': 0.8253807425498962},
{'label': 'LABEL_3', 'score': 0.15411493182182312}]

We can also test the model with Automodel and AutoTokenizer

	
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "GPT2-small-finetuned-amazon-reviews-en-classification"
user = "maximofn"
checkpoint = f"{user}/{model_name}"
num_classes = 5
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes).half().eval().to("cuda")
Copy
	
tokens = tokenizer.encode("I love this product", return_tensors="pt").to(model.device)
with torch.no_grad():
output = model(tokens)
logits = output.logits
lables = torch.softmax(logits, dim=1).cpu().numpy().tolist()
lables[0]
Copy
	
[0.003963470458984375,
0.0026721954345703125,
0.01397705078125,
0.154541015625,
0.82470703125]

If you want to try the model further, you can check it out at Maximofn/GPT2-small-finetuned-amazon-reviews-en-classification

Fine-tuning for Text Generation with Hugging Facelink image 50

To make sure I don't have VRAM memory issues, I restart the notebook

Loginlink image 51

To be able to upload the training results to the hub, we need to log in first, for which we need a token

To create a token, you need to go to the settings/tokens page of your account, and you will see something like this

User-Access-Token-dark

We click on New token and a window will appear to create a new token

new-token-dark

We give the token a name and create it with the write role, or with the Fine-grained role, which allows us to select exactly which permissions the token will have.

Once created, we copy and paste it below.

	
from huggingface_hub import notebook_login
notebook_login()
Copy

Datasetlink image 52

We are going to use an English jokes dataset

	
from datasets import load_dataset
jokes = load_dataset("Maximofn/short-jokes-dataset")
jokes
Copy
	
DatasetDict({
train: Dataset({
features: ['ID', 'Joke'],
num_rows: 231657
})
})

Let's take a look at it a bit

	
jokes
Copy
	
DatasetDict({
train: Dataset({
features: ['ID', 'Joke'],
num_rows: 231657
})
})

We see that it is a single training set of more than 200 thousand jokes. So later we will have to split it into train and evaluation.

Let's see a sample

	
from random import randint
idx = randint(0, len(jokes['train']) - 1)
jokes['train'][idx]
Copy
	
{'ID': 198387,
'Joke': 'My hot dislexic co-worker said she had an important massage to give me in her office... When I got there, she told me it can wait until I put on some clothes.'}

We see that it has a joke ID that we are not interested in at all and the joke itself

In case you have limited GPU memory, I will create a subset of the dataset, choose the percentage of jokes you want to use

	
percent_of_train_dataset = 1 # If you want 50% of the dataset, set this to 0.5
subset_dataset = jokes["train"].select(range(int(len(jokes["train"]) * percent_of_train_dataset)))
subset_dataset
Copy
	
Dataset({
features: ['ID', 'Joke'],
num_rows: 231657
})

Now we divide the subset into a training set and a validation set

	
percent_of_train_dataset = 0.90
split_dataset = subset_dataset.train_test_split(train_size=int(subset_dataset.num_rows * percent_of_train_dataset), seed=19, shuffle=False)
train_dataset = split_dataset["train"]
validation_test_dataset = split_dataset["test"]
split_dataset = validation_test_dataset.train_test_split(train_size=int(validation_test_dataset.num_rows * 0.5), seed=19, shuffle=False)
validation_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]
print(f"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(validation_dataset)}. Size of the test set: {len(test_dataset)}")
Copy
	
Size of the train set: 208491. Size of the validation set: 11583. Size of the test set: 11583

Tokenizerlink image 53

We instantiate the tokenizer. We instantiate the padding token of the tokenizer so that we don't get an error as before.

	
from transformers import AutoTokenizer
checkpoints = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Copy

Let's add two new joke start and end tokens to have more control

	
new_tokens = ['<SJ>', '<EJ>'] # Start and end of joke tokens
num_added_tokens = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added_tokens} tokens")
Copy
	
Added 2 tokens

We create a function to add the new tokens to the sentences

	
joke_column = "Joke"
def format_joke(example):
example[joke_column] = '<SJ> ' + example['Joke'] + ' <EJ>'
return example
Copy

We select the columns that we don't need

	
remove_columns = [column for column in train_dataset.column_names if column != joke_column]
remove_columns
Copy
	
['ID']

We format the dataset and remove the columns we don't need

	
train_dataset = train_dataset.map(format_joke, remove_columns=remove_columns)
validation_dataset = validation_dataset.map(format_joke, remove_columns=remove_columns)
test_dataset = test_dataset.map(format_joke, remove_columns=remove_columns)
train_dataset, validation_dataset, test_dataset
Copy
	
(Dataset({
features: ['Joke'],
num_rows: 208491
}),
Dataset({
features: ['Joke'],
num_rows: 11583
}),
Dataset({
features: ['Joke'],
num_rows: 11583
}))

Now we create a function to tokenize the jokes

	
def tokenize_function(examples):
return tokenizer(examples[joke_column], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
Copy

We tokenize the dataset and remove the column with the text

	
train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
validation_dataset = validation_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
test_dataset = test_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
train_dataset, validation_dataset, test_dataset
Copy
	
(Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 208491
}),
Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 11583
}),
Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 11583
}))

Modellink image 54

Now we instantiate the model for text generation and assign the padding token to the end of string token.

	
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(checkpoints)
model.config.pad_token_id = model.config.eos_token_id
Copy

We see the size of the model's vocabulary

	
vocab_size = model.config.vocab_size
vocab_size
Copy
	
50257

It has 50257 tokens, which is the size of the GPT2 vocabulary. But as we said we were going to create two new tokens for the start of a joke and the end of a joke, we add them to the model.

	
model.resize_token_embeddings(len(tokenizer))
new_vocab_size = model.config.vocab_size
print(f"Old vocab size: {vocab_size}. New vocab size: {new_vocab_size}. Added {new_vocab_size - vocab_size} tokens")
Copy
	
Old vocab size: 50257. New vocab size: 50259. Added 2 tokens

The two new tokens have been added.

Traininglink image 55

We set the training parameters

	
from transformers import TrainingArguments
metric_name = "accuracy"
model_name = "GPT2-small-finetuned-Maximofn-short-jokes-dataset-casualLM"
output_dir = f"./training_results"
LR = 2e-5
BS_TRAIN = 28
BS_EVAL = 32
EPOCHS = 3
WEIGHT_DECAY = 0.01
WARMUP_STEPS = 100
training_args = TrainingArguments(
model_name,
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=LR,
per_device_train_batch_size=BS_TRAIN,
per_device_eval_batch_size=BS_EVAL,
warmup_steps=WARMUP_STEPS,
num_train_epochs=EPOCHS,
weight_decay=WEIGHT_DECAY,
lr_scheduler_type="cosine",
warmup_ratio = 0.1,
fp16=True,
load_best_model_at_end=True,
# metric_for_best_model=metric_name,
push_to_hub=True,
)
Copy

Now we don't use metric_for_best_model, after defining the trainer we explain why

We define the trainer

	
from transformers import Trainer
trainer = Trainer(
model,
training_args,
train_dataset=train_dataset,
eval_dataset=validation_dataset,
tokenizer=tokenizer,
# compute_metrics=compute_metrics,
)
Copy

In this case, we don't pass a compute_metrics function; instead, during evaluation, the loss will be used to evaluate the model. That's why when defining the arguments, we don't define metric_for_best_model, because we won't be using a metric to evaluate the model, but rather the loss.

We train

	
trainer.train()
Copy
	
0%| | 0/625473 [00:00<?, ?it/s]
	
---------------------------------------------------------------------------ValueError Traceback (most recent call last)Cell In[19], line 1
----> 1 trainer.train()
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1883 hf_hub_utils.enable_progress_bars()
1884 else:
-> 1885 return inner_training_loop(
1886 args=args,
1887 resume_from_checkpoint=resume_from_checkpoint,
1888 trial=trial,
1889 ignore_keys_for_eval=ignore_keys_for_eval,
1890 )
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2213 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
2215 with self.accelerator.accumulate(model):
-> 2216 tr_loss_step = self.training_step(model, inputs)
2218 if (
2219 args.logging_nan_inf_filter
2220 and not is_torch_xla_available()
2221 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
2222 ):
2223 # if loss is nan or inf simply add the average of previous logged losses
2224 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:3238, in Trainer.training_step(self, model, inputs)
3235 return loss_mb.reduce_mean().detach().to(self.args.device)
3237 with self.compute_loss_context_manager():
-> 3238 loss = self.compute_loss(model, inputs)
3240 del inputs
3241 torch.cuda.empty_cache()
File ~/miniconda3/envs/nlp_/lib/python3.11/site-packages/transformers/trainer.py:3282, in Trainer.compute_loss(self, model, inputs, return_outputs)
3280 else:
3281 if isinstance(outputs, dict) and "loss" not in outputs:
-> 3282 raise ValueError(
3283 "The model did not return a loss from the inputs, only the following keys: "
3284 f"{','.join(outputs.keys())}. For reference, the inputs it received are {','.join(inputs.keys())}."
3285 )
3286 # We don't use .loss here since the model may return tuples instead of ModelOutput.
3287 loss = outputs["loss"] if isinstance(outputs, dict) else outputs[0]
ValueError: The model did not return a loss from the inputs, only the following keys: logits,past_key_values. For reference, the inputs it received are input_ids,attention_mask.

As we can see, it gives us an error, telling us that the model does not return the loss value, which is key to being able to train. Let's see why.

Let's first see what an example from the dataset looks like

	
idx = randint(0, len(train_dataset) - 1)
sample = train_dataset[idx]
sample
Copy
	
{'input_ids': [50257,
4162,
750,
262,
18757,
6451,
2245,
2491,
30,
4362,
340,
373,
734,
10032,
13,
220,
50258,
50256,
50256,
...,
50256,
50256,
50256],
'attention_mask': [1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
0,
0,
0,
...,
0,
0,
0]}

As we can see, we have a dictionary with the input_ids and the attention_mask. If we pass it to the model, we get this

	
import torch
output = model(
input_ids=torch.Tensor(sample["input_ids"]).long().unsqueeze(0).to(model.device),
attention_mask=torch.Tensor(sample["attention_mask"]).long().unsqueeze(0).to(model.device),
)
print(output.loss)
Copy
	
None

As we can see, it does not return the loss value because it is waiting for a value for labels, which we have not provided. In the previous example, where we were doing fine-tuning for text classification, we mentioned that the labels should be passed to a field in the dataset called labels, but in this case, we do not have that field in the dataset.

If we now assign the labels to the input_ids and look at the loss again

	
import torch
output = model(
input_ids=torch.Tensor(sample["input_ids"]).long().unsqueeze(0).to(model.device),
attention_mask=torch.Tensor(sample["attention_mask"]).long().unsqueeze(0).to(model.device),
labels=torch.Tensor(sample["input_ids"]).long().unsqueeze(0).to(model.device)
)
print(output.loss)
Copy
	
tensor(102.1873, device='cuda:0', grad_fn=<NllLossBackward0>)

Now we get a loss

Therefore, we have two options: add a labels field to the dataset with the values of input_ids or use a function from the transformers library called data_collator, in this case we will use DataCollatorForLanguageModeling. Let's take a look at it.

	
from transformers import DataCollatorForLanguageModeling
my_data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
Copy

We pass the sample through this data_collator

	
collated_sample = my_data_collator([sample]).to(model.device)
Copy

We see what the output is

	
for key, value in collated_sample.items():
print(f"{key} ({value.shape}): {value}")
Copy
	
input_ids (torch.Size([1, 768])): tensor([[50257, 4162, 750, 262, 18757, 6451, 2245, 2491, 30, 4362,
340, 373, 734, 10032, 13, 220, 50258, 50256, ..., 50256, 50256]],
device='cuda:0')
attention_mask (torch.Size([1, 768])): tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..., 0, 0]],
device='cuda:0')
labels (torch.Size([1, 768])): tensor([[50257, 4162, 750, 262, 18757, 6451, 2245, 2491, 30, 4362,
340, 373, 734, 10032, 13, 220, 50258, -100, ..., -100, -100]],
device='cuda:0')

As can be seen, the data_collator has created a labels field and assigned it the values of input_ids. The masked tokens have been assigned the value -100. This is because when we defined the data_collator, we passed the parameter mlm=False, which means that we are not performing Masked Language Modeling, but rather Language Modeling, hence no original token is masked.

Let's see if we get a loss with this data_collator

	
output = model(**collated_sample)
output.loss
Copy
	
tensor(102.7181, device='cuda:0', grad_fn=<NllLossBackward0>)

So we redefine the trainer with the data_collator and train again.

	
from transformers import DataCollatorForLanguageModeling
trainer = Trainer(
model,
training_args,
train_dataset=train_dataset,
eval_dataset=validation_dataset,
tokenizer=tokenizer,
data_collator=DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False),
)
Copy
	
trainer.train()
Copy
	
<IPython.core.display.HTML object>
	
There were missing keys in the checkpoint model loaded: ['lm_head.weight'].
	
TrainOutput(global_step=22341, training_loss=3.505178199598342, metrics={'train_runtime': 9209.5353, 'train_samples_per_second': 67.916, 'train_steps_per_second': 2.426, 'total_flos': 2.45146666696704e+17, 'train_loss': 3.505178199598342, 'epoch': 3.0})

Evaluationlink image 56

Once trained, we evaluate the model on the test dataset

	
trainer.evaluate(eval_dataset=test_dataset)
Copy
	
<IPython.core.display.HTML object>
	
{'eval_loss': 3.201305866241455,
'eval_runtime': 65.0033,
'eval_samples_per_second': 178.191,
'eval_steps_per_second': 5.569,
'epoch': 3.0}

Publish the modellink image 57

We create the model card

	
trainer.create_model_card()
Copy

We publish it

	
trainer.push_to_hub()
Copy
	
events.out.tfevents.1720875425.8de3af1b431d.6946.1: 0%| | 0.00/364 [00:00<?, ?B/s]
	
CommitInfo(commit_url='https://huggingface.co/Maximofn/GPT2-small-finetuned-Maximofn-short-jokes-dataset-casualLM/commit/d107b3bb0e02076483238f9975697761015ec390', commit_message='End of training', commit_description='', oid='d107b3bb0e02076483238f9975697761015ec390', pr_url=None, pr_revision=None, pr_num=None)

Usage of the modellink image 58

We clean everything as much as possible

	
import torch
import gc
def clear_hardwares():
torch.clear_autocast_cache()
torch.cuda.ipc_collect()
torch.cuda.empty_cache()
gc.collect()
clear_hardwares()
clear_hardwares()
Copy

We download the model and the tokenizer

	
from transformers import AutoTokenizer, AutoModelForCausalLM
user = "maximofn"
checkpoints = f"{user}/{model_name}"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
model = AutoModelForCausalLM.from_pretrained(checkpoints)
model.config.pad_token_id = model.config.eos_token_id
Copy
	
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

We check that the tokenizer and the model have the 2 extra tokens we added

	
tokenizer_vocab = tokenizer.get_vocab()
model_vocab = model.config.vocab_size
print(f"tokenizer_vocab: {len(tokenizer_vocab)}. model_vocab: {model_vocab}")
Copy
	
tokenizer_vocab: 50259. model_vocab: 50259

We see that they have 50259 tokens, that is, the 50257 tokens of GPT2 plus the 2 that we have added.

We create a function to generate jokes

	
def generate_joke(prompt_text):
text = f"<SJ> {prompt_text}"
tokens = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**tokens, max_new_tokens=256, eos_token_id=tokenizer.encode("<EJ>")[-1])
return tokenizer.decode(output[0], skip_special_tokens=False)
Copy

We generate a joke

	
generate_joke("Why didn't the frog cross the road?")
Copy
	
Setting `pad_token_id` to `eos_token_id`:50258 for open-end generation.
	
"<SJ> Why didn't the frog cross the road? Because he was frog-in-the-face. <EJ>"

If you want to try the model further, you can check it out at Maximofn/GPT2-small-finetuned-Maximofn-short-jokes-dataset-casualLM

Fine tuning for text classification with Pytorchlink image 59

We repeat the training with Pytorch

We reset the notebook to make sure

Datasetlink image 60

We download the same dataset that we used when training with the Hugging Face libraries

	
from datasets import load_dataset
dataset = load_dataset("mteb/amazon_reviews_multi", "en")
Copy

We create a variable with the number of classes

	
num_classes = len(dataset['train'].unique('label'))
num_classes
Copy
	
5

We previously processed the entire dataset to create a field called labels, but now it's not necessary because, since we are going to program everything ourselves, we adapt to how the dataset is.

Tokenizerlink image 61

We create the tokenizer. We assign the padding token so that it doesn't give us an error like before.

	
from transformers import AutoTokenizer
checkpoint = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.pad_token = tokenizer.eos_token
Copy

We create a function to tokenize the dataset

	
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
Copy

We tokenize it. We remove columns that we don't need, but now we keep the text column.

	
dataset = dataset.map(tokenize_function, batched=True, remove_columns=['id', 'label_text'])
Copy
	
dataset
Copy
	
DatasetDict({
train: Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 200000
})
validation: Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 5000
})
test: Dataset({
features: ['text', 'label', 'input_ids', 'attention_mask'],
num_rows: 5000
})
})
	
percentage = 1
subset_train = dataset['train'].select(range(int(len(dataset['train']) * percentage)))
percentage = 1
subset_validation = dataset['validation'].select(range(int(len(dataset['validation']) * percentage)))
subset_test = dataset['test'].select(range(int(len(dataset['test']) * percentage)))
print(f"len subset_train: {len(subset_train)}, len subset_validation: {len(subset_validation)}, len subset_test: {len(subset_test)}")
Copy
	
len subset_train: 200000, len subset_validation: 5000, len subset_test: 5000

Modellink image 62

We import the weights and assign the padding token

	
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=num_classes)
model.config.pad_token_id = model.config.eos_token_id
Copy
	
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at openai-community/gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Devicelink image 63

We create the device where everything will be executed

	
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Copy

While we're at it, we pass the model to the device and, while we're at it, we convert it to FP16 to use less memory.

	
model.half().to(device)
print()
Copy
	

Pytorch Datasetlink image 64

We create a PyTorch dataset

	
from torch.utils.data import Dataset
class ReviewsDataset(Dataset):
def __init__(self, huggingface_dataset):
self.dataset = huggingface_dataset
def __getitem__(self, idx):
label = self.dataset[idx]['label']
input_ids = torch.tensor(self.dataset[idx]['input_ids'])
attention_mask = torch.tensor(self.dataset[idx]['attention_mask'])
return input_ids, attention_mask, label
def __len__(self):
return len(self.dataset)
Copy

We instantiate the datasets

	
train_dataset = ReviewsDataset(subset_train)
validatation_dataset = ReviewsDataset(subset_validation)
test_dataset = ReviewsDataset(subset_test)
Copy

Let's see a sample

	
input_ids, at_mask, label = train_dataset[0]
input_ids.shape, at_mask.shape, label
Copy
	
(torch.Size([768]), torch.Size([768]), 0)

Pytorch Dataloaderlink image 65

We now create a DataLoader from PyTorch

	
from torch.utils.data import DataLoader
BS = 12
train_loader = DataLoader(train_dataset, batch_size=BS, shuffle=True)
validation_loader = DataLoader(validatation_dataset, batch_size=BS)
test_loader = DataLoader(test_dataset, batch_size=BS)
Copy

Let's see a sample

	
input_ids, at_mask, labels = next(iter(train_loader))
input_ids.shape, at_mask.shape, labels
Copy
	
(torch.Size([12, 768]),
torch.Size([12, 768]),
tensor([2, 1, 2, 0, 3, 3, 0, 4, 3, 3, 4, 2]))

To make sure everything is fine, we pass the sample to the model to see that everything works well. First, we pass the tokens to the device.

	
input_ids = input_ids.to(device)
at_mask = at_mask.to(device)
labels = labels.to(device)
Copy

Now we pass it to the model

	
output = model(input_ids=input_ids, attention_mask=at_mask, labels=labels)
output.keys()
Copy
	
odict_keys(['loss', 'logits', 'past_key_values'])

As we can see, it gives us the loss and the logits

	
output['loss']
Copy
	
tensor(5.9414, device='cuda:0', dtype=torch.float16,
grad_fn=<NllLossBackward0>)
	
output['logits']
Copy
	
tensor([[ 6.1953e+00, -1.2275e+00, -2.4824e+00, 5.8867e+00, -1.4734e+01],
[ 5.4062e+00, -8.4570e-01, -2.3203e+00, 5.1055e+00, -1.1555e+01],
[ 6.1641e+00, -9.3066e-01, -2.5664e+00, 6.0039e+00, -1.4570e+01],
[ 5.2266e+00, -4.2358e-01, -2.0801e+00, 4.7461e+00, -1.1570e+01],
[ 3.8184e+00, -2.3460e-03, -1.7666e+00, 3.4160e+00, -7.7969e+00],
[ 4.1641e+00, -4.8169e-01, -1.6914e+00, 3.9941e+00, -8.7734e+00],
[ 4.6758e+00, -3.0298e-01, -2.1641e+00, 4.1055e+00, -9.3359e+00],
[ 4.1953e+00, -3.2471e-01, -2.1875e+00, 3.9375e+00, -8.3438e+00],
[-1.1650e+00, 1.3564e+00, -6.2158e-01, -6.8115e-01, 4.8672e+00],
[ 4.4961e+00, -8.7891e-02, -2.2793e+00, 4.2812e+00, -9.3359e+00],
[ 4.9336e+00, -2.6627e-03, -2.1543e+00, 4.3711e+00, -1.0742e+01],
[ 5.9727e+00, -4.3152e-02, -1.4551e+00, 4.3438e+00, -1.2117e+01]],
device='cuda:0', dtype=torch.float16, grad_fn=<IndexBackward0>)

Metriclink image 66

Let's create a function to get the metric, which in this case will be the accuracy

	
def predicted_labels(logits):
percent = torch.softmax(logits, dim=1)
predictions = torch.argmax(percent, dim=1)
return predictions
Copy
	
def compute_accuracy(logits, labels):
predictions = predicted_labels(logits)
correct = (predictions == labels).float()
return correct.mean()
Copy

Let's see if it calculates it correctly

	
compute_accuracy(output['logits'], labels).item()
Copy
	
0.1666666716337204

Optimizerlink image 67

Since we are going to need an optimizer, we create one

	
from transformers import AdamW
LR = 2e-5
optimizer = AdamW(model.parameters(), lr=LR)
Copy
	
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:588: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(

Traininglink image 68

We create the training loop

	
from tqdm import tqdm
EPOCHS = 3
accuracy = 0
for epoch in range(EPOCHS):
model.train()
train_loss = 0
progresbar = tqdm(train_loader, total=len(train_loader), desc=f'Epoch {epoch + 1}')
for input_ids, at_mask, labels in progresbar:
input_ids = input_ids.to(device)
at_mask = at_mask.to(device)
label = labels.to(device)
output = model(input_ids=input_ids, attention_mask=at_mask, labels=label)
loss = output['loss']
train_loss += loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
progresbar.set_postfix({'train_loss': loss.item()})
train_loss /= len(train_loader)
progresbar.set_postfix({'train_loss': train_loss})
model.eval()
valid_loss = 0
progresbar = tqdm(validation_loader, total=len(validation_loader), desc=f'Epoch {epoch + 1}')
for input_ids, at_mask, labels in progresbar:
input_ids = input_ids.to(device)
at_mask = at_mask.to(device)
labels = labels.to(device)
output = model(input_ids=input_ids, attention_mask=at_mask, labels=labels)
loss = output['loss']
valid_loss += loss.item()
step_accuracy = compute_accuracy(output['logits'], labels)
accuracy += step_accuracy
progresbar.set_postfix({'valid_loss': loss.item(), 'accuracy': step_accuracy.item()})
valid_loss /= len(validation_loader)
accuracy /= len(validation_loader)
progresbar.set_postfix({'valid_loss': valid_loss, 'accuracy': accuracy})
Copy
	
Epoch 1: 100%|██████████| 16667/16667 [44:13<00:00, 6.28it/s, train_loss=nan]
Epoch 1: 100%|██████████| 417/417 [00:32<00:00, 12.72it/s, valid_loss=nan, accuracy=0]
Epoch 2: 100%|██████████| 16667/16667 [44:06<00:00, 6.30it/s, train_loss=nan]
Epoch 2: 100%|██████████| 417/417 [00:32<00:00, 12.77it/s, valid_loss=nan, accuracy=0]
Epoch 3: 100%|██████████| 16667/16667 [44:03<00:00, 6.30it/s, train_loss=nan]
Epoch 3: 100%|██████████| 417/417 [00:32<00:00, 12.86it/s, valid_loss=nan, accuracy=0]

Usage of the modellink image 69

Let's test the model we have trained

First we tokenize a text

	
input_tokens = tokenize_function({"text": "I love this product. It is amazing."})
input_tokens['input_ids'].shape, input_tokens['attention_mask'].shape
Copy
	
(torch.Size([1, 768]), torch.Size([1, 768]))

Now we pass it to the model

	
output = model(input_ids=input_tokens['input_ids'].to(device), attention_mask=input_tokens['attention_mask'].to(device))
output['logits']
Copy
	
tensor([[nan, nan, nan, nan, nan]], device='cuda:0', dtype=torch.float16,
grad_fn=<IndexBackward0>)

We see the predictions of those logits

	
predicted = predicted_labels(output['logits'])
predicted
Copy
	
tensor([0], device='cuda:0')

Fine tuning for text generation with Pytorchlink image 70

We repeat the training with Pytorch

We reset the notebook to make sure

Datasetlink image 71

We download the jokes dataset again

	
from datasets import load_dataset
jokes = load_dataset("Maximofn/short-jokes-dataset")
jokes
Copy
	
DatasetDict({
train: Dataset({
features: ['ID', 'Joke'],
num_rows: 231657
})
})

We create a subset in case there is limited memory

	
percent_of_train_dataset = 1 # If you want 50% of the dataset, set this to 0.5
subset_dataset = jokes["train"].select(range(int(len(jokes["train"]) * percent_of_train_dataset)))
subset_dataset
Copy
	
Dataset({
features: ['ID', 'Joke'],
num_rows: 231657
})

We divide the dataset into training, validation, and test subsets.

	
percent_of_train_dataset = 0.90
split_dataset = subset_dataset.train_test_split(train_size=int(subset_dataset.num_rows * percent_of_train_dataset), seed=19, shuffle=False)
train_dataset = split_dataset["train"]
validation_test_dataset = split_dataset["test"]
split_dataset = validation_test_dataset.train_test_split(train_size=int(validation_test_dataset.num_rows * 0.5), seed=19, shuffle=False)
validation_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]
print(f"Size of the train set: {len(train_dataset)}. Size of the validation set: {len(validation_dataset)}. Size of the test set: {len(test_dataset)}")
Copy
	
Size of the train set: 208491. Size of the validation set: 11583. Size of the test set: 11583

Tokenizerlink image 72

We initialize the tokenizer and assign the padding token to end of string

	
from transformers import AutoTokenizer
checkpoints = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(checkpoints)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
Copy

We add the special start of joke and end of joke tokens

	
new_tokens = ['<SJ>', '<EJ>'] # Start and end of joke tokens
num_added_tokens = tokenizer.add_tokens(new_tokens)
print(f"Added {num_added_tokens} tokens")
Copy
	
Added 2 tokens

We add them to the dataset

	
joke_column = "Joke"
def format_joke(example):
example[joke_column] = '<SJ> ' + example['Joke'] + ' <EJ>'
return example
remove_columns = [column for column in train_dataset.column_names if column != joke_column]
train_dataset = train_dataset.map(format_joke, remove_columns=remove_columns)
validation_dataset = validation_dataset.map(format_joke, remove_columns=remove_columns)
test_dataset = test_dataset.map(format_joke, remove_columns=remove_columns)
train_dataset, validation_dataset, test_dataset
Copy
	
(Dataset({
features: ['Joke'],
num_rows: 208491
}),
Dataset({
features: ['Joke'],
num_rows: 11583
}),
Dataset({
features: ['Joke'],
num_rows: 11583
}))

We tokenize the dataset

	
def tokenize_function(examples):
return tokenizer(examples[joke_column], padding="max_length", truncation=True, max_length=768, return_tensors="pt")
train_dataset = train_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
validation_dataset = validation_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
test_dataset = test_dataset.map(tokenize_function, batched=True, remove_columns=[joke_column])
train_dataset, validation_dataset, test_dataset
Copy
	
(Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 208491
}),
Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 11583
}),
Dataset({
features: ['input_ids', 'attention_mask'],
num_rows: 11583
}))

Modellink image 73

We instantiate the model, assign the padding token, and add the new joke start and end tokens.

	
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(checkpoints)
model.config.pad_token_id = model.config.eos_token_id
model.resize_token_embeddings(len(tokenizer))
Copy
	
Embedding(50259, 768)

Devicelink image 74

We create the device and pass the model to the device

	
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.half().to(device)
print()
Copy
	

Pytorch Datasetlink image 75

We create a PyTorch dataset

	
from torch.utils.data import Dataset
class JokesDataset(Dataset):
def __init__(self, huggingface_dataset):
self.dataset = huggingface_dataset
def __getitem__(self, idx):
input_ids = torch.tensor(self.dataset[idx]['input_ids'])
attention_mask = torch.tensor(self.dataset[idx]['attention_mask'])
return input_ids, attention_mask
def __len__(self):
return len(self.dataset)
Copy

We instantiate the training, validation, and test datasets.

	
train_pytorch_dataset = JokesDataset(train_dataset)
validation_pytorch_dataset = JokesDataset(validation_dataset)
test_pytorch_dataset = JokesDataset(test_dataset)
Copy

Let's see a sample

	
input_ids, attention_mask = train_pytorch_dataset[0]
input_ids.shape, attention_mask.shape
Copy
	
(torch.Size([768]), torch.Size([768]))

Pytorch Dataloaderlink image 76

We create the dataloaders

	
from torch.utils.data import DataLoader
BS = 28
train_loader = DataLoader(train_pytorch_dataset, batch_size=BS, shuffle=True)
validation_loader = DataLoader(validation_pytorch_dataset, batch_size=BS)
test_loader = DataLoader(test_pytorch_dataset, batch_size=BS)
Copy

We see a sample

	
input_ids, attention_mask = next(iter(train_loader))
input_ids.shape, attention_mask.shape
Copy
	
(torch.Size([28, 768]), torch.Size([28, 768]))

We pass it to the model

	
output = model(input_ids.to(device), attention_mask=attention_mask.to(device))
output.keys()
Copy
	
odict_keys(['logits', 'past_key_values'])

As we can see, we don't have a loss value. As we've seen, we need to pass it the input_ids and the labels.

	
output = model(input_ids.to(device), attention_mask=attention_mask.to(device), labels=input_ids.to(device))
output.keys()
Copy
	
odict_keys(['loss', 'logits', 'past_key_values'])

Now we have loss

	
output['loss'].item()
Copy
	
80.5625

Optimizerlink image 77

We create an optimizer

	
from transformers import AdamW
LR = 2e-5
optimizer = AdamW(model.parameters(), lr=5e-5)
Copy
	
/usr/local/lib/python3.10/dist-packages/transformers/optimization.py:588: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(

Traininglink image 78

We create the training loop

	
from tqdm import tqdm
EPOCHS = 3
for epoch in range(EPOCHS):
model.train()
train_loss = 0
progresbar = tqdm(train_loader, total=len(train_loader), desc=f'Epoch {epoch + 1}')
for input_ids, at_mask in progresbar:
input_ids = input_ids.to(device)
at_mask = at_mask.to(device)
output = model(input_ids=input_ids, attention_mask=at_mask, labels=input_ids)
loss = output['loss']
train_loss += loss.item()
optimizer.zero_grad()
loss.backward()
optimizer.step()
progresbar.set_postfix({'train_loss': loss.item()})
train_loss /= len(train_loader)
progresbar.set_postfix({'train_loss': train_loss})
Copy
	
Epoch 1: 100%|██████████| 7447/7447 [51:07<00:00, 2.43it/s, train_loss=nan]
Epoch 2: 100%|██████████| 7447/7447 [51:06<00:00, 2.43it/s, train_loss=nan]
Epoch 3: 100%|██████████| 7447/7447 [51:07<00:00, 2.43it/s, train_loss=nan]

Usage of the modellink image 79

We test the model

	
def generate_text(decoded_joke, max_new_tokens=100, stop_token='<EJ>', top_k=0, temperature=1.0):
input_tokens = tokenize_function({'Joke': decoded_joke})
output = model(input_tokens['input_ids'].to(device), attention_mask=input_tokens['attention_mask'].to(device))
nex_token = torch.argmax(output['logits'][:, -1, :], dim=-1).item()
nex_token_decoded = tokenizer.decode(nex_token)
decoded_joke = decoded_joke + nex_token_decoded
for _ in range(max_new_tokens):
nex_token = torch.argmax(output['logits'][:, -1, :], dim=-1).item()
nex_token_decoded = tokenizer.decode(nex_token)
if nex_token_decoded == stop_token:
break
decoded_joke = decoded_joke + nex_token_decoded
input_tokens = tokenize_function({'Joke': decoded_joke})
output = model(input_tokens['input_ids'].to(device), attention_mask=input_tokens['attention_mask'].to(device))
return decoded_joke
Copy
	
generated_text = generate_text("<SJ> Why didn't the frog cross the road")
generated_text
Copy
	
"<SJ> Why didn't the frog cross the road!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

Continue reading

Stream Information in MCP: Complete Guide to Real-time Progress Updates with FastMCP

Stream Information in MCP: Complete Guide to Real-time Progress Updates with FastMCP

Learn how to implement real-time streaming in MCP (Model Context Protocol) applications using FastMCP. This comprehensive guide shows you how to create MCP servers and clients that support progress updates and streaming information for long-running tasks. You'll build streaming-enabled tools that provide real-time feedback during data processing, file uploads, monitoring tasks, and other time-intensive operations. Discover how to use StreamableHttpTransport, implement progress handlers with Context, and create visual progress bars that enhance user experience when working with MCP applications that require continuous feedback.

Last posts -->

Have you seen these projects?

Horeca chatbot

Horeca chatbot Horeca chatbot
Python
LangChain
PostgreSQL
PGVector
React
Kubernetes
Docker
GitHub Actions

Chatbot conversational for cooks of hotels and restaurants. A cook, kitchen manager or room service of a hotel or restaurant can talk to the chatbot to get information about recipes and menus. But it also implements agents, with which it can edit or create new recipes or menus

Subtify

Subtify Subtify
Python
Whisper
Spaces

Subtitle generator for videos in the language you want. Also, it puts a different color subtitle to each person

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->

Do you want to apply AI in your project? Contact me!

Do you want to train your model with these datasets?

short-jokes-dataset

Dataset with jokes in English

opus100

Dataset with translations from English to Spanish

netflix_titles

Dataset with Netflix movies and series

View more datasets -->