Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.
Optimun
is an extension of the Transformers library that provides a set of performance optimization tools for training and inference of models, on specific hardware, with maximum efficiency.
The AI ecosystem is evolving rapidly, and every day more specialized hardware emerges along with its own optimizations. Therefore, Optimum
allows users to efficiently utilize any of this HW with the same ease as Transformers.
Optimun
allows optimization for the following HW platforms:
- Nvidia
- AMD
- Intel
- AWS
- TPU
- Havana
- FuriousAI
In addition, it offers acceleration for the following open source integrations
- ONNX runtime
- Exporters: Export Pytorch or TensorFlow models to different formats such as ONNX or TFLite
- BetterTransformer
- Torch FX
Installation
To install Optimun
simply run:
pip install optimum
But if you want to install with support for all HW platforms, you can do it like this
Accelerator | Installation |
---|---|
ONNX Runtime | pip install --upgrade --upgrade-strategy eager optimum[onnxruntime] |
Intel Neural Compressor | pip install --upgrade --upgrade-strategy eager optimum[neural-compressor] |
OpenVINO | pip install --upgrade --upgrade-strategy eager optimum[openvino] |
AMD Instinct GPUs and Ryzen AI NPU | pip install --upgrade --upgrade-strategy eager optimum[amd] |
AWS Trainum & Inferentia | pip install --upgrade --upgrade-strategy eager optimum[neuronx] |
Habana Gaudi Processor (HPU) | pip install --upgrade --upgrade-strategy eager optimum[habana] |
FuriosaAI | pip install --upgrade --upgrade-strategy eager optimum[furiosa] |
The flags --upgrade --upgrade-strategy eager
are necessary to ensure that the different packages are updated to the latest possible version.
Since most people use Pytorch on Nvidia GPUs, and especially since I have an Nvidia GPU, this post will only discuss the use of Optimun
with Nvidia GPUs and Pytorch.
BetterTransformer
BetterTransformer is a native PyTorch optimization to achieve an acceleration of x1.25 to x4 in the inference of Transformer-based models.
BetterTransformer is an API that allows leveraging modern hardware features to accelerate the training and inference of transformer models in PyTorch, using more efficient attention implementations and the fast path
of the native nn.TransformerEncoderLayer
.
BetterTransformer uses two types of accelerations:
Flash Attention
: This is an implementation ofattention
that usessparse
to reduce computational complexity. Attention is one of the most expensive operations in transformer models, andFlash Attention
makes it more efficient.Memory-Efficient Attention
: This is another implementation of attention that uses thescaled_dot_product_attention
function from PyTorch. This function is more memory-efficient than the standard attention implementation in PyTorch.
In addition, version 2.0 of PyTorch includes a native scaled dot product attention (SDPA) operator as part of torch.nn.functional
Optimun
provides this functionality with the library Transformers
Inference with Automodel
First, let's see how normal inference would work with Transformers
and Automodel
from transformers import AutoTokenizer, AutoModelForCausalLMcheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")tokenizer.pad_token = tokenizer.eos_tokeninput_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")output_tokens = model.generate(**input_tokens, max_length=50)sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)sentence_output
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Me encanta aprender de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de'
Now we see how it would be optimized with BetterTransformer
and Optimun
What we need to do is convert the model using the transform
method of BetterTransformer
from transformers import AutoTokenizer, AutoModelForCausalLMfrom optimum.bettertransformer import BetterTransformercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")# Convert the model to a BetterTransformer modelmodel = BetterTransformer.transform(model_hf, keep_original_model=True)tokenizer.pad_token = tokenizer.eos_tokeninput_tokens = tokenizer(["Me encanta aprender de"], return_tensors="pt", padding=True).to("cuda")output_tokens = model.generate(**input_tokens, max_length=50)sentence_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)sentence_output
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Me encanta aprender de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de la vie de'
Inference with Pipeline
Just like before, we first see how normal inference would be done with Transformers
and Pipeline
from transformers import pipelinepipe = pipeline(task="fill-mask", model="distilbert-base-uncased")pipe("I am a student at [MASK] University.")
[{'score': 0.05116177722811699,'token': 8422,'token_str': 'stanford','sequence': 'i am a student at stanford university.'},{'score': 0.04033993184566498,'token': 5765,'token_str': 'harvard','sequence': 'i am a student at harvard university.'},{'score': 0.03990468755364418,'token': 7996,'token_str': 'yale','sequence': 'i am a student at yale university.'},{'score': 0.0361952930688858,'token': 10921,'token_str': 'cornell','sequence': 'i am a student at cornell university.'},{'score': 0.03303057327866554,'token': 9173,'token_str': 'princeton','sequence': 'i am a student at princeton university.'}]
Now we see how to optimize it, for this we use pipeline
from Optimun
, instead of the one from Transformers
. Additionally, we need to specify that we want to use bettertransformer
as the accelerator.
from optimum.pipelines import pipeline# Use the BetterTransformer pipelinepipe = pipeline(task="fill-mask", model="distilbert-base-uncased", accelerator="bettertransformer")pipe("I am a student at [MASK] University.")
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details./home/wallabot/miniconda3/envs/nlp/lib/python3.11/site-packages/optimum/bettertransformer/models/encoder_models.py:868: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at /opt/conda/conda-bld/pytorch_1708025845868/work/aten/src/ATen/NestedTensorImpl.cpp:177.)hidden_states = torch._nested_tensor_from_mask(hidden_states, attn_mask)
[{'score': 0.05116180703043938,'token': 8422,'token_str': 'stanford','sequence': 'i am a student at stanford university.'},{'score': 0.040340032428503036,'token': 5765,'token_str': 'harvard','sequence': 'i am a student at harvard university.'},{'score': 0.039904672652482986,'token': 7996,'token_str': 'yale','sequence': 'i am a student at yale university.'},{'score': 0.036195311695337296,'token': 10921,'token_str': 'cornell','sequence': 'i am a student at cornell university.'},{'score': 0.03303062543272972,'token': 9173,'token_str': 'princeton','sequence': 'i am a student at princeton university.'}]
Training
For the training with Optimun
we do the same as with the inference with Automodel, we convert the model using the transform
method of BeterTransformer
.
When we finish the training, we revert the model back to its original form using the reverse
method of BetterTransformer
, so that we can save it and upload it to the Hugging Face hub.
from transformers import AutoTokenizer, AutoModelForCausalLMfrom optimum.bettertransformer import BetterTransformercheckpoint = "openai-community/gpt2"tokenizer = AutoTokenizer.from_pretrained(checkpoint)model_hf = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")# Convert the model to a BetterTransformer modelmodel = BetterTransformer.transform(model_hf, keep_original_model=True)############################################################################### do your training here############################################################################### Convert the model back to a Hugging Face modelmodel_hf = BetterTransformer.reverse(model)model_hf.save_pretrained("fine_tuned_model")model_hf.push_to_hub("fine_tuned_model")