BLIP-2: Multimodal Vision-Language Model

BLIP-2: Multimodal Vision-Language Model BLIP-2: Multimodal Vision-Language Model

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

Introductionlink image 7

Blip2 is an artificial intelligence capable of taking an image or video as input and having a conversation, answering questions, or providing context about what the input shows with great accuracy 🤯

GitHub

Paper

Installationlink image 8

To install this tool, it's best to create a new Anaconda environment.

	
< > Input
Python
!$ conda create -n blip2 python=3.9
Copied

Now we dive into the environment

	
< > Input
Python
!$ conda activate blip2
Copied

We install all the necessary modules

	
< > Input
Python
!$ conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
Copied
	
< > Input
Python
!$ conda install -c anaconda pillow
Copied
	
< > Input
Python
!$ conda install -y -c anaconda requests
Copied
	
< > Input
Python
!$ conda install -y -c anaconda jupyter
Copied

Finally we install Blip2

	
< > Input
Python
!$ pip install salesforce-lavis
Copied

Usagelink image 9

We load the necessary libraries

	
< > Input
Python
import torch
from PIL import Image
import requests
from lavis.models import load_model_and_preprocess
Copied

We load an example image

	
< > Input
Python
img_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/12_-_The_Mystical_King_Cobra_and_Coffee_Forests.jpg/800px-12_-_The_Mystical_King_Cobra_and_Coffee_Forests.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
display(raw_image.resize((500, 500)))
Copied
>_ Output
			
&lt;PIL.Image.Image image mode=RGB size=500x500&gt;

We set the GPU if there is one

	
< > Input
Python
device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
device
Copied
>_ Output
			
device(type='cuda')

We assign a model. In my case, with a computer that has 32 GB of RAM and a 3060 GPU with 12 GB of VRAM, I can't use all of them, so I've added a comment ok next to the models I was able to use, and the error I received for those I couldn't. If you have a computer with the same amount of RAM and VRAM, you'll know which ones you can use; if not, you'll need to test them.

	
< > Input
Python
# name = "blip2_opt"; model_type = "pretrain_opt2.7b" # ok
# name = "blip2_opt"; model_type = "caption_coco_opt2.7b" # FAIL VRAM
# name = "blip2_opt"; model_type = "pretrain_opt6.7b" # FAIL RAM
# name = "blip2_opt"; model_type = "caption_coco_opt6.7b" # FAIL RAM
# name = "blip2"; model_type = "pretrain" # FAIL type error
# name = "blip2"; model_type = "coco" # ok
name = "blip2_t5"; model_type = "pretrain_flant5xl" # ok
# name = "blip2_t5"; model_type = "caption_coco_flant5xl" # FAIL VRAM
# name = "blip2_t5"; model_type = "pretrain_flant5xxl" # FAIL
model, vis_processors, _ = load_model_and_preprocess(
name=name, model_type=model_type, is_eval=True, device=device
)
vis_processors.keys()
Copied
>_ Output
			
Loading checkpoint shards: 0%| | 0/2 [00:00&lt;?, ?it/s]
>_ Output
			
dict_keys(['train', 'eval'])

We prepare the image to feed it into the model

	
< > Input
Python
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
Copied

We analyze the image without asking anythinglink image 10

	
< > Input
Python
model.generate({"image": image})
Copied
>_ Output
			
['a black and white snake']

We analyze the image by askinglink image 11

	
< > Input
Python
prompt = None
Copied
	
< > Input
Python
def prepare_prompt(prompt, question):
if prompt is None:
prompt = question + " Answer:"
else:
prompt = prompt + " " + question + " Answer:"
return prompt
Copied
	
< > Input
Python
def get_answer(prompt, question, model):
prompt = prepare_prompt(prompt, question)
answer = model.generate(
{
"image": image,
"prompt": prompt
}
)
answer = answer[0]
prompt = prompt + " " + answer + "."
return prompt, answer
Copied
	
< > Input
Python
question = "What's in the picture?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")
Copied
>_ Output
			
Question: What's in the picture?
Answer: a snake
	
< > Input
Python
question = "What kind of snake?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")
Copied
>_ Output
			
Question: What kind of snake?
Answer: cobra
	
< > Input
Python
question = "Is it poisonous?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")
Copied
>_ Output
			
Question: Is it poisonous?
Answer: yes
	
< > Input
Python
question = "If it bites me, can I die?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")
Copied
>_ Output
			
Question: If it bites me, can I die?
Answer: yes

Continue reading

Last posts -->

Have you seen these projects?

Gymnasia

Gymnasia Gymnasia
React Native
Expo
TypeScript
FastAPI
Next.js
OpenAI
Anthropic

Mobile personal training app with AI assistant, exercise library, workout tracking, diet and body measurements

Horeca chatbot

Horeca chatbot Horeca chatbot
Python
LangChain
PostgreSQL
PGVector
React
Kubernetes
Docker
GitHub Actions

Chatbot conversational for cooks of hotels and restaurants. A cook, kitchen manager or room service of a hotel or restaurant can talk to the chatbot to get information about recipes and menus. But it also implements agents, with which it can edit or create new recipes or menus

View all projects -->
>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Do you want to watch any talk?

Last talks -->

Do you want to improve with these tips?

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

Flow edit Flow edit

FLUX.1-RealismLora

FLUX.1-RealismLora FLUX.1-RealismLora
View all containers -->
>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Do you want to train your model with these datasets?

short-jokes-dataset

HuggingFace

Dataset with jokes in English

Use: Fine-tuning text generation models for humor

231K rows 2 columns 45 MB
View on HuggingFace →

opus100

HuggingFace

Dataset with translations from English to Spanish

Use: Training English-Spanish translation models

1M rows 2 columns 210 MB
View on HuggingFace →

netflix_titles

HuggingFace

Dataset with Netflix movies and series

Use: Netflix catalog analysis and recommendation systems

8.8K rows 12 columns 3.5 MB
View on HuggingFace →
View more datasets -->