BLIP-2: Multimodal Vision-Language Model

04 of february of 2023

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

Introduction

Blip2 is an artificial intelligence capable of taking an image or video as input and having a conversation, answering questions, or providing context about what the input shows with great accuracy 🤯

GitHub

Paper

Installation

To install this tool, it's best to create a new Anaconda environment.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!$ conda create -n blip2 python=3.9
	
	Copied

Now we dive into the environment

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!$ conda activate blip2
	
	Copied

We install all the necessary modules

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!$ conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!$ conda install -c anaconda pillow
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!$ conda install -y -c anaconda requests
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!$ conda install -y -c anaconda jupyter
	
	Copied

Finally we install Blip2

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		!$ pip install salesforce-lavis
	
	Copied

Usage

We load the necessary libraries

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		import torch
from PIL import Image
import requests
from lavis.models import load_model_and_preprocess
	
	Copied

We load an example image

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		img_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/12_-_The_Mystical_King_Cobra_and_Coffee_Forests.jpg/800px-12_-_The_Mystical_King_Cobra_and_Coffee_Forests.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
display(raw_image.resize((500, 500)))
	
	Copied

>_ Output

			
				&lt;PIL.Image.Image image mode=RGB size=500x500&gt;

We set the GPU if there is one

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
device
	
	Copied

>_ Output

			
				device(type='cuda')

We assign a model. In my case, with a computer that has 32 GB of RAM and a 3060 GPU with 12 GB of VRAM, I can't use all of them, so I've added a comment ok next to the models I was able to use, and the error I received for those I couldn't. If you have a computer with the same amount of RAM and VRAM, you'll know which ones you can use; if not, you'll need to test them.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		# name = "blip2_opt"; model_type = "pretrain_opt2.7b"           # ok
# name = "blip2_opt"; model_type = "caption_coco_opt2.7b"       # FAIL VRAM
# name = "blip2_opt"; model_type = "pretrain_opt6.7b"           # FAIL RAM
# name = "blip2_opt"; model_type = "caption_coco_opt6.7b"       # FAIL RAM
 
# name = "blip2"; model_type = "pretrain"                       # FAIL type error
# name = "blip2"; model_type = "coco"                           # ok
 
name = "blip2_t5"; model_type = "pretrain_flant5xl" # ok
# name = "blip2_t5"; model_type = "caption_coco_flant5xl"       # FAIL VRAM
# name = "blip2_t5"; model_type = "pretrain_flant5xxl"          # FAIL
 
model, vis_processors, _ = load_model_and_preprocess(
    name=name, model_type=model_type, is_eval=True, device=device
)
 
vis_processors.keys()
	
	Copied

>_ Output

			
				Loading checkpoint shards:   0%|          | 0/2 [00:00&lt;?, ?it/s]

>_ Output

			
				dict_keys(['train', 'eval'])

We prepare the image to feed it into the model

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
	
	Copied

We analyze the image without asking anything

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		model.generate({"image": image})
	
	Copied

>_ Output

			
				['a black and white snake']

We analyze the image by asking

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		prompt = None
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		def prepare_prompt(prompt, question):
    if prompt is None:
        prompt = question + " Answer:"
    else:
        prompt = prompt + " " + question + " Answer:"
    return prompt
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		def get_answer(prompt, question, model):
    prompt = prepare_prompt(prompt, question)
    answer = model.generate(
        {
            "image": image,
            "prompt": prompt
        }
    )
    answer = answer[0]
    prompt = prompt + " " + answer + "."
    return prompt, answer
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		question = "What's in the picture?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")
	
	Copied

>_ Output

			
				Question: What's in the picture?
Answer: a snake

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		question = "What kind of snake?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")
	
	Copied

>_ Output

			
				Question: What kind of snake?
Answer: cobra

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		question = "Is it poisonous?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")
	
	Copied

>_ Output

			
				Question: Is it poisonous?
Answer: yes

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		question = "If it bites me, can I die?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")
	
	Copied

>_ Output

			
				Question: If it bites me, can I die?
Answer: yes

Continue reading

Deep Research with LangGraph (3/3): the Writer agent and final report

Third and final part of the Deep Research with LangGraph series. Implement the Writer agent that drafts the final report from the research, assemble the full deep researcher graph, and run the complete assistant end to end.

Deep Research with LangGraph (2/3): the multi-agent Research Supervisor

Second part of the Deep Research with LangGraph series. Build the Research Supervisor that coordinates several Researche...

Deep Research with LangGraph (1/3): Scope and Researcher agents

First part of the series on building an AI research assistant with LangGraph. Learn the system architecture and build th...

Last posts -->

Have you seen these projects?

Gymnasia

Horeca chatbot

Naviground

View all projects -->

>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Write me LinkedIn

Do you want to watch any talk?

Tomorrow's Agents: Deciphering the Mysteries of Planning, UX and Memory

AI agents, powered by LLMs, promise to transform applications. But are they simple executors today or future intelligent collaborators? To reach their...

Create your own Apple intelligence

Learn to create an IA system to execute efficiently on a device

Last talks -->

Do you want to improve with these tips?

Best practices building agents with Claude Code

Technical talk: skills, subagents, slash commands and MCPs in Claude Code

o1 prompt engineering

Create better prompts for o1 following an example

Memory profiler

See the memory usage of a script

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.

Flow edit

FLUX.1-RealismLora

token_hmr

View all containers -->

>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Write me LinkedIn

Do you want to train your model with these datasets?

short-jokes-dataset

HuggingFace

Dataset with jokes in English

Use: Fine-tuning text generation models for humor

231K rows 2 columns 45 MB

View on HuggingFace →

opus100

HuggingFace

Dataset with translations from English to Spanish

Use: Training English-Spanish translation models

1M rows 2 columns 210 MB

View on HuggingFace →

netflix_titles

HuggingFace

Dataset with Netflix movies and series

Use: Netflix catalog analysis and recommendation systems

8.8K rows 12 columns 3.5 MB

View on HuggingFace →

View more datasets -->