Blip2

04 of february of 2023

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

Introduction

Blip2 is an artificial intelligence capable of taking an image or video as input and having a conversation, answering questions, or providing context about what the input shows with great accuracy 🤯

GitHub

Paper

Installation

To install this tool, it's best to create a new Anaconda environment.

	
		!$ conda create -n blip2 python=3.9

Now we dive into the environment

	
		!$ conda activate blip2

We install all the necessary modules

	
		!$ conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia

	
		!$ conda install -c anaconda pillow

	
		!$ conda install -y -c anaconda requests

	
		!$ conda install -y -c anaconda jupyter

Finally we install Blip2

	
		!$ pip install salesforce-lavis

Usage

We load the necessary libraries

	
		import torch
from PIL import Image
import requests
from lavis.models import load_model_and_preprocess

We load an example image

	
		img_url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/12_-_The_Mystical_King_Cobra_and_Coffee_Forests.jpg/800px-12_-_The_Mystical_King_Cobra_and_Coffee_Forests.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
display(raw_image.resize((500, 500)))

	
		&lt;PIL.Image.Image image mode=RGB size=500x500&gt;

We set the GPU if there is one

	
		device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
device

	
		device(type='cuda')

We assign a model. In my case, with a computer that has 32 GB of RAM and a 3060 GPU with 12 GB of VRAM, I can't use all of them, so I've added a comment ok next to the models I was able to use, and the error I received for those I couldn't. If you have a computer with the same amount of RAM and VRAM, you'll know which ones you can use; if not, you'll need to test them.

	
		# name = "blip2_opt"; model_type = "pretrain_opt2.7b"           # ok
# name = "blip2_opt"; model_type = "caption_coco_opt2.7b"       # FAIL VRAM
# name = "blip2_opt"; model_type = "pretrain_opt6.7b"           # FAIL RAM
# name = "blip2_opt"; model_type = "caption_coco_opt6.7b"       # FAIL RAM
 
# name = "blip2"; model_type = "pretrain"                       # FAIL type error
# name = "blip2"; model_type = "coco"                           # ok
 
name = "blip2_t5"; model_type = "pretrain_flant5xl" # ok
# name = "blip2_t5"; model_type = "caption_coco_flant5xl"       # FAIL VRAM
# name = "blip2_t5"; model_type = "pretrain_flant5xxl"          # FAIL
 
model, vis_processors, _ = load_model_and_preprocess(
    name=name, model_type=model_type, is_eval=True, device=device
)
 
vis_processors.keys()

	
		Loading checkpoint shards:   0%|          | 0/2 [00:00&lt;?, ?it/s]

	
		dict_keys(['train', 'eval'])

We prepare the image to feed it into the model

	
		image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)

We analyze the image without asking anything

	
		model.generate({"image": image})

	
		['a black and white snake']

We analyze the image by asking

	
		prompt = None

	
		def prepare_prompt(prompt, question):
    if prompt is None:
        prompt = question + " Answer:"
    else:
        prompt = prompt + " " + question + " Answer:"
    return prompt

	
		def get_answer(prompt, question, model):
    prompt = prepare_prompt(prompt, question)
    answer = model.generate(
        {
            "image": image,
            "prompt": prompt
        }
    )
    answer = answer[0]
    prompt = prompt + " " + answer + "."
    return prompt, answer

	
		question = "What's in the picture?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")

	
		Question: What's in the picture?
Answer: a snake

	
		question = "What kind of snake?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")

	
		Question: What kind of snake?
Answer: cobra

	
		question = "Is it poisonous?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")

	
		Question: Is it poisonous?
Answer: yes

	
		question = "If it bites me, can I die?"
prompt, answer = get_answer(prompt, question, model)
print(f"Question: {question}")
print(f"Answer: {answer}")

	
		Question: If it bites me, can I die?
Answer: yes

Continue reading

MCP: Complete Guide to Create servers and clients MCP (Model Context Protocol) with FastMCP

Learn what is the Model Context Protocol (MCP), the open-source standard developed by Anthropic that revolutionizes how AI models interact with external tools. In this practical and detailed guide, I take you step by step in creating an MCP server and client from scratch using the fastmcp library. You will build an "intelligent" AI agent with Claude Sonnet, capable of interacting with the GitHub API to query issues and repository information. We will cover from basic concepts to advanced features like filtering tools by tags, server composition, static resources and dynamic templates (resource templates), prompt generation, and secure authentication. Discover how MCP can standardize and simplify the integration of tools in your AI applications, analogously to how USB unified peripherals!

MCP: Model Context Protocol

Agents patterns

Are your agents falling short? Elevate your AI projects with advanced patterns: ReAct, planning, multi-agents, and more. Practical guide with code!

Last posts -->

Have you seen these projects?

Horeca chatbot

Naviground

Subtify

View all projects -->

Do you want to apply AI in your project? Contact me!

Do you want to improve with these tips?

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

py-smi

Python library to get GPU data like `nvidia-smi`

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.