Fine-Tuning Florence-2: Computer Vision

18 of july of 2024

Disclaimer: This post has been translated to English using a machine translation model. Please, let me know if you find any mistakes.

In the post Florence-2 we already explained the Florence-2 model and saw how to use it. So in this post, we are going to see how to fine-tune it.

Fine tuning for Document VQA

This fine-tuning is based on the post by Merve Noyan, Andres Marafioti and Piotr Skalski, Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models, in which they explain that although this method is very comprehensive, it does not allow asking questions about documents, so they perform a retraining with the DocumentVQA dataset.

Dataset

First, we download the dataset. I'm leaving the dataset_percentage variable in case you don't want to download everything.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from datasets import load_dataset
 
dataset_percentage = 100
data_train = load_dataset("HuggingFaceM4/DocumentVQA", split=f"train[:{dataset_percentage}%]")
data_validation = load_dataset("HuggingFaceM4/DocumentVQA", split=f"validation[:{dataset_percentage}%]")
data_test = load_dataset("HuggingFaceM4/DocumentVQA", split=f"test[:{dataset_percentage}%]")
 
data_train, data_validation, data_test
	
	Copied

>_ Output

			
				(Dataset({
     features: ['questionId', 'question', 'question_types', 'image', 'docId', 'ucsf_document_id', 'ucsf_document_page_no', 'answers'],
     num_rows: 39463
}),
Dataset({
     features: ['questionId', 'question', 'question_types', 'image', 'docId', 'ucsf_document_id', 'ucsf_document_page_no', 'answers'],
     num_rows: 5349
}),
Dataset({
     features: ['questionId', 'question', 'question_types', 'image', 'docId', 'ucsf_document_id', 'ucsf_document_page_no', 'answers'],
     num_rows: 5188
}))

We make a subset of the dataset if you want to speed up the training, in my case I use 100% of the data

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		percentage = 1
 
subset_data_train = data_train.select(range(int(len(data_train) * percentage)))
subset_data_validation = data_validation.select(range(int(len(data_validation) * percentage)))
subset_data_test = data_test.select(range(int(len(data_test) * percentage)))
 
print(f"train dataset length: {len(subset_data_train)}, validation dataset length: {len(subset_data_validation)}, test dataset length: {len(subset_data_test)}")
	
	Copied

>_ Output

			
				train dataset length: 39463, validation dataset length: 5349, test dataset length: 5188

We also instantiate the model

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from transformers import AutoModelForCausalLM, AutoProcessor
import torch
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
checkpoints = 'microsoft/Florence-2-base-ft'
model = AutoModelForCausalLM.from_pretrained(checkpoints, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained(checkpoints, trust_remote_code=True)
	
	Copied

Just like in the post Florence-2 we create a function to ask the model for responses

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		def create_prompt(task_prompt, text_input=None):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input
    return prompt
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		def generate_answer(task_prompt, text_input=None, image=None, device="cpu"):
    # Create prompt
    prompt = create_prompt(task_prompt, text_input)
 
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
 
    # Get inputs
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
 
    # Get outputs
    generated_ids = model.generate(
      input_ids=inputs["input_ids"],
      pixel_values=inputs["pixel_values"],
      max_new_tokens=1024,
      early_stopping=False,
      do_sample=False,
      num_beams=3,
    )
 
    # Decode the generated IDs
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
 
    # Post-process the generated text
    parsed_answer = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )
 
    return parsed_answer
	
	Copied

We test the model with 3 documents from the dataset, with the task DocVQA to see if we get anything.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;DocVQA&gt;", text_input='What do you see in this image?', image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;DocVQA&gt;': 'docvQA'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'docvQA'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'DocVQA&gt;'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="DocVQA", text_input='What do you see in this image?', image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'DocVQA': 'unanswerable'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'DocVQA': 'unanswerable'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'DocVQA': '499150498'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

We see that the answers are not good.

We are now testing with the OCR task

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;OCR&gt;", image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;OCR&gt;': 'ConfidentialDATE:11/8/18RJT FR APPROVALBUBJECT: Rl gdasPROPOSED RELEASE DATE:for responseFOR RELEASE TO!CONTRACT: P. CARTERROUTE TO!NameIntiifnPeggy CarterAce11/fesMura PayneDavid Fishhel037Tom Gisis Com-Diane BarrowsEd BlackmerTow KuckerReturn to Peggy Carter, PR, 16 Raynolds BuildingLLS. 2015Source: https://www.industrydocuments.ucsf.edu/docs/xnbl0037'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;OCR&gt;': 'ConfidentialDATE:11/8/18RJT FR APPROVALBUBJECT: Rl gdasPROPOSED RELEASE DATE:for responseFOR RELEASE TO!CONTRACT: P. CARTERROUTE TO!NameIntiifnPeggy CarterAce11/fesMura PayneDavid Fishhel037Tom Gisis Com-Diane BarrowsEd BlackmerTow KuckerReturn to Peggy Carter, PR, 16 Raynolds BuildingLLS. 2015Source: https://www.industrydocuments.ucsf.edu/docs/xnbl0037'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;OCR&gt;': 'BSABROWN &amp; WILLIAMSON JOBACCO CORPORATIONRESEARCH &amp; DEVELOPMENTINTERNAL CORRESPONDENCETO:R. H. HoneycuttCC:C.J. CookFROM:May 9, 1995SUBJECT: Review of Existing Brainstorming Ideas/43The major function of the Product Innovation Ideas is developed marketable novel productsthat would be profile of the manufacturer and sell. Novel is defined as: a new kind, or differentfrom anything seen in known before, Innovation things as something is available. The products mayintroduced and the most technologies, materials and know, available to give a uniquetaste or tok.The first task of the product innovation was was an easy-view review and then a list ofexisting brainstorming ideas. These were group was used for two major categories that may differapparance and lerato,Ideas are grouped into two major products that may offercategories include a combination print of the above, flowers, and packaged and brand directions.ApparanceThis category is used in a novel cigarette constructions that yield visually different products withminimal changes in smokecigarette.Two cigarettes in one.Multi-plug in your.C-Switch menthol or non non smoking cigarette.E-Switch with ORPORated perforations to enable smoke to separate unburned section forfuture smoking.Tout smoking.Bobace section 30 mm.Novelcigarette constructions and permit a significant reduction in tobacco weight whilemaintaining fast smoking mechanics and visual reduction for tobacco weight.higher basis weight paper, potential reduction for cigarette weight.Easter or in an ebony agent for tobacco, e.g. starch.Colored tow and cigarette papers; seasonal promotions, eg. pastel colored cigarettes forEaster and in an Ebony brand containing a mixture of all black (black paper and tow)and all white cigarettes.499150498Source: https://www.industrydocuments.ucs.edu/docs/mxj0037'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

We obtain the text from the documents, but not what the documents are about.

Lastly, we test with the CAPTION tasks

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;CAPTION&gt;", image=data_train[idx]['image'], device=model.device))
  print(generate_answer(task_prompt="&lt;DETAILED_CAPTION&gt;", image=data_train[idx]['image'], device=model.device))
  print(generate_answer(task_prompt="&lt;MORE_DETAILED_CAPTION&gt;", image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;CAPTION&gt;': 'A certificate is stamped with the date of 18/18.'}
{'&lt;DETAILED_CAPTION&gt;': 'In this image we can see a paper with some text on it.'}
{'&lt;MORE_DETAILED_CAPTION&gt;': 'A letter is written in black ink on a white paper. The letters are written in a cursive language. The letter is addressed to peggy carter. '}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;CAPTION&gt;': 'A certificate is stamped with the date of 18/18.'}
{'&lt;DETAILED_CAPTION&gt;': 'In this image we can see a paper with some text on it.'}
{'&lt;MORE_DETAILED_CAPTION&gt;': 'A letter is written in black ink on a white paper. The letters are written in a cursive language. The letter is addressed to peggy carter. '}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;CAPTION&gt;': "a paper that says 'brown &amp; williamson tobacco corporation research &amp; development' on it"}
{'&lt;DETAILED_CAPTION&gt;': 'In this image we can see a paper with some text on it.'}
{'&lt;MORE_DETAILED_CAPTION&gt;': 'The image is a page from a book titled "Brown &amp; Williamson Jobacco Corporation Research &amp; Development".  The page is white and has black text.  The title of the page is "R. H. Honeycutt" at the top.  There is a logo of the company BSA in the top right corner.  A paragraph is written in black text below the title.'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

Let's not accept these responses either, so we are going to do some fine-tuning.

Fine tuning

First we create a Pytorch dataset

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from torch.utils.data import Dataset
 
class DocVQADataset(Dataset):
    def __init__(self, data):
        self.data = data
 
    def __len__(self):
        return len(self.data)
 
    def __getitem__(self, idx):
        example = self.data[idx]
        question = "&lt;DocVQA&gt;" + example['question']
        first_answer = example['answers'][0]
        image = example['image']
        if image.mode != "RGB":
            image = image.convert("RGB")
        return question, first_answer, image
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		train_dataset = DocVQADataset(subset_data_train)
val_dataset = DocVQADataset(subset_data_validation)
	
	Copied

Let's see it

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		train_dataset[0]
	
	Copied

>_ Output

			
				('&lt;DocVQA&gt;what is the date mentioned in this letter?',
'1/8/93',
&lt;PIL.Image.Image image mode=RGB size=1695x2025&gt;)

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		data_train[0]
	
	Copied

>_ Output

			
				{'questionId': 337,
'question': 'what is the date mentioned in this letter?',
'question_types': ['handwritten', 'form'],
'image': &lt;PIL.PngImagePlugin.PngImageFile image mode=L size=1695x2025&gt;,
'docId': 279,
'ucsf_document_id': 'xnbl0037',
'ucsf_document_page_no': '1',
'answers': ['1/8/93']}

We create a DataLoader

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		import os
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import (AdamW, AutoProcessor, get_scheduler)
 
def collate_fn(batch):
    questions, answers, images = zip(*batch)
    inputs = processor(text=list(questions), images=list(images), return_tensors="pt", padding=True).to(device)
    return inputs, answers
 
# Create DataLoader
batch_size = 8
num_workers = 0
 
train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers)
	
	Copied

Let's see a sample

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample = next(iter(train_loader))
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample
	
	Copied

>_ Output

			
				({'input_ids': tensor([[    0, 41552, 42291,   846,  1864,   250, 15698, 12375,    16,     5,
           3383,     9,   331,     9,  2042,   116,     2,     1,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
          11968,   196,   205, 22922,   346, 17487,     2,     1,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
           1229,    13,   403,   690,   116,     2,     1,     1,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
           5480,  1280,   116,     2,     1,     1,     1,     1,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698, 12196,    16,     5,
           1842,   346,    13,    20,  4680, 41828, 42237,     8, 30147, 17487,
              2,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698,   560,    61,   675,
            473,    42,  1013,   266,  9943,     7,   116,     2,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698, 12196,    16,     5,
           1280,     9, 39432,   642,  6228,  2394,  2801,    11,     5,   576,
            266, 17487,     2],
         [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,  1982,
             11,     5,  6655,  2325,    23,     5,   299,   235,     9,     5,
           3780,   116,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
         ...
         '97.00',
  '123',
  '1 January 1979 - 31 December 1979',
  '$2,720.14',
  'GPI'))

The raw sample is a lot of information, so let's check the length of the sample

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		len(sample)
	
	Copied

>_ Output

We get a length of 2 because we have the input to the model and the response

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_inputs = sample[0]
sample_answers = sample[1]
	
	Copied

We see the input

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_inputs
	
	Copied

>_ Output

			
				{'input_ids': tensor([[    0, 41552, 42291,   846,  1864,   250, 15698, 12375,    16,     5,
          3383,     9,   331,     9,  2042,   116,     2,     1,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
         11968,   196,   205, 22922,   346, 17487,     2,     1,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
          1229,    13,   403,   690,   116,     2,     1,     1,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
          5480,  1280,   116,     2,     1,     1,     1,     1,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698, 12196,    16,     5,
          1842,   346,    13,    20,  4680, 41828, 42237,     8, 30147, 17487,
             2,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698,   560,    61,   675,
           473,    42,  1013,   266,  9943,     7,   116,     2,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698, 12196,    16,     5,
          1280,     9, 39432,   642,  6228,  2394,  2801,    11,     5,   576,
           266, 17487,     2],
        [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,  1982,
            11,     5,  6655,  2325,    23,     5,   299,   235,     9,     5,
          3780,   116,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        ...
        [ 2.6400,  2.6400,  2.6400,  ...,  1.3502,  0.7925,  1.3502],
          [ 2.6400,  2.6400,  2.6400,  ...,  0.9319,  1.4025,  0.8448],
          [ 2.6400,  2.6400,  2.6400,  ...,  1.0365,  1.2282,  0.8099]]]])}

The raw input also has too much information, so let's look at the keys

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_inputs.keys()
	
	Copied

>_ Output

			
				dict_keys(['input_ids', 'attention_mask', 'pixel_values'])

As we can see, we have the input_ids and the attention_mask which correspond to the input text, and the pixel_values which correspond to the image. Let's check the dimension of each one.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_inputs['input_ids'].shape, sample_inputs['attention_mask'].shape, sample_inputs['pixel_values'].shape
	
	Copied

>_ Output

			
				(torch.Size([8, 23]), torch.Size([8, 23]), torch.Size([8, 3, 768, 768]))

In all of them there are 8 elements, because when creating the dataloader we set a batch size of 8. In the input_ids and attention_mask, each element has 28 tokens, and in the pixel_values, each element has 3 channels, 768 pixels in height, and 768 pixels in width.

Let's now look at the answers

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_answers
	
	Copied

>_ Output

			
				('JAMES A. RHODES',
'1-800-992-3284',
'$50,000',
'97.00',
'123',
'1 January 1979 - 31 December 1979',
'$2,720.14',
'GPI')

We obtained 8 responses, for the same reason as before, because when creating the dataloader we set a batch size of 8

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		len(sample_answers)
	
	Copied

>_ Output

We create a function to perform the fine tuning

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		def train_model(train_loader, val_loader, model, processor, epochs=10, lr=1e-6):
    optimizer = AdamW(model.parameters(), lr=lr)
    num_training_steps = epochs * len(train_loader)
    lr_scheduler = get_scheduler(
        name="linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )
 
    for epoch in range(epochs):
 
        # Training phase
        print(f"
Training Epoch {epoch + 1}/{epochs}")
        model.train()
        train_loss = 0
        i = -1
        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch + 1}/{epochs}"):
            i += 1
            inputs, answers = batch
 
            input_ids = inputs["input_ids"]
            pixel_values = inputs["pixel_values"]
            labels = processor.tokenizer(text=answers, return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)
 
            outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
            loss = outputs.loss
 
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
 
            train_loss += loss.item()
 
        avg_train_loss = train_loss / len(train_loader)
        print(f"Average Training Loss: {avg_train_loss}")
 
        # Validation phase
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Validation Epoch {epoch + 1}/{epochs}"):
                inputs, answers = batch
 
                input_ids = inputs["input_ids"]
                pixel_values = inputs["pixel_values"]
                labels = processor.tokenizer(text=answers, return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)
 
                outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
                loss = outputs.loss
 
                val_loss += loss.item()
 
        avg_val_loss = val_loss / len(val_loader)
        print(f"Average Validation Loss: {avg_val_loss}")
	
	Copied

We train

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		train_model(train_loader, val_loader, model, processor, epochs=3, lr=1e-6)
	
	Copied

>_ Output

			
				Training Epoch 1/3

>_ Output

			
				Training Epoch 1/3: 100%|██████████| 4933/4933 [2:45:28&lt;00:00,  2.01s/it]

>_ Output

			
				Average Training Loss: 1.153514638062836

>_ Output

			
				Validation Epoch 1/3: 100%|██████████| 669/669 [13:52&lt;00:00,  1.24s/it]

>_ Output

			
				Average Validation Loss: 0.7698153616646124
Training Epoch 2/3

>_ Output

			
				Training Epoch 2/3: 100%|██████████| 4933/4933 [2:42:51&lt;00:00,  1.98s/it]

>_ Output

			
				Average Training Loss: 0.6530420315007687

>_ Output

			
				Validation Epoch 2/3: 100%|██████████| 669/669 [13:48&lt;00:00,  1.24s/it]

>_ Output

			
				Average Validation Loss: 0.725301219375946
Training Epoch 3/3

>_ Output

			
				Training Epoch 3/3: 100%|██████████| 4933/4933 [2:42:52&lt;00:00,  1.98s/it]

>_ Output

			
				Average Training Loss: 0.5878197003753292

>_ Output

			
				Validation Epoch 3/3: 100%|██████████| 669/669 [13:45&lt;00:00,  1.23s/it]

>_ Output

			
				Average Validation Loss: 0.716769086751079

>_ Output

Test the fine-tuned model

We now test the model on a few documents from the test set

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;DocVQA&gt;", text_input='What do you see in this image?', image=data_test[idx]['image'], device=model.device))
  display(data_test[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;DocVQA&gt;': 'CAGR 19%'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'memorandum'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': '14000'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

We see that it gives us information

Let's now retest on the test set to compare with what came out before training

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;DocVQA&gt;", text_input='What do you see in this image?', image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;DocVQA&gt;': 'Confidential'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'Confidential'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'Brown &amp; Williamson Tobacco Corporation Research &amp; Development'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

It doesn't yield very good results, but we have only trained for 3 epochs. Although it could be improved by training more, what can be seen is that when we previously used the task tag <DocVQA>, we didn't get a response, but now we do.

Continue reading

Deep Research with LangGraph: Build an AI Research Assistant from Scratch

Learn how neural networks work from scratch with a practical linear regression example. This beginner-friendly tutorial explains artificial neurons, parameter initialization, loss functions, and mean squared error (MSE) with step-by-step code examples in Python.

MCP Elicitation: Implementing Elicitation in Servers with FastMCP and Python

Learn how to implement elicitation in MCP (Model Context Protocol) servers with FastMCP. Complete step-by-step tutorial ...

MCP Durability: Server and Client with Persistence for Long-Running Tasks

Learn to build durable MCP server and client for long-running tasks with persistence. Complete Model Context Protocol tu...

Last posts -->

Have you seen these projects?

Gymnasia

Horeca chatbot

Naviground

View all projects -->

>_ Available for projects

Do you have an AI project?

Let's talk.

maximofn@gmail.com

Machine Learning and AI specialist. I develop solutions with generative AI, intelligent agents and custom models.

Write me LinkedIn

Do you want to watch any talk?

Tomorrow's Agents: Deciphering the Mysteries of Planning, UX and Memory

AI agents, powered by LLMs, promise to transform applications. But are they simple executors today or future intelligent collaborators? To reach their...

Create your own Apple intelligence

Learn to create an IA system to execute efficiently on a device

Last talks -->

Do you want to improve with these tips?

o1 prompt engineering

Create better prompts for o1 following an example

Memory profiler

See the memory usage of a script

DataLoader with pin_memory and num_workers

Increase DataLoader performance with pin_memory and num_workers

Last tips -->

Use this locally

Hugging Face spaces allow us to run models with very simple demos, but what if the demo breaks? Or if the user deletes it? That's why I've created docker containers with some interesting spaces, to be able to use them locally, whatever happens. In fact, if you click on any project view button, it may take you to a space that doesn't work.