Fine-Tuning Florence-2: Visão por IA

18 de julho de 2024

Aviso: Este post foi traduzido para o português usando um modelo de tradução automática. Por favor, me avise se encontrar algum erro.

No post Florence-2 já explicamos o modelo Florence-2 e vimos como usá-lo. Então, neste post vamos ver como fazer o fine tuning dele.

Ajuste fino para VQA de documentos

Este fine tuning está baseado no post de Merve Noyan, Andres Marafioti e Piotr Skalski, Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models, no qual explicam que, embora este método seja muito completo, não permite fazer perguntas sobre documentos, então fazem um reentrenamento com o dataset DocumentVQA

Conjunto de Dados

Em primeiro lugar, baixamos o dataset. Deixo a variável dataset_percentage caso você não queira baixar tudo.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from datasets import load_dataset
 
dataset_percentage = 100
data_train = load_dataset("HuggingFaceM4/DocumentVQA", split=f"train[:{dataset_percentage}%]")
data_validation = load_dataset("HuggingFaceM4/DocumentVQA", split=f"validation[:{dataset_percentage}%]")
data_test = load_dataset("HuggingFaceM4/DocumentVQA", split=f"test[:{dataset_percentage}%]")
 
data_train, data_validation, data_test
	
	Copied

>_ Output

			
				(Dataset({
     features: ['questionId', 'question', 'question_types', 'image', 'docId', 'ucsf_document_id', 'ucsf_document_page_no', 'answers'],
     num_rows: 39463
}),
Dataset({
     features: ['questionId', 'question', 'question_types', 'image', 'docId', 'ucsf_document_id', 'ucsf_document_page_no', 'answers'],
     num_rows: 5349
}),
Dataset({
     features: ['questionId', 'question', 'question_types', 'image', 'docId', 'ucsf_document_id', 'ucsf_document_page_no', 'answers'],
     num_rows: 5188
}))

Fazemos um subset do dataset se você quiser fazer o treinamento mais rápido, no meu caso eu uso 100% dos dados

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		percentage = 1
 
subset_data_train = data_train.select(range(int(len(data_train) * percentage)))
subset_data_validation = data_validation.select(range(int(len(data_validation) * percentage)))
subset_data_test = data_test.select(range(int(len(data_test) * percentage)))
 
print(f"train dataset length: {len(subset_data_train)}, validation dataset length: {len(subset_data_validation)}, test dataset length: {len(subset_data_test)}")
	
	Copied

>_ Output

			
				train dataset length: 39463, validation dataset length: 5349, test dataset length: 5188

Instanciamos também o modelo

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from transformers import AutoModelForCausalLM, AutoProcessor
import torch
 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 
checkpoints = 'microsoft/Florence-2-base-ft'
model = AutoModelForCausalLM.from_pretrained(checkpoints, trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained(checkpoints, trust_remote_code=True)
	
	Copied

Assim como no post Florence-2 criamos uma função para pedir respostas ao modelo

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		def create_prompt(task_prompt, text_input=None):
    if text_input is None:
        prompt = task_prompt
    else:
        prompt = task_prompt + text_input
    return prompt
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		def generate_answer(task_prompt, text_input=None, image=None, device="cpu"):
    # Create prompt
    prompt = create_prompt(task_prompt, text_input)
 
    # Ensure the image is in RGB mode
    if image.mode != "RGB":
        image = image.convert("RGB")
 
    # Get inputs
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
 
    # Get outputs
    generated_ids = model.generate(
      input_ids=inputs["input_ids"],
      pixel_values=inputs["pixel_values"],
      max_new_tokens=1024,
      early_stopping=False,
      do_sample=False,
      num_beams=3,
    )
 
    # Decode the generated IDs
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
 
    # Post-process the generated text
    parsed_answer = processor.post_process_generation(
        generated_text,
        task=task_prompt,
        image_size=(image.width, image.height)
    )
 
    return parsed_answer
	
	Copied

Testamos o modelo com 3 documentos do conjunto de dados, com a tarefa DocVQA para ver se obtemos algo.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;DocVQA&gt;", text_input='What do you see in this image?', image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;DocVQA&gt;': 'docvQA'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'docvQA'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'DocVQA&gt;'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="DocVQA", text_input='What do you see in this image?', image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'DocVQA': 'unanswerable'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'DocVQA': 'unanswerable'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'DocVQA': '499150498'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

Vemos que as respostas não são boas.

Testamos agora com a tarefa OCR

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;OCR&gt;", image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;OCR&gt;': 'ConfidentialDATE:11/8/18RJT FR APPROVALBUBJECT: Rl gdasPROPOSED RELEASE DATE:for responseFOR RELEASE TO!CONTRACT: P. CARTERROUTE TO!NameIntiifnPeggy CarterAce11/fesMura PayneDavid Fishhel037Tom Gisis Com-Diane BarrowsEd BlackmerTow KuckerReturn to Peggy Carter, PR, 16 Raynolds BuildingLLS. 2015Source: https://www.industrydocuments.ucsf.edu/docs/xnbl0037'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;OCR&gt;': 'ConfidentialDATE:11/8/18RJT FR APPROVALBUBJECT: Rl gdasPROPOSED RELEASE DATE:for responseFOR RELEASE TO!CONTRACT: P. CARTERROUTE TO!NameIntiifnPeggy CarterAce11/fesMura PayneDavid Fishhel037Tom Gisis Com-Diane BarrowsEd BlackmerTow KuckerReturn to Peggy Carter, PR, 16 Raynolds BuildingLLS. 2015Source: https://www.industrydocuments.ucsf.edu/docs/xnbl0037'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;OCR&gt;': 'BSABROWN &amp; WILLIAMSON JOBACCO CORPORATIONRESEARCH &amp; DEVELOPMENTINTERNAL CORRESPONDENCETO:R. H. HoneycuttCC:C.J. CookFROM:May 9, 1995SUBJECT: Review of Existing Brainstorming Ideas/43The major function of the Product Innovation Ideas is developed marketable novel productsthat would be profile of the manufacturer and sell. Novel is defined as: a new kind, or differentfrom anything seen in known before, Innovation things as something is available. The products mayintroduced and the most technologies, materials and know, available to give a uniquetaste or tok.The first task of the product innovation was was an easy-view review and then a list ofexisting brainstorming ideas. These were group was used for two major categories that may differapparance and lerato,Ideas are grouped into two major products that may offercategories include a combination print of the above, flowers, and packaged and brand directions.ApparanceThis category is used in a novel cigarette constructions that yield visually different products withminimal changes in smokecigarette.Two cigarettes in one.Multi-plug in your.C-Switch menthol or non non smoking cigarette.E-Switch with ORPORated perforations to enable smoke to separate unburned section forfuture smoking.Tout smoking.Bobace section 30 mm.Novelcigarette constructions and permit a significant reduction in tobacco weight whilemaintaining fast smoking mechanics and visual reduction for tobacco weight.higher basis weight paper, potential reduction for cigarette weight.Easter or in an ebony agent for tobacco, e.g. starch.Colored tow and cigarette papers; seasonal promotions, eg. pastel colored cigarettes forEaster and in an Ebony brand containing a mixture of all black (black paper and tow)and all white cigarettes.499150498Source: https://www.industrydocuments.ucs.edu/docs/mxj0037'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

Obtemos o texto dos documentos, mas não do que tratam os documentos.

Por último, provamos com as tarefas CAPTION

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;CAPTION&gt;", image=data_train[idx]['image'], device=model.device))
  print(generate_answer(task_prompt="&lt;DETAILED_CAPTION&gt;", image=data_train[idx]['image'], device=model.device))
  print(generate_answer(task_prompt="&lt;MORE_DETAILED_CAPTION&gt;", image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;CAPTION&gt;': 'A certificate is stamped with the date of 18/18.'}
{'&lt;DETAILED_CAPTION&gt;': 'In this image we can see a paper with some text on it.'}
{'&lt;MORE_DETAILED_CAPTION&gt;': 'A letter is written in black ink on a white paper. The letters are written in a cursive language. The letter is addressed to peggy carter. '}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;CAPTION&gt;': 'A certificate is stamped with the date of 18/18.'}
{'&lt;DETAILED_CAPTION&gt;': 'In this image we can see a paper with some text on it.'}
{'&lt;MORE_DETAILED_CAPTION&gt;': 'A letter is written in black ink on a white paper. The letters are written in a cursive language. The letter is addressed to peggy carter. '}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;CAPTION&gt;': "a paper that says 'brown &amp; williamson tobacco corporation research &amp; development' on it"}
{'&lt;DETAILED_CAPTION&gt;': 'In this image we can see a paper with some text on it.'}
{'&lt;MORE_DETAILED_CAPTION&gt;': 'The image is a page from a book titled "Brown &amp; Williamson Jobacco Corporation Research &amp; Development".  The page is white and has black text.  The title of the page is "R. H. Honeycutt" at the top.  There is a logo of the company BSA in the top right corner.  A paragraph is written in black text below the title.'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

Vamos realizar o ajuste fino (fine tuning) então. Aqui está a tradução:

Também não aceitamos essas respostas, então vamos fazer o fine tuning.

Ajuste fino

Primeiro criamos um dataset do Pytorch

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		from torch.utils.data import Dataset
 
class DocVQADataset(Dataset):
    def __init__(self, data):
        self.data = data
 
    def __len__(self):
        return len(self.data)
 
    def __getitem__(self, idx):
        example = self.data[idx]
        question = "&lt;DocVQA&gt;" + example['question']
        first_answer = example['answers'][0]
        image = example['image']
        if image.mode != "RGB":
            image = image.convert("RGB")
        return question, first_answer, image
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		train_dataset = DocVQADataset(subset_data_train)
val_dataset = DocVQADataset(subset_data_validation)
	
	Copied

Vamos a vê-lo

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		train_dataset[0]
	
	Copied

>_ Output

			
				('&lt;DocVQA&gt;what is the date mentioned in this letter?',
'1/8/93',
&lt;PIL.Image.Image image mode=RGB size=1695x2025&gt;)

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		data_train[0]
	
	Copied

>_ Output

			
				{'questionId': 337,
'question': 'what is the date mentioned in this letter?',
'question_types': ['handwritten', 'form'],
'image': &lt;PIL.PngImagePlugin.PngImageFile image mode=L size=1695x2025&gt;,
'docId': 279,
'ucsf_document_id': 'xnbl0037',
'ucsf_document_page_no': '1',
'answers': ['1/8/93']}

Criamos um DataLoader

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		import os
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import (AdamW, AutoProcessor, get_scheduler)
 
def collate_fn(batch):
    questions, answers, images = zip(*batch)
    inputs = processor(text=list(questions), images=list(images), return_tensors="pt", padding=True).to(device)
    return inputs, answers
 
# Create DataLoader
batch_size = 8
num_workers = 0
 
train_loader = DataLoader(train_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, collate_fn=collate_fn, num_workers=num_workers)
	
	Copied

Vamos a ver um exemplo

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample = next(iter(train_loader))
	
	Copied

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample
	
	Copied

>_ Output

			
				({'input_ids': tensor([[    0, 41552, 42291,   846,  1864,   250, 15698, 12375,    16,     5,
           3383,     9,   331,     9,  2042,   116,     2,     1,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
          11968,   196,   205, 22922,   346, 17487,     2,     1,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
           1229,    13,   403,   690,   116,     2,     1,     1,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
           5480,  1280,   116,     2,     1,     1,     1,     1,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698, 12196,    16,     5,
           1842,   346,    13,    20,  4680, 41828, 42237,     8, 30147, 17487,
              2,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698,   560,    61,   675,
            473,    42,  1013,   266,  9943,     7,   116,     2,     1,     1,
              1,     1,     1],
         [    0, 41552, 42291,   846,  1864,   250, 15698, 12196,    16,     5,
           1280,     9, 39432,   642,  6228,  2394,  2801,    11,     5,   576,
            266, 17487,     2],
         [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,  1982,
             11,     5,  6655,  2325,    23,     5,   299,   235,     9,     5,
           3780,   116,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
         ...
         '97.00',
  '123',
  '1 January 1979 - 31 December 1979',
  '$2,720.14',
  'GPI'))

A amostra bruta é muita informação, então vamos ver o comprimento da amostra

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		len(sample)
	
	Copied

>_ Output

Obtemos uma comprimento de 2 porque temos a entrada do modelo e a resposta

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_inputs = sample[0]
sample_answers = sample[1]
	
	Copied

Vemos a entrada

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_inputs
	
	Copied

>_ Output

			
				{'input_ids': tensor([[    0, 41552, 42291,   846,  1864,   250, 15698, 12375,    16,     5,
          3383,     9,   331,     9,  2042,   116,     2,     1,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
         11968,   196,   205, 22922,   346, 17487,     2,     1,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
          1229,    13,   403,   690,   116,     2,     1,     1,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,     5,
          5480,  1280,   116,     2,     1,     1,     1,     1,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698, 12196,    16,     5,
          1842,   346,    13,    20,  4680, 41828, 42237,     8, 30147, 17487,
             2,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698,   560,    61,   675,
           473,    42,  1013,   266,  9943,     7,   116,     2,     1,     1,
             1,     1,     1],
        [    0, 41552, 42291,   846,  1864,   250, 15698, 12196,    16,     5,
          1280,     9, 39432,   642,  6228,  2394,  2801,    11,     5,   576,
           266, 17487,     2],
        [    0, 41552, 42291,   846,  1864,   250, 15698,  2264,    16,  1982,
            11,     5,  6655,  2325,    23,     5,   299,   235,     9,     5,
          3780,   116,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        ...
        [ 2.6400,  2.6400,  2.6400,  ...,  1.3502,  0.7925,  1.3502],
          [ 2.6400,  2.6400,  2.6400,  ...,  0.9319,  1.4025,  0.8448],
          [ 2.6400,  2.6400,  2.6400,  ...,  1.0365,  1.2282,  0.8099]]]])}

A entrada em bruto também tem muita informação, então vamos ver as keys

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_inputs.keys()
	
	Copied

>_ Output

			
				dict_keys(['input_ids', 'attention_mask', 'pixel_values'])

Como podemos ver, temos os input_ids e os attention_mask, que correspondem ao texto de entrada, e os pixel_values, que correspondem à imagem. Vamos ver a dimensão de cada um.

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_inputs['input_ids'].shape, sample_inputs['attention_mask'].shape, sample_inputs['pixel_values'].shape
	
	Copied

>_ Output

			
				(torch.Size([8, 23]), torch.Size([8, 23]), torch.Size([8, 3, 768, 768]))

Em todos há 8 elementos, porque ao criar o dataloader colocamos um batch size de 8. Nos input_ids e attention_mask cada elemento tem 28 tokens e nos pixel_values cada elemento tem 3 canais, 768 pixels de altura e 768 pixels de largura

Vamos agora ver as respostas

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		sample_answers
	
	Copied

>_ Output

			
				('JAMES A. RHODES',
'1-800-992-3284',
'$50,000',
'97.00',
'123',
'1 January 1979 - 31 December 1979',
'$2,720.14',
'GPI')

Obtivemos 8 respostas, pelo mesmo motivo de antes, pois ao criar o dataloader definimos um batch size de 8

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		len(sample_answers)
	
	Copied

>_ Output

Criamos uma função para fazer o fine tuning

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		def train_model(train_loader, val_loader, model, processor, epochs=10, lr=1e-6):
    optimizer = AdamW(model.parameters(), lr=lr)
    num_training_steps = epochs * len(train_loader)
    lr_scheduler = get_scheduler(
        name="linear",
        optimizer=optimizer,
        num_warmup_steps=0,
        num_training_steps=num_training_steps,
    )
 
    for epoch in range(epochs):
 
        # Training phase
        print(f"
Training Epoch {epoch + 1}/{epochs}")
        model.train()
        train_loss = 0
        i = -1
        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch + 1}/{epochs}"):
            i += 1
            inputs, answers = batch
 
            input_ids = inputs["input_ids"]
            pixel_values = inputs["pixel_values"]
            labels = processor.tokenizer(text=answers, return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)
 
            outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
            loss = outputs.loss
 
            loss.backward()
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
 
            train_loss += loss.item()
 
        avg_train_loss = train_loss / len(train_loader)
        print(f"Average Training Loss: {avg_train_loss}")
 
        # Validation phase
        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Validation Epoch {epoch + 1}/{epochs}"):
                inputs, answers = batch
 
                input_ids = inputs["input_ids"]
                pixel_values = inputs["pixel_values"]
                labels = processor.tokenizer(text=answers, return_tensors="pt", padding=True, return_token_type_ids=False).input_ids.to(device)
 
                outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=labels)
                loss = outputs.loss
 
                val_loss += loss.item()
 
        avg_val_loss = val_loss / len(val_loader)
        print(f"Average Validation Loss: {avg_val_loss}")
	
	Copied

Treinamos

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		train_model(train_loader, val_loader, model, processor, epochs=3, lr=1e-6)
	
	Copied

>_ Output

			
				Training Epoch 1/3

>_ Output

			
				Training Epoch 1/3: 100%|██████████| 4933/4933 [2:45:28&lt;00:00,  2.01s/it]

>_ Output

			
				Average Training Loss: 1.153514638062836

>_ Output

			
				Validation Epoch 1/3: 100%|██████████| 669/669 [13:52&lt;00:00,  1.24s/it]

>_ Output

			
				Average Validation Loss: 0.7698153616646124
Training Epoch 2/3

>_ Output

			
				Training Epoch 2/3: 100%|██████████| 4933/4933 [2:42:51&lt;00:00,  1.98s/it]

>_ Output

			
				Average Training Loss: 0.6530420315007687

>_ Output

			
				Validation Epoch 2/3: 100%|██████████| 669/669 [13:48&lt;00:00,  1.24s/it]

>_ Output

			
				Average Validation Loss: 0.725301219375946
Training Epoch 3/3

>_ Output

			
				Training Epoch 3/3: 100%|██████████| 4933/4933 [2:42:52&lt;00:00,  1.98s/it]

>_ Output

			
				Average Training Loss: 0.5878197003753292

>_ Output

			
				Validation Epoch 3/3: 100%|██████████| 669/669 [13:45&lt;00:00,  1.23s/it]

>_ Output

			
				Average Validation Loss: 0.716769086751079

>_ Output

Testar o modelo fine tuned

Testamos agora o modelo em alguns documentos do conjunto de teste

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;DocVQA&gt;", text_input='What do you see in this image?', image=data_test[idx]['image'], device=model.device))
  display(data_test[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;DocVQA&gt;': 'CAGR 19%'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'memorandum'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': '14000'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

Vemos que nos dá informação

Vamos agora a voltar a testar sobre o conjunto de teste, para comparar com o que saía antes de treinar

	
		
			< >
			Input
		
		
			Python
			
		
	
	
		for idx in range(3):
  print(generate_answer(task_prompt="&lt;DocVQA&gt;", text_input='What do you see in this image?', image=data_train[idx]['image'], device=model.device))
  display(data_train[idx]['image'].resize([350, 350]))
	
	Copied

>_ Output

			
				{'&lt;DocVQA&gt;': 'Confidential'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'Confidential'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

>_ Output

			
				{'&lt;DocVQA&gt;': 'Brown &amp; Williamson Tobacco Corporation Research &amp; Development'}

>_ Output

			
				&lt;PIL.Image.Image image mode=L size=350x350&gt;

Não dá muito bons resultados, mas só treinamos por 3 epochs. Embora pudesse melhorar treinando mais, o que se pode ver é que quando antes usávamos a tag de tarefa <DocVQA> não obtínhamos resposta, mas agora sim.

Continuar lendo

Deep Research com LangGraph: Crie um Assistente de IA para Pesquisar Automaticamente

Aprenda como funcionam as redes neurais do zero com um exemplo prático de regressão linear. Tutorial passo a passo que explica neurônios artificiais, inicialização de parâmetros, funções de perda e erro quadrático médio (EQM) com código Python.

Elicitação MCP: Implementar Elicitação em Servidores com FastMCP e Python

Aprenda a implementar elicitação em servidores MCP (Model Context Protocol) com FastMCP. Tutorial completo passo a passo...

MCP Durabilidade: Servidor e Cliente com Persistência para Tarefas de Longa Duração

Aprenda a construir servidor e cliente MCP com durabilidade para tarefas de longa duração. Tutorial completo do Model Co...

Últimos posts -->

Você viu esses projetos?

Gymnasia

Horeca chatbot

Naviground

Ver todos os projetos -->

>_ Disponível para projetos

Tem um projeto com IA?

Vamos conversar.

maximofn@gmail.com

Especialista em Machine Learning e Inteligência Artificial. Desenvolvo soluções com IA generativa, agentes inteligentes e modelos personalizados.

Escreva-me LinkedIn

Quer assistir alguma palestra?

Agentes do Amanhã: Descifrando os Mistérios da Planificação, UX e Memória

Agentes de IA, impulsionados por LLMs, prometem transformar aplicações. Mas eles são meros executores hoje ou futuros colaboradores inteligentes? Para...

Crie sua própria inteligência Apple

Aprenda a criar um sistema de IA para executar eficientemente em um dispositivo

Últimas palestras -->

Quer melhorar com essas dicas?

o1 prompt engineering

Criar prompts melhores para o1 seguindo um exemplo

Memory profiler

Ver o uso de memória de um script

DataLoader com pin_memory e num_workers

Aumentar o desempenho de DataLoader com pin_memory e num_workers

Últimos tips -->

Use isso localmente

Os espaços do Hugging Face nos permitem executar modelos com demos muito simples, mas e se a demo quebrar? Ou se o usuário a deletar? Por isso, criei contêineres docker com alguns espaços interessantes, para poder usá-los localmente, aconteça o que acontecer. Na verdade, se você clicar em qualquer botão de visualização de projeto, ele pode levá-lo a um espaço que não funciona.