視覺問答

視覺問答 (VQA) 是根據影像回答開放式問題的任務。支援此任務的模型輸入通常是影像和問題的組合，輸出是自然語言表達的答案。

VQA 的一些值得注意的用例包括

面向視障人士的無障礙應用。
教育：對講座或教科書中呈現的視覺材料提出問題。VQA 也可用於互動博物館展覽或歷史遺蹟。
客戶服務和電子商務：VQA 可以透過讓使用者詢問產品問題來增強使用者體驗。
影像檢索：VQA 模型可用於檢索具有特定特徵的影像。例如，使用者可以問“有狗嗎？”以從一組影像中找到所有有狗的影像。

本指南中您將學習如何

在 Graphcore/vqa 資料集上微調一個分類 VQA 模型，特別是 ViLT。
使用您微調的 ViLT 進行推理。
使用生成模型（如 BLIP-2）執行零樣本 VQA 推理。

微調 ViLT

ViLT 模型將文字嵌入整合到 Vision Transformer (ViT) 中，使其具有最小化的視覺與語言預訓練 (VLP) 設計。此模型可用於多個下游任務。對於 VQA 任務，分類器頭部（位於 `[CLS]` 標記最終隱藏狀態之上的線性層）被隨機初始化。因此，視覺問答被視為一個**分類問題**。

最近的模型，如 BLIP、BLIP-2 和 InstructBLIP，將 VQA 視為一個生成任務。在本指南的後面，我們將演示如何使用它們進行零樣本 VQA 推理。

在開始之前，請確保您已安裝所有必要的庫。

pip install -q transformers datasets

我們鼓勵您與社群分享您的模型。登入您的 Hugging Face 帳戶以將其上傳到 🤗 Hub。出現提示時，輸入您的令牌以登入

>>> from huggingface_hub import notebook_login

>>> notebook_login()

我們將模型檢查點定義為全域性變數。

>>> model_checkpoint = "dandelin/vilt-b32-mlm"

載入資料

出於演示目的，在本指南中，我們使用帶註釋的視覺問答 `Graphcore/vqa` 資料集的一個非常小的樣本。您可以在 🤗 Hub 上找到完整的資料集。

作為 `Graphcore/vqa` 資料集的替代方案，您可以從官方 VQA 資料集頁面手動下載相同的資料。如果您喜歡使用自定義資料進行教程，請查閱 🤗 Datasets 文件中關於建立影像資料集的指南。

讓我們載入驗證集中前 200 個示例並探索資料集的特徵

>>> from datasets import load_dataset

>>> dataset = load_dataset("Graphcore/vqa", split="validation[:200]")
>>> dataset
Dataset({
    features: ['question', 'question_type', 'question_id', 'image_id', 'answer_type', 'label'],
    num_rows: 200
})

讓我們看一個例子來理解資料集的特徵

>>> dataset[0]
{'question': 'Where is he looking?',
 'question_type': 'none of the above',
 'question_id': 262148000,
 'image_id': '/root/.cache/huggingface/datasets/downloads/extracted/ca733e0e000fb2d7a09fbcc94dbfe7b5a30750681d0e965f8e0a23b1c2f98c75/val2014/COCO_val2014_000000262148.jpg',
 'answer_type': 'other',
 'label': {'ids': ['at table', 'down', 'skateboard', 'table'],
  'weights': [0.30000001192092896,
   1.0,
   0.30000001192092896,
   0.30000001192092896]}}

與任務相關的特性包括

question：要根據影像回答的問題
image_id：問題所指影像的路徑
label：標註

我們可以移除其餘不需要的特徵

>>> dataset = dataset.remove_columns(['question_type', 'question_id', 'answer_type'])

如您所見，`label` 特徵包含由不同人工標註者收集的對同一個問題的幾個答案（此處稱為 `ids`）。這是因為問題的答案可能是主觀的。在這種情況下，問題是“他在看哪裡？”。有些人標註為“向下”，另一些人標註為“看桌子”，還有人標註為“滑板”等。

看看這張圖片，想想你會給出什麼答案

>>> from PIL import Image

>>> image = Image.open(dataset[0]['image_id'])
>>> image

由於問題和答案的模糊性，像這樣的資料集被視為多標籤分類問題（因為可能存在多個有效答案）。此外，除了建立獨熱編碼向量外，還根據特定答案在標註中出現的次數建立了軟編碼。

例如，在上面的示例中，由於答案“向下”比其他答案選擇的頻率高得多，因此它的分數（在資料集中稱為“權重”）為 1.0，而其餘答案的分數小於 1.0。

為了後續使用合適的分類頭例項化模型，我們建立兩個字典：一個將標籤名稱對映到整數，另一個反之

>>> import itertools

>>> labels = [item['ids'] for item in dataset['label']]
>>> flattened_labels = list(itertools.chain(*labels))
>>> unique_labels = list(set(flattened_labels))

>>> label2id = {label: idx for idx, label in enumerate(unique_labels)}
>>> id2label = {idx: label for label, idx in label2id.items()}

現在我們有了對映，我們可以用它們的 ID 替換字串答案，並展平資料集以便於進一步的預處理。

>>> def replace_ids(inputs):
...   inputs["label"]["ids"] = [label2id[x] for x in inputs["label"]["ids"]]
...   return inputs


>>> dataset = dataset.map(replace_ids)
>>> flat_dataset = dataset.flatten()
>>> flat_dataset.features
{'question': Value(dtype='string', id=None),
 'image_id': Value(dtype='string', id=None),
 'label.ids': Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None),
 'label.weights': Sequence(feature=Value(dtype='float64', id=None), length=-1, id=None)}

資料預處理

下一步是載入 ViLT 處理器，以準備影像和文字資料供模型使用。ViltProcessor 將 BERT tokenizer 和 ViLT 影像處理器封裝成一個方便的單一處理器

>>> from transformers import ViltProcessor

>>> processor = ViltProcessor.from_pretrained(model_checkpoint)

為了預處理資料，我們需要使用 ViltProcessor 對影像和問題進行編碼。處理器將使用 BertTokenizerFast 對文字進行分詞，併為文字資料建立 `input_ids`、`attention_mask` 和 `token_type_ids`。對於影像，處理器將利用 ViltImageProcessor 來調整影像大小並進行歸一化，並建立 `pixel_values` 和 `pixel_mask`。

所有這些預處理步驟都在底層完成，我們只需要呼叫 `processor`。但是，我們仍然需要準備目標標籤。在這種表示中，每個元素對應一個可能的答案（標籤）。對於正確答案，元素包含其各自的分數（權重），而其餘元素則設定為零。

以下函式將 `processor` 應用於影像和問題，並按上述方式格式化標籤

>>> import torch

>>> def preprocess_data(examples):
...     image_paths = examples['image_id']
...     images = [Image.open(image_path) for image_path in image_paths]
...     texts = examples['question']

...     encoding = processor(images, texts, padding="max_length", truncation=True, return_tensors="pt")

...     for k, v in encoding.items():
...           encoding[k] = v.squeeze()

...     targets = []

...     for labels, scores in zip(examples['label.ids'], examples['label.weights']):
...         target = torch.zeros(len(id2label))

...         for label, score in zip(labels, scores):
...             target[label] = score

...         targets.append(target)

...     encoding["labels"] = targets

...     return encoding

要在整個資料集上應用預處理函式，請使用 🤗 Datasets 的 `map` 函式。透過設定 `batched=True` 來一次處理資料集的多個元素，可以加快 `map` 的速度。此時，您可以隨意刪除不需要的列。

>>> processed_dataset = flat_dataset.map(preprocess_data, batched=True, remove_columns=['question','question_type',  'question_id', 'image_id', 'answer_type', 'label.ids', 'label.weights'])
>>> processed_dataset
Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'pixel_values', 'pixel_mask', 'labels'],
    num_rows: 200
})

最後一步，使用 DefaultDataCollator 建立一個示例批次

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

訓練模型

現在您已準備好開始訓練模型了！使用 ViltForQuestionAnswering 載入 ViLT。指定標籤數量以及標籤對映

>>> from transformers import ViltForQuestionAnswering

>>> model = ViltForQuestionAnswering.from_pretrained(model_checkpoint, num_labels=len(id2label), id2label=id2label, label2id=label2id)

此時，只剩下三個步驟

在 TrainingArguments 中定義您的訓練超引數

>>> from transformers import TrainingArguments

>>> repo_id = "MariaK/vilt_finetuned_200"

>>> training_args = TrainingArguments(
...     output_dir=repo_id,
...     per_device_train_batch_size=4,
...     num_train_epochs=20,
...     save_steps=200,
...     logging_steps=50,
...     learning_rate=5e-5,
...     save_total_limit=2,
...     remove_unused_columns=False,
...     push_to_hub=True,
... )

將訓練引數與模型、資料集、處理器和資料整理器一同傳遞給 Trainer。

>>> from transformers import Trainer

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=processed_dataset,
...     processing_class=processor,
... )

呼叫 train() 來微調您的模型。

>>> trainer.train()

訓練完成後，使用 push_to_hub() 方法將您的最終模型分享到 🤗 Hub。

>>> trainer.push_to_hub()

推理

現在您已經微調了一個 ViLT 模型並將其上傳到 🤗 Hub，您可以將其用於推理。嘗試微調模型進行推理最簡單的方法是在 Pipeline 中使用它。

>>> from transformers import pipeline

>>> pipe = pipeline("visual-question-answering", model="MariaK/vilt_finetuned_200")

本指南中的模型只訓練了 200 個示例，所以不要期望太多。讓我們看看它是否至少從資料中學到了一些東西，並以資料集中的第一個示例為例進行推理

>>> example = dataset[0]
>>> image = Image.open(example['image_id'])
>>> question = example['question']
>>> print(question)
>>> pipe(image, question, top_k=1)
"Where is he looking?"
[{'score': 0.5498199462890625, 'answer': 'down'}]

儘管不太自信，但模型確實學到了一些東西。透過更多的示例和更長的訓練，您會得到更好的結果！

如果需要，您也可以手動複製管道的結果

取一張圖片和一個問題，使用模型中的處理器為模型準備它們。
將預處理結果透過模型傳遞。
從邏輯中獲取最可能的答案 ID，並在 `id2label` 中找到實際答案。

>>> processor = ViltProcessor.from_pretrained("MariaK/vilt_finetuned_200")

>>> image = Image.open(example['image_id'])
>>> question = example['question']

>>> # prepare inputs
>>> inputs = processor(image, question, return_tensors="pt")

>>> model = ViltForQuestionAnswering.from_pretrained("MariaK/vilt_finetuned_200")

>>> # forward pass
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits = outputs.logits
>>> idx = logits.argmax(-1).item()
>>> print("Predicted answer:", model.config.id2label[idx])
Predicted answer: down

零樣本 VQA

之前的模型將 VQA 視為分類任務。一些最近的模型，如 BLIP、BLIP-2 和 InstructBLIP，將 VQA 視為生成任務。讓我們以 BLIP-2 為例。它引入了一種新的視覺-語言預訓練正規化，其中可以使用任何預訓練的視覺編碼器和 LLM 組合（在 BLIP-2 部落格文章中瞭解更多）。這使得在多個視覺-語言任務（包括視覺問答）上取得最先進的結果。

讓我們演示如何將此模型用於 VQA。首先，讓我們載入模型。這裡我們將顯式地將模型傳送到 GPU（如果可用），這在之前的訓練中不需要做，因為 Trainer 會自動處理

>>> from transformers import AutoProcessor, Blip2ForConditionalGeneration
>>> import torch
>>> from accelerate.test_utils.testing import get_backend

>>> processor = AutoProcessor.from_pretrained("Salesforce/blip2-opt-2.7b")
>>> model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16)
>>> device, _, _ = get_backend() # automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
>>> model.to(device)

模型以影像和文字作為輸入，所以讓我們使用 VQA 資料集中第一個示例的完全相同的影像/問題對

>>> example = dataset[0]
>>> image = Image.open(example['image_id'])
>>> question = example['question']

要使用 BLIP-2 進行視覺問答任務，文字提示必須遵循特定格式：`Question: {} Answer:`。

>>> prompt = f"Question: {question} Answer:"

現在我們需要用模型的處理器預處理影像/提示，透過模型傳遞處理後的輸入，然後解碼輸出

>>> inputs = processor(image, text=prompt, return_tensors="pt").to(device, torch.float16)

>>> generated_ids = model.generate(**inputs, max_new_tokens=10)
>>> generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
>>> print(generated_text)
"He is looking at the crowd"

正如您所見，模型識別了人群，以及面部的方向（向下看），但是它似乎忽略了人群在滑板手後面的事實。儘管如此，在無法獲取人工標註資料集的情況下，這種方法可以快速產生有用的結果。

< > 在 GitHub 上更新