影像字幕

影像字幕任務是對給定影像生成文字描述。常見的實際應用包括幫助視障人士導航不同的場景。因此，影像字幕透過向人們描述影像來幫助提高內容的可訪問性。

本指南將向您展示如何：

微調影像字幕模型。
使用微調模型進行推理。

在開始之前，請確保您已安裝所有必要的庫

pip install transformers datasets evaluate -q
pip install jiwer -q

我們鼓勵您登入到 Hugging Face 帳戶，以便您可以將模型上傳並與社群共享。當出現提示時，輸入您的令牌進行登入。

from huggingface_hub import notebook_login

notebook_login()

載入寶可夢 BLIP 字幕資料集

使用 🤗 Dataset 庫載入包含 {影像-字幕} 對的資料集。要在 PyTorch 中建立自己的影像字幕資料集，您可以按照此 Notebook 進行操作。

from datasets import load_dataset

ds = load_dataset("lambdalabs/pokemon-blip-captions")
ds

DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 833
    })
})

資料集包含兩個特徵，`image` 和 `text`。

許多影像字幕資料集每張影像包含多個字幕。在這些情況下，常見的策略是在訓練期間從可用字幕中隨機抽取一個字幕。

使用 train_test_split 方法將資料集的訓練集拆分為訓練集和測試集

ds = ds["train"].train_test_split(test_size=0.1)
train_ds = ds["train"]
test_ds = ds["test"]

讓我們視覺化訓練集中的幾個樣本。

from textwrap import wrap
import matplotlib.pyplot as plt
import numpy as np


def plot_images(images, captions):
    plt.figure(figsize=(20, 20))
    for i in range(len(images)):
        ax = plt.subplot(1, len(images), i + 1)
        caption = captions[i]
        caption = "\n".join(wrap(caption, 12))
        plt.title(caption)
        plt.imshow(images[i])
        plt.axis("off")


sample_images_to_visualize = [np.array(train_ds[i]["image"]) for i in range(5)]
sample_captions = [train_ds[i]["text"] for i in range(5)]
plot_images(sample_images_to_visualize, sample_captions)

預處理資料集

由於資料集有兩種模態（影像和文字），因此預處理管道將對影像和字幕進行預處理。

為此，載入與您將要微調的模型相關的處理器類。

from transformers import AutoProcessor

checkpoint = "microsoft/git-base"
processor = AutoProcessor.from_pretrained(checkpoint)

處理器將在內部預處理影像（包括調整大小和畫素縮放）並對字幕進行分詞。

def transforms(example_batch):
    images = [x for x in example_batch["image"]]
    captions = [x for x in example_batch["text"]]
    inputs = processor(images=images, text=captions, padding="max_length")
    inputs.update({"labels": inputs["input_ids"]})
    return inputs


train_ds.set_transform(transforms)
test_ds.set_transform(transforms)

資料集準備好後，您現在可以設定模型以進行微調。

載入基礎模型

將 “microsoft/git-base” 載入到 `AutoModelForCausalLM` 物件中。

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(checkpoint)

評估

影像字幕模型通常使用 Rouge Score 或詞錯誤率 (Word Error Rate, WER) 進行評估。在本指南中，您將使用詞錯誤率 (WER)。

我們使用 🤗 Evaluate 庫來完成此操作。有關 WER 的潛在限制和其他注意事項，請參閱此指南。

from evaluate import load
import torch

wer = load("wer")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predicted = logits.argmax(-1)
    decoded_labels = processor.batch_decode(labels, skip_special_tokens=True)
    decoded_predictions = processor.batch_decode(predicted, skip_special_tokens=True)
    wer_score = wer.compute(predictions=decoded_predictions, references=decoded_labels)
    return {"wer_score": wer_score}

訓練！

現在，您已準備好開始微調模型。您將使用 🤗 Trainer 來完成此操作。

首先，使用 TrainingArguments 定義訓練引數。

from transformers import TrainingArguments, Trainer

model_name = checkpoint.split("/")[1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-pokemon",
    learning_rate=5e-5,
    num_train_epochs=50,
    fp16=True,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,
    save_total_limit=3,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    logging_steps=50,
    remove_unused_columns=False,
    push_to_hub=True,
    label_names=["labels"],
    load_best_model_at_end=True,
)

然後將它們與資料集和模型一起傳遞給 🤗 Trainer。

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

要開始訓練，只需在 Trainer 物件上呼叫 train()。

trainer.train()

您應該會看到訓練損失隨著訓練的進行而平穩下降。

訓練完成後，使用 push_to_hub() 方法將您的模型分享到 Hub，以便所有人都可以使用您的模型。

trainer.push_to_hub()

推理

從 `test_ds` 中取一個樣本影像來測試模型。

from PIL import Image
import requests

url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/pokemon.png"
image = Image.open(requests.get(url, stream=True).raw)
image

準備影像以供模型使用。

from accelerate.test_utils.testing import get_backend
# automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
device, _, _ = get_backend()
inputs = processor(images=image, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values

呼叫 `generate` 並解碼預測。

generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_caption)

a drawing of a pink and blue pokemon

看起來微調後的模型生成了一個非常好的字幕！

< > 在 GitHub 上更新

Transformers

影像字幕

載入寶可夢 BLIP 字幕資料集

預處理資料集

載入基礎模型

評估

訓練！

推理