使用自定義生物醫學資料集微調 Vision Transformer 模型

本指南概述了在自定義生物醫學資料集上微調 Vision Transformer (ViT) 模型的過程。它包括載入和準備資料集、為不同資料拆分設定影像轉換、配置和初始化 ViT 模型，以及使用評估和視覺化工具定義訓練過程的步驟。

資料集資訊

自定義資料集是手工製作的，包含 780 張影像，分為 3 個類別（良性、惡性、正常）。

attachment:datasetinfo.png

模型資訊

我們將要微調的模型是 Google 的 "vit-large-patch16-224"。它在 ImageNet-21k（14M 影像，21.843 個類別）上進行訓練，並在 ImageNet 2012（1M 影像，1.000 個類別）上以 224x224 解析度進行微調。Google 還有其他幾個不同影像大小和補丁的 ViT 模型。

我們開始吧。

入門

首先，讓我們先安裝庫。

!pip install datasets transformers accelerate torch torchvision scikit-learn matplotlib wandb

（可選）我們將把模型推送到 Hugging Face Hub，所以我們必須登入。

# from huggingface_hub import notebook_login
# notebook_login()

資料集準備

Datasets 庫自動從資料集中提取影像和類別。有關詳細資訊，您可以訪問此連結。

from datasets import load_dataset

dataset = load_dataset("emre570/breastcancer-ultrasound-images")
dataset

我們已經得到了資料集。但是我們沒有驗證集。為了建立驗證集，我們將根據測試集的大小，將驗證集的大小計算為訓練集的一部分。然後我們將訓練資料集拆分為新的訓練和驗證子集。

# Get the numbers of each set
test_num = len(dataset["test"])
train_num = len(dataset["train"])

val_size = test_num / train_num

train_val_split = dataset["train"].train_test_split(test_size=val_size)
train_val_split

我們已經得到了分離的訓練集。讓我們將它們與測試集合並。

from datasets import DatasetDict

dataset = DatasetDict(
    {"train": train_val_split["train"], "validation": train_val_split["test"], "test": dataset["test"]}
)
dataset

太棒了！我們的資料集已準備就緒。讓我們將子集分配給不同的變數。我們稍後將使用它們以便於引用。

train_ds = dataset["train"]
val_ds = dataset["validation"]
test_ds = dataset["test"]

我們可以看到影像是一個 PIL.Image，並帶有一個相關的標籤。

train_ds[0]

我們還可以看到訓練集的特徵。

train_ds.features

讓我們從資料集中每個類別中展示一張圖片。

>>> import matplotlib.pyplot as plt

>>> # Initialize a set to keep track of shown labels
>>> shown_labels = set()

>>> # Initialize the figure for plotting
>>> plt.figure(figsize=(10, 10))

>>> # Loop through the dataset and plot the first image of each label
>>> for i, sample in enumerate(train_ds):
...     label = train_ds.features["label"].names[sample["label"]]
...     if label not in shown_labels:
...         plt.subplot(1, len(train_ds.features["label"].names), len(shown_labels) + 1)
...         plt.imshow(sample["image"])
...         plt.title(label)
...         plt.axis("off")
...         shown_labels.add(label)
...         if len(shown_labels) == len(train_ds.features["label"].names):
...             break

>>> plt.show()

資料處理

資料集已準備就緒。但我們還沒有準備好進行微調。我們將依次遵循以下步驟：

標籤對映： 我們在標籤 ID 及其對應名稱之間進行轉換，這對於模型訓練和評估很有用。
影像處理： 然後，我們使用 ViTImageProcessor 來標準化輸入影像大小並應用特定於預訓練模型的歸一化。此外，還將為訓練、驗證和測試定義不同的轉換，以使用 torchvision 改進模型泛化。
轉換函式： 實現將轉換應用於資料集的函式，將影像轉換為 ViT 模型所需的格式和維度。
資料載入： 設定自定義 collate 函式以正確批處理影像和標籤，並建立 DataLoader 以在模型訓練期間高效載入和批處理。
批處理準備： 檢索並顯示示例批處理中的資料形狀，以驗證正確的處理和模型輸入就緒狀態。

標籤對映

id2label = {id: label for id, label in enumerate(train_ds.features["label"].names)}
label2id = {label: id for id, label in id2label.items()}
id2label, id2label[train_ds[0]["label"]]

影像處理

from transformers import ViTImageProcessor

model_name = "google/vit-large-patch16-224"
processor = ViTImageProcessor.from_pretrained(model_name)

from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomResizedCrop,
    ToTensor,
    Resize,
)

image_mean, image_std = processor.image_mean, processor.image_std
size = processor.size["height"]

normalize = Normalize(mean=image_mean, std=image_std)

train_transforms = Compose(
    [
        RandomResizedCrop(size),
        RandomHorizontalFlip(),
        ToTensor(),
        normalize,
    ]
)
val_transforms = Compose(
    [
        Resize(size),
        CenterCrop(size),
        ToTensor(),
        normalize,
    ]
)
test_transforms = Compose(
    [
        Resize(size),
        CenterCrop(size),
        ToTensor(),
        normalize,
    ]
)

建立轉換函式

def apply_train_transforms(examples):
    examples["pixel_values"] = [train_transforms(image.convert("RGB")) for image in examples["image"]]
    return examples


def apply_val_transforms(examples):
    examples["pixel_values"] = [val_transforms(image.convert("RGB")) for image in examples["image"]]
    return examples


def apply_test_transforms(examples):
    examples["pixel_values"] = [val_transforms(image.convert("RGB")) for image in examples["image"]]
    return examples

將轉換函式應用於每個集合

train_ds.set_transform(apply_train_transforms)
val_ds.set_transform(apply_val_transforms)
test_ds.set_transform(apply_test_transforms)

train_ds.features

train_ds[0]

看來我們已經將畫素值轉換為張量。

資料載入

import torch
from torch.utils.data import DataLoader


def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}


train_dl = DataLoader(train_ds, collate_fn=collate_fn, batch_size=4)

批處理準備

>>> batch = next(iter(train_dl))
>>> for k, v in batch.items():
...     if isinstance(v, torch.Tensor):
...         print(k, v.shape)

pixel_values torch.Size([4, 3, 224, 224])
labels torch.Size([4])

太棒了！現在我們準備好進行微調過程了。

微調模型

現在我們將配置和微調模型。我們首先使用特定的標籤對映和預訓練設定初始化模型，並調整大小不匹配的情況。訓練引數被設定為定義模型的學習過程，包括儲存策略、批處理大小和訓練週期，並使用 Weights & Biases 記錄結果。然後將例項化 Hugging Face Trainer 來管理訓練和評估，利用自定義資料整理器和模型內建處理器。最後，訓練完成後，將在測試資料集上評估模型的效能，並列印指標以評估其準確性。

首先，我們呼叫我們的模型。

from transformers import ViTForImageClassification

model = ViTForImageClassification.from_pretrained(
    model_name, id2label=id2label, label2id=label2id, ignore_mismatched_sizes=True
)

這裡有一個細微的細節：`ignore_mismatched_sizes` 引數。

當您在新資料集上微調預訓練模型時，有時影像的輸入大小或模型架構的細節（例如分類層中的標籤數量）可能與模型最初訓練時的資料不完全匹配。這可能由於各種原因發生，例如當使用在一種影像資料（如 ImageNet 中的自然影像）上訓練的模型應用於完全不同型別的影像資料（如醫學影像或專用相機影像）時。

將 `ignore_mismatched_sizes` 設定為 `True` 允許模型調整其層以適應大小差異而不會引發錯誤。

例如，該模型訓練的類別數為 1000，即 `torch.Size([1000])`，它期望輸入具有 `torch.Size([1000])` 個類別。我們的資料集有 3 個類別，即 `torch.Size([3])` 個類別。如果我們直接輸入，它將引發錯誤，因為類別數不匹配。

然後，為該模型定義來自 Google 的訓練引數。

（可選）請注意，指標將儲存到 Weights & Biases 中，因為我們將 `report_to` 引數設定為 `wandb`。W&B 將要求您提供 API 金鑰，因此您應該建立一個帳戶和 API 金鑰。如果您不想，可以刪除 `report_to` 引數。

from transformers import TrainingArguments, Trainer
import numpy as np

train_args = TrainingArguments(
    output_dir="output-models",
    save_total_limit=2,
    report_to="wandb",
    save_strategy="epoch",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=4,
    num_train_epochs=40,
    weight_decay=0.01,
    load_best_model_at_end=True,
    logging_dir="logs",
    remove_unused_columns=False,
)

我們現在可以使用 `Trainer` 開始微調過程了。

trainer = Trainer(
    model,
    train_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=collate_fn,
    tokenizer=processor,
)
trainer.train()

輪次	訓練損失	驗證損失	準確率
40	0.174700	0.596288	0.903846

微調過程已完成。讓我們繼續評估測試集上的模型。

>>> outputs = trainer.predict(test_ds)
>>> print(outputs.metrics)

&#123;'test_loss': 0.40843912959098816, 'test_runtime': 4.9934, 'test_samples_per_second': 31.242, 'test_steps_per_second': 7.81}

{'test_loss': 0.3219967782497406, 'test_accuracy': 0.9102564102564102, 'test_runtime': 4.0543, 'test_samples_per_second': 38.478, 'test_steps_per_second': 9.619}

（可選）將模型推送到 Hub

我們可以使用 `push_to_hub` 將模型推送到 Hugging Face Hub

model.push_to_hub("your_model_name")

太棒了！讓我們視覺化結果。

結果

我們已經完成了微調。現在讓我們看看我們的模型如何使用 scikit-learn 的混淆矩陣顯示和召回率分數來預測類別。

什麼是混淆矩陣？

混淆矩陣是一種特定的表格佈局，用於視覺化演算法（通常是監督學習模型）在一組已知真實值的測試資料上的效能。它對於檢查分類模型的效能特別有用，因為它顯示了真實標籤與預測標籤的頻率。

讓我們畫出我們模型的混淆矩陣

>>> from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

>>> y_true = outputs.label_ids
>>> y_pred = outputs.predictions.argmax(1)

>>> labels = train_ds.features["label"].names
>>> cm = confusion_matrix(y_true, y_pred)
>>> disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
>>> disp.plot(xticks_rotation=45)

什麼是召回率？

召回率是分類任務中使用的效能指標，用於衡量模型正確識別資料集中所有相關例項的能力。具體來說，召回率評估模型正確預測為陽性的實際陽性例項的比例。

讓我們使用 scikit-learn 列印召回率分數

>>> from sklearn.metrics import recall_score

>>> # Calculate the recall scores
>>> # 'None' calculates recall for each class separately
>>> recall = recall_score(y_true, y_pred, average=None)

>>> # Print the recall for each class
>>> for label, score in zip(labels, recall):
...     print(f"Recall for {label}: {score:.2f}")

Recall for benign: 0.90
Recall for malignant: 0.86
Recall for normal: 0.78

良性召回率：0.90，惡性召回率：0.86，正常召回率：0.78

結論

在本食譜中，我們介紹瞭如何使用醫學資料集訓練 ViT 模型。它涵蓋了資料集準備、影像預處理、模型配置、訓練、評估和結果視覺化等關鍵步驟。透過利用 Hugging Face 的 Transformers 庫、scikit-learn 和 PyTorch Torchvision，它有助於高效的模型訓練和評估，為模型的效能及其準確分類生物醫學影像的能力提供了寶貴的見解。

< > 在 GitHub 上更新

開源 AI 食譜