用於物件檢測的 Vision Transformers

本節將介紹如何使用 Vision Transformers 完成物件檢測任務。我們將瞭解如何為我們的用例微調現有的預訓練物件檢測模型。開始之前，請檢視此 HuggingFace Space，您可以在其中玩轉最終輸出。

引言

Object detection example

物件檢測是一項計算機視覺任務，涉及識別影像或影片中的物件並確定其位置。它包括兩個主要步驟：

首先，識別存在的物件型別（例如汽車、人或動物）。
其次，透過在它們周圍繪製邊界框來確定它們的精確位置。

這些模型通常接收影像（靜態或影片幀）作為輸入，每個影像中都存在多個物件。例如，考慮一個包含汽車、人、腳踏車等多個物件的影像。處理輸入後，這些模型會生成一組數字，傳達以下資訊：

物件的位置（邊界框的 XY 座標）。
物件的類別。

物件檢測有很多應用。其中最重要的例子之一是自動駕駛領域，其中物件檢測用於檢測汽車周圍的不同物件（如行人、路標、交通訊號燈等），這些物件成為決策的輸入之一。

為了加深您對物件檢測內部運作的理解，請檢視我們關於物件檢測的專用章節🤗。

物件檢測中微調模型的必要性 🤔

你應該構建一個新模型，還是修改現有模型？這是一個很棒的問題。從頭開始訓練一個物件檢測模型意味著：

一遍又一遍地做已經完成的研究。
編寫重複的模型程式碼，訓練它們，併為不同的用例維護不同的儲存庫。
大量的實驗和資源的浪費。

與其做所有這些，不如採用一個表現良好的預訓練模型（一個在識別通用特徵方面做得出色的模型），然後調整或重新調整其權重（或其部分權重），使其適應您的用例。我們相信或假設預訓練模型已經學到了足夠的知識來提取影像中的重要特徵以定位和分類物件。因此，如果引入新物件，那麼相同的模型可以在短時間內進行訓練，並利用已學習和新特徵來開始檢測這些新物件。

在本教程結束時，您應該能夠為物件檢測用例構建一個完整的管道（從載入資料集、微調模型到進行推理）。

安裝必要的庫

讓我們從安裝開始。只需執行以下單元格即可安裝必要的包。在本教程中，我們將使用 Hugging Face Transformers 和 PyTorch。

!pip install -U -q datasets transformers[torch] evaluate timm albumentations accelerate

場景

為了讓本教程更有趣，我們來看一個真實的例子。考慮以下場景：建築工人在施工區域工作時需要 utmost safety。基本的安全協議要求每次都佩戴頭盔。由於建築工人很多，很難時刻盯著每個人。

但是，如果我們能有一個攝像頭系統，可以即時檢測人員以及他們是否佩戴頭盔，那真是太棒了，對嗎？

因此，我們將微調一個輕量級物件檢測模型來完成這項任務。讓我們深入瞭解一下。

資料集

對於上述場景，我們將使用由中國東北大學提供的hardhat資料集。我們可以使用 🤗 datasets 下載和載入此資料集。

from datasets import load_dataset

dataset = load_dataset("anindya64/hardhat")
dataset

這將為您提供以下資料結構

DatasetDict({
    train: Dataset({
        features: ['image', 'image_id', 'width', 'height', 'objects'],
        num_rows: 5297
    })
    test: Dataset({
        features: ['image', 'image_id', 'width', 'height', 'objects'],
        num_rows: 1766
    })
})

上面是一個DatasetDict，它是一個高效的字典式結構，包含訓練和測試拆分中的整個資料集。如您所見，在每個拆分（訓練和測試）下，我們都有features和num_rows。在 features 下，我們有image（一個Pillow 物件）、影像的 ID、高度和寬度以及物件。現在讓我們看看每個資料點（在訓練/測試集中）是什麼樣子的。為此，執行以下行：

dataset["train"][0]

這將為您提供以下結構：

{'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x375>,
 'image_id': 1,
 'width': 500,
 'height': 375,
 'objects': {'id': [1, 1],
  'area': [3068.0, 690.0],
  'bbox': [[178.0, 84.0, 52.0, 59.0], [111.0, 144.0, 23.0, 30.0]],
  'category': ['helmet', 'helmet']}}

如您所見，`objects` 是另一個字典，其中包含物件 ID（此處為類別 ID）、物件的區域以及邊界框座標（`bbox`）和類別（或標籤）。以下是資料元素中每個鍵和值的更詳細解釋。

image：這是一個 Pillow Image 物件，有助於在從路徑載入之前直接檢視影像。
image_id：表示來自訓練檔案的影像編號。
width：影像的寬度。
height：影像的高度。
objects：另一個包含註釋資訊的字典。它包含以下內容：
- id：一個列表，列表的長度表示物件的數量，每個值表示類別索引。
- area：物件的面積。
- bbox：表示物件的邊界框座標。
- category：物件的類別（字串）。

現在，讓我們正確地提取訓練和測試樣本。對於本教程，我們大約有 5000 個訓練樣本和 1700 個測試樣本。

# First, extract out the train and test set

train_dataset = dataset["train"]
test_dataset = dataset["test"]

既然我們知道了樣本資料點包含什麼，讓我們開始繪製該樣本。在這裡，我們將首先繪製圖像，然後繪製相應的邊界框。

這是我們要做的事情：

獲取影像及其對應的高度和寬度。
建立一個可以輕鬆在影像上繪製文字和線條的繪製物件。
從樣本中獲取註釋字典。
遍歷它。
對於每個註釋，獲取邊界框座標，即 x（邊界框水平開始的位置）、y（邊界框垂直開始的位置）、w（邊界框的寬度）、h（邊界框的高度）。
如果邊界框度量已標準化，則進行縮放，否則保持不變。
最後繪製矩形和類別的文字。

import numpy as np
from PIL import Image, ImageDraw


def draw_image_from_idx(dataset, idx):
    sample = dataset[idx]
    image = sample["image"]
    annotations = sample["objects"]
    draw = ImageDraw.Draw(image)
    width, height = sample["width"], sample["height"]

    for i in range(len(annotations["id"])):
        box = annotations["bbox"][i]
        class_idx = annotations["id"][i]
        x, y, w, h = tuple(box)
        if max(box) > 1.0:
            x1, y1 = int(x), int(y)
            x2, y2 = int(x + w), int(y + h)
        else:
            x1 = int(x * width)
            y1 = int(y * height)
            x2 = int((x + w) * width)
            y2 = int((y + h) * height)
        draw.rectangle((x1, y1, x2, y2), outline="red", width=1)
        draw.text((x1, y1), annotations["category"][i], fill="white")
    return image


draw_image_from_idx(dataset=train_dataset, idx=10)

我們有一個函式可以繪製單張影像，現在我們來寫一個簡單的函式，利用上述功能繪製多張影像。這將有助於我們進行一些分析。

import matplotlib.pyplot as plt


def plot_images(dataset, indices):
    """
    Plot images and their annotations.
    """
    num_rows = len(indices) // 3
    num_cols = 3
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 10))

    for i, idx in enumerate(indices):
        row = i // num_cols
        col = i % num_cols

        # Draw image
        image = draw_image_from_idx(dataset, idx)

        # Display image on the corresponding subplot
        axes[row, col].imshow(image)
        axes[row, col].axis("off")

    plt.tight_layout()
    plt.show()


# Now use the function to plot images

plot_images(train_dataset, range(9))

執行該函式將給我們一個下面所示的精美拼貼畫。

input-image-plot

AutoImageProcessor

在微調模型之前，我們必須對資料進行預處理，使其與預訓練時使用的方法完全匹配。HuggingFace AutoImageProcessor 負責處理影像資料，以建立 DETR 模型可以訓練的 pixel_values、pixel_mask 和 labels。

現在，讓我們從與我們想要微調的模型相同的檢查點例項化影像處理器。

from transformers import AutoImageProcessor

checkpoint = "facebook/detr-resnet-50-dc5"
image_processor = AutoImageProcessor.from_pretrained(checkpoint)

預處理資料集

在將影像傳遞給 image_processor 之前，我們還要對影像及其相應的邊界框應用不同型別的增強。

簡單來說，增強是一些隨機變換的集合，如旋轉、縮放等。這些變換用於獲取更多樣本，並使視覺模型對不同的影像條件更具魯棒性。我們將使用 albumentations 庫來實現這一點。它允許您建立影像的隨機變換，從而增加訓練的樣本量。

import albumentations
import numpy as np
import torch

transform = albumentations.Compose(
    [
        albumentations.Resize(480, 480),
        albumentations.HorizontalFlip(p=1.0),
        albumentations.RandomBrightnessContrast(p=1.0),
    ],
    bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]),
)

初始化所有轉換後，我們需要建立一個函式來格式化註釋並返回一個具有非常特定格式的註釋列表。

這是因為 image_processor 期望註釋採用以下格式：{'image_id': int, 'annotations': List[Dict]}，其中每個字典都是一個 COCO 物件註釋。

def formatted_anns(image_id, category, area, bbox):
    annotations = []
    for i in range(0, len(category)):
        new_ann = {
            "image_id": image_id,
            "category_id": category[i],
            "isCrowd": 0,
            "area": area[i],
            "bbox": list(bbox[i]),
        }
        annotations.append(new_ann)

    return annotations

最後，我們將影像和註釋轉換結合起來，對整個資料集批次進行轉換。

這是最終的程式碼：

# transforming a batch


def transform_aug_ann(examples):
    image_ids = examples["image_id"]
    images, bboxes, area, categories = [], [], [], []
    for image, objects in zip(examples["image"], examples["objects"]):
        image = np.array(image.convert("RGB"))[:, :, ::-1]
        out = transform(image=image, bboxes=objects["bbox"], category=objects["id"])

        area.append(objects["area"])
        images.append(out["image"])
        bboxes.append(out["bboxes"])
        categories.append(out["category"])

    targets = [
        {"image_id": id_, "annotations": formatted_anns(id_, cat_, ar_, box_)}
        for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes)
    ]

    return image_processor(images=images, annotations=targets, return_tensors="pt")

最後，您所要做的就是將此預處理函式應用於整個資料集。您可以透過使用 HuggingFace 🤗 Datasets with transform 方法來實現。

# Apply transformations for both train and test dataset

train_dataset_transformed = train_dataset.with_transform(transform_aug_ann)
test_dataset_transformed = test_dataset.with_transform(transform_aug_ann)

現在讓我們看看轉換後的訓練資料集樣本是什麼樣子

train_dataset_transformed[0]

這將返回一個張量字典。我們這裡主要需要的是表示影像的 pixel_values，表示注意力遮罩的 pixel_mask 以及 labels。這是一個數據點的樣子：

{'pixel_values': tensor([[[-0.1657, -0.1657, -0.1657,  ..., -0.3369, -0.4739, -0.5767],
          [-0.1657, -0.1657, -0.1657,  ..., -0.3369, -0.4739, -0.5767],
          [-0.1657, -0.1657, -0.1828,  ..., -0.3541, -0.4911, -0.5938],
          ...,
          [-0.4911, -0.5596, -0.6623,  ..., -0.7137, -0.7650, -0.7993],
          [-0.4911, -0.5596, -0.6794,  ..., -0.7308, -0.7993, -0.8335],
          [-0.4911, -0.5596, -0.6794,  ..., -0.7479, -0.8164, -0.8507]],
 
         [[-0.0924, -0.0924, -0.0924,  ...,  0.0651, -0.0749, -0.1800],
          [-0.0924, -0.0924, -0.0924,  ...,  0.0651, -0.0924, -0.2150],
          [-0.0924, -0.0924, -0.1099,  ...,  0.0476, -0.1275, -0.2500],
          ...,
          [-0.0924, -0.1800, -0.3200,  ..., -0.4426, -0.4951, -0.5301],
          [-0.0924, -0.1800, -0.3200,  ..., -0.4601, -0.5126, -0.5651],
          [-0.0924, -0.1800, -0.3200,  ..., -0.4601, -0.5301, -0.5826]],
 
         [[ 0.1999,  0.1999,  0.1999,  ...,  0.6705,  0.5136,  0.4091],
          [ 0.1999,  0.1999,  0.1999,  ...,  0.6531,  0.4962,  0.3916],
          [ 0.1999,  0.1999,  0.1825,  ...,  0.6356,  0.4614,  0.3568],
          ...,
          [ 0.4788,  0.3916,  0.2696,  ...,  0.1825,  0.1302,  0.0953],
          [ 0.4788,  0.3916,  0.2696,  ...,  0.1651,  0.0953,  0.0605],
          [ 0.4788,  0.3916,  0.2696,  ...,  0.1476,  0.0779,  0.0431]]]),
 'pixel_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]]),
 'labels': {'size': tensor([800, 800]), 'image_id': tensor([1]), 'class_labels': tensor([1, 1]), 'boxes': tensor([[0.5920, 0.3027, 0.1040, 0.1573],
         [0.7550, 0.4240, 0.0460, 0.0800]]), 'area': tensor([8522.2217, 1916.6666]), 'iscrowd': tensor([0, 0]), 'orig_size': tensor([480, 480])}}

我們差不多了 🚀。作為最後一個預處理步驟，我們需要編寫一個自定義的 collate_fn。那麼什麼是 collate_fn 呢？

collate_fn 負責從資料集中獲取樣本列表，並將它們轉換為適合模型輸入格式的批次。

通常，`DataCollator` 通常執行填充、截斷等任務。在自定義 collate 函式中，我們通常定義我們想要如何將資料分組到批次中，或者簡單地說，如何表示每個批次。

資料整理器主要將資料整合在一起，然後對其進行預處理。讓我們製作我們的整理函式。

def collate_fn(batch):
    pixel_values = [item["pixel_values"] for item in batch]
    encoding = image_processor.pad(pixel_values, return_tensors="pt")
    labels = [item["labels"] for item in batch]
    batch = {}
    batch["pixel_values"] = encoding["pixel_values"]
    batch["pixel_mask"] = encoding["pixel_mask"]
    batch["labels"] = labels
    return batch

訓練 DETR 模型。

到目前為止，所有繁重的工作都已完成。現在，剩下的就是將拼圖的各個部分一一組裝起來。開始吧！

訓練過程包括以下步驟：

使用AutoModelForObjectDetection載入基礎（預訓練）模型，使用與預處理相同的檢查點。
在TrainingArguments中定義所有超引數和附加引數。
將訓練引數傳遞給HuggingFace Trainer，以及模型、資料集和影像。
呼叫 train() 方法並微調您的模型。

從用於預處理的同一檢查點載入模型時，請記住傳遞您之前從資料集元資料建立的 label2id 和 id2label 對映。此外，我們指定 ignore_mismatched_sizes=True 以用新的分類頭替換現有的分類頭。

from transformers import AutoModelForObjectDetection

id2label = {0: "head", 1: "helmet", 2: "person"}
label2id = {v: k for k, v in id2label.items()}


model = AutoModelForObjectDetection.from_pretrained(
    checkpoint,
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
)

在繼續之前，請登入 Hugging Face Hub，以便在訓練時即時上傳您的模型。這樣，您就不需要處理檢查點並將其儲存到其他地方。

from huggingface_hub import notebook_login

notebook_login()

完成後，我們開始訓練模型。我們首先定義訓練引數，然後定義一個使用這些引數進行訓練的訓練器物件，如下所示：

from transformers import TrainingArguments
from transformers import Trainer

# Define the training arguments

training_args = TrainingArguments(
    output_dir="detr-resnet-50-hardhat-finetuned",
    per_device_train_batch_size=8,
    num_train_epochs=3,
    max_steps=1000,
    fp16=True,
    save_steps=10,
    logging_steps=30,
    learning_rate=1e-5,
    weight_decay=1e-4,
    save_total_limit=2,
    remove_unused_columns=False,
    push_to_hub=True,
)

# Define the trainer

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=collate_fn,
    train_dataset=train_dataset_transformed,
    eval_dataset=test_dataset_transformed,
    tokenizer=image_processor,
)

trainer.train()

訓練完成後，您現在可以刪除模型，因為檢查點已上傳到 HuggingFace Hub。

del model
torch.cuda.synchronize()

測試與推理

現在我們將嘗試對新微調的模型進行推理。在本教程中，我們將測試此影像：

input-test-image

這裡我們首先編寫一個非常簡單的程式碼，用於對一些新影像進行物件檢測推理。我們從對單個影像進行推理開始，然後將所有內容組合起來並將其製成一個函式。

import requests
from transformers import pipeline

# download a sample image

url = "https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/test-helmet-object-detection.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# make the object detection pipeline

obj_detector = pipeline(
    "object-detection", model="anindya64/detr-resnet-50-dc5-hardhat-finetuned"
)
results = obj_detector(train_dataset[0]["image"])

print(results)

現在讓我們編寫一個非常簡單的函式，用於在我們的影像上繪製結果。我們從結果中獲取分數、標籤和相應的邊界框座標，這些將用於在影像中繪製。

def plot_results(image, results, threshold=0.7):
    image = Image.fromarray(np.uint8(image))
    draw = ImageDraw.Draw(image)
    for result in results:
        score = result["score"]
        label = result["label"]
        box = list(result["box"].values())
        if score > threshold:
            x, y, x2, y2 = tuple(box)
            draw.rectangle((x, y, x2, y2), outline="red", width=1)
            draw.text((x, y), label, fill="white")
            draw.text(
                (x + 0.5, y - 0.5),
                text=str(score),
                fill="green" if score > 0.7 else "red",
            )
    return image

最後，將此函式用於我們使用的同一測試影像。

results = obj_detector(image)
plot_results(image, results)

這將繪製以下輸出：

output-test-image-plot

現在，讓我們把所有東西組合成一個簡單的函式。

def predict(image, pipeline, threshold=0.7):
    results = pipeline(image)
    return plot_results(image, results, threshold)


# Let's test for another test image

img = test_dataset[0]["image"]
predict(img, obj_detector)

我們甚至可以使用我們的推理函式在少量測試樣本上繪製多張影像。

from tqdm.auto import tqdm


def plot_images(dataset, indices):
    """
    Plot images and their annotations.
    """
    num_rows = len(indices) // 3
    num_cols = 3
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 10))

    for i, idx in tqdm(enumerate(indices), total=len(indices)):
        row = i // num_cols
        col = i % num_cols

        # Draw image
        image = predict(dataset[idx]["image"], obj_detector)

        # Display image on the corresponding subplot
        axes[row, col].imshow(image)
        axes[row, col].axis("off")

    plt.tight_layout()
    plt.show()


plot_images(test_dataset, range(6))

執行此函式將得到如下輸出：

test-sample-output-plot

嗯，還不錯。如果再進行微調，我們可以改進結果。您可以在此處找到這個微調的檢查點。

< > 在 GitHub 上更新

社群計算機視覺課程