零樣本目標檢測

傳統上，用於目標檢測的模型需要帶標籤的影像資料集進行訓練，並且只能檢測訓練資料中的類別集合。

零樣本目標檢測由OWL-ViT模型支援，該模型採用不同的方法。OWL-ViT是一個開放詞彙目標檢測器。這意味著它可以在不需要在帶標籤資料集上微調模型的情況下，根據自由文字查詢檢測影像中的物件。

OWL-ViT利用多模態表示執行開放詞彙檢測。它將CLIP與輕量級目標分類和定位頭結合起來。開放詞彙檢測是透過使用CLIP的文字編碼器嵌入自由文字查詢，並將它們用作目標分類和定位頭的輸入來實現的，這些頭將影像與其對應的文字描述關聯起來，而ViT則將影像塊作為輸入進行處理。OWL-ViT的作者首先從頭開始訓練CLIP，然後使用二分匹配損失在標準目標檢測資料集上端到端地微調OWL-ViT。

透過這種方法，模型可以根據文字描述檢測物件，而無需事先在帶標籤的資料集上進行訓練。

本指南中，您將學習如何使用 OWL-ViT

基於文字提示檢測物件
進行批次目標檢測
進行影像引導的目標檢測

在開始之前，請確保您已安裝所有必要的庫

pip install -q transformers

零樣本目標檢測管道

嘗試使用 OWL-ViT 進行推理最簡單的方法是將其用於 pipeline()。從 Hugging Face Hub 上的檢查點例項化零樣本目標檢測管道。

>>> from transformers import pipeline

>>> checkpoint = "google/owlv2-base-patch16-ensemble"
>>> detector = pipeline(model=checkpoint, task="zero-shot-object-detection")

接下來，選擇您想檢測其中物件的影像。這裡我們使用宇航員艾琳·柯林斯（Eileen Collins）的影像，該影像是 NASA 大影像資料集的一部分。

>>> import skimage
>>> import numpy as np
>>> from PIL import Image

>>> image = skimage.data.astronaut()
>>> image = Image.fromarray(np.uint8(image)).convert("RGB")

>>> image

將影像和要查詢的候選物件標籤傳遞給管道。這裡我們直接傳遞影像；其他合適的選項包括影像的本地路徑或影像URL。我們還傳遞了要查詢影像中所有專案的文字描述。

>>> predictions = detector(
...     image,
...     candidate_labels=["human face", "rocket", "nasa badge", "star-spangled banner"],
... )
>>> predictions
[{'score': 0.3571370542049408,
  'label': 'human face',
  'box': {'xmin': 180, 'ymin': 71, 'xmax': 271, 'ymax': 178}},
 {'score': 0.28099656105041504,
  'label': 'nasa badge',
  'box': {'xmin': 129, 'ymin': 348, 'xmax': 206, 'ymax': 427}},
 {'score': 0.2110239565372467,
  'label': 'rocket',
  'box': {'xmin': 350, 'ymin': -1, 'xmax': 468, 'ymax': 288}},
 {'score': 0.13790413737297058,
  'label': 'star-spangled banner',
  'box': {'xmin': 1, 'ymin': 1, 'xmax': 105, 'ymax': 509}},
 {'score': 0.11950037628412247,
  'label': 'nasa badge',
  'box': {'xmin': 277, 'ymin': 338, 'xmax': 327, 'ymax': 380}},
 {'score': 0.10649408400058746,
  'label': 'rocket',
  'box': {'xmin': 358, 'ymin': 64, 'xmax': 424, 'ymax': 280}}]

讓我們將預測結果視覺化

>>> from PIL import ImageDraw

>>> draw = ImageDraw.Draw(image)

>>> for prediction in predictions:
...     box = prediction["box"]
...     label = prediction["label"]
...     score = prediction["score"]

...     xmin, ymin, xmax, ymax = box.values()
...     draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
...     draw.text((xmin, ymin), f"{label}: {round(score,2)}", fill="white")

>>> image

手動進行文字提示零樣本目標檢測

既然您已經瞭解瞭如何使用零樣本目標檢測管道，那麼讓我們手動重複相同的結果。

首先從 Hugging Face Hub 上的檢查點載入模型及相關的處理器。這裡我們將使用與之前相同的檢查點

>>> from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

>>> model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

我們換一張不同的圖片來改變一下。

>>> import requests

>>> url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
>>> im = Image.open(requests.get(url, stream=True).raw)
>>> im

使用處理器為模型準備輸入。處理器結合了一個影像處理器，該處理器透過調整大小和規範化來準備影像供模型使用，以及一個負責處理文字輸入的CLIPTokenizer。

>>> text_queries = ["hat", "book", "sunglasses", "camera"]
>>> inputs = processor(text=text_queries, images=im, return_tensors="pt")

將輸入透過模型進行後處理，並可視化結果。由於影像處理器在將影像輸入模型之前調整了它們的大小，因此您需要使用 post_process_object_detection() 方法來確保預測的邊界框具有相對於原始影像的正確座標。

>>> import torch

>>> with torch.no_grad():
...     outputs = model(**inputs)
...     target_sizes = torch.tensor([im.size[::-1]])
...     results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)[0]

>>> draw = ImageDraw.Draw(im)

>>> scores = results["scores"].tolist()
>>> labels = results["labels"].tolist()
>>> boxes = results["boxes"].tolist()

>>> for box, score, label in zip(boxes, scores, labels):
...     xmin, ymin, xmax, ymax = box
...     draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
...     draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

>>> im

批次處理

您可以傳遞多組影像和文字查詢，以在多張影像中搜索不同（或相同）的物件。讓我們同時使用宇航員影像和海灘影像。對於批次處理，您應該將文字查詢作為巢狀列表傳遞給處理器，並將影像作為PIL影像、PyTorch張量或NumPy陣列的列表傳遞。

>>> images = [image, im]
>>> text_queries = [
...     ["human face", "rocket", "nasa badge", "star-spangled banner"],
...     ["hat", "book", "sunglasses", "camera"],
... ]
>>> inputs = processor(text=text_queries, images=images, return_tensors="pt")

之前進行後處理時，您將單張影像的大小作為張量傳遞，但您也可以傳遞一個元組，或者在多張影像的情況下，傳遞一個元組列表。讓我們為這兩個示例建立預測，並可視化第二個（image_idx = 1）。

>>> with torch.no_grad():
...     outputs = model(**inputs)
...     target_sizes = [x.size[::-1] for x in images]
...     results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)

>>> image_idx = 1
>>> draw = ImageDraw.Draw(images[image_idx])

>>> scores = results[image_idx]["scores"].tolist()
>>> labels = results[image_idx]["labels"].tolist()
>>> boxes = results[image_idx]["boxes"].tolist()

>>> for box, score, label in zip(boxes, scores, labels):
...     xmin, ymin, xmax, ymax = box
...     draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
...     draw.text((xmin, ymin), f"{text_queries[image_idx][label]}: {round(score,2)}", fill="white")

>>> images[image_idx]

影像引導的目標檢測

除了透過文字查詢進行零樣本目標檢測外，OWL-ViT 還提供影像引導的目標檢測。這意味著您可以使用影像查詢在目標影像中找到相似物件。與文字查詢不同，影像查詢只允許一個示例影像。

讓我們以一張沙發上有兩隻貓的影像作為目標影像，以一張單隻貓的影像作為查詢影像。

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image_target = Image.open(requests.get(url, stream=True).raw)

>>> query_url = "http://images.cocodataset.org/val2017/000000524280.jpg"
>>> query_image = Image.open(requests.get(query_url, stream=True).raw)

我們快速看一下圖片

>>> import matplotlib.pyplot as plt

>>> fig, ax = plt.subplots(1, 2)
>>> ax[0].imshow(image_target)
>>> ax[1].imshow(query_image)

在預處理步驟中，您現在需要使用 `query_images` 而不是文字查詢。

>>> inputs = processor(images=image_target, query_images=query_image, return_tensors="pt")

對於預測，不再將輸入傳遞給模型，而是將它們傳遞給image_guided_detection()。像以前一樣繪製預測結果，只是現在沒有標籤。

>>> with torch.no_grad():
...     outputs = model.image_guided_detection(**inputs)
...     target_sizes = torch.tensor([image_target.size[::-1]])
...     results = processor.post_process_image_guided_detection(outputs=outputs, target_sizes=target_sizes)[0]

>>> draw = ImageDraw.Draw(image_target)

>>> scores = results["scores"].tolist()
>>> boxes = results["boxes"].tolist()

>>> for box, score in zip(boxes, scores):
...     xmin, ymin, xmax, ymax = box
...     draw.rectangle((xmin, ymin, xmax, ymax), outline="white", width=4)

>>> image_target

< > 在 GitHub 上更新

Transformers

零樣本目標檢測

零樣本目標檢測管道

手動進行文字提示零樣本目標檢測

批次處理

影像引導的目標檢測