影像特徵提取

影像特徵提取是根據給定影像提取語義上有意義的特徵的任務。這在影像相似度匹配和影像檢索等許多用例中都有應用。此外，大多數計算機視覺模型都可以用於影像特徵提取，其中可以移除特定任務的頭部（影像分類、目標檢測等）並獲取特徵。這些特徵在更高級別上非常有用：邊緣檢測、角點檢測等。它們還可能包含有關現實世界的資訊（例如貓的外觀），具體取決於模型的深度。因此，這些輸出可以用於在特定資料集上訓練新的分類器。

在本指南中，你將：

學習如何基於 image-feature-extraction 管道構建一個簡單的影像相似度系統。
透過裸模型推理完成相同的任務。

使用 image-feature-extraction 管道進行影像相似度匹配

我們有兩張貓坐在漁網上的影像，其中一張是生成的。

from PIL import Image
import requests

img_urls = ["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png", "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.jpeg"]
image_real = Image.open(requests.get(img_urls[0], stream=True).raw).convert("RGB")
image_gen = Image.open(requests.get(img_urls[1], stream=True).raw).convert("RGB")

讓我們看看管道的實際操作。首先，初始化管道。如果你不向其傳遞任何模型，管道將自動使用 google/vit-base-patch16-224 進行初始化。如果你想計算相似度，將 pool 設定為 True。

import torch
from transformers import pipeline
from accelerate.test_utils.testing import get_backend
# automatically detects the underlying device type (CUDA, CPU, XPU, MPS, etc.)
DEVICE, _, _ = get_backend()
pipe = pipeline(task="image-feature-extraction", model_name="google/vit-base-patch16-384", device=DEVICE, pool=True)

要使用 pipe 進行推理，請將兩張影像都傳遞給它。

outputs = pipe([image_real, image_gen])

輸出包含這兩張影像的池化嵌入。

# get the length of a single output
print(len(outputs[0][0]))
# show outputs
print(outputs)

# 768
# [[[-0.03909236937761307, 0.43381670117378235, -0.06913255900144577,

要獲得相似度分數，我們需要將它們傳遞給一個相似度函式。

from torch.nn.functional import cosine_similarity

similarity_score = cosine_similarity(torch.Tensor(outputs[0]),
                                     torch.Tensor(outputs[1]), dim=1)

print(similarity_score)

# tensor([0.6043])

如果你想在池化之前獲取最後的隱藏狀態，請避免為 pool 引數傳遞任何值，因為它預設為 False。這些隱藏狀態對於基於模型的特徵訓練新分類器或模型很有用。

pipe = pipeline(task="image-feature-extraction", model_name="google/vit-base-patch16-224", device=DEVICE)
outputs = pipe(image_real)

由於輸出未池化，我們得到最後的隱藏狀態，其中第一維是批次大小，最後兩維是嵌入形狀。

import numpy as np
print(np.array(outputs).shape)
# (1, 197, 768)

使用 AutoModel 獲取特徵和相似度

我們還可以使用 transformers 的 AutoModel 類來獲取特徵。AutoModel 載入任何不帶任務特定頭部的 transformers 模型，我們可以使用它來獲取特徵。

from transformers import AutoImageProcessor, AutoModel

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModel.from_pretrained("google/vit-base-patch16-224").to(DEVICE)

讓我們編寫一個簡單的推理函式。我們將首先將輸入傳遞給 processor，然後將其輸出傳遞給 model。

def infer(image):
  inputs = processor(image, return_tensors="pt").to(DEVICE)
  outputs = model(**inputs)
  return outputs.pooler_output

我們可以直接將影像傳遞給此函式並獲取嵌入。

embed_real = infer(image_real)
embed_gen = infer(image_gen)

我們可以再次對嵌入計算相似度。

from torch.nn.functional import cosine_similarity

similarity_score = cosine_similarity(embed_real, embed_gen, dim=1)
print(similarity_score)

# tensor([0.6061], device='cuda:0', grad_fn=<SumBackward1>)

< > 在 GitHub 上更新

Transformers

影像特徵提取

使用 image-feature-extraction 管道進行影像相似度匹配

使用 AutoModel 獲取特徵和相似度