OWL-ViT

概述

OWL-ViT (Vision Transformer for Open-World Localization的縮寫) 是在 Simple Open-Vocabulary Object Detection with Vision Transformers 一文中由 Matthias Minderer、Alexey Gritsenko、Austin Stone、Maxim Neumann、Dirk Weissenborn、Alexey Dosovitskiy、Aravindh Mahendran、Anurag Arnab、Mostafa Dehghani、Zhuoran Shen、Xiao Wang、Xiaohua Zhai、Thomas Kipf 和 Neil Houlsby 提出的。OWL-ViT 是一個在多種（影像，文字）對上訓練的開放詞彙目標檢測網路。它可以用來透過一個或多個文字查詢來查詢影像，以搜尋和檢測文字中描述的目標物件。

論文摘要如下：

將簡單的架構與大規模預訓練相結合，已在影像分類方面取得了巨大進步。對於目標檢測，預訓練和縮放方法尚未成熟，尤其是在長尾和開放詞彙的場景下，訓練資料相對稀缺。在本文中，我們提出了一種將影像-文字模型遷移到開放詞彙目標檢測的有效方法。我們使用標準的 Vision Transformer 架構，並進行了最小的修改，採用對比影像-文字預訓練和端到端檢測微調。我們對這種設定的縮放屬性的分析表明，增加影像級預訓練和模型大小，能夠在下游檢測任務上帶來持續的改進。我們提供了實現零樣本文字條件和單樣本影像條件目標檢測非常強大效能所需的適應策略和正則化方法。程式碼和模型可在 GitHub 上獲取。

OWL-ViT 架構。摘自原始論文。

此模型由 adirik 貢獻。原始程式碼可在此處找到。

使用技巧

OWL-ViT 是一種零樣本文字條件的目標檢測模型。OWL-ViT 使用 CLIP 作為其多模態骨幹網路，使用類似 ViT 的 Transformer 獲取視覺特徵，並使用因果語言模型獲取文字特徵。為了將 CLIP 用於檢測，OWL-ViT 移除了視覺模型的最終 token 池化層，並在每個 transformer 輸出 token 上附加了一個輕量級的分類和邊界框頭。透過將固定的分類層權重替換為從文字模型中獲取的類名嵌入，實現了開放詞彙分類。作者首先從頭訓練 CLIP，然後使用分類和邊界框頭在標準檢測資料集上進行端到端微調，使用二分匹配損失。每張影像可以使用一個或多個文字查詢來執行零樣本文字條件的目標檢測。

OwlViTImageProcessor 可用於為模型調整影像大小（或縮放）和歸一化，而 CLIPTokenizer 用於編碼文字。OwlViTProcessor 將 OwlViTImageProcessor 和 CLIPTokenizer 包裝成單個例項，以同時編碼文字和準備影像。以下示例展示瞭如何使用 OwlViTProcessor 和 OwlViTForObjectDetection 進行目標檢測。

>>> import requests
>>> from PIL import Image
>>> import torch

>>> from transformers import OwlViTProcessor, OwlViTForObjectDetection

>>> processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
>>> model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> text_labels = [["a photo of a cat", "a photo of a dog"]]
>>> inputs = processor(text=text_labels, images=image, return_tensors="pt")
>>> outputs = model(**inputs)

>>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
>>> target_sizes = torch.tensor([(image.height, image.width)])
>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
>>> results = processor.post_process_grounded_object_detection(
...     outputs=outputs, target_sizes=target_sizes, threshold=0.1, text_labels=text_labels
... )
>>> # Retrieve predictions for the first image for the corresponding text queries
>>> result = results[0]
>>> boxes, scores, text_labels = result["boxes"], result["scores"], result["text_labels"]
>>> for box, score, text_label in zip(boxes, scores, text_labels):
...     box = [round(i, 2) for i in box.tolist()]
...     print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
Detected a photo of a cat with confidence 0.707 at location [324.97, 20.44, 640.58, 373.29]
Detected a photo of a cat with confidence 0.717 at location [1.46, 55.26, 315.55, 472.17]

資源

關於使用 OWL-ViT 進行零樣本和單樣本（影像引導）目標檢測的演示筆記本可以在這裡找到。

OwlViTConfig

class transformers.OwlViTConfig

< 來源 >

( text_config = None vision_config = None projection_dim = 512 logit_scale_init_value = 2.6592 return_dict = True **kwargs )

引數

text_config (dict, 可選) — 用於初始化 OwlViTTextConfig 的配置選項字典。
vision_config (dict, 可選) — 用於初始化 OwlViTVisionConfig 的配置選項字典。
projection_dim (int, 可選, 預設為 512) — 文字和視覺投影層的維度。
logit_scale_init_value (float, 可選, 預設為 2.6592) — logit_scale 引數的初始值。預設值根據原始 OWL-ViT 實現使用。
return_dict (bool, 可選, 預設為 True) — 模型是否應返回一個字典。如果為 False，則返回一個元組。
kwargs (可選) — 關鍵字引數字典。

OwlViTConfig 是用於儲存 OwlViTModel 配置的配置類。它用於根據指定的引數例項化一個 OWL-ViT 模型，定義文字模型和視覺模型配置。使用預設值例項化配置將產生與 OWL-ViT google/owlvit-base-patch32 架構相似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

from_text_vision_configs

< 來源 >

( text_config: dict vision_config: dict **kwargs ) → OwlViTConfig

OwlViTConfig

一個配置物件的例項

從 owlvit 文字模型配置和 owlvit 視覺模型配置例項化一個 OwlViTConfig（或其派生類）。

OwlViTTextConfig

class transformers.OwlViTTextConfig

< 來源 >

( vocab_size = 49408 hidden_size = 512 intermediate_size = 2048 num_hidden_layers = 12 num_attention_heads = 8 max_position_embeddings = 16 hidden_act = 'quick_gelu' layer_norm_eps = 1e-05 attention_dropout = 0.0 initializer_range = 0.02 initializer_factor = 1.0 pad_token_id = 0 bos_token_id = 49406 eos_token_id = 49407 **kwargs )

引數

vocab_size (int, 可選, 預設為 49408) — OWL-ViT 文字模型的詞彙表大小。定義了在呼叫 OwlViTTextModel 時傳入的 inputs_ids 可以表示的不同詞元的數量。
hidden_size (int, 可選, 預設為 512) — 編碼器層和池化層的維度。
intermediate_size (int, 可選, 預設為 2048) — Transformer 編碼器中“中間”（即前饋）層的維度。
num_hidden_layers (int, 可選, 預設為 12) — Transformer 編碼器中的隱藏層數量。
num_attention_heads (int, 可選, 預設為 8) — Transformer 編碼器中每個注意力層的注意力頭數量。
max_position_embeddings (int, 可選, 預設為 16) — 該模型可能使用的最大序列長度。通常將其設定為一個較大的值以備不時之需（例如，512、1024 或 2048）。
hidden_act (str 或 function, 可選, 預設為 "quick_gelu") — 編碼器和池化器中的非線性啟用函式（函式或字串）。如果為字串，支援 "gelu"、"relu"、"selu" 和 "gelu_new"、"quick_gelu"。
layer_norm_eps (float, 可選, 預設為 1e-05) — 層歸一化層使用的 epsilon 值。
attention_dropout (float, 可選, 預設為 0.0) — 注意力機率的 dropout 比率。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
initializer_factor (float, 可選, 預設為 1.0) — 初始化所有權重矩陣的因子（應保持為 1，內部用於初始化測試）。
pad_token_id (int, 可選, 預設為 0) — 輸入序列中填充詞元的 ID。
bos_token_id (int, 可選, 預設為 49406) — 輸入序列中序列開始詞元的 ID。
eos_token_id (int, 可選, 預設為 49407) — 輸入序列中序列結束詞元的 ID。

這是用於儲存 OwlViTTextModel 配置的配置類。它用於根據指定的引數例項化一個 OwlViT 文字編碼器，定義模型架構。使用預設值例項化配置將產生與 OwlViT google/owlvit-base-patch32 架構相似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import OwlViTTextConfig, OwlViTTextModel

>>> # Initializing a OwlViTTextModel with google/owlvit-base-patch32 style configuration
>>> configuration = OwlViTTextConfig()

>>> # Initializing a OwlViTTextConfig from the google/owlvit-base-patch32 style configuration
>>> model = OwlViTTextModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

OwlViTVisionConfig

class transformers.OwlViTVisionConfig

< 來源 >

( hidden_size = 768 intermediate_size = 3072 num_hidden_layers = 12 num_attention_heads = 12 num_channels = 3 image_size = 768 patch_size = 32 hidden_act = 'quick_gelu' layer_norm_eps = 1e-05 attention_dropout = 0.0 initializer_range = 0.02 initializer_factor = 1.0 **kwargs )

引數

hidden_size (int, 可選, 預設為 768) — 編碼器層和池化層的維度。
intermediate_size (int, 可選, 預設為 3072) — Transformer 編碼器中“中間”（即前饋）層的維度。
num_hidden_layers (int, 可選, 預設為 12) — Transformer 編碼器中的隱藏層數量。
num_attention_heads (int, 可選, 預設為 12) — Transformer 編碼器中每個注意力層的注意力頭數量。
num_channels (int, 可選, 預設為 3) — 輸入影像中的通道數。
image_size (int, 可選, 預設為 768) — 每張影像的大小（解析度）。
patch_size (int, 可選, 預設為 32) — 每個影像塊的大小（解析度）。
hidden_act (str 或 function, 可選, 預設為 "quick_gelu") — 編碼器和池化器中的非線性啟用函式（函式或字串）。如果為字串，支援 "gelu"、"relu"、"selu" 和 "gelu_new"、"quick_gelu"。
layer_norm_eps (float, optional, defaults to 1e-05) — 層歸一化層使用的 epsilon 值。
attention_dropout (float, optional, defaults to 0.0) — 注意力機率的 dropout 比率。
initializer_range (float, optional, defaults to 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
initializer_factor (float, optional, defaults to 1.0) — 用於初始化所有權重矩陣的因子（應保持為 1，內部用於初始化測試）。

這是一個配置類，用於儲存 OwlViTVisionModel 的配置。它用於根據指定的引數例項化一個 OWL-ViT 影像編碼器，定義模型架構。使用預設值例項化配置將產生與 OWL-ViT google/owlvit-base-patch32 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import OwlViTVisionConfig, OwlViTVisionModel

>>> # Initializing a OwlViTVisionModel with google/owlvit-base-patch32 style configuration
>>> configuration = OwlViTVisionConfig()

>>> # Initializing a OwlViTVisionModel model from the google/owlvit-base-patch32 style configuration
>>> model = OwlViTVisionModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

Transformers

OWL-ViT

概述

使用技巧

資源

OwlViTConfig

class transformers.OwlViTConfig

from_text_vision_configs

OwlViTTextConfig

class transformers.OwlViTTextConfig

OwlViTVisionConfig

class transformers.OwlViTVisionConfig

OwlViTImageProcessor

class transformers.OwlViTImageProcessor

preprocess

OwlViTImageProcessorFast

class transformers.OwlViTImageProcessorFast

preprocess

post_process_object_detection

post_process_image_guided_detection

OwlViTProcessor

class transformers.OwlViTProcessor

__call__

post_process_grounded_object_detection

post_process_image_guided_detection

OwlViTModel

class transformers.OwlViTModel

forward

get_text_features

get_image_features

OwlViTTextModel

class transformers.OwlViTTextModel

forward

OwlViTVisionModel

class transformers.OwlViTVisionModel

forward

OwlViTForObjectDetection

class transformers.OwlViTForObjectDetection

forward

image_guided_detection

call