ViTPose

概述

ViTPose 模型由 Yufei Xu、Jing Zhang、Qiming Zhang 和 Dacheng Tao 在論文《ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation》中提出。ViTPose 使用標準的非層級式 Vision Transformer 作為關鍵點估計任務的主幹網路。模型頂部添加了一個簡單的解碼器頭，用於從給定影像中預測熱圖。儘管模型結構簡單，但在具有挑戰性的 MS COCO 關鍵點檢測基準測試中取得了最先進的結果。該模型在《ViTPose++: Vision Transformer for Generic Body Pose Estimation》中得到進一步改進，作者在 ViT 主幹網路中採用了混合專家（MoE）模組，並增加了更多資料的預訓練，進一步提升了效能。

論文摘要如下：

儘管在設計中沒有考慮特定的領域知識，但普通的視覺 Transformer 在視覺識別任務中表現出色。然而，很少有研究揭示這種簡單結構在姿態估計任務中的潛力。在本文中，我們透過一個名為 ViTPose 的簡單基線模型，從多個方面展示了普通視覺 Transformer 在姿態估計方面的驚人能力，即模型結構的簡單性、模型大小的可擴充套件性、訓練正規化的靈活性以及模型間知識的可遷移性。具體來說，ViTPose 採用普通且非層級的視覺 Transformer 作為主幹網路，為給定的個人例項提取特徵，並使用一個輕量級解碼器進行姿態估計。利用 Transformer 的可擴充套件模型容量和高並行性，它可以從 1 億引數擴充套件到 10 億引數，在吞吐量和效能之間建立了一個新的帕累託前沿。此外，ViTPose 在注意力型別、輸入解析度、預訓練和微調策略以及處理多種姿態任務方面非常靈活。我們還透過實驗證明，大型 ViTPose 模型的知識可以透過一個簡單的知識令牌輕鬆遷移到小型模型中。實驗結果表明，我們的基本 ViTPose 模型在具有挑戰性的 MS COCO 關鍵點檢測基準測試中優於代表性方法，而最大的模型則創下了新的最先進水平。

ViTPose 架構。摘自原始論文。

此模型由 nielsr 和 sangbumchoi 貢獻。原始程式碼可在此處找到。

使用技巧

ViTPose 是一種所謂的“自頂向下”關鍵點檢測模型。這意味著，首先使用一個目標檢測器，如 RT-DETR，來檢測影像中的人（或其他例項）。然後，ViTPose 將裁剪後的影像作為輸入，併為每個影像預測關鍵點。

import torch
import requests
import numpy as np

from PIL import Image

from transformers import AutoProcessor, RTDetrForObjectDetection, VitPoseForPoseEstimation

device = "cuda" if torch.cuda.is_available() else "cpu"

url = "http://images.cocodataset.org/val2017/000000000139.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# ------------------------------------------------------------------------
# Stage 1. Detect humans on the image
# ------------------------------------------------------------------------

# You can choose any detector of your choice
person_image_processor = AutoProcessor.from_pretrained("PekingU/rtdetr_r50vd_coco_o365")
person_model = RTDetrForObjectDetection.from_pretrained("PekingU/rtdetr_r50vd_coco_o365", device_map=device)

inputs = person_image_processor(images=image, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = person_model(**inputs)

results = person_image_processor.post_process_object_detection(
    outputs, target_sizes=torch.tensor([(image.height, image.width)]), threshold=0.3
)
result = results[0]  # take first image results

# Human label refers 0 index in COCO dataset
person_boxes = result["boxes"][result["labels"] == 0]
person_boxes = person_boxes.cpu().numpy()

# Convert boxes from VOC (x1, y1, x2, y2) to COCO (x1, y1, w, h) format
person_boxes[:, 2] = person_boxes[:, 2] - person_boxes[:, 0]
person_boxes[:, 3] = person_boxes[:, 3] - person_boxes[:, 1]

# ------------------------------------------------------------------------
# Stage 2. Detect keypoints for each person found
# ------------------------------------------------------------------------

image_processor = AutoProcessor.from_pretrained("usyd-community/vitpose-base-simple")
model = VitPoseForPoseEstimation.from_pretrained("usyd-community/vitpose-base-simple", device_map=device)

inputs = image_processor(image, boxes=[person_boxes], return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model(**inputs)

pose_results = image_processor.post_process_pose_estimation(outputs, boxes=[person_boxes])
image_pose_result = pose_results[0]  # results for first image

ViTPose++ 模型

最好的檢查點是來自 ViTPose++ 論文的。ViTPose++ 模型在 ViT 主幹網路中採用了所謂的混合專家（MoE）架構，從而獲得了更好的效能。

ViTPose+ 檢查點使用了 6 個專家，因此可以傳遞 6 個不同的資料集索引。下面提供了各個資料集索引的概述：

0: COCO 2017 驗證集，使用在“person”類別上達到 56 AP 的目標檢測器
1: AiC 資料集
2: MPII 資料集
3: AP-10K 資料集
4: APT-36K 資料集
5: COCO-WholeBody 資料集

在模型的前向傳播中傳遞 dataset_index 引數，以指示批次中每個樣本應使用哪些專家。使用示例如下所示：

image_processor = AutoProcessor.from_pretrained("usyd-community/vitpose-plus-base")
model = VitPoseForPoseEstimation.from_pretrained("usyd-community/vitpose-plus-base", device=device)

inputs = image_processor(image, boxes=[person_boxes], return_tensors="pt").to(device)

dataset_index = torch.tensor([0], device=device) # must be a tensor of shape (batch_size,)

with torch.no_grad():
    outputs = model(**inputs, dataset_index=dataset_index)

ViTPose+ 檢查點使用了 6 個專家，因此可以傳遞 6 個不同的資料集索引。下面提供了各個資料集索引的概述：

0: COCO 2017 驗證集，使用在“person”類別上達到 56 AP 的目標檢測器
1: AiC 資料集
2: MPII 資料集
3: AP-10K 資料集
4: APT-36K 資料集
5: COCO-WholeBody 資料集

視覺化

要視覺化各種關鍵點，可以使用 supervision [庫](https://github.com/roboflow/supervision)（需要 pip install supervision）

import supervision as sv

xy = torch.stack([pose_result['keypoints'] for pose_result in image_pose_result]).cpu().numpy()
scores = torch.stack([pose_result['scores'] for pose_result in image_pose_result]).cpu().numpy()

key_points = sv.KeyPoints(
    xy=xy, confidence=scores
)

edge_annotator = sv.EdgeAnnotator(
    color=sv.Color.GREEN,
    thickness=1
)
vertex_annotator = sv.VertexAnnotator(
    color=sv.Color.RED,
    radius=2
)
annotated_frame = edge_annotator.annotate(
    scene=image.copy(),
    key_points=key_points
)
annotated_frame = vertex_annotator.annotate(
    scene=annotated_frame,
    key_points=key_points
)

此外，也可以使用 OpenCV 來視覺化關鍵點（需要 pip install opencv-python）

import math
import cv2

def draw_points(image, keypoints, scores, pose_keypoint_color, keypoint_score_threshold, radius, show_keypoint_weight):
    if pose_keypoint_color is not None:
        assert len(pose_keypoint_color) == len(keypoints)
    for kid, (kpt, kpt_score) in enumerate(zip(keypoints, scores)):
        x_coord, y_coord = int(kpt[0]), int(kpt[1])
        if kpt_score > keypoint_score_threshold:
            color = tuple(int(c) for c in pose_keypoint_color[kid])
            if show_keypoint_weight:
                cv2.circle(image, (int(x_coord), int(y_coord)), radius, color, -1)
                transparency = max(0, min(1, kpt_score))
                cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
            else:
                cv2.circle(image, (int(x_coord), int(y_coord)), radius, color, -1)

def draw_links(image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold, thickness, show_keypoint_weight, stick_width = 2):
    height, width, _ = image.shape
    if keypoint_edges is not None and link_colors is not None:
        assert len(link_colors) == len(keypoint_edges)
        for sk_id, sk in enumerate(keypoint_edges):
            x1, y1, score1 = (int(keypoints[sk[0], 0]), int(keypoints[sk[0], 1]), scores[sk[0]])
            x2, y2, score2 = (int(keypoints[sk[1], 0]), int(keypoints[sk[1], 1]), scores[sk[1]])
            if (
                x1 > 0
                and x1 < width
                and y1 > 0
                and y1 < height
                and x2 > 0
                and x2 < width
                and y2 > 0
                and y2 < height
                and score1 > keypoint_score_threshold
                and score2 > keypoint_score_threshold
            ):
                color = tuple(int(c) for c in link_colors[sk_id])
                if show_keypoint_weight:
                    X = (x1, x2)
                    Y = (y1, y2)
                    mean_x = np.mean(X)
                    mean_y = np.mean(Y)
                    length = ((Y[0] - Y[1]) ** 2 + (X[0] - X[1]) ** 2) ** 0.5
                    angle = math.degrees(math.atan2(Y[0] - Y[1], X[0] - X[1]))
                    polygon = cv2.ellipse2Poly(
                        (int(mean_x), int(mean_y)), (int(length / 2), int(stick_width)), int(angle), 0, 360, 1
                    )
                    cv2.fillConvexPoly(image, polygon, color)
                    transparency = max(0, min(1, 0.5 * (keypoints[sk[0], 2] + keypoints[sk[1], 2])))
                    cv2.addWeighted(image, transparency, image, 1 - transparency, 0, dst=image)
                else:
                    cv2.line(image, (x1, y1), (x2, y2), color, thickness=thickness)


# Note: keypoint_edges and color palette are dataset-specific
keypoint_edges = model.config.edges

palette = np.array(
    [
        [255, 128, 0],
        [255, 153, 51],
        [255, 178, 102],
        [230, 230, 0],
        [255, 153, 255],
        [153, 204, 255],
        [255, 102, 255],
        [255, 51, 255],
        [102, 178, 255],
        [51, 153, 255],
        [255, 153, 153],
        [255, 102, 102],
        [255, 51, 51],
        [153, 255, 153],
        [102, 255, 102],
        [51, 255, 51],
        [0, 255, 0],
        [0, 0, 255],
        [255, 0, 0],
        [255, 255, 255],
    ]
)

link_colors = palette[[0, 0, 0, 0, 7, 7, 7, 9, 9, 9, 9, 9, 16, 16, 16, 16, 16, 16, 16]]
keypoint_colors = palette[[16, 16, 16, 16, 16, 9, 9, 9, 9, 9, 9, 0, 0, 0, 0, 0, 0]]

numpy_image = np.array(image)

for pose_result in image_pose_result:
    scores = np.array(pose_result["scores"])
    keypoints = np.array(pose_result["keypoints"])

    # draw each point on image
    draw_points(numpy_image, keypoints, scores, keypoint_colors, keypoint_score_threshold=0.3, radius=4, show_keypoint_weight=False)

    # draw links
    draw_links(numpy_image, keypoints, scores, keypoint_edges, link_colors, keypoint_score_threshold=0.3, thickness=1, show_keypoint_weight=False)

pose_image = Image.fromarray(numpy_image)
pose_image

資源

以下是 Hugging Face 官方和社群（以 🌎 標誌）提供的資源列表，可幫助您開始使用 ViTPose。如果您有興趣提交資源以包含在此處，請隨時發起 Pull Request，我們將進行稽核！資源最好能展示一些新內容，而不是重複現有資源。

ViTPose 在影像和影片上的演示可以在這裡找到。
一個展示推理和視覺化的筆記可以在這裡找到。

VitPoseImageProcessor

class transformers.VitPoseImageProcessor

< 源 >

( do_affine_transform: bool = True size: typing.Optional[dict[str, int]] = None do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None **kwargs )

引數

do_affine_transform (bool, 可選, 預設為 True) — 是否對輸入影像應用仿射變換。
size (dict[str, int] 可選, 預設為 {"height" -- 256, "width": 192}): 應用 affine_transform 後圖像的解析度。僅當 do_affine_transform 設定為 True 時有效。可在 preprocess 方法中被 size 覆蓋。
do_rescale (bool, 可選, 預設為 True) — 是否應用縮放因子（將畫素值轉換為 0.0 到 1.0 之間的浮點數）。
rescale_factor (int or float, 可選, 預設為 1/255) — 如果縮放影像，則使用的縮放因子。可在 preprocess 方法中被 rescale_factor 覆蓋。
do_normalize (bool, 可選, 預設為 True) — 是否使用均值和標準差對輸入進行歸一化。
image_mean (list[int], 預設為 [0.485, 0.456, 0.406], 可選) — 每個通道的均值序列，用於歸一化影像。
image_std (list[int], 預設為 [0.229, 0.224, 0.225], 可選) — 每個通道的標準差序列，用於歸一化影像。

構造一個 VitPose 影像處理器。

preprocess

< 源 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] boxes: typing.Union[list[list[float]], numpy.ndarray] do_affine_transform: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None ) → BatchFeature

引數

images (ImageInput) — 待預處理的影像。期望單個或一批影像，畫素值範圍為 0 到 255。如果傳入的影像畫素值在 0 和 1 之間，請設定 do_rescale=False。
boxes (list[list[list[float]]] 或 np.ndarray) — 每個影像的邊界框列表或陣列。每個框應為一個包含 4 個浮點數的列表，代表 COCO 格式的邊界框座標（左上角 x, 左上角 y, 寬度, 高度）。
do_affine_transform (bool, 可選, 預設為 self.do_affine_transform) — 是否對輸入影像應用仿射變換。
size (dict[str, int] 可選, 預設為 self.size) — 格式為 {"height": h, "width": w} 的字典，指定調整大小後輸出影像的尺寸。
do_rescale (bool, 可選, 預設為 self.do_rescale) — 是否將影像值重新縮放到 [0 - 1] 之間。
rescale_factor (float, 可選, 預設為 self.rescale_factor) — 如果 do_rescale 設定為 True，則用於重新縮放影像的縮放因子。
do_normalize (bool, 可選, 預設為 self.do_normalize) — 是否歸一化影像。
image_mean (float or list[float], 可選, 預設為 self.image_mean) — 如果 do_normalize 設定為 True，則使用的影像均值。
image_std (float or list[float], 可選, 預設為 self.image_std) — 如果 do_normalize 設定為 True，則使用的影像標準差。
return_tensors (str 或 TensorType, 可選, 預設為 'np') — 如果設定，將返回特定框架的張量。可接受的值有：
- 'tf': 返回 TensorFlow tf.constant 物件。
- 'pt': 返回 PyTorch torch.Tensor 物件。
- 'np': 返回 NumPy np.ndarray 物件。
- 'jax': 返回 JAX jnp.ndarray 物件。

批次特徵

一個具有以下欄位的 BatchFeature

pixel_values — 待輸入模型的畫素值，形狀為 (batch_size, num_channels, height, width)。

預處理一張或一批影像。

post_process_pose_estimation

< 源 >

( outputs: VitPoseEstimatorOutput boxes: typing.Union[list[list[list[float]]], numpy.ndarray] kernel_size: int = 11 threshold: typing.Optional[float] = None target_sizes: typing.Union[transformers.utils.generic.TensorType, list[tuple]] = None ) → list[list[Dict]]

引數

outputs (VitPoseEstimatorOutput) — VitPoseForPoseEstimation 模型輸出。
boxes (list[list[list[float]]] 或 np.ndarray) — 每個影像的邊界框列表或陣列。每個框應為一個包含 4 個浮點數的列表，代表 COCO 格式的邊界框座標（左上角 x, 左上角 y, 寬度, 高度）。
kernel_size (int, 可選, 預設為 11) — 用於調製的 K 個高斯核大小。
threshold (float, 可選, 預設為 None) — 用於保留目標檢測預測結果的得分閾值。
target_sizes (torch.Tensor or list[tuple[int, int]], 可選) — 形狀為 (batch_size, 2) 的張量或包含批次中每個影像目標尺寸 (height, width) 的元組列表。如果未設定，預測將使用預設值進行調整。

list[list[Dict]]

一個字典列表，每個字典包含批次中單個影像經模型預測的關鍵點和邊界框。

將熱圖轉換為關鍵點預測，並將其變換回影像座標系。

VitPoseConfig

class transformers.VitPoseConfig

< 源 >

( backbone_config: typing.Optional[transformers.configuration_utils.PretrainedConfig] = None backbone: typing.Optional[str] = None use_pretrained_backbone: bool = False use_timm_backbone: bool = False backbone_kwargs: typing.Optional[dict] = None initializer_range: float = 0.02 scale_factor: int = 4 use_simple_decoder: bool = True **kwargs )

引數

backbone_config (PretrainedConfig 或 dict, 可選, 預設為 VitPoseBackboneConfig()) — 主幹模型的配置。目前，僅支援 `model_type` 為 `vitpose_backbone` 的 `backbone_config`。
backbone (str, 可選) — 當 `backbone_config` 為 `None` 時使用的主幹名稱。如果 `use_pretrained_backbone` 為 `True`，將從 timm 或 transformers 庫載入相應的預訓練權重。如果 `use_pretrained_backbone` 為 `False`，將載入主幹的配置並用它來初始化具有隨機權重的主幹。
use_pretrained_backbone (bool, 可選, 預設為 False) — 是否為主幹使用預訓練權重。
use_timm_backbone (bool, 可選, 預設為 False) — 是否從 timm 庫載入 `backbone`。如果為 `False`，則從 transformers 庫載入主幹。
backbone_kwargs (dict, 可選) — 從檢查點載入時傳遞給 AutoBackbone 的關鍵字引數，例如 `{'out_indices': (0, 1, 2, 3)}`。如果設定了 `backbone_config`，則不能指定此引數。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
scale_factor (int, 可選, 預設為 4) — 用於對來自 ViT 主幹的特徵圖進行上取樣的因子。
use_simple_decoder (bool, 可選, 預設為 True) — 是否使用 `VitPoseSimpleDecoder` 將主幹的特徵圖解碼為熱力圖。否則，它使用 `VitPoseClassicDecoder`。

這是用於儲存 VitPoseForPoseEstimation 配置的配置類。它用於根據指定的引數例項化一個 VitPose 模型，定義模型架構。使用預設值例項化配置將產生與 VitPose usyd-community/vitpose-base-simple 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import VitPoseConfig, VitPoseForPoseEstimation

>>> # Initializing a VitPose configuration
>>> configuration = VitPoseConfig()

>>> # Initializing a model (with random weights) from the configuration
>>> model = VitPoseForPoseEstimation(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

VitPoseForPoseEstimation

class transformers.VitPoseForPoseEstimation

< 原始碼 >

( config: VitPoseConfig )

引數

config (VitPoseConfig) — 包含模型所有引數的模型配置類。使用配置檔案進行初始化不會載入與模型相關的權重，只會載入配置。請查閱 from_pretrained() 方法來載入模型權重。

帶有姿態估計頭的 VitPose 模型。

該模型繼承自 PreTrainedModel。請查閱超類文件，瞭解該庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

該模型也是一個 PyTorch torch.nn.Module 子類。可以像常規 PyTorch 模組一樣使用它，並參考 PyTorch 文件瞭解所有與常規用法和行為相關的事項。

forward

< 原始碼 >

( pixel_values: Tensor dataset_index: typing.Optional[torch.Tensor] = None flip_pairs: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → `transformers.models.vitpose.modeling_vitpose.VitPoseEstimatorOutput` 或 `tuple(torch.FloatTensor)`

引數

pixel_values (形狀為 `(batch_size, num_channels, image_size, image_size)` 的 `torch.Tensor`) — 對應於輸入影像的張量。畫素值可以使用 `{image_processor_class}` 獲得。詳見 `{image_processor_class}.__call__` (`{processor_class}` 使用 `{image_processor_class}` 處理影像)。
dataset_index (形狀為 `(batch_size,)` 的 `torch.Tensor`) — 在主幹的混合專家（MoE）塊中使用的索引。

這對應於訓練期間使用的資料集索引，例如，對於單個數據集，索引 0 指的是相應的資料集。對於多個數據集，索引 0 指的是資料集 A（例如 MPII），索引 1 指的是資料集 B（例如 CrowdPose）。
flip_pairs (torch.tensor, 可選) — 是否映象關鍵點對（例如，左耳 — 右耳）。
labels (形狀為 `(batch_size, sequence_length)` 的 `torch.Tensor`, 可選) — 用於計算掩碼語言建模損失的標籤。索引應在 `[0, ..., config.vocab_size]` 或 -100 之間（參見 `input_ids` 文件字串）。索引設定為 `-100` 的標記將被忽略（遮蔽），損失僅對標籤在 `[0, ..., config.vocab_size]` 範圍內的標記進行計算。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參閱返回張量下的 `attentions`。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量下的 `hidden_states`。
return_dict (bool, 可選) — 是否返回一個 ModelOutput 而不是一個普通的元組。

`transformers.models.vitpose.modeling_vitpose.VitPoseEstimatorOutput` 或 `tuple(torch.FloatTensor)`

一個 `transformers.models.vitpose.modeling_vitpose.VitPoseEstimatorOutput` 或一個 `torch.FloatTensor` 的元組（如果傳遞 `return_dict=False` 或當 `config.return_dict=False` 時），包含根據配置（VitPoseConfig）和輸入而不同的各種元素。

loss (形狀為 `(1,)` 的 `torch.FloatTensor`, 可選, 當提供 `labels` 時返回) — 目前不支援損失計算。詳見 https://github.com/ViTAE-Transformer/ViTPose/tree/main/mmpose/models/losses。
heatmaps (形狀為 `(batch_size, num_keypoints, height, width)` 的 `torch.FloatTensor`) — 模型預測的熱力圖。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 `output_hidden_states=True` 或當 `config.output_hidden_states=True` 時返回) — `torch.FloatTensor` 的元組（一個用於嵌入層的輸出，如果模型有嵌入層，+ 每個階段的輸出各一個），形狀為 `(batch_size, sequence_length, hidden_size)`。模型在每個階段輸出的隱藏狀態（也稱為特徵圖）。
attentions (`tuple[torch.FloatTensor, ...]`，可選，當傳遞 `output_attentions=True` 或當 `config.output_attentions=True` 時返回) — `torch.FloatTensor` 的元組（每層一個），形狀為 `(batch_size, num_heads, sequence_length, sequence_length)`。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

VitPoseForPoseEstimation 的 forward 方法重寫了 `__call__` 特殊方法。

雖然前向傳播的流程需要在此函式中定義，但之後應該呼叫 `Module` 例項而不是此函式，因為前者會處理前處理和後處理步驟，而後者會靜默忽略它們。

示例

>>> from transformers import AutoImageProcessor, VitPoseForPoseEstimation
>>> import torch
>>> from PIL import Image
>>> import requests

>>> processor = AutoImageProcessor.from_pretrained("usyd-community/vitpose-base-simple")
>>> model = VitPoseForPoseEstimation.from_pretrained("usyd-community/vitpose-base-simple")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> boxes = [[[412.8, 157.61, 53.05, 138.01], [384.43, 172.21, 15.12, 35.74]]]
>>> inputs = processor(image, boxes=boxes, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)
>>> heatmaps = outputs.heatmaps

< > 在 GitHub 上更新

Transformers

ViTPose

概述

使用技巧

ViTPose++ 模型

視覺化

資源

VitPoseImageProcessor

class transformers.VitPoseImageProcessor

preprocess

post_process_pose_estimation

VitPoseConfig

class transformers.VitPoseConfig

VitPoseForPoseEstimation

class transformers.VitPoseForPoseEstimation

forward