VideoMAE

概述

VideoMAE 模型是由 Zhan Tong、Yibing Song、Jue Wang、Limin Wang 在 VideoMAE: 用於自監督影片預訓練的資料高效學習器蒙版自編碼器中提出的。VideoMAE 將蒙版自編碼器 (MAE) 擴充套件到影片領域，聲稱在多個影片分類基準上實現了最先進的效能。

論文摘要如下：

通常需要在大規模資料集上預訓練影片 Transformer，才能在相對較小的資料集上獲得卓越效能。在本文中，我們展示了影片蒙版自編碼器 (VideoMAE) 是用於自監督影片預訓練 (SSVP) 的資料高效學習器。我們受到近期 ImageMAE 的啟發，並提出了定製的影片管遮罩和重建。這些簡單的設計被證明能有效克服影片重建過程中時間相關性導致的資訊洩露。我們在 SSVP 上獲得了三個重要發現：(1) 即使是極高的遮罩比例（即 90% 到 95%），VideoMAE 仍然能產生令人滿意的效能。時間上冗餘的影片內容允許比影像更高的遮罩比例。(2) VideoMAE 在非常小的資料集（即約 3k-4k 影片）上無需使用任何額外資料即可實現令人印象深刻的結果。這部分歸因於影片重建的挑戰性任務，它強制進行高層次結構學習。(3) VideoMAE 表明，對於 SSVP，資料質量比資料數量更重要。預訓練和目標資料集之間的領域差異是 SSVP 中的重要問題。值得注意的是，我們的 VideoMAE 在沒有使用任何額外資料的情況下，使用普通的 ViT 主幹網路在 Kinects-400 上達到 83.9%，在 Something-Something V2 上達到 75.3%，在 UCF101 上達到 90.8%，在 HMDB51 上達到 61.1%。

VideoMAE 預訓練。摘自原始論文。

此模型由 nielsr 貢獻。原始程式碼可在此處找到。

使用縮放點積注意力 (SDPA)

PyTorch 包含一個原生縮放點積注意力 (SDPA) 運算子，作為 torch.nn.functional 的一部分。此函式包含幾種實現，可根據輸入和所使用的硬體進行應用。更多資訊請參閱官方文件或 GPU 推理頁面。

當實現可用時，SDPA 預設用於 `torch>=2.1.1`，但你也可以在 `from_pretrained()` 中設定 `attn_implementation="sdpa"` 來明確請求使用 SDPA。

from transformers import VideoMAEForVideoClassification
model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics", attn_implementation="sdpa", torch_dtype=torch.float16)
...

為了獲得最佳加速效果，我們建議以半精度（例如 `torch.float16` 或 `torch.bfloat16`）載入模型。

在本地基準測試（A100-40GB，PyTorch 2.3.0，作業系統 Ubuntu 22.04）上，使用 float32 和 MCG-NJU/videomae-base-finetuned-kinetics 模型，我們在推理過程中觀察到以下加速。

批次大小	平均推理時間（毫秒），eager 模式	平均推理時間（毫秒），sdpa 模型	加速，Sdpa / Eager (x)
1	37	10	3.7
2	24	18	1.33
4	43	32	1.34
8	84	60	1.4

資源

一份官方 Hugging Face 和社群（以 🌎 標示）資源列表，幫助您開始使用 VideoMAE。如果您有興趣提交資源以包含在此處，請隨時開啟一個拉取請求，我們將對其進行審查！該資源應理想地展示一些新內容，而不是重複現有資源。

影片分類

一個 notebook 展示瞭如何在自定義資料集上微調 VideoMAE 模型。
影片分類任務指南
一個 🤗 Space 展示瞭如何使用影片分類模型進行推理。

VideoMAEConfig

class transformers.VideoMAEConfig

< 來源 >

( image_size = 224 patch_size = 16 num_channels = 3 num_frames = 16 tubelet_size = 2 hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-12 qkv_bias = True use_mean_pooling = True decoder_num_attention_heads = 6 decoder_hidden_size = 384 decoder_num_hidden_layers = 4 decoder_intermediate_size = 1536 norm_pix_loss = True **kwargs )

引數

image_size (int, 可選, 預設為 224) — 每個影像的大小（解析度）。
patch_size (int, 可選, 預設為 16) — 每個補丁的大小（解析度）。
num_channels (int, 可選, 預設為 3) — 輸入通道數。
num_frames (int, 可選, 預設為 16) — 每個影片中的幀數。
tubelet_size (int, 可選, 預設為 2) — 小塊（tubelet）的數量。
hidden_size (int, 可選, 預設為 768) — 編碼器層和池化器層的維度。
num_hidden_layers (int, 可選, 預設為 12) — Transformer 編碼器中的隱藏層數量。
num_attention_heads (int, 可選, 預設為 12) — Transformer 編碼器中每個注意力層的注意力頭數量。
intermediate_size (int, 可選, 預設為 3072) — Transformer 編碼器中“中間”（即前饋）層的維度。
hidden_act (str 或 function, 可選, 預設為 "gelu") — 編碼器和池化器中的非線性啟用函式（函式或字串）。如果是字串，支援 "gelu", "relu", "selu" 和 "gelu_new"。
hidden_dropout_prob (float, 可選, 預設為 0.0) — 嵌入、編碼器和池化器中所有全連線層的 dropout 機率。
attention_probs_dropout_prob (float, 可選, 預設為 0.0) — 注意力機率的 dropout 比例。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的截斷正態初始化器的標準差。
layer_norm_eps (float, 可選, 預設為 1e-12) — 層歸一化層使用的 epsilon 值。
qkv_bias (bool, 可選, 預設為 True) — 是否在查詢、鍵和值中新增偏置。
use_mean_pooling (bool, 可選, 預設為 True) — 是否對最終隱藏狀態進行平均池化，而不是使用 [CLS] token 的最終隱藏狀態。
decoder_num_attention_heads (int, 可選, 預設為 6) — 解碼器中每個注意力層的注意力頭數量。
decoder_hidden_size (int, 可選, 預設為 384) — 解碼器的維度。
decoder_num_hidden_layers (int, 可選, 預設為 4) — 解碼器中的隱藏層數量。
decoder_intermediate_size (int, 可選, 預設為 1536) — 解碼器中“中間”（即前饋）層的維度。
norm_pix_loss (bool, 可選, 預設為 True) — 是否歸一化目標畫素。

這是用於儲存 VideoMAEModel 配置的配置類。它用於根據指定引數例項化 VideoMAE 模型，定義模型架構。使用預設值例項化配置將生成與 VideoMAE MCG-NJU/videomae-base 架構相似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import VideoMAEConfig, VideoMAEModel

>>> # Initializing a VideoMAE videomae-base style configuration
>>> configuration = VideoMAEConfig()

>>> # Randomly initializing a model from the configuration
>>> model = VideoMAEModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

VideoMAEFeatureExtractor

class transformers.VideoMAEFeatureExtractor

< 來源 >

( *args **kwargs )

call

< 來源 >

( images **kwargs )

預處理單張或批次影像。

VideoMAEImageProcessor

class transformers.VideoMAEImageProcessor

< 來源 >

( do_resize: bool = True size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_center_crop: bool = True crop_size: typing.Optional[dict[str, int]] = None do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None **kwargs )

引數

do_resize (bool, 可選, 預設為 True) — 是否將影像的高度和寬度調整為指定大小。可以透過 preprocess 方法中的 do_resize 引數覆蓋。
size (dict[str, int] 可選, 預設為 {"shortest_edge" -- 224}): 調整大小後的輸出影像尺寸。影像的最短邊將調整為 size["shortest_edge"]，同時保持原始影像的縱橫比。可以透過 preprocess 方法中的 size 引數覆蓋。
resample (PILImageResampling, 可選, 預設為 Resampling.BILINEAR) — 如果調整影像大小，使用的重取樣濾波器。可以透過 preprocess 方法中的 resample 引數覆蓋。
do_center_crop (bool, 可選, 預設為 True) — 是否將影像中心裁剪為指定的 crop_size。可以透過 preprocess 方法中的 do_center_crop 引數覆蓋。
crop_size (dict[str, int], 可選, 預設為 {"height" -- 224, "width": 224}): 應用中心裁剪後的影像尺寸。可以透過 preprocess 方法中的 crop_size 引數覆蓋。
do_rescale (bool, 可選, 預設為 True) — 是否透過指定比例 rescale_factor 重新縮放影像。可以透過 preprocess 方法中的 do_rescale 引數覆蓋。
rescale_factor (int 或 float, 可選, 預設為 1/255) — 如果 do_rescale 設定為 True，則用於重新縮放影像的比例因子。可以透過 preprocess 方法中的 rescale_factor 引數覆蓋。
do_normalize (bool, 可選, 預設為 True) — 是否歸一化影像。可以透過 preprocess 方法中的 do_normalize 引數覆蓋。
image_mean (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_MEAN) — 如果歸一化影像，使用的均值。這是一個浮點數或浮點數列表，其長度與影像中的通道數相同。可以透過 preprocess 方法中的 image_mean 引數覆蓋。
image_std (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_STD) — 如果歸一化影像，使用的標準差。這是一個浮點數或浮點數列表，其長度與影像中的通道數相同。可以透過 preprocess 方法中的 image_std 引數覆蓋。

構建一個 VideoMAE 影像處理器。

preprocess

< 來源 >

( videos: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None resample: Resampling = None do_center_crop: typing.Optional[bool] = None crop_size: typing.Optional[dict[str, int]] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

引數

images (ImageInput) — 要預處理的影像。期望單個或批次影像，畫素值範圍為 0 到 255。如果傳入畫素值在 0 到 1 之間的影像，請設定 do_rescale=False。
do_resize (bool, 可選, 預設為 self.do_resize) — 是否調整影像大小。
size (dict[str, int], 可選, 預設為 self.size) — 應用調整大小後的影像尺寸。
resample (PILImageResampling, 可選, 預設為 self.resample) — 如果調整影像大小，使用的重取樣濾波器。這可以是列舉 PILImageResampling 之一，僅在 do_resize 設定為 True 時有效。
do_center_crop (bool, 可選, 預設為 self.do_centre_crop) — 是否中心裁剪影像。
crop_size (dict[str, int], 可選, 預設為 self.crop_size) — 應用中心裁剪後的影像尺寸。
do_rescale (bool, 可選, 預設為 self.do_rescale) — 是否將影像值重新縮放至 [0 - 1] 之間。
rescale_factor (float, 可選, 預設為 self.rescale_factor) — 如果 do_rescale 設定為 True，則用於重新縮放影像的比例因子。
do_normalize (bool, 可選, 預設為 self.do_normalize) — 是否歸一化影像。
image_mean (float 或 list[float], 可選, 預設為 self.image_mean) — 影像均值。
image_std (float 或 list[float], 可選, 預設為 self.image_std) — 影像標準差。
return_tensors (str 或 TensorType, 可選) — 返回張量的型別。可以是以下之一：
- 未設定：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回 tf.Tensor 型別的批次。
- TensorType.PYTORCH 或 'pt'：返回 torch.Tensor 型別的批次。
- TensorType.NUMPY 或 'np'：返回 np.ndarray 型別的批次。
- TensorType.JAX 或 'jax'：返回 jax.numpy.ndarray 型別的批次。
data_format (ChannelDimension 或 str, 可選, 預設為 ChannelDimension.FIRST) — 輸出影像的通道維度格式。可以是以下之一：
- ChannelDimension.FIRST：影像為 (num_channels, height, width) 格式。
- ChannelDimension.LAST：影像為 (height, width, num_channels) 格式。
- 未設定：使用輸入影像推斷的通道維度格式。
input_data_format (ChannelDimension 或 str, 可選) — 輸入影像的通道維度格式。如果未設定，則從輸入影像推斷通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：影像為 (num_channels, height, width) 格式。
- "channels_last" 或 ChannelDimension.LAST：影像為 (height, width, num_channels) 格式。
- "none" 或 ChannelDimension.NONE：影像為 (height, width) 格式。

預處理一張或一批影像。

VideoMAEModel

class transformers.VideoMAEModel

< source >

( config )

引數

config (VideoMAEModel) — 包含模型所有引數的模型配置類。使用配置檔案初始化不會載入與模型相關的權重，只加載配置。請檢視 from_pretrained() 方法載入模型權重。

裸露的 VideoMAE 模型，直接輸出原始隱藏狀態，沒有額外的特定頭部。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解該庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中與通用用法和行為相關的所有事項。

前向傳播

< source >

( pixel_values: FloatTensor bool_masked_pos: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (形狀為 (batch_size, num_channels, image_size, image_size) 的 torch.FloatTensor) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參見 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
bool_masked_pos (形狀為 (batch_size, sequence_length) 的 torch.BoolTensor, 可選) — 布林遮罩位置。指示哪些補丁被遮罩（1）和哪些未被遮罩（0）。批次中的每個影片必須具有相同數量的遮罩補丁。如果為 None，則考慮所有補丁。序列長度為 (num_frames // tubelet_size) * (image_size // patch_size) ** 2。
head_mask (形狀為 (num_heads,) 或 (num_layers, num_heads) 的 torch.Tensor, 可選) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選擇在 [0, 1] 之間：
- 1 表示頭部未被遮罩,
- 0 表示頭部被遮罩.
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關詳細資訊，請參見返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關詳細資訊，請參見返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通元組。

transformers.modeling_outputs.BaseModelOutput 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.BaseModelOutput 或一個 torch.FloatTensor 元組（如果傳遞了 return_dict=False 或當 config.return_dict=False 時），包含根據配置 (VideoMAEConfig) 和輸入的不同元素。

last_hidden_state (torch.FloatTensor, 形狀為 (batch_size, sequence_length, hidden_size)) — 模型最後一層輸出的隱藏狀態序列。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — 形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元組（如果模型具有嵌入層，則為一個嵌入層輸出，加上每個層的一個輸出）。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple(torch.FloatTensor), 可選, 當傳遞 output_attentions=True 或當 config.output_attentions=True 時返回) — 形狀為 (batch_size, num_heads, sequence_length, sequence_length) 的 torch.FloatTensor 元組（每個層一個）。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

VideoMAEModel 的前向傳播方法，覆蓋了 __call__ 特殊方法。

儘管前向傳播的配方需要在該函式中定義，但在此之後應該呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默地忽略它們。

示例

>>> import av
>>> import numpy as np

>>> from transformers import AutoImageProcessor, VideoMAEModel
>>> from huggingface_hub import hf_hub_download

>>> np.random.seed(0)


>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`list[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`list[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices


>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)

>>> # sample 16 frames
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container, indices)

>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
>>> model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")

>>> # prepare video for the model
>>> inputs = image_processor(list(video), return_tensors="pt")

>>> # forward pass
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 1568, 768]

VideoMAEForPreTraining

VideoMAEForPreTraining 包含頂部的解碼器，用於自監督預訓練。

class transformers.VideoMAEForPreTraining

< source >

( config )

引數

config (VideoMAEForPreTraining) — 包含模型所有引數的模型配置類。使用配置檔案初始化不會載入與模型相關的權重，只加載配置。請檢視 from_pretrained() 方法載入模型權重。

VideoMAE 模型轉換器，頂部帶有解碼器，用於自監督預訓練。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解該庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中與通用用法和行為相關的所有事項。

前向傳播

< source >

( pixel_values: FloatTensor bool_masked_pos: BoolTensor head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.models.videomae.modeling_videomae.VideoMAEForPreTrainingOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (形狀為 (batch_size, num_channels, image_size, image_size) 的 torch.FloatTensor) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參見 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
bool_masked_pos (形狀為 (batch_size, sequence_length) 的 torch.BoolTensor) — 布林遮罩位置。指示哪些補丁被遮罩（1）和哪些未被遮罩（0）。批次中的每個影片必須具有相同數量的遮罩補丁。序列長度為 (num_frames // tubelet_size) * (image_size // patch_size) ** 2。
head_mask (形狀為 (num_heads,) 或 (num_layers, num_heads) 的 torch.Tensor, 可選) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選擇在 [0, 1] 之間：
- 1 表示頭部未被遮罩,
- 0 表示頭部被遮罩.
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關詳細資訊，請參見返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關詳細資訊，請參見返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通元組。

transformers.models.videomae.modeling_videomae.VideoMAEForPreTrainingOutput 或 tuple(torch.FloatTensor)

一個 transformers.models.videomae.modeling_videomae.VideoMAEForPreTrainingOutput 或一個 torch.FloatTensor 元組（如果傳遞了 return_dict=False 或當 config.return_dict=False 時），包含根據配置 (VideoMAEConfig) 和輸入的不同元素。

loss (torch.FloatTensor，形狀為 (1,)) — 畫素重建損失。
logits (形狀為 (batch_size, patch_size ** 2 * num_channels) 的 torch.FloatTensor) — 畫素重建邏輯。
hidden_states (tuple[torch.FloatTensor], 可選, 當傳遞 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — 形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元組（如果模型具有嵌入層，則為一個嵌入層輸出，加上每個層的一個輸出）。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple[torch.FloatTensor], 可選, 當傳遞 output_attentions=True 或當 config.output_attentions=True 時返回) — 形狀為 (batch_size, num_heads, sequence_length, sequence_length) 的 torch.FloatTensor 元組（每個層一個）。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

VideoMAEForPreTraining 的前向傳播方法，覆蓋了 __call__ 特殊方法。

示例

>>> from transformers import AutoImageProcessor, VideoMAEForPreTraining
>>> import numpy as np
>>> import torch

>>> num_frames = 16
>>> video = list(np.random.randint(0, 256, (num_frames, 3, 224, 224)))

>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base")
>>> model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")

>>> pixel_values = image_processor(video, return_tensors="pt").pixel_values

>>> num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
>>> seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
>>> bool_masked_pos = torch.randint(0, 2, (1, seq_length)).bool()

>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
>>> loss = outputs.loss

VideoMAEForVideoClassification

class transformers.VideoMAEForVideoClassification

< source >

( config )

引數

config (VideoMAEForVideoClassification) — 包含模型所有引數的模型配置類。使用配置檔案初始化不會載入與模型相關的權重，只加載配置。請檢視 from_pretrained() 方法載入模型權重。

VideoMAE 模型轉換器，頂部帶有一個影片分類頭部（所有 token 平均池化隱藏狀態頂部的線性層），例如用於 ImageNet。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解該庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件中與通用用法和行為相關的所有事項。

前向傳播

< source >

( pixel_values: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.ImageClassifierOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (形狀為 (batch_size, num_channels, image_size, image_size) 的 torch.Tensor, 可選) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參見 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
head_mask (形狀為 (num_heads,) 或 (num_layers, num_heads) 的 torch.Tensor, 可選) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選擇在 [0, 1] 之間：
- 1 表示頭部未被遮罩,
- 0 表示頭部被遮罩.
labels (形狀為 (batch_size,) 的 torch.LongTensor, 可選) — 用於計算影像分類/迴歸損失的標籤。索引應在 [0, ..., config.num_labels - 1] 之間。如果 config.num_labels == 1，則計算迴歸損失（均方損失）；如果 config.num_labels > 1，則計算分類損失（交叉熵）。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關詳細資訊，請參見返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關詳細資訊，請參見返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通元組。

transformers.modeling_outputs.ImageClassifierOutput 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.ImageClassifierOutput 或一個 torch.FloatTensor 元組（如果傳遞了 return_dict=False 或當 config.return_dict=False 時），包含根據配置 (VideoMAEConfig) 和輸入的不同元素。

loss (形狀為 (1,) 的 torch.FloatTensor，可選，當提供 labels 時返回) — 分類損失（如果 config.num_labels==1，則為迴歸損失）。
logits (形狀為 (batch_size, config.num_labels) 的 torch.FloatTensor) — 分類（如果 config.num_labels==1，則為迴歸）分數（SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — 形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元組（如果模型具有嵌入層，則為一個嵌入層輸出，加上每個階段的一個輸出）。模型在每個階段輸出的隱藏狀態（也稱為特徵圖）。
attentions (tuple(torch.FloatTensor), 可選, 當傳遞 output_attentions=True 或當 config.output_attentions=True 時返回) — 形狀為 (batch_size, num_heads, patch_size, sequence_length) 的 torch.FloatTensor 元組（每個層一個）。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

VideoMAEForVideoClassification 的前向傳播方法，覆蓋了 __call__ 特殊方法。

示例

>>> import av
>>> import torch
>>> import numpy as np

>>> from transformers import AutoImageProcessor, VideoMAEForVideoClassification
>>> from huggingface_hub import hf_hub_download

>>> np.random.seed(0)


>>> def read_video_pyav(container, indices):
...     '''
...     Decode the video with PyAV decoder.
...     Args:
...         container (`av.container.input.InputContainer`): PyAV container.
...         indices (`list[int]`): List of frame indices to decode.
...     Returns:
...         result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
...     '''
...     frames = []
...     container.seek(0)
...     start_index = indices[0]
...     end_index = indices[-1]
...     for i, frame in enumerate(container.decode(video=0)):
...         if i > end_index:
...             break
...         if i >= start_index and i in indices:
...             frames.append(frame)
...     return np.stack([x.to_ndarray(format="rgb24") for x in frames])


>>> def sample_frame_indices(clip_len, frame_sample_rate, seg_len):
...     '''
...     Sample a given number of frame indices from the video.
...     Args:
...         clip_len (`int`): Total number of frames to sample.
...         frame_sample_rate (`int`): Sample every n-th frame.
...         seg_len (`int`): Maximum allowed index of sample's last frame.
...     Returns:
...         indices (`list[int]`): List of sampled frame indices
...     '''
...     converted_len = int(clip_len * frame_sample_rate)
...     end_idx = np.random.randint(converted_len, seg_len)
...     start_idx = end_idx - converted_len
...     indices = np.linspace(start_idx, end_idx, num=clip_len)
...     indices = np.clip(indices, start_idx, end_idx - 1).astype(np.int64)
...     return indices


>>> # video clip consists of 300 frames (10 seconds at 30 FPS)
>>> file_path = hf_hub_download(
...     repo_id="nielsr/video-demo", filename="eating_spaghetti.mp4", repo_type="dataset"
... )
>>> container = av.open(file_path)

>>> # sample 16 frames
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=1, seg_len=container.streams.video[0].frames)
>>> video = read_video_pyav(container, indices)

>>> image_processor = AutoImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
>>> model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")

>>> inputs = image_processor(list(video), return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)
...     logits = outputs.logits

>>> # model predicts one of the 400 Kinetics-400 classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
eating spaghetti

< > 在 GitHub 上更新

Transformers

VideoMAE

概述

使用縮放點積注意力 (SDPA)

資源

VideoMAEConfig

class transformers.VideoMAEConfig

VideoMAEFeatureExtractor

class transformers.VideoMAEFeatureExtractor

__call__

VideoMAEImageProcessor

class transformers.VideoMAEImageProcessor

preprocess

VideoMAEModel

class transformers.VideoMAEModel

前向傳播

VideoMAEForPreTraining

class transformers.VideoMAEForPreTraining

前向傳播

VideoMAEForVideoClassification

class transformers.VideoMAEForVideoClassification

前向傳播

call