TVP

概述

文字-視覺提示 (TVP) 框架由 Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding 在論文 Text-Visual Prompting for Efficient 2D Temporal Video Grounding 中提出。

論文摘要如下：

在本文中，我們研究了時間影片定位 (TVG) 問題，該問題旨在預測長未剪輯影片中由文字語句描述的時刻的開始/結束時間點。得益於細粒度的 3D 視覺特徵，TVG 技術近年來取得了顯著進展。然而，3D 卷積神經網路 (CNN) 的高複雜性使得提取密集 3D 視覺特徵非常耗時，這需要大量的記憶體和計算資源。為了實現高效的 TVG，我們提出了一種新穎的文字-視覺提示 (TVP) 框架，該框架將最佳化後的擾動模式（我們稱之為“提示”）整合到 TVG 模型的視覺輸入和文字特徵中。與 3D CNN 形成鮮明對比的是，我們展示了 TVP 允許我們有效地協同訓練 2D TVG 模型中的視覺編碼器和語言編碼器，並僅使用低複雜度的稀疏 2D 視覺特徵提高跨模態特徵融合的效能。此外，我們提出了一種用於 TVG 高效學習的時間-距離 IoU (TDIoU) 損失。在 Charades-STA 和 ActivityNet Captions 兩個基準資料集上的實驗經驗性地表明，所提出的 TVP 顯著提高了 2D TVG 的效能（例如，在 Charades-STA 上提高了 9.79%，在 ActivityNet Captions 上提高了 30.77%），並且與使用 3D 視覺特徵的 TVG 相比，推理速度提高了 5 倍。

這項研究解決了時間影片定位 (TVG) 問題，即根據文字語句描述，確定長影片中特定事件的開始和結束時間。我們提出了文字-視覺提示 (TVP) 來增強 TVG。TVP 涉及將專門設計的模式（稱為“提示”）整合到 TVG 模型的視覺（基於影像）和文字（基於詞語）輸入元件中。這些提示提供額外的時空上下文，提高了模型準確確定影片中事件時間的能力。該方法採用 2D 視覺輸入代替 3D 視覺輸入。儘管 3D 輸入提供更多時空細節，但它們處理起來也更耗時。使用 2D 輸入和提示方法旨在更有效地提供類似級別的上下文和準確性。

TVP 架構。摘自原始論文。

此模型由 Jiqing Feng 貢獻。原始程式碼可在此處找到。

使用技巧和示例

提示是最佳化的擾動模式，將新增到輸入影片幀或文字特徵中。通用集是指對任何輸入使用完全相同的提示集，這意味著這些提示會一致地新增到所有影片幀和文字特徵中，而不管輸入內容如何。

TVP 由一個視覺編碼器和跨模態編碼器組成。一套通用的視覺提示和文字提示將分別整合到取樣的影片幀和文字特徵中。特別地，一組不同的視覺提示將按順序應用於一個未剪輯影片的均勻取樣幀。

此模型的目標是將可訓練提示整合到視覺輸入和文字特徵中，以解決時間影片定位 (TVG) 問題。原則上，可以在所提出的架構中應用任何視覺、跨模態編碼器。

TvpProcessor 將 BertTokenizer 和 TvpImageProcessor 包裝成一個單一例項，用於分別編碼文字和準備影像。

以下示例展示瞭如何使用 TvpProcessor 和 TvpForVideoGrounding 執行時間影片定位。

import av
import cv2
import numpy as np
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoProcessor, TvpForVideoGrounding


def pyav_decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps):
    '''
    Convert the video from its original fps to the target_fps and decode the video with PyAV decoder.
    Args:
        container (container): pyav container.
        sampling_rate (int): frame sampling rate (interval between two sampled frames).
        num_frames (int): number of frames to sample.
        clip_idx (int): if clip_idx is -1, perform random temporal sampling.
            If clip_idx is larger than -1, uniformly split the video to num_clips
            clips, and select the clip_idx-th video clip.
        num_clips (int): overall number of clips to uniformly sample from the given video.
        target_fps (int): the input video may have different fps, convert it to
            the target video fps before frame sampling.
    Returns:
        frames (tensor): decoded frames from the video. Return None if the no
            video stream was found.
        fps (float): the number of frames per second of the video.
    '''
    video = container.streams.video[0]
    fps = float(video.average_rate)
    clip_size = sampling_rate * num_frames / target_fps * fps
    delta = max(num_frames - clip_size, 0)
    start_idx = delta * clip_idx / num_clips
    end_idx = start_idx + clip_size - 1
    timebase = video.duration / num_frames
    video_start_pts = int(start_idx * timebase)
    video_end_pts = int(end_idx * timebase)
    seek_offset = max(video_start_pts - 1024, 0)
    container.seek(seek_offset, any_frame=False, backward=True, stream=video)
    frames = {}
    for frame in container.decode(video=0):
        if frame.pts < video_start_pts:
            continue
        frames[frame.pts] = frame
        if frame.pts > video_end_pts:
            break
    frames = [frames[pts] for pts in sorted(frames)]
    return frames, fps


def decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps):
    '''
    Decode the video and perform temporal sampling.
    Args:
        container (container): pyav container.
        sampling_rate (int): frame sampling rate (interval between two sampled frames).
        num_frames (int): number of frames to sample.
        clip_idx (int): if clip_idx is -1, perform random temporal sampling.
            If clip_idx is larger than -1, uniformly split the video to num_clips
            clips, and select the clip_idx-th video clip.
        num_clips (int): overall number of clips to uniformly sample from the given video.
        target_fps (int): the input video may have different fps, convert it to
            the target video fps before frame sampling.
    Returns:
        frames (tensor): decoded frames from the video.
    '''
    assert clip_idx >= -2, "Not a valid clip_idx {}".format(clip_idx)
    frames, fps = pyav_decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps)
    clip_size = sampling_rate * num_frames / target_fps * fps
    index = np.linspace(0, clip_size - 1, num_frames)
    index = np.clip(index, 0, len(frames) - 1).astype(np.int64)
    frames = np.array([frames[idx].to_rgb().to_ndarray() for idx in index])
    frames = frames.transpose(0, 3, 1, 2)
    return frames


file = hf_hub_download(repo_id="Intel/tvp_demo", filename="AK2KG.mp4", repo_type="dataset")
model = TvpForVideoGrounding.from_pretrained("Intel/tvp-base")

decoder_kwargs = dict(
    container=av.open(file, metadata_errors="ignore"),
    sampling_rate=1,
    num_frames=model.config.num_frames,
    clip_idx=0,
    num_clips=1,
    target_fps=3,
)
raw_sampled_frms = decode(**decoder_kwargs)

text = "a person is sitting on a bed."
processor = AutoProcessor.from_pretrained("Intel/tvp-base")
model_inputs = processor(
    text=[text], videos=list(raw_sampled_frms), return_tensors="pt", max_text_length=100#, size=size
)

model_inputs["pixel_values"] = model_inputs["pixel_values"].to(model.dtype)
output = model(**model_inputs)

def get_video_duration(filename):
    cap = cv2.VideoCapture(filename)
    if cap.isOpened():
        rate = cap.get(5)
        frame_num = cap.get(7)
        duration = frame_num/rate
        return duration
    return -1

duration = get_video_duration(file)
start, end = processor.post_process_video_grounding(output.logits, duration)

print(f"The time slot of the video corresponding to the text \"{text}\" is from {start}s to {end}s")

技巧

TVP 的此實現使用 BertTokenizer 生成文字嵌入，並使用 Resnet-50 模型計算視覺嵌入。
已釋出預訓練的 tvp-base 檢查點。
有關 TVP 在時間影片定位任務中的效能，請參閱表 2。

TvpConfig

類 transformers.TvpConfig

< 源 >

( backbone_config = None backbone = None use_pretrained_backbone = False use_timm_backbone = False backbone_kwargs = None distance_loss_weight = 1.0 duration_loss_weight = 0.1 visual_prompter_type = 'framepad' visual_prompter_apply = 'replace' visual_prompt_size = 96 max_img_size = 448 num_frames = 48 vocab_size = 30522 hidden_size = 768 intermediate_size = 3072 num_hidden_layers = 12 num_attention_heads = 12 max_position_embeddings = 512 max_grid_col_position_embeddings = 100 max_grid_row_position_embeddings = 100 hidden_dropout_prob = 0.1 hidden_act = 'gelu' layer_norm_eps = 1e-12 initializer_range = 0.02 attention_probs_dropout_prob = 0.1 **kwargs )

引數

backbone_config (PretrainedConfig 或 dict, 可選) — 骨幹模型的配置。
backbone (str, 可選) — 當 backbone_config 為 None 時使用的骨幹模型的名稱。如果 use_pretrained_backbone 為 True，這將從 timm 或 transformers 庫載入相應的預訓練權重。如果 use_pretrained_backbone 為 False，這將載入骨幹模型的配置並使用它來初始化具有隨機權重的骨幹模型。
use_pretrained_backbone (bool, 可選, 預設為 False) — 是否使用骨幹模型的預訓練權重。
use_timm_backbone (bool, 可選, 預設為 False) — 是否從 timm 庫載入 backbone。如果為 False，則從 transformers 庫載入骨幹模型。
backbone_kwargs (dict, 可選) — 載入檢查點時傳遞給 AutoBackbone 的關鍵字引數，例如 {'out_indices': (0, 1, 2, 3)}。如果設定了 backbone_config，則不能指定此引數。
distance_loss_weight (float, 可選, 預設為 1.0) — 距離損失的權重。
duration_loss_weight (float, 可選, 預設為 0.1) — 持續時間損失的權重。
visual_prompter_type (str, 可選, 預設為 "framepad") — 視覺提示型別。填充型別。Framepad 表示在每個幀上填充。應為“framepad”或“framedownpad”之一。
visual_prompter_apply (str, 可選, 預設為 "replace") — 視覺提示的應用方式。Replace 表示使用提示的值來改變視覺輸入中的原始值。應為“replace”、“add”或“remove”之一。
visual_prompt_size (int, 可選, 預設為 96) — 視覺提示的大小。
max_img_size (int, 可選, 預設為 448) — 幀的最大尺寸。
num_frames (int, 可選, 預設為 48) — 從影片中提取的幀數。
vocab_size (int, 可選, 預設為 30522) — Tvp 文字模型的詞彙表大小。定義了呼叫 TvpModel 時可以透過 inputs_ids 表示的不同標記的數量。
hidden_size (int, 可選, 預設為 768) — 編碼器層的維度。
intermediate_size (int, 可選, 預設為 3072) — Transformer 編碼器中“中間”（即前饋）層的維度。
num_hidden_layers (int, 可選, 預設為 12) — Transformer 編碼器中的隱藏層數量。
num_attention_heads (int, 可選, 預設為 12) — Transformer 編碼器中每個注意力層的注意力頭數量。
max_position_embeddings (int, 可選, 預設為 512) — 此模型可能使用的最大序列長度。通常為了以防萬一設定為較大的值（例如 512 或 1024 或 2048）。
max_grid_col_position_embeddings (int, 可選, 預設為 100) — 影片幀中水平補丁的最大數量。
max_grid_row_position_embeddings (int, 可選, 預設為 100) — 影片幀中垂直補丁的最大數量。
hidden_dropout_prob (float, 可選, 預設為 0.1) — 隱藏層的 dropout 機率。
hidden_act (str 或 function, 可選, 預設為 "gelu") — 編碼器和池化器中的非線性啟用函式（函式或字串）。如果是字串，則支援 "gelu", "relu", "selu" 和 "gelu_new" "quick_gelu"。
layer_norm_eps (float, 可選, 預設為 1e-12) — 層歸一化層使用的 epsilon 值。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
attention_probs_dropout_prob (float, 可選, 預設為 0.1) — 注意力層的 dropout 機率。

這是用於儲存 TvpModel 配置的配置類。它用於根據指定的引數例項化 Tvp 模型，定義模型架構。使用預設值例項化配置將產生與 Tvp Intel/tvp-base 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

from_backbone_config

< 源 >

( backbone_config: PretrainedConfig **kwargs ) → TvpConfig

引數

backbone_config (PretrainedConfig) — 骨幹配置。

TvpConfig

一個配置物件的例項

從預訓練的骨幹模型配置例項化 TvpConfig（或派生類）。

to_dict

< 源 >

( ) → dict[str, any]

dict[str, any]

所有構成此配置例項的屬性的字典，

將此例項序列化為 Python 字典。覆蓋預設的 to_dict()。

TvpImageProcessor

類 transformers.TvpImageProcessor

< 源 >

( do_resize: bool = True size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_center_crop: bool = True crop_size: typing.Optional[dict[str, int]] = None do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_pad: bool = True pad_size: typing.Optional[dict[str, int]] = None constant_values: typing.Union[float, collections.abc.Iterable[float]] = 0 pad_mode: PaddingMode = <PaddingMode.CONSTANT: 'constant'> do_normalize: bool = True do_flip_channel_order: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None **kwargs )

引數

do_resize (bool, optional, defaults to True) — 是否將影像的（高度，寬度）尺寸調整到指定的 size。可以透過 preprocess 方法中的 do_resize 引數覆蓋。
size (dict[str, int] 可選, 預設為 {"longest_edge" -- 448})：調整大小後輸出影像的尺寸。影像的最長邊將調整為 size["longest_edge"]，同時保持原始影像的縱橫比。可以透過 preprocess 方法中的 size 覆蓋。
resample (PILImageResampling, 可選, 預設為 Resampling.BILINEAR) — 調整影像大小時使用的重取樣過濾器。可以透過 preprocess 方法中的 resample 引數覆蓋。
do_center_crop (bool, 可選, 預設為 True) — 是否將影像中心裁剪到指定的 crop_size。可以透過 preprocess 方法中的 do_center_crop 引數覆蓋。
crop_size (dict[str, int], 可選, 預設為 {"height" -- 448, "width": 448})：應用中心裁剪後圖像的尺寸。可以透過 preprocess 方法中的 crop_size 引數覆蓋。
do_rescale (bool, 可選, 預設為 True) — 是否根據指定的比例 rescale_factor 重新縮放影像。可以透過 preprocess 方法中的 do_rescale 引數覆蓋。
rescale_factor (int 或 float, 可選, 預設為 1/255) — 定義重新縮放影像時使用的比例因子。可以透過 preprocess 方法中的 rescale_factor 引數覆蓋。
do_pad (bool, 可選, 預設為 True) — 是否填充影像。可以透過 preprocess 方法中的 do_pad 引數覆蓋。
pad_size (dict[str, int], 可選, 預設為 {"height" -- 448, "width": 448})：應用填充後圖像的尺寸。可以透過 preprocess 方法中的 pad_size 引數覆蓋。
constant_values (Union[float, Iterable[float]], 可選, 預設為 0) — 填充影像時使用的填充值。
pad_mode (PaddingMode, 可選, 預設為 PaddingMode.CONSTANT) — 填充時使用的模式。
do_normalize (bool, 可選, 預設為 True) — 是否對影像進行歸一化。可以透過 preprocess 方法中的 do_normalize 引數覆蓋。
do_flip_channel_order (bool, 可選, 預設為 True) — 是否將顏色通道從 RGB 翻轉為 BGR。可以透過 preprocess 方法中的 do_flip_channel_order 引數覆蓋。
image_mean (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_MEAN) — 歸一化影像時使用的均值。這是一個浮點數或浮點數列表，長度與影像中的通道數相同。可以透過 preprocess 方法中的 image_mean 引數覆蓋。
image_std (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_STD) — 歸一化影像時使用的標準差。這是一個浮點數或浮點數列表，長度與影像中的通道數相同。可以透過 preprocess 方法中的 image_std 引數覆蓋。

構造一個 Tvp 影像處理器。

預處理

< 源 >

( videos: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor'], list[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]], list[list[typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]]]] do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None resample: Resampling = None do_center_crop: typing.Optional[bool] = None crop_size: typing.Optional[dict[str, int]] = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_pad: typing.Optional[bool] = None pad_size: typing.Optional[dict[str, int]] = None constant_values: typing.Union[float, collections.abc.Iterable[float], NoneType] = None pad_mode: PaddingMode = None do_normalize: typing.Optional[bool] = None do_flip_channel_order: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: ChannelDimension = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None )

引數

videos (ImageInput 或 list[ImageInput] 或 list[list[ImageInput]]) — 要預處理的幀。
do_resize (bool, 可選, 預設為 self.do_resize) — 是否調整影像大小。
size (dict[str, int], 可選, 預設為 self.size) — 調整大小後圖像的尺寸。
resample (PILImageResampling, 可選, 預設為 self.resample) — 調整影像大小（如果 do_resize 設定為 True）時使用的重取樣過濾器。這可以是 PILImageResampling 列舉中的一個，僅在 do_resize 設定為 True 時有效。
do_center_crop (bool, 可選, 預設為 self.do_centre_crop) — 是否中心裁剪影像。
crop_size (dict[str, int], 可選, 預設為 self.crop_size) — 應用中心裁剪後圖像的尺寸。
do_rescale (bool, 可選, 預設為 self.do_rescale) — 是否將影像值重新縮放到 [0 - 1] 之間。
rescale_factor (float, 可選, 預設為 self.rescale_factor) — 如果 do_rescale 設定為 True，則用於重新縮放影像的縮放因子。
do_pad (bool, 可選, 預設為 True) — 是否填充影像。可以透過 preprocess 方法中的 do_pad 引數覆蓋。
pad_size (dict[str, int], 可選, 預設為 {"height" -- 448, "width": 448})：應用填充後圖像的尺寸。可以透過 preprocess 方法中的 pad_size 引數覆蓋。
constant_values (Union[float, Iterable[float]], 可選, 預設為 0) — 填充影像時使用的填充值。
pad_mode (PaddingMode, 可選, 預設為 “PaddingMode.CONSTANT”) — 填充時使用的模式。
do_normalize (bool, 可選, 預設為 self.do_normalize) — 是否對影像進行歸一化。
do_flip_channel_order (bool, 可選, 預設為 self.do_flip_channel_order) — 是否翻轉影像的通道順序。
image_mean (float 或 list[float], 可選, 預設為 self.image_mean) — 影像均值。
image_std (float 或 list[float], 可選, 預設為 self.image_std) — 影像標準差。
return_tensors (str 或 TensorType, 可選) — 要返回的張量型別。可以是以下之一：
- 未設定：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回 tf.Tensor 型別的批次。
- TensorType.PYTORCH 或 'pt'：返回 torch.Tensor 型別的批次。
- TensorType.NUMPY 或 'np'：返回 np.ndarray 型別的批次。
- TensorType.JAX 或 'jax'：返回 jax.numpy.ndarray 型別的批次。
data_format (ChannelDimension 或 str, 可選, 預設為 ChannelDimension.FIRST) — 輸出影像的通道維度格式。可以是以下之一：
- ChannelDimension.FIRST：影像格式為 (num_channels, height, width)。
- ChannelDimension.LAST：影像格式為 (height, width, num_channels)。
- 未設定：使用輸入影像推斷的通道維度格式。
input_data_format (ChannelDimension 或 str, 可選) — 輸入影像的通道維度格式。如果未設定，則從輸入影像推斷通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：影像格式為 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：影像格式為 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：影像格式為 (height, width)。

預處理一張或一批影像。

TvpProcessor

class transformers.TvpProcessor

< 源 >

( image_processor = None tokenizer = None **kwargs )

引數

image_processor (TvpImageProcessor, 可選) — 影像處理器是必需的輸入。
tokenizer (BertTokenizerFast, 可選) — 分詞器是必需的輸入。

構造一個 TVP 處理器，它將 TVP 影像處理器和 Bert 分詞器封裝在一個處理器中。

TvpProcessor 提供了 TvpImageProcessor 和 BertTokenizerFast 的所有功能。有關更多資訊，請參閱 call() 和 decode() 的文件字串。

call

< 源 >

( text = None videos = None return_tensors = None **kwargs ) → BatchEncoding

引數

text (str, list[str], list[list[str]]) — 要編碼的序列或序列批次。每個序列可以是字串或字串列表（預分詞字串）。如果序列以字串列表（預分詞）形式提供，則必須設定 is_split_into_words=True（以消除與序列批次的歧義）。
videos (list[PIL.Image.Image], list[np.ndarray], list[torch.Tensor], list[list[PIL.Image.Image]], list[list[np.ndarray]], — list[list[torch.Tensor]])：要準備的影片或影片批次。每個影片都應該是一個幀列表，可以是 PIL 影像或 NumPy 陣列。如果是 NumPy 陣列/PyTorch 張量，每個幀的形狀應為 (H, W, C)，其中 H 和 W 是幀的高度和寬度，C 是通道數。
return_tensors (str 或 TensorType, 可選) — 如果設定，將返回特定框架的張量。可接受的值為：
- 'tf'：返回 TensorFlow tf.constant 物件。
- 'pt'：返回 PyTorch torch.Tensor 物件。
- 'np'：返回 NumPy np.ndarray 物件。
- 'jax'：返回 JAX jnp.ndarray 物件。

BatchEncoding

一個 BatchEncoding，包含以下欄位：

input_ids — 要輸入到模型的 token ID 列表。當 text 不為 None 時返回。
attention_mask — 指定模型應關注哪些 token 的索引列表（當 return_attention_mask=True 或 "attention_mask" 在 self.model_input_names 中且 text 不為 None 時）。
pixel_values — 要輸入到模型的畫素值。當 videos 不為 None 時返回。

準備一個或多個序列和影像到模型的主方法。如果 text 不為 None，此方法將 text 和 kwargs 引數轉發給 BertTokenizerFast 的 call() 以編碼文字。如果 videos 不為 None，此方法將 videos 和 kwargs 引數轉發給 TvpImageProcessor 的 call() 以準備影像。有關更多資訊，請參閱上述兩種方法的文件字串。

TvpModel

class transformers.TvpModel

< 源 >

( config )

引數

config (TvpModel) — 包含模型所有引數的模型配置類。用配置檔案初始化不會載入與模型相關的權重，只加載配置。請檢視 from_pretrained() 方法以載入模型權重。

裸 TVP 模型transformer 輸出 BaseModelOutputWithPooling 物件，頂部沒有任何特定頭部。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件以瞭解所有與一般用法和行為相關的事項。

前向

< 源 >

( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None head_mask: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None interpolate_pos_encoding: bool = False )

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 輸入序列標記在詞彙表中的索引。預設情況下會忽略填充。

索引可以使用 AutoTokenizer 獲取。詳情請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)，可選) — 對應輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。詳情請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
attention_mask (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 用於避免對填充標記索引執行注意力操作的掩碼。掩碼值選擇範圍為 [0, 1]：
- 1 表示標記未被掩蓋，
- 0 表示標記已被掩蓋。
什麼是注意力掩碼？
head_mask (torch.FloatTensor，形狀為 (num_heads,) 或 (num_layers, num_heads)，可選) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選擇範圍為 [0, 1]：
- 1 表示頭部未被掩蓋，
- 0 表示頭部已被掩蓋。
output_attentions (bool，可選) — 是否返回所有注意力層的注意力張量。更多詳情請參閱返回張量下的 attentions。
output_hidden_states (bool，可選) — 是否返回所有層的隱藏狀態。更多詳情請參閱返回張量下的 hidden_states。
return_dict (bool，可選) — 是否返回 ModelOutput 而非普通的元組。
interpolate_pos_encoding (bool，預設為 False) — 是否插值預訓練的位置編碼。

TvpModel 的 forward 方法，重寫了 __call__ 特殊方法。

儘管前向傳播的實現需要在該函式中定義，但在此之後應呼叫 Module 例項而非此函式，因為前者會處理預處理和後處理步驟，而後者會默默忽略它們。

示例

>>> import torch
>>> from transformers import AutoConfig, AutoTokenizer, TvpModel

>>> model = TvpModel.from_pretrained("Jiqing/tiny-random-tvp")

>>> tokenizer = AutoTokenizer.from_pretrained("Jiqing/tiny-random-tvp")

>>> pixel_values = torch.rand(1, 1, 3, 448, 448)
>>> text_inputs = tokenizer("This is an example input", return_tensors="pt")
>>> output = model(text_inputs.input_ids, pixel_values, text_inputs.attention_mask)

TvpForVideoGrounding

class transformers.TvpForVideoGrounding

< 來源 >

( config )

引數

config (TvpForVideoGrounding) — 包含模型所有引數的模型配置類。使用配置檔案初始化並不會載入與模型相關的權重，只加載配置。請檢視 from_pretrained() 方法來載入模型權重。

Tvp 模型，頂部帶有一個影片定位頭，用於計算 IoU、距離和持續時間損失。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件以瞭解所有與一般用法和行為相關的事項。

前向

< 來源 >

( input_ids: typing.Optional[torch.LongTensor] = None pixel_values: typing.Optional[torch.FloatTensor] = None attention_mask: typing.Optional[torch.LongTensor] = None labels: typing.Optional[tuple[torch.Tensor]] = None head_mask: typing.Optional[torch.FloatTensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None interpolate_pos_encoding: bool = False )

引數

input_ids (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 輸入序列標記在詞彙表中的索引。預設情況下會忽略填充。

索引可以使用 AutoTokenizer 獲取。詳情請參閱 PreTrainedTokenizer.encode() 和 PreTrainedTokenizer.call()。

什麼是輸入 ID？
pixel_values (torch.FloatTensor，形狀為 (batch_size, num_channels, image_size, image_size)，可選) — 對應輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。詳情請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
attention_mask (torch.LongTensor，形狀為 (batch_size, sequence_length)，可選) — 用於避免對填充標記索引執行注意力操作的掩碼。掩碼值選擇範圍為 [0, 1]：
- 1 表示標記未被掩蓋，
- 0 表示標記已被掩蓋。
什麼是注意力掩碼？
labels (torch.FloatTensor，形狀為 (batch_size, 3)，可選) — 標籤包含與文字對應的影片的持續時間、開始時間和結束時間。
head_mask (torch.FloatTensor，形狀為 (num_heads,) 或 (num_layers, num_heads)，可選) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選擇範圍為 [0, 1]：
- 1 表示頭部未被掩蓋，
- 0 表示頭部已被掩蓋。
output_attentions (bool，可選) — 是否返回所有注意力層的注意力張量。更多詳情請參閱返回張量下的 attentions。
output_hidden_states (bool，可選) — 是否返回所有層的隱藏狀態。更多詳情請參閱返回張量下的 hidden_states。
return_dict (bool，可選) — 是否返回 ModelOutput 而非普通的元組。
interpolate_pos_encoding (bool，預設為 False) — 是否插值預訓練的位置編碼。

TvpForVideoGrounding 的 forward 方法，重寫了 __call__ 特殊方法。

儘管前向傳播的實現需要在該函式中定義，但在此之後應呼叫 Module 例項而非此函式，因為前者會處理預處理和後處理步驟，而後者會默默忽略它們。

示例

>>> import torch
>>> from transformers import AutoConfig, AutoTokenizer, TvpForVideoGrounding

>>> model = TvpForVideoGrounding.from_pretrained("Jiqing/tiny-random-tvp")

>>> tokenizer = AutoTokenizer.from_pretrained("Jiqing/tiny-random-tvp")

>>> pixel_values = torch.rand(1, 1, 3, 448, 448)
>>> text_inputs = tokenizer("This is an example input", return_tensors="pt")
>>> output = model(text_inputs.input_ids, pixel_values, text_inputs.attention_mask)

< > 在 GitHub 上更新

Transformers

TVP

概述

使用技巧和示例

TvpConfig

類 transformers.TvpConfig

from_backbone_config

to_dict

TvpImageProcessor

類 transformers.TvpImageProcessor

預處理

TvpProcessor

class transformers.TvpProcessor

__call__

TvpModel

class transformers.TvpModel

前向

TvpForVideoGrounding

class transformers.TvpForVideoGrounding

前向

call