Vision Transformer (ViT)

Vision Transformer (ViT) 是一種用於計算機視覺任務的 Transformer。影像被分割成固定大小的較小補丁，這些補丁被視為一系列標記，類似於 NLP 任務中的單詞。與卷積架構相比，ViT 預訓練所需的資源更少，其在大型資料集上的效能可以遷移到較小的下游任務中。

您可以在 Google 組織下找到所有原始 ViT 檢查點。

單擊右側邊欄中的 ViT 模型，可檢視更多如何將 ViT 應用於不同計算機視覺任務的示例。

下面的示例演示瞭如何使用 Pipeline 或 AutoModel 類對影像進行分類。

流水線

自動模型

注意事項

最佳結果透過有監督預訓練獲得，在微調期間，最好使用解析度高於 224x224 的影像。
使用 ViTImageProcessorFast 來調整（或重新縮放）和標準化影像至預期大小。
補丁和影像解析度反映在檢查點名稱中。例如，google/vit-base-patch16-224 是一個 **基礎大小** 的架構，其補丁解析度為 16x16，微調解析度為 224x224。

ViTConfig

class transformers.ViTConfig

< 來源 >

( hidden_size = 768 num_hidden_layers = 12 num_attention_heads = 12 intermediate_size = 3072 hidden_act = 'gelu' hidden_dropout_prob = 0.0 attention_probs_dropout_prob = 0.0 initializer_range = 0.02 layer_norm_eps = 1e-12 image_size = 224 patch_size = 16 num_channels = 3 qkv_bias = True encoder_stride = 16 pooler_output_size = None pooler_act = 'tanh' **kwargs )

引數

hidden_size (int, 可選, 預設為 768) — 編碼器層和池化層維度。
num_hidden_layers (int, 可選, 預設為 12) — Transformer 編碼器中的隱藏層數量。
num_attention_heads (int, 可選, 預設為 12) — Transformer 編碼器中每個注意力層的注意力頭數量。
intermediate_size (int, 可選, 預設為 3072) — Transformer 編碼器中“中間”（即，前饋）層的維度。
hidden_act (str 或 function, 可選, 預設為 "gelu") — 編碼器和池化器中的非線性啟用函式（函式或字串）。如果是字串，支援 "gelu"、"relu"、"selu" 和 "gelu_new"。
hidden_dropout_prob (float, 可選, 預設為 0.0) — 嵌入、編碼器和池化器中所有全連線層的 dropout 機率。
attention_probs_dropout_prob (float, 可選, 預設為 0.0) — 注意力機率的 dropout 比率。
initializer_range (float, 可選, 預設為 0.02) — 用於初始化所有權重矩陣的 truncated_normal_initializer 的標準差。
layer_norm_eps (float, 可選, 預設為 1e-12) — 層歸一化層使用的 epsilon 值。
image_size (int, 可選, 預設為 224) — 每張影像的大小（解析度）。
patch_size (int, 可選, 預設為 16) — 每個補丁的大小（解析度）。
num_channels (int, 可選, 預設為 3) — 輸入通道數量。
qkv_bias (bool, 可選, 預設為 True) — 是否在查詢、鍵和值中新增偏置。
encoder_stride (int, 可選, 預設為 16) — 用於掩碼影像建模的解碼器頭部增加空間解析度的因子。
pooler_output_size (int, 可選) — 池化層維度。如果為 None，預設為 hidden_size。
pooler_act (str, 可選, 預設為 "tanh") — 池化器將使用的啟用函式。支援 Flax 和 Pytorch 的 ACT2FN 鍵，以及 https://www.tensorflow.org/api_docs/python/tf/keras/activations 的元素，適用於 Tensorflow。

這是用於儲存 ViTModel 配置的配置類。它用於根據指定引數例項化 ViT 模型，定義模型架構。使用預設值例項化配置將生成與 ViT google/vit-base-patch16-224 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請閱讀 PretrainedConfig 的文件。

示例

>>> from transformers import ViTConfig, ViTModel

>>> # Initializing a ViT vit-base-patch16-224 style configuration
>>> configuration = ViTConfig()

>>> # Initializing a model (with random weights) from the vit-base-patch16-224 style configuration
>>> model = ViTModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

ViTFeatureExtractor

class transformers.ViTFeatureExtractor

< 來源 >

( *args **kwargs )

call

< 來源 >

( images **kwargs )

預處理單張或批次影像。

ViTImageProcessor

class transformers.ViTImageProcessor

< 來源 >

( do_resize: bool = True size: typing.Optional[dict[str, int]] = None resample: Resampling = <Resampling.BILINEAR: 2> do_rescale: bool = True rescale_factor: typing.Union[int, float] = 0.00392156862745098 do_normalize: bool = True image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None do_convert_rgb: typing.Optional[bool] = None **kwargs )

引數

do_resize (bool, 可選, 預設為 True) — 是否將影像的（高、寬）維度調整為指定的 (size["height"], size["width"])。可透過 preprocess 方法中的 do_resize 引數覆蓋。
size (dict, 可選, 預設為 {"height" -- 224, "width": 224}): 調整大小後輸出影像的尺寸。可透過 preprocess 方法中的 size 引數覆蓋。
resample (PILImageResampling, 可選, 預設為 Resampling.BILINEAR) — 如果調整影像大小，使用的重取樣過濾器。可透過 preprocess 方法中的 resample 引數覆蓋。
do_rescale (bool, 可選, 預設為 True) — 是否透過指定比例 rescale_factor 重新縮放影像。可透過 preprocess 方法中的 do_rescale 引數覆蓋。
rescale_factor (int 或 float, 可選, 預設為 1/255) — 如果重新縮放影像，使用的縮放因子。可透過 preprocess 方法中的 rescale_factor 引數覆蓋。
do_normalize (bool, 可選, 預設為 True) — 是否標準化影像。可透過 preprocess 方法中的 do_normalize 引數覆蓋。
image_mean (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_MEAN) — 如果標準化影像，使用的均值。這是一個浮點數或浮點數列表，其長度為影像中的通道數。可透過 preprocess 方法中的 image_mean 引數覆蓋。
image_std (float 或 list[float], 可選, 預設為 IMAGENET_STANDARD_STD) — 如果標準化影像，使用的標準差。這是一個浮點數或浮點數列表，其長度為影像中的通道數。可透過 preprocess 方法中的 image_std 引數覆蓋。
do_convert_rgb (bool, 可選) — 是否將影像轉換為 RGB。

構造 ViT 影像處理器。

preprocess

< 來源 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] do_resize: typing.Optional[bool] = None size: typing.Optional[dict[str, int]] = None resample: Resampling = None do_rescale: typing.Optional[bool] = None rescale_factor: typing.Optional[float] = None do_normalize: typing.Optional[bool] = None image_mean: typing.Union[float, list[float], NoneType] = None image_std: typing.Union[float, list[float], NoneType] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None data_format: typing.Union[str, transformers.image_utils.ChannelDimension] = <ChannelDimension.FIRST: 'channels_first'> input_data_format: typing.Union[str, transformers.image_utils.ChannelDimension, NoneType] = None do_convert_rgb: typing.Optional[bool] = None )

引數

images (ImageInput) — 要預處理的影像。期望單個或批次影像，畫素值範圍為 0 到 255。如果傳入的影像畫素值在 0 到 1 之間，請設定 do_rescale=False。
do_resize (bool, 可選, 預設為 self.do_resize) — 是否調整影像大小。
size (dict[str, int], 可選, 預設為 self.size) — 格式為 {"height": h, "width": w} 的字典，指定調整大小後輸出影像的尺寸。
resample (PILImageResampling filter, optional, defaults to self.resample) — 影像大小調整時使用的 PILImageResampling 過濾器，例如 PILImageResampling.BILINEAR。僅當 do_resize 設定為 True 時有效。
do_rescale (bool, optional, defaults to self.do_rescale) — 是否將影像值重新縮放為 [0 - 1] 之間。
rescale_factor (float, optional, defaults to self.rescale_factor) — 如果 do_rescale 設定為 True，則影像的重新縮放因子。
do_normalize (bool, optional, defaults to self.do_normalize) — 是否對影像進行歸一化。
image_mean (float or list[float], optional, defaults to self.image_mean) — 如果 do_normalize 設定為 True，則影像使用的平均值。
image_std (float or list[float], optional, defaults to self.image_std) — 如果 do_normalize 設定為 True，則影像使用的標準差。
return_tensors (str or TensorType, optional) — 返回張量的型別。可以是以下之一：
- 未設定：返回 np.ndarray 列表。
- TensorType.TENSORFLOW 或 'tf'：返回型別為 tf.Tensor 的批處理。
- TensorType.PYTORCH 或 'pt'：返回型別為 torch.Tensor 的批處理。
- TensorType.NUMPY 或 'np'：返回型別為 np.ndarray 的批處理。
- TensorType.JAX 或 'jax'：返回型別為 jax.numpy.ndarray 的批處理。
data_format (ChannelDimension or str, optional, defaults to ChannelDimension.FIRST) — 輸出影像的通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：影像格式為 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：影像格式為 (height, width, num_channels)。
- 未設定：使用輸入影像的通道維度格式。
input_data_format (ChannelDimension or str, optional) — 輸入影像的通道維度格式。如果未設定，則從輸入影像推斷通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：影像格式為 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：影像格式為 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：影像格式為 (height, width)。
do_convert_rgb (bool, optional, defaults to self.do_convert_rgb) — 是否將影像轉換為 RGB。

預處理一張或一批影像。

ViTImageProcessorFast

class transformers.ViTImageProcessorFast

< 來源 >

( **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs] )

構建一個快速的 Vit 影像處理器。

preprocess

< 來源 >

( images: typing.Union[ForwardRef('PIL.Image.Image'), numpy.ndarray, ForwardRef('torch.Tensor'), list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']] *args **kwargs: typing_extensions.Unpack[transformers.image_processing_utils_fast.DefaultFastImageProcessorKwargs] ) → <class 'transformers.image_processing_base.BatchFeature'>

引數

images (Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, list['PIL.Image.Image'], list[numpy.ndarray], list['torch.Tensor']]) — 要預處理的影像。期望單個或批次影像，畫素值範圍為 0 到 255。如果傳入畫素值在 0 到 1 之間的影像，請設定 do_rescale=False。
do_resize (bool, optional) — 是否調整影像大小。
size (dict[str, int], optional) — 描述模型最大輸入維度。
default_to_square (bool, optional) — 如果大小為整數，是否在調整大小後預設為正方形影像。
resample (Union[PILImageResampling, F.InterpolationMode, NoneType]) — 調整影像大小時使用的重取樣過濾器。這可以是列舉 PILImageResampling 之一。僅當 do_resize 設定為 True 時有效。
do_center_crop (bool, optional) — 是否對影像進行中心裁剪。
crop_size (dict[str, int], optional) — 應用 center_crop 後輸出影像的大小。
do_rescale (bool, optional) — 是否重新縮放影像。
rescale_factor (Union[int, float, NoneType]) — 如果 do_rescale 設定為 True，則影像的重新縮放因子。
do_normalize (bool, optional) — 是否對影像進行歸一化。
image_mean (Union[float, list[float], NoneType]) — 用於歸一化的影像平均值。僅當 do_normalize 設定為 True 時有效。
image_std (Union[float, list[float], NoneType]) — 用於歸一化的影像標準差。僅當 do_normalize 設定為 True 時有效。
do_convert_rgb (bool, optional) — 是否將影像轉換為 RGB。
return_tensors (Union[str, ~utils.generic.TensorType, NoneType]) — 如果設定為“pt”，則返回堆疊的張量，否則返回張量列表。
data_format (~image_utils.ChannelDimension, optional) — 僅支援 ChannelDimension.FIRST。為與慢速處理器相容而新增。
input_data_format (Union[str, ~image_utils.ChannelDimension, NoneType]) — 輸入影像的通道維度格式。如果未設定，則從輸入影像推斷通道維度格式。可以是以下之一：
- "channels_first" 或 ChannelDimension.FIRST：影像格式為 (num_channels, height, width)。
- "channels_last" 或 ChannelDimension.LAST：影像格式為 (height, width, num_channels)。
- "none" 或 ChannelDimension.NONE：影像格式為 (height, width)。
device (torch.device, optional) — 處理影像的裝置。如果未設定，則從輸入影像推斷裝置。
disable_grouping (bool, optional) — 是否停用按大小對影像進行分組，以便單獨處理而不是批次處理。如果為 None，則在影像位於 CPU 上時設定為 True，否則設定為 False。此選擇基於經驗觀察，詳情請參閱：https://github.com/huggingface/transformers/pull/38157

<class 'transformers.image_processing_base.BatchFeature'>

data (dict) — 由 call 方法返回的列表/陣列/張量字典（“pixel_values”等）。
tensor_type (Union[None, str, TensorType], 可選) — 您可以在此處提供一個`tensor_type`，以便在初始化時將整數列表轉換為PyTorch/TensorFlow/Numpy張量。

ViTModel

class transformers.ViTModel

< 來源 >

( config: ViTConfig add_pooling_layer: bool = True use_mask_token: bool = False )

引數

config (ViTConfig) — 模型配置類，包含模型的所有引數。使用配置檔案初始化不會載入與模型關聯的權重，僅載入配置。請檢視 from_pretrained() 方法以載入模型權重。
add_pooling_layer (bool, optional, defaults to True) — 是否新增池化層
use_mask_token (bool, optional, defaults to False) — 是否使用掩碼標記進行掩碼影像建模。

裸 Vit 模型，輸出原始隱藏狀態，頂部沒有任何特定頭部。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件以瞭解所有與一般用法和行為相關的事項。

forward

< 來源 >

( pixel_values: typing.Optional[torch.Tensor] = None bool_masked_pos: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None interpolate_pos_encoding: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.Tensor of shape (batch_size, num_channels, image_size, image_size), optional) — 與輸入影像對應的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參見 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
bool_masked_pos (torch.BoolTensor of shape (batch_size, num_patches), optional) — 布林掩碼位置。指示哪些補丁被掩碼（1），哪些沒有（0）。
head_mask (torch.Tensor of shape (num_heads,) or (num_layers, num_heads), optional) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選擇在 [0, 1]：
- 1 表示頭部未被掩碼，
- 0 表示頭部被掩碼。
output_attentions (bool, optional) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參見返回張量中的 attentions。
output_hidden_states (bool, optional) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參見返回張量中的 hidden_states。
interpolate_pos_encoding (bool, optional) — 是否插值預訓練的位置編碼。
return_dict (bool, optional) — 是否返回 ModelOutput 而不是純元組。

transformers.modeling_outputs.BaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.BaseModelOutputWithPooling 或一個 torch.FloatTensor 元組（如果傳入 return_dict=False 或 config.return_dict=False），包含根據配置（ViTConfig）和輸入的不同元素。

last_hidden_state (torch.FloatTensor, 形狀為 (batch_size, sequence_length, hidden_size)) — 模型最後一層輸出的隱藏狀態序列。
pooler_output (torch.FloatTensor of shape (batch_size, hidden_size)) — 序列中第一個標記（分類標記）的最後一層隱藏狀態，經過用於輔助預訓練任務的層進一步處理。例如，對於 BERT 家族模型，這會在經過線性層和 tanh 啟用函式處理後返回分類標記。線性層權重在預訓練期間根據下一個句子預測（分類）目標進行訓練。
hidden_states (tuple(torch.FloatTensor), optional, 當傳入 output_hidden_states=True 或 config.output_hidden_states=True 時返回) — torch.FloatTensor 元組（如果模型有嵌入層，則包括嵌入層的輸出，加上每一層的輸出），形狀為 (batch_size, sequence_length, hidden_size)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。
attentions (tuple(torch.FloatTensor), optional, 當傳入 output_attentions=True 或 config.output_attentions=True 時返回) — torch.FloatTensor 元組（每個層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

ViTModel 的前向方法，重寫了 __call__ 特殊方法。

雖然前向傳播的配方需要在此函式中定義，但此後應呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默忽略它們。

示例

ViTForMaskedImageModeling

class transformers.ViTForMaskedImageModeling

< 來源 >

( config: ViTConfig )

引數

config (ViTConfig) — 模型配置類，包含模型的所有引數。使用配置檔案初始化不會載入與模型關聯的權重，僅載入配置。請檢視 from_pretrained() 方法以載入模型權重。

帶有解碼器的 Vit 模型，用於掩碼影像建模，如 SimMIM 中所提出的。

請注意，我們在 examples directory 中提供了一個指令碼，用於在自定義資料上預訓練此模型。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件以瞭解所有與一般用法和行為相關的事項。

forward

< 來源 >

( pixel_values: typing.Optional[torch.Tensor] = None bool_masked_pos: typing.Optional[torch.BoolTensor] = None head_mask: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None interpolate_pos_encoding: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.MaskedImageModelingOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.Tensor of shape (batch_size, num_channels, image_size, image_size), optional) — 與輸入影像對應的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參見 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
bool_masked_pos (torch.BoolTensor of shape (batch_size, num_patches)) — 布林掩碼位置。指示哪些補丁被掩碼（1），哪些沒有（0）。
head_mask (torch.Tensor of shape (num_heads,) or (num_layers, num_heads), optional) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選擇在 [0, 1]：
- 1 表示頭部未被掩碼，
- 0 表示頭部被掩碼。
output_attentions (bool, optional) — 是否返回所有注意力層的注意力張量。有關更多詳細資訊，請參見返回張量中的 attentions。
output_hidden_states (bool, optional) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參見返回張量中的 hidden_states。
interpolate_pos_encoding (bool, optional) — 是否插入預訓練位置編碼。
return_dict (bool, optional) — 是否返回ModelOutput而不是普通元組。

transformers.modeling_outputs.MaskedImageModelingOutput 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.MaskedImageModelingOutput 或一個 torch.FloatTensor 元組（如果傳入 return_dict=False 或當 config.return_dict=False 時），包含根據配置（ViTConfig）和輸入而定的各種元素。

loss (形狀為 (1,) 的 torch.FloatTensor，可選，當提供 bool_masked_pos 時返回) — 重構損失。
reconstruction (形狀為 (batch_size, num_channels, height, width) 的 torch.FloatTensor) — 重構/完成的影像。
hidden_states (tuple(torch.FloatTensor)，可選，當傳入 output_hidden_states=True 時返回，或者
當 config.output_hidden_states=True 時) — 形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元組（如果模型有嵌入層，則一個用於嵌入層的輸出，加上一個用於每個階段的輸出）。模型在每個階段輸出的隱藏狀態（也稱為特徵圖）。
attentions (tuple(torch.FloatTensor)，可選，當傳入 output_attentions=True 時返回，或者當
config.output_attentions=True 時): 形狀為 (batch_size, num_heads, patch_size, sequence_length) 的 torch.FloatTensor 元組（每個層一個）。注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

ViTForMaskedImageModeling 的前向方法，覆蓋了 __call__ 特殊方法。

雖然前向傳播的配方需要在此函式中定義，但此後應呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默忽略它們。

示例

>>> from transformers import AutoImageProcessor, ViTForMaskedImageModeling
>>> import torch
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> model = ViTForMaskedImageModeling.from_pretrained("google/vit-base-patch16-224-in21k")

>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
>>> # create random boolean mask of shape (batch_size, num_patches)
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()

>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
>>> loss, reconstructed_pixel_values = outputs.loss, outputs.reconstruction
>>> list(reconstructed_pixel_values.shape)
[1, 3, 224, 224]

ViTForImageClassification

class transformers.ViTForImageClassification

< source >

( config: ViTConfig )

引數

config (ViTConfig) — 包含模型所有引數的模型配置類。用配置檔案初始化不會載入與模型相關的權重，只加載配置。檢視 from_pretrained() 方法載入模型權重。

ViT 模型轉換器，頂部帶有一個影像分類頭（[CLS] token 最終隱藏狀態頂部的一個線性層），例如用於 ImageNet。

請注意，透過在模型的前向傳播中將 interpolate_pos_encoding 設定為 True，可以在比其訓練影像解析度更高的影像上微調 ViT。這將把預訓練的位置嵌入插值到更高的解析度。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件以瞭解所有與一般用法和行為相關的事項。

forward

< source >

( pixel_values: typing.Optional[torch.Tensor] = None head_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None interpolate_pos_encoding: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.ImageClassifierOutput 或 tuple(torch.FloatTensor)

引數

pixel_values (形狀為 (batch_size, num_channels, image_size, image_size) 的 torch.Tensor，可選) — 對應輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參見 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
head_mask (形狀為 (num_heads,) 或 (num_layers, num_heads) 的 torch.Tensor，可選) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選自 [0, 1]：
- 1 表示頭部未被遮蔽，
- 0 表示頭部已被遮蔽。
labels (形狀為 (batch_size,) 的 torch.LongTensor，可選) — 用於計算影像分類/迴歸損失的標籤。索引應在 [0, ..., config.num_labels - 1] 之間。如果 config.num_labels == 1，則計算迴歸損失（均方損失），如果 config.num_labels > 1，則計算分類損失（交叉熵）。
output_attentions (bool, optional) — 是否返回所有注意力層的注意力張量。有關詳細資訊，請參閱返回張量下的 attentions。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關詳細資訊，請參閱返回張量下的 hidden_states。
interpolate_pos_encoding (bool, 可選) — 是否插入預訓練位置編碼。
return_dict (bool, 可選) — 是否返回ModelOutput而不是普通元組。

transformers.modeling_outputs.ImageClassifierOutput 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.ImageClassifierOutput 或一個 torch.FloatTensor 元組（如果傳入 return_dict=False 或當 config.return_dict=False 時），包含根據配置（ViTConfig）和輸入而定的各種元素。

loss (形狀為 (1,) 的 torch.FloatTensor，可選，當提供 labels 時返回) — 分類損失（如果 config.num_labels==1，則為迴歸損失）。
logits (形狀為 (batch_size, config.num_labels) 的 torch.FloatTensor) — 分類（如果 config.num_labels==1，則為迴歸）分數（SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor)，可選，當傳入 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — 形狀為 (batch_size, sequence_length, hidden_size) 的 torch.FloatTensor 元組（如果模型有嵌入層，則一個用於嵌入層的輸出，加上一個用於每個階段的輸出）。模型在每個階段輸出的隱藏狀態（也稱為特徵圖）。
attentions (tuple(torch.FloatTensor)，可選，當傳入 output_attentions=True 或當 config.output_attentions=True 時返回) — 形狀為 (batch_size, num_heads, patch_size, sequence_length) 的 torch.FloatTensor 元組（每個層一個）。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

ViTForImageClassification 的前向方法，覆蓋了 __call__ 特殊方法。

雖然前向傳播的配方需要在此函式中定義，但此後應呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默忽略它們。

示例

>>> from transformers import AutoImageProcessor, ViTForImageClassification
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> model = ViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
...

TFViTModel

class transformers.TFViTModel

< source >

( config: ViTConfig *inputs add_pooling_layer = True **kwargs )

引數

config (ViTConfig) — 包含模型所有引數的模型配置類。用配置檔案初始化不會載入與模型相關的權重，只加載配置。檢視 from_pretrained() 方法載入模型權重。

裸 ViT 模型轉換器，輸出原始隱藏狀態，頂部沒有任何特定頭部。

此模型繼承自 TFPreTrainedModel。檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 keras.Model 的子類。將其作為常規的 TF 2.0 Keras 模型使用，並參考 TF 2.0 文件瞭解所有與通用用法和行為相關的事項。

transformers 中的 TensorFlow 模型和層接受兩種輸入格式

所有輸入作為關鍵字引數（如 PyTorch 模型），或
所有輸入作為第一個位置引數中的列表、元組或字典。

支援第二種格式的原因是 Keras 方法在將輸入傳遞給模型和層時更喜歡這種格式。由於這種支援，當使用 model.fit() 等方法時，一切都應該“正常工作”——只需以 model.fit() 支援的任何格式傳遞您的輸入和標籤即可！但是，如果您想在 fit() 和 predict() 等 Keras 方法之外使用第二種格式，例如在使用 Keras Functional API 建立您自己的層或模型時，您可以使用三種可能性將所有輸入張量收集到第一個位置引數中

一個只包含 pixel_values 的獨立張量：model(pixel_values)
一個長度可變的列表，其中包含一個或多個輸入張量，按文件字串中給定的順序排列：model([pixel_values, attention_mask]) 或 model([pixel_values, attention_mask, token_type_ids])
一個字典，包含一個或多個與文件字串中給定的輸入名稱相關聯的輸入張量：model({"pixel_values": pixel_values, "token_type_ids": token_type_ids})

請注意，當使用子類化建立模型和層時，您無需擔心任何這些，因為您可以像呼叫任何其他 Python 函式一樣傳遞輸入！

呼叫

< source >

( pixel_values: TFModelInputType | None = None head_mask: np.ndarray | tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None interpolate_pos_encoding: Optional[bool] = None return_dict: Optional[bool] = None training: bool = False ) → transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling 或 tuple(tf.Tensor)

引數

pixel_values (np.ndarray, tf.Tensor, list[tf.Tensor] `dict[str, tf.Tensor] 或 dict[str, np.ndarray]，並且每個示例必須具有形狀 (batch_size, num_channels, height, width)) — 畫素值。畫素值可以使用 AutoImageProcessor 獲取。有關詳細資訊，請參見 ViTImageProcessor.call()。
head_mask (形狀為 (num_heads,) 或 (num_layers, num_heads) 的 np.ndarray 或 tf.Tensor，可選) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選自 [0, 1]：
- 1 表示頭部未被遮蔽，
- 0 表示頭部已被遮蔽。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關詳細資訊，請參閱返回張量下的 attentions。此引數僅在 eager 模式下可用，在 graph 模式下將使用配置中的值。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關詳細資訊，請參閱返回張量下的 hidden_states。此引數僅在 eager 模式下可用，在 graph 模式下將使用配置中的值。
interpolate_pos_encoding (bool, 可選) — 是否插入預訓練位置編碼。
return_dict (bool, 可選) — 是否返回ModelOutput而不是普通元組。此引數在 eager 模式下可用，在 graph 模式下將始終設定為 True。
training (bool, 可選，預設為 `False“) — 是否在訓練模式下使用模型（某些模組如 dropout 模組在訓練和評估之間有不同的行為）。

transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling 或 tuple(tf.Tensor)

一個 transformers.modeling_tf_outputs.TFBaseModelOutputWithPooling 或一個 tf.Tensor 元組（如果傳入 return_dict=False 或當 config.return_dict=False 時），包含根據配置（ViTConfig）和輸入而定的各種元素。

last_hidden_state (tf.Tensor of shape (batch_size, sequence_length, hidden_size)) — 模型最後一層輸出的隱藏狀態序列。
pooler_output (形狀為 (batch_size, hidden_size) 的 tf.Tensor) — 序列第一個 token（分類 token）的最後一層隱藏狀態，經線性層和 Tanh 啟用函式進一步處理。線性層權重在預訓練期間透過下一個句子預測（分類）目標進行訓練。

此輸出通常不是輸入語義內容的良好摘要，通常最好對整個輸入序列的隱藏狀態進行平均或池化。
hidden_states (tuple(tf.Tensor)，可選，當傳入 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — 形狀為 (batch_size, sequence_length, hidden_size) 的 tf.Tensor 元組（一個用於嵌入層的輸出 + 一個用於每個層的輸出）。

模型在每個層輸出的隱藏狀態加上初始嵌入輸出。
attentions (tuple(tf.Tensor)，可選，當傳入 output_attentions=True 或當 config.output_attentions=True 時返回) — 形狀為 (batch_size, num_heads, sequence_length, sequence_length) 的 tf.Tensor 元組（每個層一個）。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

TFViTModel 的前向方法，覆蓋了 __call__ 特殊方法。

雖然前向傳播的配方需要在此函式中定義，但此後應呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默忽略它們。

示例

>>> from transformers import AutoImageProcessor, TFViTModel
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> model = TFViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(image, return_tensors="tf")
>>> outputs = model(**inputs)

>>> last_hidden_states = outputs.last_hidden_state
>>> list(last_hidden_states.shape)
[1, 197, 768]

TFViTForImageClassification

class transformers.TFViTForImageClassification

< source >

( config: ViTConfig *inputs **kwargs )

引數

config (ViTConfig) — 包含模型所有引數的模型配置類。用配置檔案初始化不會載入與模型相關的權重，只加載配置。檢視 from_pretrained() 方法載入模型權重。

ViT 模型轉換器，頂部帶有一個影像分類頭（[CLS] token 最終隱藏狀態頂部的一個線性層），例如用於 ImageNet。

此模型繼承自 TFPreTrainedModel。檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 keras.Model 的子類。將其作為常規的 TF 2.0 Keras 模型使用，並參考 TF 2.0 文件瞭解所有與通用用法和行為相關的事項。

transformers 中的 TensorFlow 模型和層接受兩種輸入格式

所有輸入作為關鍵字引數（如 PyTorch 模型），或
所有輸入作為第一個位置引數中的列表、元組或字典。

一個只包含 pixel_values 的獨立張量：model(pixel_values)
一個長度可變的列表，其中包含一個或多個輸入張量，按文件字串中給定的順序排列：model([pixel_values, attention_mask]) 或 model([pixel_values, attention_mask, token_type_ids])
一個字典，包含一個或多個與文件字串中給定的輸入名稱相關聯的輸入張量：model({"pixel_values": pixel_values, "token_type_ids": token_type_ids})

請注意，當使用子類化建立模型和層時，您無需擔心任何這些，因為您可以像呼叫任何其他 Python 函式一樣傳遞輸入！

呼叫

< source >

( pixel_values: TFModelInputType | None = None head_mask: np.ndarray | tf.Tensor | None = None output_attentions: Optional[bool] = None output_hidden_states: Optional[bool] = None interpolate_pos_encoding: Optional[bool] = None return_dict: Optional[bool] = None labels: np.ndarray | tf.Tensor | None = None training: Optional[bool] = False ) → transformers.modeling_tf_outputs.TFSequenceClassifierOutput 或 tuple(tf.Tensor)

引數

pixel_values (np.ndarray, tf.Tensor, list[tf.Tensor] `dict[str, tf.Tensor] 或 dict[str, np.ndarray]，且每個示例必須具有形狀 (batch_size, num_channels, height, width)) — 畫素值。畫素值可以使用 AutoImageProcessor 獲取。有關詳細資訊，請參見 ViTImageProcessor.call()。
head_mask (形狀為 (num_heads,) 或 (num_layers, num_heads) 的 np.ndarray 或 tf.Tensor，可選) — 用於使自注意力模組的選定頭部無效的掩碼。掩碼值選自 [0, 1]：
- 1 表示頭部未被遮蔽，
- 0 表示頭部已被遮蔽。
output_attentions (bool, 可選) — 是否返回所有注意力層的注意力張量。有關詳細資訊，請參閱返回張量下的 attentions。此引數僅在 eager 模式下可用，在 graph 模式下將使用配置中的值。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關詳細資訊，請參閱返回張量下的 hidden_states。此引數僅在 eager 模式下可用，在 graph 模式下將使用配置中的值。
interpolate_pos_encoding (bool, 可選) — 是否插入預訓練位置編碼。
return_dict (bool, 可選) — 是否返回ModelOutput而不是普通元組。此引數在 eager 模式下可用，在 graph 模式下將始終設定為 True。
training (bool, 可選，預設為 `False“) — 是否在訓練模式下使用模型（某些模組如 dropout 模組在訓練和評估之間有不同的行為）。
labels (形狀為 (batch_size,) 的 tf.Tensor 或 np.ndarray，可選) — 用於計算影像分類/迴歸損失的標籤。索引應在 [0, ..., config.num_labels - 1] 之間。如果 config.num_labels == 1，則計算迴歸損失（均方損失），如果 config.num_labels > 1，則計算分類損失（交叉熵）。

transformers.modeling_tf_outputs.TFSequenceClassifierOutput 或 tuple(tf.Tensor)

一個 transformers.modeling_tf_outputs.TFSequenceClassifierOutput 或一個 tf.Tensor 元組（如果傳入 return_dict=False 或當 config.return_dict=False 時），包含根據配置（ViTConfig）和輸入而定的各種元素。

loss (tf.Tensor，形狀為 (batch_size, )，可選，當提供 labels 時返回) — 分類損失（如果 config.num_labels==1，則為迴歸損失）。
logits (tf.Tensor，形狀為 (batch_size, config.num_labels)) — 分類（或迴歸，如果 config.num_labels==1）分數（SoftMax 之前）。
hidden_states (tuple(tf.Tensor)，可選，當傳入 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — 形狀為 (batch_size, sequence_length, hidden_size) 的 tf.Tensor 元組（一個用於嵌入層的輸出 + 一個用於每個層的輸出）。

模型在每個層輸出的隱藏狀態加上初始嵌入輸出。
attentions (tuple(tf.Tensor)，可選，當傳入 output_attentions=True 或當 config.output_attentions=True 時返回) — 形狀為 (batch_size, num_heads, sequence_length, sequence_length) 的 tf.Tensor 元組（每個層一個）。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

TFViTForImageClassification 的前向方法，覆蓋了 __call__ 特殊方法。

雖然前向傳播的配方需要在此函式中定義，但此後應呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默忽略它們。

示例

>>> from transformers import AutoImageProcessor, TFViTForImageClassification
>>> import tensorflow as tf
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image"))
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> model = TFViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

>>> inputs = image_processor(image, return_tensors="tf")
>>> logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = int(tf.math.argmax(logits, axis=-1))
>>> print(model.config.id2label[predicted_label])
Egyptian cat

FlaxVitModel

class transformers.FlaxViTModel

< source >

( config: ViTConfig input_shape = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

引數

config (ViTConfig) — 包含模型所有引數的模型配置類。用配置檔案初始化不會載入與模型相關的權重，只加載配置。檢視 from_pretrained() 方法載入模型權重。
dtype (jax.numpy.dtype, 可選，預設為 jax.numpy.float32) — 計算的資料型別。可以是 jax.numpy.float32、jax.numpy.float16（在 GPU 上）和 jax.numpy.bfloat16（在 TPU 上）之一。

這可用於在 GPU 或 TPU 上啟用混合精度訓練或半精度推理。如果指定，所有計算將以給定的 dtype 執行。

請注意，這僅指定了計算的 dtype，不影響模型引數的 dtype。

如果要更改模型引數的 dtype，請參見 to_fp16() 和 to_bf16()。

裸 ViT 模型轉換器，輸出原始隱藏狀態，頂部沒有任何特定頭部。

此模型繼承自 FlaxPreTrainedModel。檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載、儲存和從 PyTorch 模型轉換權重）

此模型也是 flax.linen.Module 的子類。將其作為常規的 Flax linen Module 使用，並參考 Flax 文件瞭解所有與通用用法和行為相關的事項。

最後，此模型支援固有的 JAX 功能，例如

call

< source >

( pixel_values params: typing.Optional[dict] = None dropout_rng: <function PRNGKey at 0x7effc7ad3a30> = None train: bool = False output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling 或 tuple(torch.FloatTensor)

一個 transformers.modeling_flax_outputs.FlaxBaseModelOutputWithPooling 或一個 torch.FloatTensor 的元組（如果傳入 return_dict=False 或當 config.return_dict=False 時），包含根據配置（<class 'transformers.models.vit.configuration_vit.ViTConfig'>）和輸入而定的各種元素。

last_hidden_state (形狀為 (batch_size, sequence_length, hidden_size) 的 jnp.ndarray) — 模型最後一層輸出的隱藏狀態序列。
pooler_output (形狀為 (batch_size, hidden_size) 的 jnp.ndarray) — 序列第一個 token（分類 token）的最後一層隱藏狀態，經過線性層和 Tanh 啟用函式進一步處理。線性層的權重在預訓練期間透過下一句預測（分類）目標進行訓練。
hidden_states (tuple(jnp.ndarray), 可選, 當傳入 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — jnp.ndarray 的元組（一個用於嵌入層輸出 + 每個層輸出一個），形狀為 (batch_size, sequence_length, hidden_size)。

模型在每個層輸出的隱藏狀態加上初始嵌入輸出。
attentions (tuple(jnp.ndarray), 可選, 當傳入 output_attentions=True 或當 config.output_attentions=True 時返回) — jnp.ndarray 的元組（每個層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

FlaxViTPreTrainedModel 的 forward 方法，重寫了 __call__ 特殊方法。

雖然前向傳播的配方需要在此函式中定義，但此後應呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默忽略它們。

示例

>>> from transformers import AutoImageProcessor, FlaxViTModel
>>> from PIL import Image
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
>>> model = FlaxViTModel.from_pretrained("google/vit-base-patch16-224-in21k")

>>> inputs = image_processor(images=image, return_tensors="np")
>>> outputs = model(**inputs)
>>> last_hidden_states = outputs.last_hidden_state

FlaxViTForImageClassification

class transformers.FlaxViTForImageClassification

< 源 >

( config: ViTConfig input_shape = None seed: int = 0 dtype: dtype = <class 'jax.numpy.float32'> _do_init: bool = True **kwargs )

引數

config (ViTConfig) — 模型配置類，包含模型的所有引數。使用配置檔案初始化不會載入與模型相關的權重，只加載配置。請查閱 from_pretrained() 方法來載入模型權重。
dtype (jax.numpy.dtype, 可選, 預設為 jax.numpy.float32) — 計算的資料型別。可以是 jax.numpy.float32、jax.numpy.float16（在 GPU 上）和 jax.numpy.bfloat16（在 TPU 上）之一。

這可以用於在 GPU 或 TPU 上啟用混合精度訓練或半精度推理。如果指定，所有計算將以給定的 dtype 執行。

請注意，這僅指定了計算的資料型別，不影響模型引數的資料型別。

如果您希望更改模型引數的資料型別，請參閱 to_fp16() 和 to_bf16()。

ViT 模型轉換器，頂部帶有一個影像分類頭（[CLS] token 最終隱藏狀態頂部的一個線性層），例如用於 ImageNet。

此模型繼承自 FlaxPreTrainedModel。檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載、儲存和從 PyTorch 模型轉換權重）

此模型也是 flax.linen.Module 的子類。將其作為常規的 Flax linen Module 使用，並參考 Flax 文件瞭解所有與通用用法和行為相關的事項。

最後，此模型支援固有的 JAX 功能，例如

call

< 源 >

( pixel_values params: typing.Optional[dict] = None dropout_rng: <function PRNGKey at 0x7effc7ad3a30> = None train: bool = False output_attentions: typing.Optional[bool] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput 或 tuple(torch.FloatTensor)

transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput 或 tuple(torch.FloatTensor)

一個 transformers.modeling_flax_outputs.FlaxSequenceClassifierOutput 或一個 torch.FloatTensor 的元組（如果傳入 return_dict=False 或當 config.return_dict=False 時），包含根據配置（<class 'transformers.models.vit.configuration_vit.ViTConfig'>）和輸入而定的各種元素。

logits (形狀為 (batch_size, config.num_labels) 的 jnp.ndarray) — 分類（如果 config.num_labels==1，則為迴歸）分數（SoftMax 之前）。
hidden_states (tuple(jnp.ndarray), 可選, 當傳入 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — jnp.ndarray 的元組（一個用於嵌入層輸出 + 每個層輸出一個），形狀為 (batch_size, sequence_length, hidden_size)。

模型在每個層輸出的隱藏狀態加上初始嵌入輸出。
attentions (tuple(jnp.ndarray), 可選, 當傳入 output_attentions=True 或當 config.output_attentions=True 時返回) — jnp.ndarray 的元組（每個層一個），形狀為 (batch_size, num_heads, sequence_length, sequence_length)。

注意力 softmax 後的注意力權重，用於計算自注意力頭中的加權平均值。

FlaxViTPreTrainedModel 的 forward 方法，重寫了 __call__ 特殊方法。

雖然前向傳播的配方需要在此函式中定義，但此後應呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默忽略它們。

示例

>>> from transformers import AutoImageProcessor, FlaxViTForImageClassification
>>> from PIL import Image
>>> import jax
>>> import requests

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
>>> model = FlaxViTForImageClassification.from_pretrained("google/vit-base-patch16-224")

>>> inputs = image_processor(images=image, return_tensors="np")
>>> outputs = model(**inputs)
>>> logits = outputs.logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_class_idx = jax.numpy.argmax(logits, axis=-1)
>>> print("Predicted class:", model.config.id2label[predicted_class_idx.item()])

< > 在 GitHub 上更新

Transformers

Vision Transformer (ViT)

注意事項

ViTConfig

class transformers.ViTConfig

ViTFeatureExtractor

class transformers.ViTFeatureExtractor

__call__

ViTImageProcessor

class transformers.ViTImageProcessor

preprocess

ViTImageProcessorFast

class transformers.ViTImageProcessorFast

preprocess

ViTModel

class transformers.ViTModel

forward

ViTForMaskedImageModeling

class transformers.ViTForMaskedImageModeling

forward

ViTForImageClassification

class transformers.ViTForImageClassification

forward

TFViTModel

class transformers.TFViTModel

呼叫

TFViTForImageClassification

class transformers.TFViTForImageClassification

呼叫

FlaxVitModel

class transformers.FlaxViTModel

__call__

FlaxViTForImageClassification

class transformers.FlaxViTForImageClassification

__call__

call

call

call