SwiftFormer

概述

SwiftFormer 模型由 Abdelrahman Shaker、Muhammad Maaz、Hanoona Rasheed、Salman Khan、Ming-Hsuan Yang、Fahad Shahbaz Khan 在 SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications 中提出。

SwiftFormer 論文引入了一種新穎高效的加性注意力機制，該機制有效地用線性逐元素乘法取代了自注意力計算中的二次矩陣乘法運算。基於此構建了一系列名為“SwiftFormer”的模型，它們在準確性和移動推理速度方面都達到了最先進的效能。即使是它們的小型變體，也能在 iPhone 14 上以僅 0.8 毫秒的延遲實現 78.5% 的 ImageNet1K top-1 準確率，這比 MobileViT-v2 更準確，速度快 2 倍。

論文摘要如下：

自注意力已成為在各種視覺應用中捕獲全域性上下文的實際選擇。然而，其相對於影像解析度的二次計算複雜度限制了其在即時應用中的使用，特別是部署在資源受限的移動裝置上。儘管已經提出了混合方法來結合卷積和自注意力的優勢以實現更好的速度-精度權衡，但自注意力中昂貴的矩陣乘法運算仍然是瓶頸。在這項工作中，我們引入了一種新穎高效的加性注意力機制，該機制有效地用線性逐元素乘法取代了二次矩陣乘法運算。我們的設計表明，鍵值互動可以用線性層代替，而不會犧牲任何準確性。與以前最先進的方法不同，我們高效的自注意力公式使其能夠在網路的各個階段使用。使用我們提出的高效加性注意力，我們構建了一系列名為“SwiftFormer”的模型，它們在準確性和移動推理速度方面都達到了最先進的效能。我們的小型變體在 iPhone 14 上僅以 0.8 毫秒的延遲實現了 78.5% 的 ImageNet-1K top-1 準確率，這比 MobileViT-v2 更準確，速度快 2 倍。

此模型由 shehan97 貢獻。TensorFlow 版本由 joaocmd 貢獻。原始程式碼可在此處找到。

SwiftFormerConfig

class transformers.SwiftFormerConfig

< 來源 >

( image_size = 224 num_channels = 3 depths = [3, 3, 6, 4] embed_dims = [48, 56, 112, 220] mlp_ratio = 4 downsamples = [True, True, True, True] hidden_act = 'gelu' down_patch_size = 3 down_stride = 2 down_pad = 1 drop_path_rate = 0.0 drop_mlp_rate = 0.0 drop_conv_encoder_rate = 0.0 use_layer_scale = True layer_scale_init_value = 1e-05 batch_norm_eps = 1e-05 **kwargs )

引數

image_size (int, 可選, 預設為 224) — 每張影像的大小（解析度）
num_channels (int, 可選, 預設為 3) — 輸入通道數
depths (list[int], 可選, 預設為 [3, 3, 6, 4]) — 每個階段的深度
embed_dims (list[int], 可選, 預設為 [48, 56, 112, 220]) — 每個階段的嵌入維度
mlp_ratio (int, 可選, 預設為 4) — MLP 隱藏維度與輸入維度之比。
downsamples (list[bool], 可選, 預設為 [True, True, True, True]) — 是否在兩個階段之間對輸入進行下采樣。
hidden_act (str, 可選, 預設為 "gelu") — 非線性啟用函式（字串）。支援 "gelu"、"relu"、"selu" 和 "gelu_new"。
down_patch_size (int, 可選, 預設為 3) — 下采樣層中的補丁大小。
down_stride (int, 可選, 預設為 2) — 下采樣層中卷積核的步幅。
down_pad (int, 可選, 預設為 1) — 下采樣層中的填充。
drop_path_rate (float, 可選, 預設為 0.0) — DropPath 中增加 dropout 機率的速率。
drop_mlp_rate (float, 可選, 預設為 0.0) — SwiftFormer 的 MLP 元件的 Dropout 率。
drop_conv_encoder_rate (float, 可選, 預設為 0.0) — SwiftFormer 的 ConvEncoder 元件的 Dropout 率。
use_layer_scale (bool, 可選, 預設為 True) — 是否縮放令牌混合器的輸出。
layer_scale_init_value (float, 可選, 預設為 1e-05) — 令牌混合器輸出的縮放因子。
batch_norm_eps (float, 可選, 預設為 1e-05) — 批歸一化層使用的 epsilon。

這是用於儲存 SwiftFormerModel 配置的配置類。它用於根據指定引數例項化 SwiftFormer 模型，定義模型架構。使用預設值例項化配置將生成與 SwiftFormer MBZUAI/swiftformer-xs 架構類似的配置。

配置物件繼承自 PretrainedConfig，可用於控制模型輸出。有關更多資訊，請參閱 PretrainedConfig 文件。

示例

>>> from transformers import SwiftFormerConfig, SwiftFormerModel

>>> # Initializing a SwiftFormer swiftformer-base-patch16-224 style configuration
>>> configuration = SwiftFormerConfig()

>>> # Initializing a model (with random weights) from the swiftformer-base-patch16-224 style configuration
>>> model = SwiftFormerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config

SwiftFormerModel

class transformers.SwiftFormerModel

< 來源 >

( config: SwiftFormerConfig )

引數

config (SwiftFormerConfig) — 包含模型所有引數的模型配置類。使用配置檔案初始化不載入與模型關聯的權重，只加載配置。請檢視 from_pretrained() 方法載入模型權重。

輸出原始隱藏狀態的裸 Swiftformer 模型，頂部沒有任何特定頭部。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件瞭解所有與通用用法和行為相關的事項。

前向

< 來源 >

( pixel_values: typing.Optional[torch.Tensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.BaseModelOutputWithNoAttention 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.Tensor, 形狀為 (batch_size, num_channels, image_size, image_size), 可選) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通元組。

transformers.modeling_outputs.BaseModelOutputWithNoAttention 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.BaseModelOutputWithNoAttention 或一個 torch.FloatTensor 元組（如果傳遞 return_dict=False 或當 config.return_dict=False 時），包含根據配置 (SwiftFormerConfig) 和輸入的不同元素。

last_hidden_state (torch.FloatTensor, 形狀為 (batch_size, num_channels, height, width)) — 模型最後一層輸出的隱藏狀態序列。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — torch.FloatTensor 元組（一個用於嵌入層輸出，如果模型有嵌入層，+ 每個層輸出一個），形狀為 (batch_size, num_channels, height, width)。

模型在每個層輸出的隱藏狀態以及可選的初始嵌入輸出。

SwiftFormerModel 的前向方法，覆蓋了 __call__ 特殊方法。

儘管前向傳遞的配方需要在此函式中定義，但之後應該呼叫 Module 例項而不是此函式，因為前者負責執行預處理和後處理步驟，而後者則默默忽略它們。

示例

SwiftFormerForImageClassification

class transformers.SwiftFormerForImageClassification

< 來源 >

( config: SwiftFormerConfig )

引數

config (SwiftFormerConfig) — 包含模型所有引數的模型配置類。使用配置檔案初始化不載入與模型關聯的權重，只加載配置。請檢視 from_pretrained() 方法載入模型權重。

帶有影像分類頭部的 Swiftformer 模型，例如用於 ImageNet。

此模型繼承自 PreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 PyTorch torch.nn.Module 子類。將其作為常規 PyTorch 模組使用，並參考 PyTorch 文件瞭解所有與通用用法和行為相關的事項。

前向

< 來源 >

( pixel_values: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None ) → transformers.modeling_outputs.ImageClassifierOutputWithNoAttention 或 tuple(torch.FloatTensor)

引數

pixel_values (torch.Tensor, 形狀為 (batch_size, num_channels, image_size, image_size), 可選) — 對應於輸入影像的張量。畫素值可以使用 {image_processor_class} 獲取。有關詳細資訊，請參閱 {image_processor_class}.__call__（{processor_class} 使用 {image_processor_class} 處理影像）。
labels (torch.LongTensor, 形狀為 (batch_size,), 可選) — 用於計算影像分類/迴歸損失的標籤。索引應在 [0, ..., config.num_labels - 1] 範圍內。如果 config.num_labels == 1，則計算迴歸損失（均方誤差損失），如果 config.num_labels > 1，則計算分類損失（交叉熵）。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通元組。

transformers.modeling_outputs.ImageClassifierOutputWithNoAttention 或 tuple(torch.FloatTensor)

一個 transformers.modeling_outputs.ImageClassifierOutputWithNoAttention 或一個 torch.FloatTensor 元組（如果傳遞 return_dict=False 或當 config.return_dict=False 時），包含根據配置 (SwiftFormerConfig) 和輸入的不同元素。

loss (形狀為 (1,) 的 torch.FloatTensor，可選，當提供 labels 時返回) — 分類損失（如果 config.num_labels==1，則為迴歸損失）。
logits (形狀為 (batch_size, config.num_labels) 的 torch.FloatTensor) — 分類（如果 config.num_labels==1，則為迴歸）分數（SoftMax 之前）。
hidden_states (tuple(torch.FloatTensor), 可選, 當傳遞 output_hidden_states=True 或當 config.output_hidden_states=True 時返回) — torch.FloatTensor 元組（一個用於嵌入層輸出，如果模型有嵌入層，+ 每個階段輸出一個），形狀為 (batch_size, num_channels, height, width)。模型在每個階段輸出的隱藏狀態（也稱為特徵圖）。

SwiftFormerForImageClassification 的前向方法，覆蓋了 __call__ 特殊方法。

示例

>>> from transformers import AutoImageProcessor, SwiftFormerForImageClassification
>>> import torch
>>> from datasets import load_dataset

>>> dataset = load_dataset("huggingface/cats-image")
>>> image = dataset["test"]["image"][0]

>>> image_processor = AutoImageProcessor.from_pretrained("MBZUAI/swiftformer-xs")
>>> model = SwiftFormerForImageClassification.from_pretrained("MBZUAI/swiftformer-xs")

>>> inputs = image_processor(image, return_tensors="pt")

>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> # model predicts one of the 1000 ImageNet classes
>>> predicted_label = logits.argmax(-1).item()
>>> print(model.config.id2label[predicted_label])
...

TFSwiftFormerModel

class transformers.TFSwiftFormerModel

< 來源 >

( config: SwiftFormerConfig *inputs **kwargs )

引數

config (SwiftFormerConfig) — 包含模型所有引數的模型配置類。使用配置檔案初始化不載入與模型關聯的權重，只加載配置。請檢視 from_pretrained() 方法載入模型權重。

輸出原始隱藏狀態的裸 TFSwiftFormer 模型 Transformer，頂部沒有任何特定頭部。此模型繼承自 TFPreTrainedModel。請檢視超類文件，瞭解庫為其所有模型實現的通用方法（例如下載或儲存、調整輸入嵌入大小、修剪頭部等）。

此模型也是 keras.Model 子類。將其作為常規 TF 2.0 Keras 模型使用，並參考 TF 2.0 文件瞭解所有與通用用法和行為相關的事項。

TF 2.0 模型接受兩種輸入格式

所有輸入作為關鍵字引數（如 PyTorch 模型），或
將所有輸入作為列表、元組或字典放在第一個位置引數中。當使用 keras.Model.fit 方法時，此第二種選項非常有用，因為該方法目前要求將所有張量放在模型呼叫函式的第一個引數中：model(inputs)。如果選擇此第二種選項，您可以使用以下三種可能性來收集第一個位置引數中的所有輸入張量
只有一個 input_ids 的單個張量，沒有其他：model(input_ids)
長度可變的列表，包含一個或多個輸入張量，按文件字串中給出的順序：model([input_ids, attention_mask]) 或 model([input_ids, attention_mask, token_type_ids])
一個字典，其中包含一個或多個與文件字串中給出的輸入名稱關聯的輸入張量：model({"input_ids": input_ids, "token_type_ids": token_type_ids})

呼叫

< 來源 >

( pixel_values: typing.Optional[tensorflow.python.framework.tensor.Tensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None training: bool = False )

引數

pixel_values (tf.Tensor, 形狀為 (batch_size, num_channels, height, width)) — 畫素值。畫素值可以使用 AutoImageProcessor 獲取。有關詳細資訊，請參閱 ViTImageProcessor.call()。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。有關更多詳細資訊，請參閱返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是普通元組。
training (bool, 可選, 預設為 False) — 是否在訓練模式下執行模型。

TFSwiftFormerModel 的 forward 方法，會覆蓋 __call__ 特殊方法。

TFSwiftFormerForImageClassification

class transformers.TFSwiftFormerForImageClassification

< source >

( config: SwiftFormerConfig **kwargs )

引數

config (SwiftFormerConfig) — 模型配置類，包含模型的所有引數。使用配置檔案初始化不會載入與模型相關的權重，只會載入配置。請檢視 from_pretrained() 方法來載入模型權重。

TFSwiftFormer 模型變換器，頂部帶有一個影像分類頭（例如用於 ImageNet）。

此模型繼承自 TFPreTrainedModel。有關庫為其所有模型實現的通用方法（如下載或儲存、調整輸入嵌入大小、修剪頭等），請檢視超類文件。

此模型也是 keras.Model 子類。將其作為常規 TF 2.0 Keras 模型使用，並參考 TF 2.0 文件瞭解所有與通用用法和行為相關的事項。

TF 2.0 模型接受兩種輸入格式

所有輸入作為關鍵字引數（如 PyTorch 模型），或
將所有輸入作為列表、元組或字典放在第一個位置引數中。當使用 keras.Model.fit 方法時，此第二種選項非常有用，因為該方法目前要求將所有張量放在模型呼叫函式的第一個引數中：model(inputs)。如果選擇此第二種選項，您可以使用以下三種可能性來收集第一個位置引數中的所有輸入張量
只有一個 input_ids 的單個張量，沒有其他：model(input_ids)
長度可變的列表，包含一個或多個輸入張量，按文件字串中給出的順序：model([input_ids, attention_mask]) 或 model([input_ids, attention_mask, token_type_ids])
一個字典，其中包含一個或多個與文件字串中給出的輸入名稱關聯的輸入張量：model({"input_ids": input_ids, "token_type_ids": token_type_ids})

呼叫

< source >

( pixel_values: typing.Optional[tensorflow.python.framework.tensor.Tensor] = None labels: typing.Optional[tensorflow.python.framework.tensor.Tensor] = None output_hidden_states: typing.Optional[bool] = None return_dict: typing.Optional[bool] = None training: bool = False )

引數

pixel_values (tf.Tensor 形狀為 (batch_size, num_channels, height, width)) — 畫素值。畫素值可以使用 AutoImageProcessor 獲得。詳情請參見 ViTImageProcessor.call()。
output_hidden_states (bool, 可選) — 是否返回所有層的隱藏狀態。更多詳情請參見返回張量下的 hidden_states。
return_dict (bool, 可選) — 是否返回 ModelOutput 而不是簡單的元組。
training (bool, 可選, 預設為 False) — 是否在訓練模式下執行模型。
labels (tf.Tensor 形狀為 (batch_size,), 可選) — 用於計算影像分類/迴歸損失的標籤。索引應在 [0, ..., config.num_labels - 1] 範圍內。如果 config.num_labels == 1，則計算迴歸損失（均方誤差損失）；如果 config.num_labels > 1，則計算分類損失（交叉熵損失）。

TFSwiftFormerForImageClassification 的 forward 方法，會覆蓋 __call__ 特殊方法。

< > 在 GitHub 上更新

Transformers

SwiftFormer

概述

SwiftFormerConfig

class transformers.SwiftFormerConfig

SwiftFormerModel

class transformers.SwiftFormerModel

前向

SwiftFormerForImageClassification

class transformers.SwiftFormerForImageClassification

前向

TFSwiftFormerModel

class transformers.TFSwiftFormerModel

呼叫

TFSwiftFormerForImageClassification

class transformers.TFSwiftFormerForImageClassification

呼叫