MobileViT

概述

MobileViT 模型由 Sachin Mehta 和 Mohammad Rastegari 在MobileViT: 輕量級、通用型、移動友好的視覺 Transformer中提出。MobileViT 引入了一個新層，用 Transformer 的全域性處理取代了卷積的區域性處理。

論文摘要如下：

輕量級卷積神經網路 (CNN) 是移動視覺任務的事實標準。它們的空間歸納偏置使其能夠以更少的引數學習不同視覺任務的表示。然而，這些網路是空間區域性的。為了學習全域性表示，基於自注意力的視覺 Transformer (ViT) 被採用。與 CNN 不同，ViT 是重量級的。在本文中，我們提出了以下問題：是否有可能結合 CNN 和 ViT 的優點，為移動視覺任務構建一個輕量級、低延遲的網路？為此，我們引入了 MobileViT，一個用於移動裝置的輕量級通用視覺 Transformer。MobileViT 為 Transformer 的全域性資訊處理提供了一個不同的視角，即 Transformer 即卷積。我們的結果表明，MobileViT 在不同任務和資料集上顯著優於基於 CNN 和 ViT 的網路。在 ImageNet-1k 資料集上，MobileViT 以約 600 萬個引數實現了 78.4% 的 top-1 準確率，比 MobileNetv3（基於 CNN）和 DeIT（基於 ViT）在相似引數數量下分別高出 3.2% 和 6.2%。在 MS-COCO 目標檢測任務上，MobileViT 在相似引數數量下比 MobileNetv3 高出 5.7%。

此模型由matthijs貢獻。該模型的 TensorFlow 版本由sayakpaul貢獻。原始程式碼和權重可在此處找到。

使用技巧

MobileViT 更像一個 CNN 而不是 Transformer 模型。它不適用於序列資料，而是批處理影像。與 ViT 不同，它沒有嵌入。主幹模型輸出一個特徵圖。您可以參考本教程進行輕量級介紹。
可以使用 MobileViTImageProcessor 來準備影像供模型使用。請注意，如果您自行進行預處理，預訓練的檢查點期望影像採用 BGR 畫素順序（而非 RGB）。
可用的影像分類檢查點在 ImageNet-1k（也稱為 ILSVRC 2012，包含 130 萬張影像和 1,000 個類別）上進行了預訓練。
分割模型使用 DeepLabV3 頭。可用的語義分割檢查點在 PASCAL VOC 上進行了預訓練。
顧名思義，MobileViT 旨在行動電話上實現高效能和高效率。MobileViT 模型的 TensorFlow 版本與 TensorFlow Lite 完全相容。

您可以使用以下程式碼將 MobileViT 檢查點（無論是影像分類還是語義分割）轉換為生成 TensorFlow Lite 模型

from transformers import TFMobileViTForImageClassification
import tensorflow as tf


model_ckpt = "apple/mobilevit-xx-small"
model = TFMobileViTForImageClassification.from_pretrained(model_ckpt)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS,
    tf.lite.OpsSet.SELECT_TF_OPS,
]
tflite_model = converter.convert()
tflite_filename = model_ckpt.split("/")[-1] + ".tflite"
with open(tflite_filename, "wb") as f:
    f.write(tflite_model)

生成的模型只有 **大約 1MB**，非常適合資源和網路頻寬受限的移動應用。

資源

一份官方 Hugging Face 和社群（用🌎表示）資源列表，幫助您開始使用 MobileViT。

影像分類

MobileViTForImageClassification 由此示例指令碼和notebook支援。
另請參閱：影像分類任務指南

語義分割

語義分割任務指南

如果您有興趣在此處提交資源，請隨時開啟 Pull Request，我們將對其進行審查！該資源最好能展示一些新內容，而不是重複現有資源。

Transformers

MobileViT

概述

使用技巧

資源

MobileViTConfig

class transformers.MobileViTConfig

MobileViTFeatureExtractor

class transformers.MobileViTFeatureExtractor

__call__

post_process_semantic_segmentation

MobileViTImageProcessor

class transformers.MobileViTImageProcessor

preprocess

post_process_semantic_segmentation

MobileViTModel

class transformers.MobileViTModel

forward

MobileViTForImageClassification

class transformers.MobileViTForImageClassification

forward

MobileViTForSemanticSegmentation

class transformers.MobileViTForSemanticSegmentation

forward

TFMobileViTModel

class transformers.TFMobileViTModel

呼叫

TFMobileViTForImageClassification

class transformers.TFMobileViTForImageClassification

呼叫

TFMobileViTForSemanticSegmentation

class transformers.TFMobileViTForSemanticSegmentation

呼叫

call