使用 optimum.exporters.onnx 將模型匯出到 ONNX

摘要

將模型匯出到 ONNX 就像

optimum-cli export onnx --model gpt2 gpt2_onnx/

檢視幫助以獲取更多選項

optimum-cli export onnx --help

為什麼要使用 ONNX？

如果你需要在生產環境中部署 🤗 Transformers 或 🤗 Diffusers 模型，我們建議將它們匯出為可載入並在專用執行時和硬體上執行的序列化格式。在本指南中，我們將向你展示如何將這些模型匯出到 ONNX (Open Neural Network eXchange)。

ONNX 是一個開放標準，它定義了一組通用的運算子和一種通用的檔案格式，用於表示各種框架（包括 PyTorch 和 TensorFlow）中的深度學習模型。當模型匯出為 ONNX 格式時，這些運算子用於構建計算圖（通常稱為*中間表示*），它表示資料在神經網路中的流動。

透過暴露一個具有標準化運算子和資料型別的圖，ONNX 可以輕鬆地在不同框架之間切換。例如，在 PyTorch 中訓練的模型可以匯出為 ONNX 格式，然後匯入到 TensorRT 或 OpenVINO 中。

一旦匯出，模型可以透過圖最佳化和量化等技術進行推理最佳化。檢視 optimum.onnxruntime 子包以最佳化和執行 ONNX 模型！

🤗 Optimum 透過利用配置物件提供 ONNX 匯出支援。這些配置物件已為許多模型架構準備好，並且旨在輕鬆擴充套件到其他架構。

要檢視支援的架構，請訪問配置參考頁面。

使用 CLI 將模型匯出到 ONNX

要將 🤗 Transformers 或 🤗 Diffusers 模型匯出到 ONNX，你需要首先安裝一些額外的依賴項

pip install optimum[exporters]

Optimum ONNX 匯出可以透過 Optimum 命令列使用

optimum-cli export onnx --help

usage: optimum-cli <command> [<args>] export onnx [-h] -m MODEL [--task TASK] [--monolith] [--device DEVICE] [--opset OPSET] [--atol ATOL]
                                                  [--framework {pt,tf}] [--pad_token_id PAD_TOKEN_ID] [--cache_dir CACHE_DIR] [--trust-remote-code]
                                                  [--no-post-process] [--optimize {O1,O2,O3,O4}] [--batch_size BATCH_SIZE]
                                                  [--sequence_length SEQUENCE_LENGTH] [--num_choices NUM_CHOICES] [--width WIDTH] [--height HEIGHT]
                                                  [--num_channels NUM_CHANNELS] [--feature_size FEATURE_SIZE] [--nb_max_frames NB_MAX_FRAMES]
                                                  [--audio_sequence_length AUDIO_SEQUENCE_LENGTH]
                                                  output

optional arguments:
  -h, --help            show this help message and exit

Required arguments:
  -m MODEL, --model MODEL
                        Model ID on huggingface.co or path on disk to load model from.
  output                Path indicating the directory where to store generated ONNX model.

Optional arguments:
  --task TASK           The task to export the model for. If not specified, the task will be auto-inferred based on the model. Available tasks depend on the model, but are among: ['default', 'fill-mask', 'text-generation', 'text2text-generation', 'text-classification', 'token-classification', 'multiple-choice', 'object-detection', 'question-answering', 'image-classification', 'image-segmentation', 'masked-im', 'semantic-segmentation', 'automatic-speech-recognition', 'audio-classification', 'audio-frame-classification', 'automatic-speech-recognition', 'audio-xvector', 'image-to-text', 'zero-shot-object-detection', 'image-to-image', 'inpainting', 'text-to-image']. For decoder models, use `xxx-with-past` to export the model using past key values in the decoder.
  --monolith            Force to export the model as a single ONNX file. By default, the ONNX exporter may break the model in several ONNX files, for example for encoder-decoder models where the encoder should be run only once while the decoder is looped over.
  --device DEVICE       The device to use to do the export. Defaults to "cpu".
  --opset OPSET         If specified, ONNX opset version to export the model with. Otherwise, the default opset will be used.
  --atol ATOL           If specified, the absolute difference tolerance when validating the model. Otherwise, the default atol for the model will be used.
  --framework {pt,tf}   The framework to use for the ONNX export. If not provided, will attempt to use the local checkpoint's original framework or what is available in the environment.
  --pad_token_id PAD_TOKEN_ID
                        This is needed by some models, for some tasks. If not provided, will attempt to use the tokenizer to guess it.
  --cache_dir CACHE_DIR
                        Path indicating where to store cache.
  --trust-remote-code   Allows to use custom code for the modeling hosted in the model repository. This option should only be set for repositories you trust and in which you have read the code, as it will execute on your local machine arbitrary code present in the model repository.
  --no-post-process     Allows to disable any post-processing done by default on the exported ONNX models. For example, the merging of decoder and decoder-with-past models into a single ONNX model file to reduce memory usage.
  --optimize {O1,O2,O3,O4}
                        Allows to run ONNX Runtime optimizations directly during the export. Some of these optimizations are specific to ONNX Runtime, and the resulting ONNX will not be usable with other runtime as OpenVINO or TensorRT. Possible options:
                            - O1: Basic general optimizations
                            - O2: Basic and extended general optimizations, transformers-specific fusions
                            - O3: Same as O2 with GELU approximation
                            - O4: Same as O3 with mixed precision (fp16, GPU-only, requires `--device cuda`)

匯出檢查點可以按如下方式完成

optimum-cli export onnx --model distilbert-base-uncased-distilled-squad distilbert_base_uncased_squad_onnx/

你應該會看到以下日誌（以及可能來自 PyTorch / TensorFlow 的潛在日誌，為了清晰起見此處已隱藏）

Automatic task detection to question-answering.
Framework not specified. Using pt to export the model.
Using framework PyTorch: 1.12.1

Validating ONNX model...
        -[✓] ONNX model output names match reference model (start_logits, end_logits)
        - Validating ONNX Model output "start_logits":
                -[✓] (2, 16) matches (2, 16)
                -[✓] all values close (atol: 0.0001)
        - Validating ONNX Model output "end_logits":
                -[✓] (2, 16) matches (2, 16)
                -[✓] all values close (atol: 0.0001)
All good, model saved at: distilbert_base_uncased_squad_onnx/model.onnx

這將匯出 --model 引數定義的檢查點的 ONNX 圖。如你所見，任務已自動檢測。這是因為模型位於 Hub 上。

對於本地模型，需要提供 --task 引數，否則它將預設為不帶任何任務特定頭部的模型架構

optimum-cli export onnx --model local_path --task question-answering distilbert_base_uncased_squad_onnx/

請注意，為 Hub 上的模型提供 --task 引數將停用自動任務檢測。

生成的 model.onnx 檔案隨後可以在支援 ONNX 標準的眾多加速器之一上執行。例如，我們可以使用 optimum.onnxruntime 包載入並執行 ONNX Runtime 模型，如下所示

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering

>>> tokenizer = AutoTokenizer.from_pretrained("distilbert_base_uncased_squad_onnx")
>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert_base_uncased_squad_onnx")
>>> inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt")
>>> outputs = model(**inputs)

列印輸出將顯示

QuestionAnsweringModelOutput(loss=None, start_logits=tensor([[-4.7652, -1.0452, -7.0409, -4.6864, -4.0277, -6.2021, -4.9473,  2.6287,
          7.6111, -1.2488, -2.0551, -0.9350,  4.9758, -0.7707,  2.1493, -2.0703,
         -4.3232, -4.9472]]), end_logits=tensor([[ 0.4382, -1.6502, -6.3654, -6.0661, -4.1482, -3.5779, -0.0774, -3.6168,
         -1.8750, -2.8910,  6.2582,  0.5425, -3.7699,  3.8232, -1.5073,  6.2311,
          3.3604, -0.0772]]), hidden_states=None, attentions=None)

如你所見，將模型轉換為 ONNX 並不意味著離開 Hugging Face 生態系統。你最終會得到一個與常規 🤗 Transformers 模型類似的 API！

也可以透過以下方式直接從 ORTModelForQuestionAnswering 類匯出 ONNX 模型

>>> model = ORTModelForQuestionAnswering.from_pretrained("distilbert-base-uncased-distilled-squad", export=True)

有關更多資訊，請檢視 optimum.onnxruntime 文件中此主題的頁面。

Hub 上的 TensorFlow 檢查點的過程是相同的。例如，我們可以從 Keras 組織匯出純 TensorFlow 檢查點，如下所示

optimum-cli export onnx --model keras-io/transformers-qa distilbert_base_cased_squad_onnx/

匯出用於 Optimum 的 ORTModel 的模型

透過 optimum-cli export onnx 匯出的模型可以直接在 ORTModel 中使用。這對於編碼器-解碼器模型特別有用，在這種情況下，匯出會將編碼器和解碼器分成兩個 .onnx 檔案，因為編碼器通常只執行一次，而解碼器在自動生成任務中可能會執行多次。

在解碼器中使用歷史鍵/值匯出模型

當匯出用於生成的解碼器模型時，將歷史鍵和值的重用封裝到匯出的 ONNX 中可能會很有用。這可以避免在生成過程中重新計算相同的中間啟用。

在 ONNX 匯出中，預設情況下會重用歷史鍵/值。此行為對應於 --task text2text-generation-with-past、--task text-generation-with-past 或 --task automatic-speech-recognition-with-past。如果出於任何目的你希望停用帶有歷史鍵/值重用的匯出，則需要明確將任務 text2text-generation、text-generation 或 automatic-speech-recognition 傳遞給 optimum-cli export onnx。

使用歷史鍵/值匯出的模型可以直接在 Optimum 的 ORTModel 中重用

optimum-cli export onnx --model gpt2 gpt2_onnx/

和

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForCausalLM

>>> tokenizer = AutoTokenizer.from_pretrained("./gpt2_onnx/")
>>> model = ORTModelForCausalLM.from_pretrained("./gpt2_onnx/")
>>> inputs = tokenizer("My name is Arthur and I live in", return_tensors="pt")
>>> gen_tokens = model.generate(**inputs)
>>> print(tokenizer.batch_decode(gen_tokens))
# prints ['My name is Arthur and I live in the United States of America. I am a member of the']

選擇任務

在從 Hugging Face Hub 匯出模型時，在大多數情況下無需指定 --task。

但是，如果你需要檢查給定模型架構的 ONNX 匯出支援哪些任務，我們已為你準備好。首先，你可以在此處檢視 PyTorch 和 TensorFlow 支援的任務列表。

對於每個模型架構，你可以透過 TasksManager 查詢支援的任務列表。例如，對於 DistilBERT，對於 ONNX 匯出，我們有

>>> from optimum.exporters.tasks import TasksManager

>>> distilbert_tasks = list(TasksManager.get_supported_tasks_for_model_type("distilbert", "onnx").keys())
>>> print(distilbert_tasks)
['default', 'fill-mask', 'text-classification', 'multiple-choice', 'token-classification', 'question-answering']

然後，你可以將這些任務之一傳遞給 optimum-cli export onnx 命令中的 --task 引數，如上所述。

Transformers 模型的自定義匯出

自定義官方 Transformers 模型的匯出

Optimum 允許高階使用者更精細地控制 ONNX 匯出的配置。如果你希望使用不同的關鍵字引數匯出模型，例如使用 output_attentions=True 或 output_hidden_states=True，這尤其有用。

為了支援這些用例，~exporters.main_export 支援兩個引數：model_kwargs 和 custom_onnx_configs，它們按以下方式使用

model_kwargs 允許覆蓋模型 forward 的一些預設引數，實際上是 model(**reference_model_inputs, **model_kwargs)。
custom_onnx_configs 應該是一個 Dict[str, OnnxConfig]，將子模型名稱（通常是 model、encoder_model、decoder_model 或 decoder_model_with_past - 參考）對映到給定子模型的自定義 ONNX 配置。

下面給出了一個完整的示例，允許匯出帶有 output_attentions=True 的模型。

from optimum.exporters.onnx import main_export
from optimum.exporters.onnx.model_configs import WhisperOnnxConfig
from transformers import AutoConfig

from optimum.exporters.onnx.base import ConfigBehavior
from typing import Dict

class CustomWhisperOnnxConfig(WhisperOnnxConfig):
    @property
    def outputs(self) -> Dict[str, Dict[int, str]]:
        common_outputs = super().outputs

        if self._behavior is ConfigBehavior.ENCODER:
            for i in range(self._config.encoder_layers):
                common_outputs[f"encoder_attentions.{i}"] = {0: "batch_size"}
        elif self._behavior is ConfigBehavior.DECODER:
            for i in range(self._config.decoder_layers):
                common_outputs[f"decoder_attentions.{i}"] = {
                    0: "batch_size",
                    2: "decoder_sequence_length",
                    3: "past_decoder_sequence_length + 1"
                }
            for i in range(self._config.decoder_layers):
                common_outputs[f"cross_attentions.{i}"] = {
                    0: "batch_size",
                    2: "decoder_sequence_length",
                    3: "encoder_sequence_length_out"
                }

        return common_outputs

    @property
    def torch_to_onnx_output_map(self):
        if self._behavior is ConfigBehavior.ENCODER:
            # The encoder export uses WhisperEncoder that returns the key "attentions"
            return {"attentions": "encoder_attentions"}
        else:
            return {}

model_id = "openai/whisper-tiny.en"
config = AutoConfig.from_pretrained(model_id)

custom_whisper_onnx_config = CustomWhisperOnnxConfig(
        config=config,
        task="automatic-speech-recognition",
)

encoder_config = custom_whisper_onnx_config.with_behavior("encoder")
decoder_config = custom_whisper_onnx_config.with_behavior("decoder", use_past=False)
decoder_with_past_config = custom_whisper_onnx_config.with_behavior("decoder", use_past=True)

custom_onnx_configs={
    "encoder_model": encoder_config,
    "decoder_model": decoder_config,
    "decoder_with_past_model": decoder_with_past_config,
}

main_export(
    model_id,
    output="custom_whisper_onnx",
    no_post_process=True,
    model_kwargs={"output_attentions": True},
    custom_onnx_configs=custom_onnx_configs
)

對於只需要單個 ONNX 檔案的任務（例如僅編碼器），匯出的自定義輸入/輸出模型可以與 optimum.onnxruntime.ORTModelForCustomTasks 類一起使用，用於在 CPU 或 GPU 上使用 ONNX Runtime 進行推理。

自定義 Transformers 模型的匯出與自定義建模

Optimum 支援匯出使用 trust_remote_code=True 的 Transformers 模型，這在 Transformers 庫中不被官方支援，但可與其功能一起使用，如管道和生成。

此類模型的示例包括 THUDM/chatglm2-6b 和 mosaicml/mpt-30b。

要匯出自定義模型，需要將字典 custom_onnx_configs 傳遞給 main_export()，其中包含要匯出的模型所有子部分（例如，編碼器和解碼器子部分）的 ONNX 配置定義。以下示例允許匯出 mosaicml/mpt-7b 模型

from optimum.exporters.onnx import main_export

from transformers import AutoConfig

from optimum.exporters.onnx.config import TextDecoderOnnxConfig
from optimum.utils import NormalizedTextConfig, DummyPastKeyValuesGenerator
from typing import Dict


class MPTDummyPastKeyValuesGenerator(DummyPastKeyValuesGenerator):
    """
    MPT swaps the two last dimensions for the key cache compared to usual transformers
    decoder models, thus the redefinition here.
    """
    def generate(self, input_name: str, framework: str = "pt"):
        past_key_shape = (
            self.batch_size,
            self.num_attention_heads,
            self.hidden_size // self.num_attention_heads,
            self.sequence_length,
        )
        past_value_shape = (
            self.batch_size,
            self.num_attention_heads,
            self.sequence_length,
            self.hidden_size // self.num_attention_heads,
        )
        return [
            (
                self.random_float_tensor(past_key_shape, framework=framework),
                self.random_float_tensor(past_value_shape, framework=framework),
            )
            for _ in range(self.num_layers)
        ]

class CustomMPTOnnxConfig(TextDecoderOnnxConfig):
    DUMMY_INPUT_GENERATOR_CLASSES = (MPTDummyPastKeyValuesGenerator,) + TextDecoderOnnxConfig.DUMMY_INPUT_GENERATOR_CLASSES
    DUMMY_PKV_GENERATOR_CLASS = MPTDummyPastKeyValuesGenerator

    DEFAULT_ONNX_OPSET = 14  # aten::tril operator requires opset>=14
    NORMALIZED_CONFIG_CLASS = NormalizedTextConfig.with_args(
        hidden_size="d_model",
        num_layers="n_layers",
        num_attention_heads="n_heads"
    )

    def add_past_key_values(self, inputs_or_outputs: Dict[str, Dict[int, str]], direction: str):
        """
        Adapted from https://github.com/huggingface/optimum/blob/v1.9.0/optimum/exporters/onnx/base.py#L625
        """
        if direction not in ["inputs", "outputs"]:
            raise ValueError(f'direction must either be "inputs" or "outputs", but {direction} was given')

        if direction == "inputs":
            decoder_sequence_name = "past_sequence_length"
            name = "past_key_values"
        else:
            decoder_sequence_name = "past_sequence_length + 1"
            name = "present"

        for i in range(self._normalized_config.num_layers):
            inputs_or_outputs[f"{name}.{i}.key"] = {0: "batch_size", 3: decoder_sequence_name}
            inputs_or_outputs[f"{name}.{i}.value"] = {0: "batch_size", 2: decoder_sequence_name}


model_id = "fxmarty/tiny-mpt-random-remote-code"
config = AutoConfig.from_pretrained(model_id, trust_remote_code=True)

onnx_config = CustomMPTOnnxConfig(
    config=config,
    task="text-generation",
    use_past_in_inputs=False,
)
onnx_config_with_past = CustomMPTOnnxConfig(config, task="text-generation", use_past=True)

custom_onnx_configs = {
    "decoder_model": onnx_config,
    "decoder_with_past_model": onnx_config_with_past,
}

main_export(
    model_id,
    output="mpt_onnx",
    task="text-generation-with-past",
    trust_remote_code=True,
    custom_onnx_configs=custom_onnx_configs,
    no_post_process=True,
    legacy=True,
    opset=14
)

此外，main_export 的高階引數 fn_get_submodels 允許在模型需要匯出為多個子模型時自定義子模型的提取方式。此類函式的示例可[在此處查閱](合併後連結到 utils.py 相關程式碼)。

< > 在 GitHub 上更新