將模型匯出至 Inferentia

總結

將 PyTorch 模型匯出為 Neuron 模型非常簡單

optimum-cli export neuron \
  --model bert-base-uncased \
  --sequence_length 128 \
  --batch_size 1 \
  bert_neuron/

檢視幫助以獲取更多選項

optimum-cli export neuron --help

為何要編譯為 Neuron 模型？

AWS 提供了兩代 Inferentia 加速器，專為機器學習推理而設計，具有更高的吞吐量、更低的延遲和更低的成本：inf2 (NeuronCore-v2) 和 inf1 (NeuronCore-v1)。

在生產環境中，要在 Neuron 裝置上部署 🤗 Transformers 模型，您需要在推理之前編譯模型並將其匯出為序列化格式。透過使用 Neuron 編譯器（neuronx-cc 或 neuron-cc）進行提前編譯（AOT），您的模型將被轉換為序列化且最佳化的 TorchScript 模組。

儘管預編譯可以避免推理過程中的開銷，但編譯後的 Neuron 模型存在一些限制：

編譯期間使用的輸入形狀和資料型別無法更改。
Neuron 模型針對每種硬體和 SDK 版本進行了專門最佳化，這意味著：
- 使用 Neuron 編譯的模型無法在非 Neuron 環境中執行。
- 為 inf1 (NeuronCore-v1) 編譯的模型與 inf2 (NeuronCore-v2) 不相容，反之亦然。
- 為某個 SDK 版本編譯的模型（通常）與另一個 SDK 版本不相容。

在本指南中，我們將向您展示如何將模型匯出為針對 Neuron 裝置最佳化的序列化模型。

🤗 Optimum 透過利用配置物件來支援 Neuron 匯出。這些配置物件已為多種模型架構預先準備好，並且設計得易於擴充套件到其他架構。

要檢視支援的架構，請訪問配置參考頁面。

使用 CLI 將模型匯出到 Neuron

要將 🤗 Transformers 模型匯出到 Neuron，您首先需要安裝一些額外的依賴項：

對於 Inf2

pip install optimum-neuron[neuronx]

對於 Inf1

pip install optimum-neuron[neuron]

Optimum Neuron 匯出可以透過 Optimum 命令列使用：

optimum-cli export neuron --help

usage: optimum-cli export neuron [-h] -m MODEL [--task TASK] [--atol ATOL] [--cache_dir CACHE_DIR] [--trust-remote-code]
                                 [--compiler_workdir COMPILER_WORKDIR] [--disable-validation] [--auto_cast {none,matmul,all}]
                                 [--auto_cast_type {bf16,fp16,tf32}] [--dynamic-batch-size] [--num_cores NUM_CORES] [--unet UNET]
                                 [--output_hidden_states] [--output_attentions] [--batch_size BATCH_SIZE]
                                 [--sequence_length SEQUENCE_LENGTH] [--num_beams NUM_BEAMS] [--num_choices NUM_CHOICES]
                                 [--num_channels NUM_CHANNELS] [--width WIDTH] [--height HEIGHT]
                                 [--num_images_per_prompt NUM_IMAGES_PER_PROMPT] [-O1 | -O2 | -O3]
                                 output

optional arguments:
  -h, --help            show this help message and exit
  -O1                   Enables the core performance optimizations in the compiler, while also minimizing compile time.
  -O2                   [Default] Provides the best balance between model performance and compile time.
  -O3                   May provide additional model execution performance but may incur longer compile times and higher host
                        memory usage during model compilation.

Required arguments:
  -m MODEL, --model MODEL
                        Model ID on huggingface.co or path on disk to load model from.
  output                Path indicating the directory where to store generated Neuronx compiled TorchScript model.

Optional arguments:
  --task TASK           The task to export the model for. If not specified, the task will be auto-inferred based on the model.
                        Available tasks depend on the model, but are among: ['audio-classification', 'audio-frame-
                        classification', 'audio-xvector', 'automatic-speech-recognition', 'conversational', 'depth-estimation',
                        'feature-extraction', 'fill-mask', 'image-classification', 'image-segmentation', 'image-to-image',
                        'image-to-text', 'mask-generation', 'masked-im', 'multiple-choice', 'object-detection', 'question-
                        answering', 'semantic-segmentation', 'text-to-audio', 'text-generation', 'text2text-generation', 'text-
                        classification', 'token-classification', 'zero-shot-image-classification', 'zero-shot-object-detection',
                        'stable-diffusion', 'stable-diffusion-xl'].
  --atol ATOL           If specified, the absolute difference tolerance when validating the model. Otherwise, the default atol
                        for the model will be used.
  --cache_dir CACHE_DIR
                        Path indicating where to store cache.
  --trust-remote-code   Allow to use custom code for the modeling hosted in the model repository. This option should only be set
                        for repositories you trust and in which you have read the code, as it will execute on your local machine
                        arbitrary code present in the model repository.
  --compiler_workdir COMPILER_WORKDIR
                        Path indicating the directory where to store intermediary files generated by Neuronx compiler.
  --disable-validation  Whether to disable the validation of inference on neuron device compared to the outputs of original
                        PyTorch model on CPU.
  --auto_cast {none,matmul,all}
                        Whether to cast operations from FP32 to lower precision to speed up the inference. Can be `"none"`,
                        `"matmul"` or `"all"`.
  --auto_cast_type {bf16,fp16,tf32}
                        The data type to cast FP32 operations to when auto-cast mode is enabled. Can be `"bf16"`, `"fp16"` or
                        `"tf32"`.
  --dynamic-batch-size  Enable dynamic batch size for neuron compiled model. If this option is enabled, the input batch size can
                        be a multiple of the batch size during the compilation, but it comes with a potential tradeoff in terms
                        of latency.
  --num_cores NUM_CORES
                        The number of cores on which the model should be deployed (text-generation only).
  --unet UNET           UNet model ID on huggingface.co or path on disk to load model from. This will replace the unet in the
                        original Stable Diffusion pipeline.
  --output_hidden_states
                        Whether or not for the traced model to return the hidden states of all layers.
  --output_attentions   Whether or not for the traced model to return the attentions tensors of all attention layers.

Input shapes:
  --batch_size BATCH_SIZE
                        Batch size that the Neuronx-cc compiler exported model will be able to take as input.
  --sequence_length SEQUENCE_LENGTH
                        Sequence length that the Neuronx-cc compiler exported model will be able to take as input.
  --num_beams NUM_BEAMS
                        Number of beams for beam search that the Neuronx-cc compiler exported model will be able to take as
                        input.
  --num_choices NUM_CHOICES
                        Only for the multiple-choice task. Num choices that the Neuronx-cc compiler exported model will be able
                        to take as input.
  --num_channels NUM_CHANNELS
                        Image tasks only. Number of channels that the Neuronx-cc compiler exported model will be able to take as
                        input.
  --width WIDTH         Image tasks only. Width that the Neuronx-cc compiler exported model will be able to take as input.
  --height HEIGHT       Image tasks only. Height that the Neuronx-cc compiler exported model will be able to take as input.
  --num_images_per_prompt NUM_IMAGES_PER_PROMPT
                        Stable diffusion only. Number of images per prompt that the Neuronx-cc compiler exported model will be
                        able to take as input.

匯出標準（非 LLM）模型

Hugging Face Hub 上的大多數模型都可以直接使用 torch trace 匯出，然後轉換為序列化且最佳化的 TorchScript 模組。

NEFF：Neuron 可執行檔案格式，是 Neuron 裝置上的二進位制可執行檔案。

匯出模型時，必須傳遞兩組匯出引數：

compiler_args 是編譯器的可選引數，這些引數通常控制編譯器如何在推理效能（延遲和吞吐量）與準確性之間進行權衡。
input_shapes 是您需要傳送給 Neuron 編譯器的強制性靜態形狀資訊。

請輸入以下命令檢視所有匯出引數：

optimum-cli export neuron -h

匯出標準 NLP 模型可以按如下方式進行：

optimum-cli export neuron --model distilbert-base-uncased-distilled-squad \
                          --batch_size 1 --sequence_length 16 \
                          --auto_cast matmul --auto_cast_type fp16 \
                          distilbert_base_uncased_squad_neuron/

這裡，模型以靜態輸入形狀 (1, 16) 匯出，並使用編譯器引數指定矩陣乘法運算必須以 float16 精度執行，以加快推理速度。

匯出後，您應該會看到以下日誌，這些日誌透過與 CPU 上的 PyTorch 模型進行比較來驗證 Neuron 裝置上的模型：

Validating Neuron model...
        -[✓] Neuron model output names match reference model (last_hidden_state)
        - Validating Neuron Model output "last_hidden_state":
                -[✓] (1, 16, 32) matches (1, 16, 32)
                -[✓] all values close (atol: 0.0001)
The Neuronx export succeeded and the exported model was saved at: distilbert_base_uncased_squad_neuron/

這將匯出一個由 --model 引數定義的檢查點的 Neuron 編譯的 TorchScript 模組。

如您所見，任務被自動檢測到了。這是因為模型在 Hub 上。對於本地模型，需要提供 --task 引數，否則它將預設為沒有任何任務特定頭的模型架構。

optimum-cli export neuron --model local_path --task question-answering --batch_size 1 --sequence_length 16 --dynamic-batch-size distilbert_base_uncased_squad_neuron/

請注意，為 Hub 上的模型提供 --task 引數將停用自動任務檢測。生成的 model.neuron 檔案隨後可以在 Neuron 裝置上載入和執行。

對於每個模型架構，您可以透過 ~exporters.tasks.TasksManager 找到支援的任務列表。例如，對於 DistilBERT，對於 Neuron 匯出，我們有：

>>> from optimum.exporters.tasks import TasksManager
>>> from optimum.exporters.neuron.model_configs import *  # Register neuron specific configs to the TasksManager

>>> distilbert_tasks = list(TasksManager.get_supported_tasks_for_model_type("distilbert", "neuron").keys())
>>> print(distilbert_tasks)
['feature-extraction', 'fill-mask', 'multiple-choice', 'question-answering', 'text-classification', 'token-classification']

然後，您可以將其中一個任務傳遞給 optimum-cli export neuron 命令中的 --task 引數，如上所述。

匯出後，Neuron 模型可以直接使用 NeuronModelForXXX 類進行推理：

>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForSequenceClassification

>>> tokenizer = AutoTokenizer.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")
>>> model = NeuronModelForSequenceClassification.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")

>>> inputs = tokenizer("Hamilton is considered to be the best musical of human history.", return_tensors="pt")
>>> logits = model(**inputs).logits
>>> print(model.config.id2label[logits.argmax().item()])
'POSITIVE'

如您所見，無需傳遞匯出期間使用的 Neuron 引數，因為它們已儲存在 config.json 檔案中，並將由 NeuronModelForXXX 類自動恢復。

請注意，輸入總是被填充到用於編譯的形狀，填充會帶來計算開銷。調整靜態形狀，使其大於您將在推理期間輸入模型的輸入形狀，但不要大太多。

將 Stable Diffusion 匯出到 Neuron

使用 Optimum CLI，您可以編譯 Stable Diffusion 流水線中的元件，以在 Neuron 裝置上進行推理加速。

到目前為止，我們支援匯出流水線中的以下元件：

CLIP 文字編碼器
U-Net
VAE 編碼器
VAE 解碼器

“選擇這些模組是因為它們代表了流水線中大部分的計算量，效能基準測試表明，在 Neuron 上執行它們可以帶來顯著的效能提升。”

此外，請隨時調整編譯配置，以在您的用例中找到效能與準確性之間的最佳平衡。預設情況下，我們建議將 FP32 矩陣乘法運算轉換為 BF16，這樣可以在適度犧牲準確性的情況下提供良好的效能。請查閱 AWS Neuron 文件中的指南，以更好地瞭解您的編譯選項。

使用 CLI 匯出 Stable Diffusion 檢查點可以按如下方式進行：

optimum-cli export neuron --model stabilityai/stable-diffusion-2-1-base \
  --task stable-diffusion \
  --batch_size 1 \
  --height 512 `# height in pixels of generated image, eg. 512, 768` \
  --width 512 `# width in pixels of generated image, eg. 512, 768` \
  --num_images_per_prompt 4 `# number of images to generate per prompt, defaults to 1` \
  --auto_cast matmul `# cast only matrix multiplication operations` \
  --auto_cast_type bf16 `# cast operations from FP32 to BF16` \
  sd_neuron/

將 Stable Diffusion XL 匯出到 Neuron

與 Stable Diffusion 類似，您可以使用 Optimum CLI 編譯 SDXL 流水線中的元件，以便在 Neuron 裝置上進行推理。

我們支援匯出流水線中的以下元件以提高速度：

文字編碼器
第二個文字編碼器
U-Net（比 Stable Diffusion 流水線中的 UNet 大三倍）
VAE 編碼器
VAE 解碼器

“Stable Diffusion XL 在 768 到 1024 畫素的影像上效果特別好。”

使用 CLI 匯出 SDXL 檢查點可以按如下方式進行：

optimum-cli export neuron --model stabilityai/stable-diffusion-xl-base-1.0 \
  --task stable-diffusion-xl \
  --batch_size 1 \
  --height 1024 `# height in pixels of generated image, eg. 768, 1024` \
  --width 1024 `# width in pixels of generated image, eg. 768, 1024` \
  --num_images_per_prompt 4 `# number of images to generate per prompt, defaults to 1` \
  --auto_cast matmul `# cast only matrix multiplication operations` \
  --auto_cast_type bf16 `# cast operations from FP32 to BF16` \
  sd_neuron/

將 LLM 匯出到 Neuron

LLM 模型不是使用 Torch tracing 匯出的，而是直接轉換為 Neuron 圖，可以將 transformers 檢查點權重載入到其中。

與標準 NLP 模型一樣，您在匯出 LLM 模型時需要指定靜態引數：

batch_size 是模型將接受的輸入序列數量。預設為 1。
sequence_length 是輸入序列中的最大 token 數。預設為 max_position_embeddings（對於舊模型為 n_positions）。
auto_cast_type 指定編碼權重的格式。可以是 fp32 (float32)、fp16 (float16) 或 bf16 (bfloat16) 之一。預設為 fp32。
num_cores 是例項化模型時使用的 Neuron 核心數。每個 Neuron 核心有 16Gb 記憶體，這意味著較大的模型需要拆分到多個核心上。預設為 1。

optimum-cli export neuron --model meta-llama/Meta-Llama-3-8B \
  --batch_size 1 \
  --sequence_length 4096 \
  --auto_cast_type fp16 `# cast operations from BF16 to FP16` \
  --num_cores 2 \
  llama3_neuron/

一個重要的限制是，LLM 模型只能在 Neuron 平臺上匯出，因為它們在匯出期間需要針對實際裝置進行適配。

LLM 模型的匯出可能比標準模型花費更長的時間（有時超過一小時）。

如前所述，Neuron 模型引數是靜態的。這特別意味著在推理期間：

輸入的 batch_size 應小於匯出期間使用的 batch_size。
輸入序列的 length 應小於匯出期間使用的 sequence_length。
token 的最大數量（輸入 + 生成）不能超過匯出期間使用的 sequence_length。

匯出後，Neuron 模型可以簡單地使用 NeuronModelForCausalLM 類重新載入。與原始的 transformers 模型一樣，使用 generate() 而不是 forward() 來生成文字序列。

from transformers import AutoTokenizer
-from transformers import AutoModelForCausalLM
+from optimum.neuron import NeuronModelForCausalLM

# Instantiate and convert to Neuron a PyTorch checkpoint
-model = AutoModelForCausalLM.from_pretrained("gpt2")
+model = NeuronModelForCausalLM.from_pretrained("./gpt2-neuron")

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token_id = tokenizer.eos_token_id

tokens = tokenizer("I really wish ", return_tensors="pt")
with torch.inference_mode():
    sample_output = model.generate(
        **tokens,
        do_sample=True,
        min_length=128,
        max_length=256,
        temperature=0.7,
    )
    outputs = [tokenizer.decode(tok) for tok in sample_output]
    print(outputs)

生成過程是高度可配置的。有關詳細資訊，請參閱 https://huggingface.co/docs/transformers/generation_strategies。

請注意：

對於每個模型架構，都為所有引數提供了預設值，但傳遞給 generate 方法的值將優先。
生成引數可以儲存在 generation_config.json 檔案中。當模型目錄中存在此類檔案時，它將被解析以設定預設引數（傳遞給 generate 方法的值仍然優先）。

透過 NeuronModel 以程式設計方式將模型匯出到 Neuron

作為 optimim-cli 的替代方案，您還可以使用 optimum.neuron.NeuronModelForXXX 模型類在您自己的 Python 指令碼或 notebook 中將模型匯出到 Neuron。

這裡是一個例子

>>> from optimum.neuron import NeuronModelForSequenceClassification

>>> input_shapes = {"batch_size": 1, "sequence_length": 64}  # mandatory shapes
>>> model = NeuronModelForSequenceClassification.from_pretrained(
...   "distilbert-base-uncased-finetuned-sst-2-english", export=True, **input_shapes
... )

# Save the model
>>> model.save_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")

# Push the neuron model to HF Hub
>>> model.push_to_hub(
...     "a_local_path_for_compiled_neuron_model", repository_id="my-neuron-repo"
... )

此示例可以適應其他模型型別，使用與 optimum-cli 相同的匯出引數。

使用 NeuronX TGI 匯出 Neuron 模型

NeuronX TGI 映象不僅包含 NeuronX 執行時，還包含匯出 Neuron 模型所需的所有包和工具。

使用以下命令透過 TGI 映象將模型匯出到 Neuron：

docker run --entrypoint optimum-cli \
       -v $(pwd)/data:/data \
       --privileged \
       ghcr.io/huggingface/neuronx-tgi:latest \
       export neuron \
       --model <organization>/<model> \
       --batch_size 1 \
       --sequence_length 4096 \
       --auto_cast_type fp16 \
       --num_cores 2 \
       /data/<neuron_model_path>

匯出的模型將儲存在 ./data/<neuron_model_path> 下。

匯出選項和 Docker / SageMaker 環境變數

您必須確保用於編譯的選項與用於部署的選項相匹配。

您可以在 TGI Neuron 後端文件中的 .env 和 docker-compose.yaml 檔案中看到這些引數的示例。

對於 Docker 和 SageMaker，您可以在以下選項及其 optimum-cli 等效項中看到這些反映：

HF_AUTO_CAST_TYPE = auto_cast_type
MAX_BATCH_SIZE = batch_size
MAX_TOTAL_TOKENS = sequence_length
HF_NUM_CORES = num_cores

AWS Trainium & Inferentia