匯出模型

要匯出託管在 Hub 上的模型，您可以使用我們的空間。轉換後，將在您的名稱空間下推送一個倉庫，該倉庫可以是公共的或私有的。

使用 CLI

要使用 CLI 將模型匯出為 OpenVINO IR 格式

optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B ov_model/

要匯出私有模型或需要訪問許可權的模型，您可以執行 huggingface-cli login 永久登入，或者將環境變數 HF_TOKEN 設定為具有模型訪問許可權的令牌。有關更多資訊，請參閱身份驗證文件。

模型引數可以是託管在 Hub 上的模型 ID，也可以是本地託管模型的路徑。對於本地模型，您需要從支援的任務列表中指定模型在匯出前應載入的任務。

optimum-cli export openvino --model local_llama --task text-generation-with-past ov_model/

檢視幫助以獲取更多選項

usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt,tf}] [--trust-remote-code]
[--weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4}]
[--quant-mode {int8,f8e4m3,f8e5m2,nf4_f8e4m3,nf4_f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2}]
[--library {transformers,diffusers,timm,sentence_transformers,open_clip}]
[--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym]
[--group-size GROUP_SIZE] [--backup-precision {none,int8_sym,int8_asym}]
[--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--gptq]
[--lora-correction] [--sensitivity-metric SENSITIVITY_METRIC]
[--quantization-statistics-path QUANTIZATION_STATISTICS_PATH]
[--num-samples NUM_SAMPLES] [--disable-stateful] [--disable-convert-tokenizer]
[--smooth-quant-alpha SMOOTH_QUANT_ALPHA]
output

optional arguments:
-h, --help show this help message and exit

Required arguments:
-m MODEL, --model MODEL
Model ID on huggingface.co or path on disk to load model from.
output Path indicating the directory where to store the generated OV model.

Optional arguments:
--task TASK The task to export the model for. If not specified, the task will be auto-inferred based on
the model. Available tasks depend on the model, but are among: ['image-to-image',
'image-segmentation', 'inpainting', 'sentence-similarity', 'text-to-audio', 'image-to-text',
'automatic-speech-recognition', 'token-classification', 'text-to-image', 'audio-classification',
'feature-extraction', 'semantic-segmentation', 'masked-im', 'audio-xvector',
'audio-frame-classification', 'text2text-generation', 'multiple-choice', 'depth-estimation',
'image-classification', 'fill-mask', 'zero-shot-object-detection', 'object-detection',
'question-answering', 'zero-shot-image-classification', 'mask-generation', 'text-generation',
'text-classification']. For decoder models, use 'xxx-with-past' to export the model using past
key values in the decoder.
--framework {pt,tf} The framework to use for the export. If not provided, will attempt to use the local
checkpoint's original framework or what is available in the environment.
--trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should
only be set for repositories you trust and in which you have read the code, as it will execute
on your local machine arbitrary code present in the model repository.
--weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4}
The weight format of the exported model. Option 'cb4' represents a codebook with 16
fixed fp8 values in E4M3 format.
--quant-mode {int8,f8e4m3,f8e5m2,nf4_f8e4m3,nf4_f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2}
Quantization precision mode. This is used for applying full model quantization including
activations.
--library {transformers,diffusers,timm,sentence_transformers,open_clip}
The library used to load the model before export. If not provided, will attempt to infer the
local checkpoint's library
--cache_dir CACHE_DIR
The path to a directory in which the downloaded model should be cached if the standard cache
should not be used.
--pad-token-id PAD_TOKEN_ID
This is needed by some models, for some tasks. If not provided, will attempt to use the
tokenizer to guess it.
--ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit
quantization. If set to 0.8, 80% of the layers will be quantized to int4 while 20% will be
quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size
and inference latency. Default value is 1.0. Note: If dataset is provided, and the ratio is
less than 1.0, then data-aware mixed precision assignment will be applied.
--sym Whether to apply symmetric quantization. This argument is related to integer-typed
--weight-format and --quant-mode options. In case of full or mixed quantization (--quant-mode)
symmetric quantization will be applied to weights in any case, so only activation quantization
will be affected by --sym argument. For weight-only quantization (--weight-format) --sym
argument does not affect backup precision. Examples: (1) --weight-format int8 --sym => int8
symmetric quantization of weights; (2) --weight-format int4 => int4 asymmetric quantization of
weights; (3) --weight-format int4 --sym --backup-precision int8_asym => int4 symmetric
quantization of weights with int8 asymmetric backup precision; (4) --quant-mode int8 --sym =>
weights and activations are quantized to int8 symmetric data type; (5) --quant-mode int8 =>
activations are quantized to int8 asymmetric data type, weights -- to int8 symmetric data type;
(6) --quant-mode int4_f8e5m2 --sym => activations are quantized to f8e5m2 data type, weights --
to int4 symmetric data type.
--group-size GROUP_SIZE
The group size to use for quantization. Recommended value is 128 and -1 uses per-column
quantization.
--backup-precision {none,int8_sym,int8_asym}
Defines a backup precision for mixed-precision weight compression. Only valid for 4-bit weight
formats. If not provided, backup precision is int8_asym. 'none' stands for original floating-
point precision of the model weights, in this case weights are retained in their original
precision without any quantization. 'int8_sym' stands for 8-bit integer symmetric quantization
without zero point. 'int8_asym' stands for 8-bit integer asymmetric quantization with zero
points per each quantization group.
--dataset DATASET The dataset used for data-aware compression or quantization with NNCF. For language models you
can use the one from the list ['auto','wikitext2','c4','c4-new']. With 'auto' the dataset will
be collected from model's generations. For diffusion models it should be on of
['conceptual_captions','laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit']. For
visual language models the dataset must be set to 'contextual'. Note: if none of the data-aware
compression algorithms are selected and ratio parameter is omitted or equals 1.0, the dataset
argument will not have an effect on the resulting model.
--all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an
weight compression is applied, they are compressed to INT8.
--awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs. If
dataset is provided, a data-aware activation-based version of the algorithm will be executed,
which requires additional time. Otherwise, data-free AWQ will be applied which relies on
per-column magnitudes of weights instead of activations. Note: it is possible that there will
be no matching patterns in the model to apply AWQ, in such case it will be skipped.
--scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between
the original and compressed layers. Providing a dataset is required to run scale estimation.
Please note, that applying scale estimation takes additional memory and time.
--gptq Indicates whether to apply GPTQ algorithm that optimizes compressed weights in a layer-wise
fashion to minimize the difference between activations of a compressed and original layer.
Please note, that applying GPTQ takes additional memory and time.
--lora-correction Indicates whether to apply LoRA Correction algorithm. When enabled, this algorithm introduces
low-rank adaptation layers in the model that can recover accuracy after weight compression at
some cost of inference latency. Please note, that applying LoRA Correction algorithm takes
additional memory and time.
--sensitivity-metric SENSITIVITY_METRIC
The sensitivity metric for assigning quantization precision to layers. It can be one of the
following: ['weight_quantization_error', 'hessian_input_activation',
'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude'].
--quantization-statistics-path QUANTIZATION_STATISTICS_PATH
Directory path to dump/load data-aware weight-only quantization statistics. This is useful when
running data-aware quantization multiple times on the same model and dataset to avoid
recomputing statistics. This option is applicable exclusively for weight-only quantization.
Please note that the statistics depend on the dataset, so if you change the dataset, you should
also change the statistics path to avoid confusion.
--num-samples NUM_SAMPLES
The maximum number of samples to take from the dataset for quantization.
--disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models
are produced by default when this key is not used. In stateful models all kv-cache inputs and
outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-
stateful option is used, it may result in sub-optimal inference performance. Use it when you
intentionally want to use a stateless model, for example, to be compatible with existing
OpenVINO native inference code that expects KV-cache inputs and outputs in the model.
--disable-convert-tokenizer
Do not add converted tokenizer and detokenizer OpenVINO models.
--smooth-quant-alpha SMOOTH_QUANT_ALPHA
SmoothQuant alpha parameter that improves the distribution of activations before MatMul layers
and reduces quantization error. Valid only when activations quantization is enabled.

您還可以在匯出模型時，透過將 `--weight-format` 分別設定為 `fp16`、`int8` 或 `int4`，對 Linear、Convolutional 和 Embedding 層應用 fp16、8 位或 4 位僅權重量化。

使用 INT8 權重壓縮匯出

optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int8 ov_model/

使用 INT4 權重壓縮匯出

optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int4 ov_model/

使用 INT4 權重壓縮和無資料 AWQ 演算法匯出

optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int4 --awq ov_model/

使用 INT4 權重壓縮和資料感知 AWQ 和尺度估計算法匯出

optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B \
    --weight-format int4 --awq --scale-estimation --dataset wikitext2 ov_model/

有關量化引數的更多資訊，請檢視文件

預設情況下，大於 10 億引數的模型將以 8 位權重匯出為 OpenVINO 格式。您可以透過 `--weight-format fp32` 停用此功能。

除了僅權重化量化外，您還可以透過將 `--quant-mode` 設定為首選精度來應用包括啟用在內的完整模型量化。這將把線性層、卷積層和其他一些層的權重和啟用量化到所選模式。請參閱以下示例。

optimum-cli export openvino -m openai/whisper-large-v3-turbo --quant-mode int8 --dataset librispeech --num-samples 32 --smooth-quant-alpha 0.9 ./whisper-large-v3-turbo

預設量化配置

對於某些模型，我們維護了一組預設量化配置（連結）。要應用預設的 4 位僅權重化量化，應提供 --weight-format int4，不帶任何額外引數。對於 int8 權重和啟用量化，應為 --quant-mode int8。例如：

optimum-cli export openvino -m microsoft/Phi-4-mini-instruct --weight-format int4 ./Phi-4-mini-instruct

或者

optimum-cli export openvino -m openai/clip-vit-base-patch16 --quant-mode int8 ./clip-vit-base-patch16

解碼器模型

對於帶有解碼器的模型，我們預設啟用過去鍵和值的重用。這有助於避免在每個生成步驟中重新計算相同的中間啟用。要不帶此功能匯出模型，您需要在指定任務時刪除 -with-past 字尾。

帶 K-V 快取	不帶 K-V 快取
`text-generation-with-past`	`文字生成`
`text2text-generation-with-past`	`text2text-generation`
`automatic-speech-recognition-with-past`	`自動語音識別`

擴散模型

當 Stable Diffusion 模型匯出為 OpenVINO 格式時，它們被分解為不同的元件，這些元件在推理時再進行組合。

文字編碼器
U-Net
VAE 編碼器
VAE 解碼器

要使用 CLI 將 Stable Diffusion XL 模型匯出為 OpenVINO IR 格式，您可以執行以下操作：

optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 ov_sdxl/

您還可以在模型匯出期間應用混合量化。例如：

optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 \
    --weight-format int8 --dataset conceptual_captions ov_sdxl/

有關混合量化的更多資訊，請參閱此 Jupyter 筆記本。

載入模型時

您還可以在載入模型時透過設定 export=True 來載入 PyTorch 檢查點並將其即時轉換為 OpenVINO 格式。

為了方便儲存結果模型，您可以使用 save_pretrained() 方法，它將同時儲存描述圖的 BIN 和 XML 檔案。將分詞器儲存到同一目錄，以便輕鬆載入模型的相應分詞器。

- from transformers import AutoModelForCausalLM
+ from optimum.intel import OVModelForCausalLM
  from transformers import AutoTokenizer

  model_id = "meta-llama/Meta-Llama-3-8B"
- model = AutoModelForCausalLM.from_pretrained(model_id)
+ model = OVModelForCausalLM.from_pretrained(model_id, export=True)
  tokenizer = AutoTokenizer.from_pretrained(model_id)

  save_directory = "ov_model"
  model.save_pretrained(save_directory)
  tokenizer.save_pretrained(save_directory)

載入模型後

from transformers import AutoModelForCausalLM
from optimum.exporters.openvino import export_from_model

model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
export_from_model(model, output="ov_model", task="text-generation-with-past")

模型匯出後，您現在可以透過將 AutoModelForXxx 類替換為相應的 OVModelForXxx 類來載入您的 OpenVINO 模型。

< > 在 GitHub 上更新