Optimum 文件
量化
加入 Hugging Face 社群
並獲得增強的文件體驗
開始使用
量化
🤗 Optimum 提供了 `optimum.onnxruntime` 包,透過使用 ONNX Runtime 量化工具,可以對 Hugging Face Hub 上託管的許多模型應用量化。
量化過程透過 ORTConfig 和 ORTQuantizer 類進行抽象。前者允許您指定如何進行量化,而後者有效地處理量化。
您可以閱讀量化概念指南以瞭解量化。它解釋了在使用 ORTQuantizer 執行量化時將使用的主要概念。
使用 Optimum CLI 量化模型
Optimum ONNX Runtime 量化工具可以透過 Optimum 命令列介面使用
optimum-cli onnxruntime quantize --help
usage: optimum-cli <command> [<args>] onnxruntime quantize [-h] --onnx_model ONNX_MODEL -o OUTPUT [--per_channel] (--arm64 | --avx2 | --avx512 | --avx512_vnni | --tensorrt | -c CONFIG)
options:
-h, --help show this help message and exit
--arm64 Quantization for the ARM64 architecture.
--avx2 Quantization with AVX-2 instructions.
--avx512 Quantization with AVX-512 instructions.
--avx512_vnni Quantization with AVX-512 and VNNI instructions.
--tensorrt Quantization for NVIDIA TensorRT optimizer.
-c CONFIG, --config CONFIG
`ORTConfig` file to use to optimize the model.
Required arguments:
--onnx_model ONNX_MODEL
Path to the repository where the ONNX models to quantize are located.
-o OUTPUT, --output OUTPUT
Path to the directory where to store generated ONNX model.
Optional arguments:
--per_channel Compute the quantization parameters on a per-channel basis.
ONNX 模型的量化可以按如下方式進行
optimum-cli onnxruntime quantize --onnx_model onnx_model_location/ --avx512 -o quantized_model/
這將使用 AVX-512 指令量化 `onnx_model_location` 中的所有 ONNX 檔案。
建立 ORTQuantizer
ORTQuantizer 類用於量化您的 ONNX 模型。該類可以使用 `from_pretrained()` 方法進行初始化,該方法支援不同的檢查點格式。
- 使用已初始化的 `ORTModelForXXX` 類。
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
# Loading ONNX Model from the Hub
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(
... "optimum/distilbert-base-uncased-finetuned-sst-2-english"
... )
# Create a quantizer from a ORTModelForXXX
>>> quantizer = ORTQuantizer.from_pretrained(ort_model)
- 使用來自目錄的本地 ONNX 模型。
>>> from optimum.onnxruntime import ORTQuantizer
# This assumes a model.onnx exists in path/to/model
>>> quantizer = ORTQuantizer.from_pretrained("path/to/model")
應用動態量化
ORTQuantizer 類可用於動態量化您的 ONNX 模型。下面您將找到一個簡單的端到端示例,說明如何動態量化 distilbert-base-uncased-finetuned-sst-2-english。
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig
# Load PyTorch model and convert to ONNX
>>> onnx_model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", export=True)
# Create quantizer
>>> quantizer = ORTQuantizer.from_pretrained(onnx_model)
# Define the quantization strategy by creating the appropriate configuration
>>> dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
# Quantize the model
>>> model_quantized_path = quantizer.quantize(
... save_dir="path/to/output/model",
... quantization_config=dqconfig,
... )
靜態量化示例
ORTQuantizer 類可用於靜態量化您的 ONNX 模型。下面您將找到一個簡單的端到端示例,說明如何靜態量化 distilbert-base-uncased-finetuned-sst-2-english。
>>> from functools import partial
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSequenceClassification
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig, AutoCalibrationConfig
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
# Load PyTorch model and convert to ONNX and create Quantizer and setup config
>>> onnx_model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> quantizer = ORTQuantizer.from_pretrained(onnx_model)
>>> qconfig = AutoQuantizationConfig.arm64(is_static=True, per_channel=False)
# Create the calibration dataset
>>> def preprocess_fn(ex, tokenizer):
... return tokenizer(ex["sentence"])
>>> calibration_dataset = quantizer.get_calibration_dataset(
... "glue",
... dataset_config_name="sst2",
... preprocess_function=partial(preprocess_fn, tokenizer=tokenizer),
... num_samples=50,
... dataset_split="train",
... )
# Create the calibration configuration containing the parameters related to calibration.
>>> calibration_config = AutoCalibrationConfig.minmax(calibration_dataset)
# Perform the calibration step: computes the activations quantization ranges
>>> ranges = quantizer.fit(
... dataset=calibration_dataset,
... calibration_config=calibration_config,
... operators_to_quantize=qconfig.operators_to_quantize,
... )
# Apply static quantization on the model
>>> model_quantized_path = quantizer.quantize(
... save_dir="path/to/output/model",
... calibration_tensors_range=ranges,
... quantization_config=qconfig,
... )
量化 Seq2Seq 模型
ORTQuantizer 類目前不支援多檔案模型,例如 ORTModelForSeq2SeqLM。如果要量化 Seq2Seq 模型,則必須單獨量化模型的每個元件。
目前,Seq2Seq 模型僅支援動態量化。
- 將 seq2seq 模型載入為 `ORTModelForSeq2SeqLM`。
>>> from optimum.onnxruntime import ORTQuantizer, ORTModelForSeq2SeqLM
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig
# load Seq2Seq model and set model file directory
>>> model_id = "optimum/t5-small"
>>> onnx_model = ORTModelForSeq2SeqLM.from_pretrained(model_id)
>>> model_dir = onnx_model.model_save_dir
- 為編碼器、解碼器和帶過去鍵的解碼器定義量化器
# Create encoder quantizer
>>> encoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="encoder_model.onnx")
# Create decoder quantizer
>>> decoder_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_model.onnx")
# Create decoder with past key values quantizer
>>> decoder_wp_quantizer = ORTQuantizer.from_pretrained(model_dir, file_name="decoder_with_past_model.onnx")
# Create Quantizer list
>>> quantizer = [encoder_quantizer, decoder_quantizer, decoder_wp_quantizer]
- 量化所有模型
# Define the quantization strategy by creating the appropriate configuration
>>> dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
# Quantize the models individually
>>> for q in quantizer:
... q.quantize(save_dir=".",quantization_config=dqconfig) # doctest: +IGNORE_RESULT