概述

量化透過以較低精度儲存權重來降低載入和使用模型的記憶體需求，同時儘量保持儘可能高的準確性。權重通常以全精度 (fp32) 浮點表示儲存，但考慮到當今模型的大尺寸，半精度 (fp16 或 bf16) 越來越受歡迎。一些量化方法甚至可以將精度進一步降低到整數表示，例如 int8 或 int4。

Transformers 支援多種量化方法，每種方法都有其優缺點，因此您可以根據您的特定用例選擇最佳方法。有些方法需要校準才能獲得更高的準確性和極致壓縮（1-2 位），而其他方法則可以直接進行即時量化。

使用下面的 Space 可幫助您根據硬體和要量化的位數選擇量化方法。

量化方法	即時量化	CPU	CUDA GPU	ROCm GPU	Metal (Apple Silicon)	英特爾 GPU	Torch 編譯 ()	位數	PEFT 微調	可使用 🤗Transformers 序列化	🤗Transformers 支援	庫連結
AQLM	🔴	🟢	🟢	🔴	🔴	🔴	🟢	1/2	🟢	🟢	🟢	https://github.com/Vahe1994/AQLM
AutoRound	🔴	🟢	🟢	🔴	🔴	🟢	🔴	2/3/4/8	🔴	🟢	🟢	https://github.com/intel/auto-round
AWQ	🔴	🟢	🟢	🟢	🔴	🟢	?	4	🟢	🟢	🟢	https://github.com/casper-hansen/AutoAWQ
bitsandbytes	🟢	🟡	🟢	🟡	🔴	🟡	🟢	4/8	🟢	🟢	🟢	https://github.com/bitsandbytes-foundation/bitsandbytes
壓縮張量	🔴	🟢	🟢	🟢	🔴	🔴	🔴	1/8	🟢	🟢	🟢	https://github.com/neuralmagic/compressed-tensors
EETQ	🟢	🔴	🟢	🔴	🔴	🔴	?	8	🟢	🟢	🟢	https://github.com/NetEase-FuXi/EETQ
GGUF / GGML (llama.cpp)	🟢	🟢	🟢	🔴	🟢	🔴	🔴	1/8	🔴	檢視備註	檢視備註	https://github.com/ggerganov/llama.cpp
GPTQModel	🔴	🟢	🟢	🟢	🟢	🟢	🔴	2/3/4/8	🟢	🟢	🟢	https://github.com/ModelCloud/GPTQModel
AutoGPTQ	🔴	🔴	🟢	🟢	🔴	🔴	🔴	2/3/4/8	🟢	🟢	🟢	https://github.com/AutoGPTQ/AutoGPTQ
HIGGS	🟢	🔴	🟢	🔴	🔴	🔴	🟢	2/4	🔴	🟢	🟢	https://github.com/HanGuo97/flute
HQQ	🟢	🟢	🟢	🔴	🔴	🔴	🟢	1/8	🟢	🔴	🟢	https://github.com/mobiusml/hqq/
optimum-quanto	🟢	🟢	🟢	🔴	🟢	🔴	🟢	2/4/8	🔴	🔴	🟢	https://github.com/huggingface/optimum-quanto
FBGEMM_FP8	🟢	🔴	🟢	🔴	🔴	🔴	🔴	8	🔴	🟢	🟢	https://github.com/pytorch/FBGEMM
torchao	🟢	🟢	🟢	🔴	🟡	🔴		4/8		🟢🔴	🟢	https://github.com/pytorch/ao
VPTQ	🔴	🔴	🟢	🟡	🔴	🔴	🟢	1/8	🔴	🟢	🟢	https://github.com/microsoft/VPTQ
FINEGRAINED_FP8	🟢	🔴	🟢	🔴	🔴	🔴	🔴	8	🔴	🟢	🟢
SpQR	🔴	🔴	🟢	🔴	🔴	🔴	🟢	3	🔴	🟢	🟢	https://github.com/Vahe1994/SpQR/
Quark	🔴	🟢	🟢	🟢	🟢	🟢	?	2/4/6/8/9/16	🔴	🔴	🟢	https://quark.docs.amd.com/latest/

資源

如果您是量化新手，我們建議您檢視 DeepLearning.AI 合作提供的這些適合初學者的量化課程。

使用者友好的量化工具

如果您正在尋找使用者友好的量化體驗，可以使用以下社群空間和筆記本

< > 在 GitHub 上更新

Transformers（變形金剛）

概述

資源

使用者友好的量化工具