模型量化

bitsandbytes 整合

Accelerate 為你的模型帶來了 bitsandbytes 量化功能。現在，你只需幾行程式碼，就可以以 8 位或 4 位精度載入任何 PyTorch 模型。

如果你想將 Transformers 模型與 bitsandbytes 一起使用，應遵循這份文件。

要深入瞭解 bitsandbytes 量化的工作原理，請檢視關於8 位量化和4 位量化的部落格文章。

先決條件

你需要安裝以下依賴項：

安裝 bitsandbytes 庫

pip install bitsandbytes

對於非 CUDA 裝置，可以參考這裡的 bitsandbytes 安裝指南。

從原始碼安裝最新的 accelerate

pip install git+https://github.com/huggingface/accelerate.git

安裝 minGPT 和 huggingface_hub 以執行示例

git clone https://github.com/karpathy/minGPT.git
pip install minGPT/
pip install huggingface_hub

工作原理

首先，我們需要初始化模型。為了節省記憶體，我們可以使用上下文管理器 init_empty_weights() 來初始化一個空模型。

我們以 minGPT 庫中的 GPT2 模型為例。

from accelerate import init_empty_weights
from mingpt.model import GPT

model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024

with init_empty_weights():
    empty_model = GPT(model_config)

然後，我們需要獲取模型權重的路徑。該路徑可以是 state_dict 檔案（例如 “pytorch_model.bin”），也可以是包含分片檢查點的資料夾。

from huggingface_hub import snapshot_download
weights_location = snapshot_download(repo_id="marcsun13/gpt2-xl-linear-sharded")

最後，你需要使用 BnbQuantizationConfig 設定你的量化配置。

這是一個 8 位量化的示例

from accelerate.utils import BnbQuantizationConfig
bnb_quantization_config = BnbQuantizationConfig(load_in_8bit=True, llm_int8_threshold = 6)

這是一個 4 位量化的示例

from accelerate.utils import BnbQuantizationConfig
bnb_quantization_config = BnbQuantizationConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")

要使用選定的配置對空模型進行量化，你需要使用 load_and_quantize_model()。

from accelerate.utils import load_and_quantize_model
quantized_model = load_and_quantize_model(empty_model, weights_location=weights_location, bnb_quantization_config=bnb_quantization_config)

儲存和載入 8 位模型

你可以使用 accelerate 的 save_model() 來儲存你的 8 位模型。

from accelerate import Accelerator
accelerate = Accelerator()
new_weights_location = "path/to/save_directory"
accelerate.save_model(quantized_model, new_weights_location)

quantized_model_from_saved = load_and_quantize_model(empty_model, weights_location=new_weights_location, bnb_quantization_config=bnb_quantization_config, device_map = "auto")

請注意，目前不支援 4 位模型的序列化。

將模組解除安裝到 CPU 和磁碟

如果你的 GPU 空間不足以儲存整個模型，你可以將一些模組解除安裝到 CPU 或磁碟。這在底層使用了大模型推理技術。更多詳情請檢視這份文件。

對於 8 位量化，所選模組將被轉換為 8 位精度。

對於 4 位量化，所選模組將保持使用者在 BnbQuantizationConfig 中傳遞的 torch_dtype。當 4 位序列化成為可能時，我們將增加對這些解除安裝模組轉換為 4 位的支援。

你只需傳遞一個自定義的 device_map 即可將模組解除安裝到 CPU/磁碟。解除安裝的模組會在需要時被排程到 GPU 上。下面是一個示例：

device_map = {
    "transformer.wte": 0,
    "transformer.wpe": 0,
    "transformer.drop": 0,
    "transformer.h": "cpu",
    "transformer.ln_f": "disk",
    "lm_head": "disk",
}

微調量化模型

在這些模型上執行純 8 位或 4 位訓練是不可能的。但是，你可以利用引數高效微調方法（PEFT），例如在其上訓練介面卡。更多詳情請檢視 peft 庫。

目前，你不能在任何量化模型之上新增介面卡。但是，隨著 Transformers 模型對介面卡的官方支援，你可以微調量化模型。如果你想微調一個 Transformers 模型，請改為遵循這份文件。檢視這個演示，瞭解如何微調一個 4 位 Transformers 模型。

請注意，載入模型進行訓練時，你不需要傳遞 device_map。它會自動將你的模型載入到 GPU 上。請注意，device_map=auto 僅應用於推理。

示例演示 - 在 Google Colab 上執行 GPT2 1.5b

檢視 Google Colab 演示，瞭解如何在 GPT2 模型上執行量化模型。GPT2-1.5B 模型的檢查點是 FP32 格式，佔用 6GB 記憶體。量化後，使用 8 位模組時佔用 1.6GB，使用 4 位模組時佔用 1.2GB。

< > 在 GitHub 上更新