Transformers 文件
torchao
並獲得增強的文件體驗
開始使用
torchao
torchao 是一個 PyTorch 架構最佳化庫,支援自定義高效能資料型別、量化和稀疏性。它可以與原生的 PyTorch 功能(如 torch.compile)組合,以實現更快的推理和訓練。
有關 torchao 的其他功能,請參閱下表。
特性 | 描述 |
---|---|
量化感知訓練 (QAT) | 以最小的精度損失訓練量化模型(請參閱 QAT README) |
Float8 訓練 | 使用 float8 格式進行高吞吐量訓練(請參閱 torchtitan 和 Accelerate 文件) |
稀疏性支援 | 半結構化 (2:4) 稀疏性可實現更快的推理(請參閱使用半結構化 (2:4) 稀疏性加速神經網路訓練部落格文章) |
最佳化器量化 | 使用 Adam 的 4 位和 8 位變體減少最佳化器狀態記憶體 |
KV 快取量化 | 以更低的記憶體實現長上下文推理(請參閱 KV 快取量化) |
自定義核心支援 | 使用您自己的 torch.compile 相容操作 |
FSDP2 | 可與 FSDP2 組合用於訓練 |
有關該庫的更多詳細資訊,請參閱 torchao README.md。
torchao 支援以下量化技術。
- A16W8 Float8 動態量化
- A16W8 Float8 僅權重 量化
- A8W8 Int8 動態量化
- A16W8 Int8 僅權重 量化
- A16W4 Int4 僅權重 量化
- A16W4 Int4 僅權重 量化 + 2:4 稀疏性
- 自動量化
torchao 還支援透過指定模組的完全限定名稱及其對應的量化配置的字典來進行模組級配置。這允許跳過某些層的量化,併為不同的模組使用不同的量化配置。
查看下錶以檢查您的硬體是否相容。
元件 | 相容性 |
---|---|
CUDA 版本 | ✅ cu118, cu126, cu128 |
CPU | ✅ 更改 device_map="cpu" (請參閱以下示例) |
使用以下命令從 PyPi 或 PyTorch 索引安裝 torchao。
# Updating 🤗 Transformers to the latest version, as the example script below uses the new auto compilation
# Stable release from Pypi which will default to CUDA 12.6
pip install --upgrade torchao transformers
如果您的 torchao 版本低於 0.10.0,您需要升級它,請參閱棄用通知以獲取更多詳細資訊。
量化示例
TorchAO 提供各種量化配置。每個配置都可以透過諸如 group_size
、scheme
和 layout
等引數進一步自定義,以針對特定硬體和模型架構進行最佳化。
有關可用配置的完整列表,請參閱 量化 API 文件。
您可以手動選擇量化型別和設定,也可以自動選擇量化型別。
建立一個 TorchAoConfig,並指定要量化的權重的量化型別和 group_size
(僅適用於 int8 權重和 int4 權重)。將 cache_implementation
設定為 "static"
以自動 torch.compile 前向方法。
我們將根據硬體(例如 A100 GPU、H100 GPU、CPU)展示推薦的量化方法示例。
H100 GPU
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, Float8WeightOnlyConfig
quant_config = Float8DynamicActivationFloat8WeightConfig()
# or float8 weight only quantization
# quant_config = Float8WeightOnlyConfig()
quantization_config = TorchAoConfig(quant_type=quant_config)
# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype="auto",
device_map="auto",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int4WeightOnlyConfig
from torchao.dtypes import MarlinSparseLayout
quant_config = Int4WeightOnlyConfig(layout=MarlinSparseLayout())
quantization_config = TorchAoConfig(quant_type=quant_config)
# Load and quantize the model with sparsity. A sparse checkpoint is needed to accelerate without accuraccy loss
quantized_model = AutoModelForCausalLM.from_pretrained(
"RedHatAI/Sparse-Llama-3.1-8B-2of4",
torch_dtype=torch.float16,
device_map="cuda",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Sparse-Llama-3.1-8B-2of4")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
A100 GPU
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int8DynamicActivationInt8WeightConfig, Int8WeightOnlyConfig
quant_config = Int8DynamicActivationInt8WeightConfig()
# or int8 weight only quantization
# quant_config = Int8WeightOnlyConfig()
quantization_config = TorchAoConfig(quant_type=quant_config)
# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype="auto",
device_map="auto",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int4WeightOnlyConfig
from torchao.dtypes import MarlinSparseLayout
quant_config = Int4WeightOnlyConfig(layout=MarlinSparseLayout())
quantization_config = TorchAoConfig(quant_type=quant_config)
# Load and quantize the model with sparsity. A sparse checkpoint is needed to accelerate without accuraccy loss
quantized_model = AutoModelForCausalLM.from_pretrained(
"RedHatAI/Sparse-Llama-3.1-8B-2of4",
torch_dtype=torch.float16,
device_map="cuda",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Sparse-Llama-3.1-8B-2of4")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
CPU
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int8DynamicActivationInt8WeightConfig, Int8WeightOnlyConfig
quant_config = Int8DynamicActivationInt8WeightConfig()
# quant_config = Int8WeightOnlyConfig()
quantization_config = TorchAoConfig(quant_type=quant_config)
# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype="auto",
device_map="cpu",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt")
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
print(tokenizer.decode(output[0], skip_special_tokens=True))
逐模組量化
1. 跳過某些層的量化
使用 ModuleFqnToConfig
,我們可以為所有層指定預設配置,同時跳過某些層的量化。
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
model_id = "meta-llama/Llama-3.1-8B-Instruct"
from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig
config = Int4WeightOnlyConfig(group_size=128)
# set default to int4 (for linears), and skip quantizing `model.layers.0.self_attn.q_proj`
quant_config = ModuleFqnToConfig({"_default": config, "model.layers.0.self_attn.q_proj": None})
quantization_config = TorchAoConfig(quant_type=quant_config)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
# lm_head is not quantized and model.layers.0.self_attn.q_proj is not quantized
print("quantized model:", quantized_model)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
output_text = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
2. 使用不同量化配置量化不同層
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
model_id = "facebook/opt-125m"
from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig, Int8DynamicActivationInt4WeightConfig, IntxWeightOnlyConfig, PerAxis, MappingType
weight_dtype = torch.int8
granularity = PerAxis(0)
mapping_type = MappingType.ASYMMETRIC
embedding_config = IntxWeightOnlyConfig(
weight_dtype=weight_dtype,
granularity=granularity,
mapping_type=mapping_type,
)
linear_config = Int8DynamicActivationInt4WeightConfig(group_size=128)
quant_config = ModuleFqnToConfig({"_default": linear_config, "model.decoder.embed_tokens": embedding_config, "model.decoder.embed_positions": None})
# set `include_embedding` to True in order to include embedding in quantization
# when `include_embedding` is True, we'll remove input embedding from `modules_not_to_convert` as well
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True)
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
print("quantized model:", quantized_model)
# make sure embedding is quantized
print("embed_tokens weight:", quantized_model.model.decoder.embed_tokens.weight)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Manual Testing
prompt = "Hey, are you conscious? Can you talk to me?"
inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128, cache_implementation="static")
output_text = tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
自動量化
如果您想為可量化層(nn.Linear
)自動選擇量化型別,可以使用 autoquant API。
autoquant
API 透過對輸入型別和形狀進行微基準測試並編譯單個線性層來自動選擇量化型別。
注意:目前 autoquant 僅支援 GPU。
建立一個 TorchAoConfig 並將其設定為 "autoquant"
。將 cache_implementation
設定為 "static"
以自動 torch.compile 前向方法。最後,在量化模型上呼叫 finalize_autoquant
以完成量化並記錄輸入形狀。
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
quantization_config = TorchAoConfig("autoquant", min_sqnr=None)
quantized_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype="auto",
device_map="auto",
quantization_config=quantization_config
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
# explicitly call `finalize_autoquant` (may be refactored and removed in the future)
quantized_model.finalize_autoquant()
print(tokenizer.decode(output[0], skip_special_tokens=True))
序列化
torchao 實現了 torch.Tensor 子類,以最大限度地靈活支援新的量化 torch.Tensor 格式。Safetensors 序列化和反序列化不適用於 torchao。
為了避免任意使用者程式碼執行,torchao 將 torch.load 中的 weights_only
設定為 True
,以確保只加載張量。任何已知的使用者函式都可以透過 add_safe_globals 列入白名單。
# don't serialize model with Safetensors
output_dir = "llama3-8b-int4wo-128"
quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False)
載入量化模型
載入量化模型取決於量化方案。對於 int8 和 float8 等量化方案,您可以在任何裝置上量化模型,也可以在任何裝置上載入它。下面的示例演示了在 CPU 上量化模型,然後將其載入到 CUDA 上。
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int8WeightOnlyConfig
quant_config = Int8WeightOnlyConfig(group_size=128)
quantization_config = TorchAoConfig(quant_type=quant_config)
# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype="auto",
device_map="cpu",
quantization_config=quantization_config
)
# save the quantized model
output_dir = "llama-3.1-8b-torchao-int8-cuda"
quantized_model.save_pretrained(output_dir, safe_serialization=False)
# reload the quantized model
reloaded_model = AutoModelForCausalLM.from_pretrained(
output_dir,
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
output = reloaded_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))
對於 int4,模型只能載入到量化它的同一裝置上,因為佈局是特定於裝置的。下面的示例演示了在 CPU 上量化和載入模型。
import torch
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
from torchao.quantization import Int4WeightOnlyConfig
from torchao.dtypes import Int4CPULayout
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout())
quantization_config = TorchAoConfig(quant_type=quant_config)
# Load and quantize the model
quantized_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct",
torch_dtype="auto",
device_map="cpu",
quantization_config=quantization_config
)
# save the quantized model
output_dir = "llama-3.1-8b-torchao-int4-cpu"
quantized_model.save_pretrained(output_dir, safe_serialization=False)
# reload the quantized model
reloaded_model = AutoModelForCausalLM.from_pretrained(
output_dir,
device_map="cpu",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
input_text = "What are we having for dinner?"
input_ids = tokenizer(input_text, return_tensors="pt")
output = reloaded_model.generate(**input_ids, max_new_tokens=10)
print(tokenizer.decode(output[0], skip_special_tokens=True))
⚠️ 棄用通知
從 0.10.0 版本開始,基於字串的量化配置 API(例如,
TorchAoConfig("int4_weight_only", group_size=128)
)已**棄用**,並將在未來的版本中移除。請改用新的基於
AOBaseConfig
的方法# Old way (deprecated) quantization_config = TorchAoConfig("int4_weight_only", group_size=128) # New way (recommended) from torchao.quantization import Int4WeightOnlyConfig quant_config = Int4WeightOnlyConfig(group_size=128) quantization_config = TorchAoConfig(quant_type=quant_config)
新 API 提供更大的靈活性、更好的型別安全性和對 torchao 中可用功能的完整訪問。
以下是如何從常用字串識別符號遷移到其
AOBaseConfig
等效項的方法
舊字串 API 新的 AOBaseConfig
API"int4_weight_only"
Int4WeightOnlyConfig()
"int8_weight_only"
Int8WeightOnlyConfig()
"int8_dynamic_activation_int8_weight"
Int8DynamicActivationInt8WeightConfig()
所有配置物件都接受用於自定義的引數(例如,
group_size
、scheme
、layout
)。
資源
為了更好地瞭解預期效能,請檢視 CUDA 和 XPU 後端各種模型的基準測試。您還可以執行以下程式碼來對模型進行基準測試。
from torch._inductor.utils import do_bench_using_profiling
from typing import Callable
def benchmark_fn(func: Callable, *args, **kwargs) -> float:
"""Thin wrapper around do_bench_using_profiling"""
no_args = lambda: func(*args, **kwargs)
time = do_bench_using_profiling(no_args)
return time * 1e3
MAX_NEW_TOKENS = 1000
print("int4wo-128 model:", benchmark_fn(quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))
bf16_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
output = bf16_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") # auto-compile
print("bf16 model:", benchmark_fn(bf16_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))
為了獲得最佳效能,您可以使用 torchao.quantization.utils.recommended_inductor_config_setter()
呼叫推薦設定
有關更多示例和文件,請參閱其他可用量化技術。
問題
如果您在 Transformers 整合中遇到任何問題,請在 Transformers 倉庫中提出問題。對於與 torchao 直接相關的問題,請在 torchao 倉庫中提出問題。
< > 在 GitHub 上更新