量化

量化側重於用更少的位元表示資料，同時儘量保持原始資料的精度。這通常意味著將資料型別轉換為用更少的位元表示相同資訊。例如，如果您的模型權重儲存為32位浮點數並量化為16位浮點數，這將使模型大小減半，從而更容易儲存並減少記憶體使用。較低的精度還可以加快推理速度，因為用更少的位元執行計算所需的時間更少。

Diffusers 支援多種量化後端，以使大型擴散模型（如 Flux）更易於訪問。本指南將展示如何使用 PipelineQuantizationConfig 類在從預訓練或未量化檢查點初始化管道時對其進行量化。

管道級量化

您可以透過兩種方式使用 PipelineQuantizationConfig，具體取決於您對管道中每個模型的量化規格所需控制的級別。

對於更基本和簡單的用例，您只需定義 quant_backend、quant_kwargs 和 components_to_quantize
對於更細粒度的量化控制，提供一個 quant_mapping，其中包含各個模型元件的量化規格

簡單量化

使用以下引數初始化 PipelineQuantizationConfig。

quant_backend 指定要使用的量化後端。目前支援的後端包括：bitsandbytes_4bit、bitsandbytes_8bit、gguf、quanto 和 torchao。
quant_kwargs 包含要使用的特定量化引數。
components_to_quantize 指定要量化管道的哪些元件。通常，您應該量化計算最密集的元件，例如 Transformer。如果管道具有多個文字編碼器（例如 FluxPipeline），則文字編碼器是另一個需要考慮量化的元件。下面的示例量化了 FluxPipeline 中的 T5 文字編碼器，同時保持 CLIP 模型不變。

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)

將 pipeline_quant_config 傳遞給 from_pretrained() 以量化管道。

pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

quant_mapping

quant_mapping 引數提供了更靈活的選項，用於量化管道中的每個獨立元件，例如組合不同的量化後端。

初始化 PipelineQuantizationConfig 並向其傳遞 quant_mapping。quant_mapping 允許您指定管道中每個元件的量化選項，例如 Transformer 和文字編碼器。

以下示例將 ~quantizers.QuantoConfig 和 transformers.BitsAndBytesConfig 這兩個量化後端用於 Transformer 和文字編碼器。

import torch
from diffusers import DiffusionPipeline
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers.quantizers.quantization_config import QuantoConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={
        "transformer": QuantoConfig(weights_dtype="int8"),
        "text_encoder_2": TransformersBitsAndBytesConfig(
            load_in_4bit=True, compute_dtype=torch.bfloat16
        ),
    }
)

Transformers 中有一個單獨的 bitsandbytes 後端。對於來自 Transformers 的元件，您需要匯入並使用 transformers.BitsAndBytesConfig。例如，FluxPipeline 中的 text_encoder_2 是來自 Transformers 的 T5EncoderModel，因此您需要使用 transformers.BitsAndBytesConfig，而不是 diffusers.BitsAndBytesConfig。

如果您不想管理這些不同的匯入，或者不確定每個管道元件的來源，請使用上述簡單量化方法。

import torch
from diffusers import DiffusionPipeline
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from diffusers.quantizers import PipelineQuantizationConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig

pipeline_quant_config = PipelineQuantizationConfig(
    quant_mapping={
        "transformer": DiffusersBitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16),
        "text_encoder_2": TransformersBitsAndBytesConfig(
            load_in_4bit=True, compute_dtype=torch.bfloat16
        ),
    }
)

將 pipeline_quant_config 傳遞給 from_pretrained() 以量化管道。

pipe = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("photo of a cute dog").images[0]

資源

請檢視以下資源以瞭解更多關於量化的資訊。

如果您是量化新手，我們建議您檢視以下與 DeepLearning.AI 合作的入門級課程。
- 使用 Hugging Face 進行量化基礎
- 深入瞭解量化
如果您有興趣新增新的量化方法，請參閱貢獻新量化方法指南。
Transformers 量化概述提供了不同量化後端的優缺點概覽。
閱讀探索 Diffusers 中的量化後端部落格文章，簡要介紹每個量化後端、如何選擇後端以及如何將量化與其他記憶體最佳化結合使用。

< > 在 GitHub 上更新