Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

編譯和解除安裝量化模型

模型最佳化通常涉及在推理速度和記憶體使用之間進行權衡。例如，雖然快取可以提高推理速度，但它也會增加記憶體消耗，因為它需要儲存中間注意力層的輸出。一種更均衡的最佳化策略是結合模型量化、torch.compile 和各種解除安裝方法。

對於影像生成，結合量化和模型解除安裝通常可以在質量、速度和記憶體之間實現最佳權衡。分組解除安裝對於影像生成效果不佳，因為如果計算核心完成速度較快，通常無法完全重疊資料傳輸。這會導致 CPU 和 GPU 之間產生一些通訊開銷。

對於影片生成，結合量化和分組解除安裝往往效果更好，因為影片模型更受計算限制。

下表比較了不同最佳化策略組合及其對 Flux 延遲和記憶體使用的影響。

組合	延遲 (秒)	記憶體使用 (GB)
量化	32.602	14.9453
量化, torch.compile	25.847	14.9448
量化, torch.compile, 模型 CPU 解除安裝	32.312	12.2369

這些結果是在 RTX 4090 上對 Flux 進行基準測試得出的。transformer 和 text_encoder 元件被量化。如果您有興趣評估自己的模型，請參考此基準測試指令碼。

本指南將向您展示如何使用 bitsandbytes 編譯和解除安裝量化模型。請確保您正在使用 PyTorch nightly 和最新版本的 bitsandbytes。

pip install -U bitsandbytes

量化和 torch.compile

首先，量化一個模型以減少儲存所需的記憶體，並編譯它以加速推理。

配置 Dynamo 的 `capture_dynamic_output_shape_ops = True` 以在編譯 bitsandbytes 模型時處理動態輸出。

import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig

torch._dynamo.config.capture_dynamic_output_shape_ops = True

# quantize
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
    components_to_quantize=["transformer", "text_encoder_2"],
)
pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
).to("cuda")

# compile
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.transformer.compile(mode="max-autotune", fullgraph=True)
pipeline("""
    cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California
    highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
"""
).images[0]

量化、torch.compile 和解除安裝

除了量化和 torch.compile，如果您需要進一步減少記憶體使用，可以嘗試解除安裝。解除安裝會根據計算需要，將各種層或模型元件從 CPU 移動到 GPU。

在解除安裝過程中配置 Dynamo 的 `cache_size_limit` 以避免過度重新編譯，並設定 `capture_dynamic_output_shape_ops = True` 以在編譯 bitsandbytes 模型時處理動態輸出。

模型 CPU 解除安裝

分組解除安裝

< > 在 GitHub 上更新

←減少記憶體使用 Pruna→