Diffusers 文件

減少記憶體使用

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

減少記憶體使用

現代擴散模型，如 Flux 和 Wan，擁有數十億引數，在硬體上進行推理時會佔用大量記憶體。這具有挑戰性，因為普通的 GPU 通常沒有足夠的記憶體。為了克服記憶體限制，您可以使用多個 GPU（如果可用），將一些管道元件解除安裝到 CPU，等等。

本指南將向您展示如何減少記憶體使用。

請記住，這些技術可能需要根據模型進行調整。例如，基於 Transformer 的擴散模型可能無法像基於 UNet 的模型那樣從這些記憶體最佳化中獲得同等的好處。

多 GPU

如果您可以使用多個 GPU，有幾種選項可以有效地將大型模型載入和分發到您的硬體上。這些功能由 Accelerate 庫支援，因此請確保它已安裝。

pip install -U accelerate

分片檢查點

分片載入大型檢查點很有用，因為分片是逐個載入的。這使得記憶體使用量保持在較低水平，只需要模型大小和最大分片大小的記憶體。當 fp32 檢查點大於 5GB 時，我們建議進行分片。預設分片大小為 5GB。

在 save_pretrained() 中使用 max_shard_size 引數對檢查點進行分片。

from diffusers import AutoModel

unet = AutoModel.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", subfolder="unet"
)
unet.save_pretrained("sdxl-unet-sharded", max_shard_size="5GB")

現在您可以使用分片檢查點，而不是常規檢查點，以節省記憶體。

import torch
from diffusers import AutoModel, StableDiffusionXLPipeline

unet = AutoModel.from_pretrained(
    "username/sdxl-unet-sharded", torch_dtype=torch.float16
)
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    unet=unet,
    torch_dtype=torch.float16
).to("cuda")

裝置放置

裝置放置是一個實驗性功能，API 可能會更改。目前只支援 balanced 策略。我們計劃將來支援額外的對映策略。

device_map 引數控制管道中的模型元件或單個模型中的層如何分佈在裝置上。

管道級別

模型級別

在設計自己的 device_map 時，它應該是一個字典，包含模型的特定模組名稱或層以及裝置識別符號（GPU 的整數、CPU 的 cpu 和磁碟的 disk）。

在模型上呼叫 hf_device_map 以檢視模型層如何分佈，然後設計您自己的。

print(transformer.hf_device_map)
{'pos_embed': 0, 'time_text_embed': 0, 'context_embedder': 0, 'x_embedder': 0, 'transformer_blocks': 0, 'single_transformer_blocks.0': 0, 'single_transformer_blocks.1': 0, 'single_transformer_blocks.2': 0, 'single_transformer_blocks.3': 0, 'single_transformer_blocks.4': 0, 'single_transformer_blocks.5': 0, 'single_transformer_blocks.6': 0, 'single_transformer_blocks.7': 0, 'single_transformer_blocks.8': 0, 'single_transformer_blocks.9': 0, 'single_transformer_blocks.10': 'cpu', 'single_transformer_blocks.11': 'cpu', 'single_transformer_blocks.12': 'cpu', 'single_transformer_blocks.13': 'cpu', 'single_transformer_blocks.14': 'cpu', 'single_transformer_blocks.15': 'cpu', 'single_transformer_blocks.16': 'cpu', 'single_transformer_blocks.17': 'cpu', 'single_transformer_blocks.18': 'cpu', 'single_transformer_blocks.19': 'cpu', 'single_transformer_blocks.20': 'cpu', 'single_transformer_blocks.21': 'cpu', 'single_transformer_blocks.22': 'cpu', 'single_transformer_blocks.23': 'cpu', 'single_transformer_blocks.24': 'cpu', 'single_transformer_blocks.25': 'cpu', 'single_transformer_blocks.26': 'cpu', 'single_transformer_blocks.27': 'cpu', 'single_transformer_blocks.28': 'cpu', 'single_transformer_blocks.29': 'cpu', 'single_transformer_blocks.30': 'cpu', 'single_transformer_blocks.31': 'cpu', 'single_transformer_blocks.32': 'cpu', 'single_transformer_blocks.33': 'cpu', 'single_transformer_blocks.34': 'cpu', 'single_transformer_blocks.35': 'cpu', 'single_transformer_blocks.36': 'cpu', 'single_transformer_blocks.37': 'cpu', 'norm_out': 'cpu', 'proj_out': 'cpu'}

例如，下面的 device_map 將 single_transformer_blocks.10 到 single_transformer_blocks.20 放置在第二個 GPU (1) 上。

import torch
from diffusers import AutoModel

device_map = {
    'pos_embed': 0, 'time_text_embed': 0, 'context_embedder': 0, 'x_embedder': 0, 'transformer_blocks': 0, 'single_transformer_blocks.0': 0, 'single_transformer_blocks.1': 0, 'single_transformer_blocks.2': 0, 'single_transformer_blocks.3': 0, 'single_transformer_blocks.4': 0, 'single_transformer_blocks.5': 0, 'single_transformer_blocks.6': 0, 'single_transformer_blocks.7': 0, 'single_transformer_blocks.8': 0, 'single_transformer_blocks.9': 0, 'single_transformer_blocks.10': 1, 'single_transformer_blocks.11': 1, 'single_transformer_blocks.12': 1, 'single_transformer_blocks.13': 1, 'single_transformer_blocks.14': 1, 'single_transformer_blocks.15': 1, 'single_transformer_blocks.16': 1, 'single_transformer_blocks.17': 1, 'single_transformer_blocks.18': 1, 'single_transformer_blocks.19': 1, 'single_transformer_blocks.20': 1, 'single_transformer_blocks.21': 'cpu', 'single_transformer_blocks.22': 'cpu', 'single_transformer_blocks.23': 'cpu', 'single_transformer_blocks.24': 'cpu', 'single_transformer_blocks.25': 'cpu', 'single_transformer_blocks.26': 'cpu', 'single_transformer_blocks.27': 'cpu', 'single_transformer_blocks.28': 'cpu', 'single_transformer_blocks.29': 'cpu', 'single_transformer_blocks.30': 'cpu', 'single_transformer_blocks.31': 'cpu', 'single_transformer_blocks.32': 'cpu', 'single_transformer_blocks.33': 'cpu', 'single_transformer_blocks.34': 'cpu', 'single_transformer_blocks.35': 'cpu', 'single_transformer_blocks.36': 'cpu', 'single_transformer_blocks.37': 'cpu', 'norm_out': 'cpu', 'proj_out': 'cpu'
}

transformer = AutoModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev", 
    subfolder="transformer",
    device_map=device_map,
    torch_dtype=torch.bfloat16
)

傳遞一個將最大記憶體使用量對映到每個裝置的字典以強制執行限制。如果裝置不在 max_memory 中，它將被忽略，管道元件將不會分佈到它上面。

import torch
from diffusers import AutoModel, StableDiffusionXLPipeline

max_memory = {0:"1GB", 1:"1GB"}
pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    device_map="balanced",
    max_memory=max_memory
)

預設情況下，Diffusers 使用所有裝置的最大記憶體，但如果它們不適合 GPU，那麼您將需要使用單個 GPU 並透過以下方法解除安裝到 CPU。

enable_model_cpu_offload() 僅適用於單個 GPU，但非常大的模型可能無法容納在其中
enable_sequential_cpu_offload() 可能有效，但速度極慢，並且也僅限於單個 GPU

使用 reset_device_map() 方法重置 device_map。如果您想在已進行裝置對映的管道上使用 .to()、enable_sequential_cpu_offload() 和 enable_model_cpu_offload() 等方法，則這是必要的。

pipeline.reset_device_map()

VAE 切片

VAE 切片透過將大批次輸入拆分為單個數據批次並單獨處理來節省記憶體。當一次生成多張影像時，此方法效果最佳。

例如，如果您一次生成 4 張影像，解碼會將峰值啟用記憶體增加 4 倍。VAE 切片透過一次只解碼 1 張影像而不是一次解碼所有 4 張影像來減少此問題。

呼叫 enable_vae_slicing() 以啟用切片 VAE。您可以預期在解碼多影像批次時效能會略有提高，而單影像批次則沒有效能影響。

import torch
from diffusers import AutoModel, StableDiffusionXLPipeline

pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")
pipeline.enable_vae_slicing()
pipeline(["An astronaut riding a horse on Mars"]*32).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

AutoencoderKLWan 和 AsymmetricAutoencoderKL 類不支援切片。

VAE 平鋪

VAE 平鋪透過將影像分成更小的重疊圖塊而不是一次處理整個影像來節省記憶體。這還可以降低峰值記憶體使用量，因為 GPU 一次只處理一個圖塊。

呼叫 enable_vae_tiling() 以啟用 VAE 平鋪。生成的影像可能在圖塊之間有一些色調變化，因為它們是單獨解碼的，但是圖塊之間不應該有任何明顯的接縫。對於低於預設（但可配置）限制的解析度，平鋪被停用。例如，StableDiffusionPipeline 中的 VAE 的此限制為 512x512。

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
).to("cuda")
pipeline.enable_vae_tiling()

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
pipeline(prompt, image=init_image, strength=0.5).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

[!WARNING][AutoencoderKLWan](/docs/diffusers/v0.34.0/en/api/models/autoencoder_kl_wan#diffusers.AutoencoderKLWan) 和 AsymmetricAutoencoderKL 不支援平鋪。

解除安裝

解除安裝策略將當前不活躍的層或模型移動到 CPU，以避免增加 GPU 記憶體。這些策略可以與量化和 torch.compile 結合使用，以平衡推理速度和記憶體使用。

有關更多詳細資訊，請參閱編譯和解除安裝量化模型指南。

CPU 解除安裝

CPU 解除安裝選擇性地將權重從 GPU 移動到 CPU。當需要某個元件時，它會被傳輸到 GPU，當不再需要時，它會被移動到 CPU。此方法適用於子模組而不是整個模型。它透過避免將整個模型儲存在 GPU 上來節省記憶體。

CPU 解除安裝極大地減少了記憶體使用，但它也非常慢，因為子模組在裝置之間來回傳遞多次。由於其速度慢，通常不切實際。

在呼叫 enable_sequential_cpu_offload() 之前，請勿將管道移動到 CUDA，否則節省的記憶體量僅為最小（有關更多詳細資訊，請參閱此問題）。這是一個有狀態操作，它會在模型上安裝鉤子。

呼叫 enable_sequential_cpu_offload() 以在管道上啟用它。

import torch
from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipeline.enable_sequential_cpu_offload()

pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=0.,
    height=768,
    width=1360,
    num_inference_steps=4,
    max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

模型解除安裝

模型解除安裝將整個模型移動到 GPU，而不是選擇性地移動一些層或模型元件。其中一個主要管道模型（通常是文字編碼器、UNet 和 VAE）放置在 GPU 上，而其他元件則保留在 CPU 上。像 UNet 這樣多次執行的元件會保留在 GPU 上，直到它完全完成且不再需要。這消除了 CPU 解除安裝的通訊開銷，並使模型解除安裝成為更快的替代方案。權衡是記憶體節省不會那麼大。

請記住，如果模型在安裝鉤子後（參見移除鉤子瞭解更多詳細資訊）在管道外部被重用，您需要按照預期順序執行整個管道和模型才能正確解除安裝它們。這是一個有狀態操作，它在模型上安裝鉤子。

呼叫 enable_model_cpu_offload() 以在管道上啟用它。

import torch
from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()

pipeline(
    prompt="An astronaut riding a horse on Mars",
    guidance_scale=0.,
    height=768,
    width=1360,
    num_inference_steps=4,
    max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")

enable_model_cpu_offload() 在您單獨使用 encode_prompt() 方法生成文字編碼器隱藏狀態時也很有用。

組解除安裝

組解除安裝將內部層組（torch.nn.ModuleList 或 torch.nn.Sequential）移動到 CPU。它比模型解除安裝使用更少的記憶體，並且比CPU 解除安裝更快，因為它減少了通訊開銷。

如果正向實現包含權重依賴的輸入裝置型別轉換，組解除安裝可能無法與所有模型一起工作，因為它可能與組解除安裝的裝置型別轉換機制衝突。

呼叫 enable_group_offload() 以啟用它用於繼承自 ModelMixin 的標準 Diffusers 模型元件。對於不繼承自 ModelMixin 的其他模型元件，例如通用的 torch.nn.Module，請改用 apply_group_offloading()。

offload_type 引數可以設定為 block_level 或 leaf_level。

block_level 根據 num_blocks_per_group 引數解除安裝層組。例如，如果模型有 40 層，並且 num_blocks_per_group=2，則每次載入和解除安裝 2 層（總共 20 次載入/解除安裝）。這大大減少了記憶體需求。
leaf_level 在最低級別解除安裝單個層，等效於CPU 解除安裝。但是，如果您使用流而不犧牲推理速度，則可以使其更快。

import torch
from diffusers import CogVideoXPipeline
from diffusers.hooks import apply_group_offloading
from diffusers.utils import export_to_video

onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)

# Use the enable_group_offload method for Diffusers model implementations
pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level")
pipeline.vae.enable_group_offload(onload_device=onload_device, offload_type="leaf_level")

# Use the apply_group_offloading method for other model components
apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)

CUDA 流

對於支援非同步資料傳輸流的 CUDA 裝置，可以啟用 use_stream 引數，以減少總體執行時間（與CPU 解除安裝相比）。它透過層預取來重疊資料傳輸和計算。下一個要執行的層在當前層仍在執行時載入到 GPU 上。它會顯著增加 CPU 記憶體，因此請確保您擁有的記憶體量是模型大小的 2 倍。

將 record_stream 設定為 True 以獲得更快的速度，但會略微增加記憶體使用量。請參閱 torch.Tensor.record_stream 文件以瞭解更多資訊。

當 VAE 啟用平鋪並設定 use_stream=True 時，請務必在推理之前進行一次虛擬前向傳播（也可以使用虛擬輸入），以避免裝置不匹配錯誤。這可能不適用於所有實現，因此如果您遇到任何問題，請隨時提出 issue。

如果您將 block_level 組解除安裝與 use_stream 啟用一起使用，則 num_blocks_per_group 引數應設定為 1，否則將發出警告。

pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", use_stream=True, record_stream=True)

在組解除安裝期間使用流時，可以將 low_cpu_mem_usage 引數設定為 True 以減少 CPU 記憶體使用。它最適合 leaf_level 解除安裝和 CPU 記憶體成為瓶頸的情況。透過動態建立固定張量而不是預固定它們來節省記憶體。但是，這可能會增加總體執行時間。

解除安裝到磁碟

組解除安裝根據模型大小可能會消耗大量系統記憶體。在記憶體有限的系統上，嘗試將組解除安裝到磁碟作為輔助記憶體。

在 enable_group_offload() 或 apply_group_offloading() 中設定 offload_to_disk_path 引數，以將模型解除安裝到磁碟。

pipeline.transformer.enable_group_offload(onload_device=onload_device, offload_device=offload_device, offload_type="leaf_level", offload_to_disk_path="path/to/disk")

apply_group_offloading(pipeline.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2, offload_to_disk_path="path/to/disk")

請參閱這兩個表格以比較速度和記憶體的權衡。

分層型別轉換

將分層型別轉換與組解除安裝相結合，可以進一步節省記憶體。

分層型別轉換將權重儲存在較小的資料格式（例如，torch.float8_e4m3fn 和 torch.float8_e5m2）中以減少記憶體使用，並將其提升為更高精度（例如 torch.float16 或 torch.bfloat16）進行計算。某些層（歸一化和調製相關權重）被跳過，因為將它們儲存在 fp8 中可能會降低生成質量。

如果正向實現包含權重的內部型別轉換，分層型別轉換可能不適用於所有模型。當前的分層型別轉換實現假設正向傳播與權重精度無關，並且輸入資料型別始終在 compute_dtype 中指定（參見此處瞭解不相容的實現）。

分層型別轉換也可能在帶有 PEFT 層的自定義建模實現中失敗。有一些可用的檢查，但它們沒有經過廣泛測試或保證在所有情況下都有效。

呼叫 enable_layerwise_casting() 來設定儲存和計算資料型別。

import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video

transformer = CogVideoXTransformer3DModel.from_pretrained(
    "THUDM/CogVideoX-5b",
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)

pipeline = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b",
    transformer=transformer,
    torch_dtype=torch.bfloat16
).to("cuda")
prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)
video = pipeline(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)

如果需要更多控制和靈活性，也可以使用 apply_layerwise_casting() 方法。可以透過在特定內部模組上呼叫它來部分應用於模型層。使用 skip_modules_pattern 或 skip_modules_classes 引數指定要避免的模組，例如歸一化和調製層。

import torch
from diffusers import CogVideoXTransformer3DModel
from diffusers.hooks import apply_layerwise_casting

transformer = CogVideoXTransformer3DModel.from_pretrained(
    "THUDM/CogVideoX-5b",
    subfolder="transformer",
    torch_dtype=torch.bfloat16
)

# skip the normalization layer
apply_layerwise_casting(
    transformer,
    storage_dtype=torch.float8_e4m3fn,
    compute_dtype=torch.bfloat16,
    skip_modules_classes=["norm"],
    non_blocking=True,
)

torch.channels_last

torch.channels_last 將張量的儲存方式從 (batch size, channels, height, width) 翻轉為 (batch size, height, width, channels)。這使得張量與硬體順序訪問記憶體中儲存的張量的方式對齊，並避免在記憶體中跳躍以訪問畫素值。

並非所有運算子目前都支援 channels-last 格式，這可能會導致效能下降，但仍然值得嘗試。

print(pipeline.unet.conv_out.state_dict()["weight"].stride())  # (2880, 9, 3, 1)
pipeline.unet.to(memory_format=torch.channels_last)  # in-place operation
print(
    pipeline.unet.conv_out.state_dict()["weight"].stride()
)  # (2880, 1, 960, 320) having a stride of 1 for the 2nd dimension proves that it works

torch.jit.trace

torch.jit.trace 記錄模型對樣本輸入執行的操作，並基於記錄的執行路徑建立模型的新最佳化表示。在跟蹤過程中，模型會進行最佳化以減少 Python 和動態控制流的開銷，並且操作會融合在一起以提高效率。返回的可執行檔案或 ScriptFunction 可以被編譯。

import time
import torch
from diffusers import StableDiffusionPipeline
import functools

# torch disable grad
torch.set_grad_enabled(False)

# set variables
n_experiments = 2
unet_runs_per_experiment = 50

# load sample inputs
def generate_inputs():
    sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16)
    timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999
    encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16)
    return sample, timestep, encoder_hidden_states


pipeline = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to("cuda")
unet = pipeline.unet
unet.eval()
unet.to(memory_format=torch.channels_last)  # use channels_last memory format
unet.forward = functools.partial(unet.forward, return_dict=False)  # set return_dict=False as default

# warmup
for _ in range(3):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet(*inputs)

# trace
print("tracing..")
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print("done tracing")

# warmup and optimize graph
for _ in range(5):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet_traced(*inputs)

# benchmarking
with torch.inference_mode():
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet_traced(*inputs)
        torch.cuda.synchronize()
        print(f"unet traced inference took {time.time() - start_time:.2f} seconds")
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet(*inputs)
        torch.cuda.synchronize()
        print(f"unet inference took {time.time() - start_time:.2f} seconds")

# save the model
unet_traced.save("unet_traced.pt")

將管道的 UNet 替換為跟蹤版本。

import torch
from diffusers import StableDiffusionPipeline
from dataclasses import dataclass

@dataclass
class UNet2DConditionOutput:
    sample: torch.Tensor

pipeline = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to("cuda")

# use jitted unet
unet_traced = torch.jit.load("unet_traced.pt")

# del pipeline.unet
class TracedUNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.in_channels = pipe.unet.config.in_channels
        self.device = pipe.unet.device

    def forward(self, latent_model_input, t, encoder_hidden_states):
        sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)

pipeline.unet = TracedUNet()

with torch.inference_mode():
    image = pipe([prompt] * 1, num_inference_steps=50).images[0]

記憶體高效注意力機制

記憶體高效注意力機制優化了記憶體使用和推理速度！

Transformer 注意力機制是記憶體密集型的，特別是對於長序列，因此您可以嘗試使用不同且更記憶體高效的注意力型別。

預設情況下，如果安裝了 PyTorch >= 2.0，則使用縮放點積注意力 (SDPA)。您無需對程式碼進行任何其他更改。

SDPA 支援 FlashAttention 和 xFormers 以及原生 C++ PyTorch 實現。它會根據您的輸入自動選擇最優實現。

您可以使用 enable_xformers_memory_efficient_attention() 方法顯式使用 xFormers。

# pip install xformers
import torch
from diffusers import StableDiffusionXLPipeline

pipeline = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")
pipeline.enable_xformers_memory_efficient_attention()

呼叫 disable_xformers_memory_efficient_attention() 停用它。

pipeline.disable_xformers_memory_efficient_attention()

< > 在 GitHub 上更新

←快取編譯和解除安裝量化模型→