ParaAttention

要在 FLUX.1-dev 上應用首塊快取，請按如下所示呼叫 `apply_cache_on_pipe`。0.08 是 FLUX 模型預設的殘差差異值。

import time
import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe

apply_cache_on_pipe(pipe, residual_diff_threshold=0.08)

# Enable memory savings
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()

begin = time.time()
image = pipe(
    "A cat holding a sign that says hello world",
    num_inference_steps=28,
).images[0]
end = time.time()
print(f"Time: {end - begin:.2f}s")

print("Saving image to flux.png")
image.save("flux.png")

最佳化	原始	FBCache rdt=0.06	FBCache rdt=0.08	FBCache rdt=0.10	FBCache rdt=0.12
預覽
牆鍾時間 (秒)	26.36	21.83	17.01	16.00	13.78

與基準線相比，首塊快取將推理速度降低到 17.01 秒，即快 1.55 倍，同時幾乎沒有質量損失。

import time
import torch
from diffusers import FluxPipeline

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe

apply_cache_on_pipe(
    pipe,
    residual_diff_threshold=0.12,  # Use a larger value to make the cache take effect
)

from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only

quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
pipe.transformer = torch.compile(
   pipe.transformer, mode="max-autotune-no-cudagraphs",
)

# Enable memory savings
# pipe.enable_model_cpu_offload()
# pipe.enable_sequential_cpu_offload()

for i in range(2):
    begin = time.time()
    image = pipe(
        "A cat holding a sign that says hello world",
        num_inference_steps=28,
    ).images[0]
    end = time.time()
    if i == 0:
        print(f"Warm up time: {end - begin:.2f}s")
    else:
        print(f"Time: {end - begin:.2f}s")

print("Saving image to flux.png")
image.save("flux.png")

fp8 動態量化和 torch.compile 將推理速度降低到 7.56 秒，比基線快 3.48 倍。

以下程式碼示例結合了首塊快取、fp8 動態量化、torch.compile 和上下文並行，以實現最快的推理速度。

import time
import torch
import torch.distributed as dist
from diffusers import FluxPipeline

dist.init_process_group()

torch.cuda.set_device(dist.get_rank())

pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
).to("cuda")

from para_attn.context_parallel import init_context_parallel_mesh
from para_attn.context_parallel.diffusers_adapters import parallelize_pipe
from para_attn.parallel_vae.diffusers_adapters import parallelize_vae

mesh = init_context_parallel_mesh(
    pipe.device.type,
    max_ring_dim_size=2,
)
parallelize_pipe(
    pipe,
    mesh=mesh,
)
parallelize_vae(pipe.vae, mesh=mesh._flatten())

from para_attn.first_block_cache.diffusers_adapters import apply_cache_on_pipe

apply_cache_on_pipe(
    pipe,
    residual_diff_threshold=0.12,  # Use a larger value to make the cache take effect
)

from torchao.quantization import quantize_, float8_dynamic_activation_float8_weight, float8_weight_only

quantize_(pipe.text_encoder, float8_weight_only())
quantize_(pipe.transformer, float8_dynamic_activation_float8_weight())
torch._inductor.config.reorder_for_compute_comm_overlap = True
pipe.transformer = torch.compile(
   pipe.transformer, mode="max-autotune-no-cudagraphs",
)

# Enable memory savings
# pipe.enable_model_cpu_offload(gpu_id=dist.get_rank())
# pipe.enable_sequential_cpu_offload(gpu_id=dist.get_rank())

for i in range(2):
    begin = time.time()
    image = pipe(
        "A cat holding a sign that says hello world",
        num_inference_steps=28,
        output_type="pil" if dist.get_rank() == 0 else "pt",
    ).images[0]
    end = time.time()
    if dist.get_rank() == 0:
        if i == 0:
            print(f"Warm up time: {end - begin:.2f}s")
        else:
            print(f"Time: {end - begin:.2f}s")

if dist.get_rank() == 0:
    print("Saving image to flux.png")
    image.save("flux.png")

dist.destroy_process_group()

儲存到 `run_flux.py` 並使用 torchrun 啟動它。

# Use --nproc_per_node to specify the number of GPUs
torchrun --nproc_per_node=2 run_flux.py

使用 2 個 NVIDIA L20 GPU，推理速度比基線降低到 8.20 秒，即快 3.21 倍。在 4 個 L20 GPU 上，推理速度為 3.90 秒，即快 6.75 倍。

GPU 型別	GPU 數量	最佳化	牆鍾時間 (秒)	加速比
NVIDIA L20	1	基線	26.36	1.00倍
NVIDIA L20	1	FBCache (rdt=0.08)	17.01	1.55倍
NVIDIA L20	1	FP8 DQ	13.40	1.96倍
NVIDIA L20	1	FBCache (rdt=0.12) + FP8 DQ	7.56	3.48倍
NVIDIA L20	2	FBCache (rdt=0.12) + FP8 DQ + CP	4.92	5.35倍
NVIDIA L20	4	FBCache (rdt=0.12) + FP8 DQ + CP	3.90	6.75倍

Diffusers

首塊快取

fp8 量化

上下文並行

基準測試