Diffusers 文件

影片生成

Hugging Face's logo
加入 Hugging Face 社群

並獲得增強的文件體驗

開始使用

影片生成

影片生成模型將影像生成(可視為單幀影片)擴充套件到處理與空間和時間相關的資料。在生成長篇和高解析度影片時,確保所有這些資料——文字、空間、時間——在幀與幀之間保持一致和對齊是一個巨大的挑戰。

現代影片模型透過擴散 Transformer(DiT)架構來應對這一挑戰。這降低了計算成本,並允許更有效地擴充套件到更大、更高質量的影像和影片資料。

請在下方檢視這些影片模型的部分功能。

Wan2.1
HunyuanVideo
LTX-Video
# pip install ftfy
import torch
import numpy as np
from diffusers import AutoModel, WanPipeline
from diffusers.hooks.group_offloading import apply_group_offloading
from diffusers.utils import export_to_video, load_image
from transformers import UMT5EncoderModel

text_encoder = UMT5EncoderModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16)
vae = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32)
transformer = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)

# group-offloading
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
apply_group_offloading(text_encoder,
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="block_level",
    num_blocks_per_group=4
)
transformer.enable_group_offload(
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="leaf_level",
    use_stream=True
)

pipeline = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
    vae=vae,
    transformer=transformer,
    text_encoder=text_encoder,
    torch_dtype=torch.bfloat16
)
pipeline.to("cuda")

prompt = """
The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic 
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
negative_prompt = """
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, 
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, 
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
"""

output = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_frames=81,
    guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=16)

本指南將介紹影片生成的基礎知識,例如要配置哪些引數以及如何減少其記憶體使用量。

如果您有興趣瞭解如何使用特定模型,請參閱其 pipeline API 模型卡。

Pipeline 引數

pipeline 中有幾個引數需要配置,它們會影響影片生成的質量或速度。嘗試不同的引數值對於發現合適的質量和速度平衡點非常重要。

num_frames

幀是靜態影像,它與其他幀按順序播放以建立運動或影片。使用 num_frames 控制每秒生成的幀數。增加 num_frames 會提高感知的運動平滑度和視覺連貫性,這對於動態內容的影片尤其重要。較高的 num_frames 值也會增加影片時長。

一些影片模型在推理時需要更具體的 num_frames 值。例如,HunyuanVideoPipeline 建議使用 (4 * num_frames) +1 來計算 num_frames。請務必檢視 pipeline 的 API 模型卡,以瞭解是否有推薦值。

import torch
from diffusers import LTXPipeline
from diffusers.utils import export_to_video

pipeline = LTXPipeline.from_pretrained(
    "Lightricks/LTX-Video", torch_dtype=torch.bfloat16
).to("cuda")

prompt = """
A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman 
with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The 
camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and 
natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be 
real-life footage
"""

video = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=768,
    height=512,
    num_frames=161,
    decode_timestep=0.03,
    decode_noise_scale=0.025,
    num_inference_steps=50,
).frames[0]
export_to_video(video, "output.mp4", fps=24)

guidance_scale

引導比例(guidance scale)或“cfg”控制生成的幀與輸入條件(文字、影像或兩者)的貼近程度。增加 guidance_scale 會生成更接近輸入條件的幀,幷包含更精細的細節,但有引入偽影和降低輸出多樣性的風險。較低的 guidance_scale 值鼓勵更寬鬆的提示遵循和增加的輸出多樣性,但細節可能不那麼好。如果值過低,它可能會完全忽略您的提示並生成隨機噪聲。

import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video

pipeline = CogVideoXPipeline.from_pretrained(
  "THUDM/CogVideoX-2b",
  torch_dtype=torch.float16
).to("cuda")

prompt = """
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over
a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, 
with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an 
oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at 
a playful environment. The scene captures the innocence and imagination of childhood, 
with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
"""

video = pipeline(
  prompt=prompt,
  guidance_scale=6,
  num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)

negative_prompt

負面提示(negative prompt)對於排除您不希望在生成影片中看到的內容非常有用。它通常用於透過將模型推離不希望的元素(如“模糊、扭曲、醜陋”)來最佳化生成影片的質量和對齊度。這可以建立更清晰、更專注的影片。

# pip install ftfy
import torch
from diffusers import WanPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

vae = AutoencoderKLWan.from_pretrained(
  "Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32
)
pipeline = WanPipeline.from_pretrained(
  "Wan-AI/Wan2.1-T2V-14B-Diffusers", vae=vae, torch_dtype=torch.bfloat16
)
pipeline.scheduler = UniPCMultistepScheduler.from_config(
  pipeline.scheduler.config, flow_shift=5.0
)
pipeline.to("cuda")

pipeline.load_lora_weights("benjamin-paine/steamboat-willie-14b", adapter_name="steamboat-willie")
pipeline.set_adapters("steamboat-willie")

pipeline.enable_model_cpu_offload()

# use "steamboat willie style" to trigger the LoRA
prompt = """
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts 
dynamic shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""

output = pipeline(
  prompt=prompt,
  num_frames=81,
  guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=16)

減少記憶體使用

最近的影片模型,如 HunyuanVideoPipelineWanPipeline,引數超過100億,需要大量記憶體,通常會超出消費級硬體的可用記憶體。Diffusers 提供了幾種技術來減少這些大型模型的記憶體需求。

有關其他節省記憶體技術的更多詳細資訊,請參閱減少記憶體使用指南。

其中一種技術是分組解除安裝,它將內部模型層的組(例如 torch.nn.Sequential)在不使用時解除安裝到 CPU。這些層僅在需要計算時才載入,以避免將所有模型元件儲存在 GPU 上。對於像 WanPipeline 這樣的 140 億引數模型,分組解除安裝可以將所需記憶體降低到約 13GB 的 VRAM。

# pip install ftfy
import torch
import numpy as np
from diffusers import AutoModel, WanPipeline
from diffusers.hooks.group_offloading import apply_group_offloading
from diffusers.utils import export_to_video, load_image
from transformers import UMT5EncoderModel

text_encoder = UMT5EncoderModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="text_encoder", torch_dtype=torch.bfloat16)
vae = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32)
transformer = AutoModel.from_pretrained("Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)

# group-offloading
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
apply_group_offloading(text_encoder,
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="block_level",
    num_blocks_per_group=4
)
transformer.enable_group_offload(
    onload_device=onload_device,
    offload_device=offload_device,
    offload_type="leaf_level",
    use_stream=True
)

pipeline = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-14B-Diffusers",
    vae=vae,
    transformer=transformer,
    text_encoder=text_encoder,
    torch_dtype=torch.bfloat16
)
pipeline.to("cuda")

prompt = """
The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic 
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
negative_prompt = """
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, 
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, 
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
"""

output = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_frames=81,
    guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=16)

減少記憶體的另一種選擇是考慮對模型進行量化,即將模型權重儲存在較低精度的資料型別中。然而,量化可能會影響影片質量,具體取決於特定的影片模型。請參閱量化概述以瞭解有關不同支援的量化後端的更多資訊。

下面的示例使用 bitsandbytes 對模型進行量化。

# pip install ftfy

import torch
from diffusers import WanPipeline
from diffusers import AutoModel, WanPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from transformers import UMT5EncoderModel
from diffusers.utils import export_to_video

# quantize transformer and text encoder weights with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
  quant_backend="bitsandbytes_4bit",
  quant_kwargs={"load_in_4bit": True},
  components_to_quantize=["transformer", "text_encoder"]
)

vae = AutoModel.from_pretrained(
  "Wan-AI/Wan2.1-T2V-14B-Diffusers", subfolder="vae", torch_dtype=torch.float32
)
pipeline = WanPipeline.from_pretrained(
  "Wan-AI/Wan2.1-T2V-14B-Diffusers", vae=vae, quantization_config=pipeline_quant_config, torch_dtype=torch.bfloat16
)
pipeline.scheduler = UniPCMultistepScheduler.from_config(
  pipeline.scheduler.config, flow_shift=5.0
)
pipeline.to("cuda")

pipeline.load_lora_weights("benjamin-paine/steamboat-willie-14b", adapter_name="steamboat-willie")
pipeline.set_adapters("steamboat-willie")

pipeline.enable_model_cpu_offload()

# use "steamboat willie style" to trigger the LoRA
prompt = """
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts 
dynamic shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""

output = pipeline(
  prompt=prompt,
  num_frames=81,
  guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=16)

推理速度

torch.compile 可以透過使用最佳化的核心來加速推理。首次編譯需要更長的時間,但一旦編譯完成,速度會快得多。最好一次性編譯 pipeline,然後在不更改任何內容的情況下多次使用 pipeline。像影像尺寸這樣的更改會觸發重新編譯。

下面的示例編譯了 pipeline 中的 transformer,並使用 "max-autotune" 模式來最大化效能。

import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video

pipeline = CogVideoXPipeline.from_pretrained(
  "THUDM/CogVideoX-2b",
  torch_dtype=torch.float16
).to("cuda")

# torch.compile
pipeline.transformer.to(memory_format=torch.channels_last)
pipeline.transformer = torch.compile(
    pipeline.transformer, mode="max-autotune", fullgraph=True
)

prompt = """
A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. 
The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. 
Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, 
with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting.
"""

video = pipeline(
  prompt=prompt,
  guidance_scale=6,
  num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)
< > 在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.