Diffusers 文件

Wan2.1

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

Wan2.1

由 Wan 團隊推出的Wan-2.1。

本報告介紹了 Wan，這是一套全面開放的影片基礎模型，旨在突破影片生成領域的界限。Wan 基於主流的擴散變換器正規化構建，透過一系列創新，包括我們新穎的 VAE、可擴充套件的預訓練策略、大規模資料整理和自動化評估指標，在生成能力方面取得了顯著進展。這些貢獻共同提升了模型的效能和多功能性。具體而言，Wan 具有四個關鍵特性：領先的效能：Wan 的 14B 模型，在包含數十億影像和影片的龐大資料集上進行訓練，展示了影片生成在資料和模型大小方面的縮放定律。它在多個內部和外部基準測試中始終優於現有開源模型和最先進的商業解決方案，顯示出明顯且顯著的效能優勢。全面性：Wan 提供兩個功能強大的模型，即 1.3B 和 14B 引數，分別注重效率和效果。它還涵蓋了多種下游應用，包括影像到影片、指令引導的影片編輯和個人影片生成，涵蓋多達八項任務。消費者級效率：1.3B 模型展示了卓越的資源效率，僅需 8.19 GB 視訊記憶體，使其與各種消費者級 GPU 相容。開放性：我們開源了整個 Wan 系列，包括原始碼和所有模型，旨在促進影片生成社群的發展。這種開放性旨在顯著擴充套件行業影片製作的創作可能性，併為學術界提供高質量的影片基礎模型。所有程式碼和模型均可在此連結獲取。

您可以在 Wan-AI 組織下找到所有原始 Wan2.1 檢查點。

Diffusers 支援以下 Wan 模型

點選右側邊欄中的 Wan2.1 模型，檢視更多影片生成示例。

文字到影片生成

下面的示例演示瞭如何從文字生成影片，並針對記憶體或推理速度進行了最佳化。

T2V 記憶體

T2V 推理速度

首尾幀到影片生成

下面的示例演示瞭如何使用影像到影片流水線，透過文字描述、起始幀和結束幀來生成影片。

用法

任意到影片可控生成

Wan VACE 支援各種生成技術，可實現可控的影片生成。部分功能包括：

控制到影片（深度、姿態、草圖、流程、灰度、塗鴉、佈局、邊界框等）。推薦用於影片預處理以獲取控制影片的庫：huggingface/controlnet_aux
影像/影片到影片（首幀、末幀、起始剪輯、結束剪輯、隨機剪輯）
影像修補和外擴
主題到影片（人臉、物體、角色等）
合成到影片（引用任何內容，動畫任何內容，交換任何內容，擴充套件任何內容，移動任何內容等）

此拉取請求中提供的程式碼片段演示瞭如何使用可控訊號生成影片的一些示例。

在使用 VACE 流水線準備輸入時，需要記住的通用規則是：用作條件的輸入影像或影片幀應具有相應的黑色蒙版。黑色蒙版表示模型不會為該區域生成新內容，而僅使用這些部分來條件化生成過程。對於應由模型生成的部分/幀，蒙版應為白色。

注意事項

Wan2.1 支援使用 load_lora_weights() 載入 LoRA。

顯示示例程式碼

# pip install ftfy
import torch
from diffusers import AutoModel, WanPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video

vae = AutoModel.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="vae", torch_dtype=torch.float32
)
pipeline = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers", vae=vae, torch_dtype=torch.bfloat16
)
pipeline.scheduler = UniPCMultistepScheduler.from_config(
    pipeline.scheduler.config, flow_shift=5.0
)
pipeline.to("cuda")

pipeline.load_lora_weights("benjamin-paine/steamboat-willie-1.3b", adapter_name="steamboat-willie")
pipeline.set_adapters("steamboat-willie")

pipeline.enable_model_cpu_offload()

# use "steamboat willie style" to trigger the LoRA
prompt = """
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot, 
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in 
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground. 
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic 
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""

output = pipeline(
    prompt=prompt,
    num_frames=81,
    guidance_scale=5.0,
).frames[0]
export_to_video(output, "output.mp4", fps=16)

WanTransformer3DModel 和 AutoencoderKLWan 支援從單個檔案使用 from_single_file() 載入。

顯示示例程式碼

# pip install ftfy
import torch
from diffusers import WanPipeline, AutoModel

vae = AutoModel.from_single_file(
    "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/vae/wan_2.1_vae.safetensors"
)
transformer = AutoModel.from_single_file(
    "https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/diffusion_models/wan2.1_t2v_1.3B_bf16.safetensors",
    torch_dtype=torch.bfloat16
)
pipeline = WanPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
    vae=vae,
    transformer=transformer,
    torch_dtype=torch.bfloat16
)

將 AutoencoderKLWan dtype 設定為 torch.float32 以獲得更好的解碼質量。
每秒幀數 (fps) 或 k 應透過 4 * k + 1 計算。
對於較低解析度的影片，嘗試較低的 shift 值（2.0 到 5.0）；對於較高解析度的影像，嘗試較高的 shift 值（7.0 到 12.0）。

WanPipeline

class diffusers.WanPipeline

< 原始碼 >

( tokenizer: AutoTokenizer text_encoder: UMT5EncoderModel transformer: WanTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler )

引數

tokenizer (T5Tokenizer) — 來自 T5 的分詞器，特別是 google/umt5-xxl 變體。
text_encoder (T5EncoderModel) — T5，特別是 google/umt5-xxl 變體。
transformer (WanTransformer3DModel) — 用於對輸入潛在變數進行去噪的條件變換器。
scheduler (UniPCMultistepScheduler) — 與 transformer 結合使用的排程器，用於對編碼影像潛在變數進行去噪。
vae (AutoencoderKLWan) — 變分自動編碼器 (VAE) 模型，用於將影片編碼和解碼為潛在表示。

使用 Wan 進行文字到影片生成的流水線。

此模型繼承自DiffusionPipeline。有關所有流水線通用的方法（下載、儲存、在特定裝置上執行等），請檢視超類文件。

call

< 原始碼 >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None height: int = 480 width: int = 832 num_frames: int = 81 num_inference_steps: int = 50 guidance_scale: float = 5.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~WanPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞。如果未定義，請傳入 prompt_embeds。
negative_prompt (str 或 List[str], 可選) — 在影像生成過程中應避免的提示詞。如果未定義，請傳入 negative_prompt_embeds。當不使用引導（guidance_scale < 1）時忽略。
height (int, 預設為 480) — 生成影像的高度（畫素）。
width (int, 預設為 832) — 生成影像的寬度（畫素）。
num_frames (int, 預設為 81) — 生成影片中的幀數。
num_inference_steps (int, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但推理速度會變慢。
guidance_scale (float, 預設為 5.0) — Classifier-Free Diffusion Guidance 中定義的引導比例。guidance_scale 定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 啟用引導比例。更高的引導比例有助於生成與文字 prompt 緊密相關的影像，但通常以犧牲影像質量為代價。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示詞生成的影像數量。
生成器 (torch.Generator 或 List[torch.Generator], 可選) — 一個 torch.Generator 用於使生成具有確定性。
隱式表示 (torch.Tensor, 可選) — 從高斯分佈中取樣的預生成噪聲隱式表示，用作影像生成的輸入。可用於使用不同的提示調整相同的生成。如果未提供，則使用提供的隨機 生成器 取樣生成隱式表示張量。
提示嵌入 (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入（提示加權）。如果未提供，文字嵌入將從 提示 輸入引數生成。
輸出型別 (str, 可選, 預設為 "np") — 生成影像的輸出格式。選擇 PIL.Image 或 np.array。
返回字典 (bool, 可選, 預設為 True) — 是否返回 WanPipelineOutput 而不是普通元組。
注意力引數 (dict, 可選) — 如果指定，此 kwargs 字典將傳遞給 diffusers.models.attention_processor 中 self.processor 下定義的 AttentionProcessor。
回撥在步驟結束時 (Callable, PipelineCallback, MultiPipelineCallbacks, 可選) — 一個函式或 PipelineCallback 或 MultiPipelineCallbacks 的子類，在推理過程中每個去噪步驟結束時呼叫，引數如下：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
回撥在步驟結束時張量輸入 (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。
最大序列長度 (int, 預設為 512) — 文字編碼器的最大序列長度。如果提示比此長度長，將被截斷。如果提示更短，將填充到此長度。

~WanPipelineOutput 或 元組

如果 return_dict 為 True，則返回 WanPipelineOutput，否則返回一個 元組，其中第一個元素是生成的影像列表，第二個元素是布林值列表，指示相應的生成影像是否包含“不適合工作”（nsfw）內容。

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers.utils import export_to_video
>>> from diffusers import AutoencoderKLWan, WanPipeline
>>> from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

>>> # Available models: Wan-AI/Wan2.1-T2V-14B-Diffusers, Wan-AI/Wan2.1-T2V-1.3B-Diffusers
>>> model_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
>>> vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
>>> pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
>>> flow_shift = 5.0  # 5.0 for 720P, 3.0 for 480P
>>> pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
>>> pipe.to("cuda")

>>> prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
>>> negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

>>> output = pipe(
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     height=720,
...     width=1280,
...     num_frames=81,
...     guidance_scale=5.0,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=16)

編碼提示

< 來源 >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 226 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

引數

提示 (str 或 List[str], 可選) — 要編碼的提示
負面提示 (str 或 List[str], 可選) — 不用於指導影像生成的提示。如果未定義，則必須傳遞 negative_prompt_embeds。在使用非指導（即，如果 guidance_scale 小於 1）時將被忽略。
執行無分類器指導 (bool, 可選, 預設為 True) — 是否使用無分類器指導。
每個提示的影片數量 (int, 可選, 預設為 1) — 每個提示應生成的影片數量。要放置結果嵌入的 torch 裝置。
提示嵌入 (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示加權。如果未提供，文字嵌入將從 prompt 輸入引數生成。
負面提示嵌入 (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示加權。如果未提供，負面提示嵌入將從 negative_prompt 輸入引數生成。
裝置 — (torch.device, 可選): torch 裝置
資料型別 — (torch.dtype, 可選): torch 資料型別

將提示編碼為文字編碼器隱藏狀態。

WanImageToVideoPipeline

class diffusers.WanImageToVideoPipeline

< 來源 >

( tokenizer: AutoTokenizer text_encoder: UMT5EncoderModel image_encoder: CLIPVisionModel image_processor: CLIPImageProcessor transformer: WanTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler )

引數

分詞器 (T5Tokenizer) — 來自 T5 的分詞器，特別是 google/umt5-xxl 變體。
文字編碼器 (T5EncoderModel) — T5，特別是 google/umt5-xxl 變體。
影像編碼器 (CLIPVisionModel) — CLIP，特別是 clip-vit-huge-patch14 變體。
變換器 (WanTransformer3DModel) — 用於去噪輸入隱式表示的條件變換器。
排程器 (UniPCMultistepScheduler) — 與 transformer 結合使用的排程器，用於對編碼後的影像隱式表示進行去噪。
變分自編碼器 (AutoencoderKLWan) — 變分自編碼器 (VAE) 模型，用於將影片編碼和解碼為隱式表示。

用於使用 Wan 生成影像到影片的管道。

此模型繼承自DiffusionPipeline。有關所有流水線通用的方法（下載、儲存、在特定裝置上執行等），請檢視超類文件。

call

< 來源 >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None height: int = 480 width: int = 832 num_frames: int = 81 num_inference_steps: int = 50 guidance_scale: float = 5.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None image_embeds: typing.Optional[torch.Tensor] = None last_image: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~WanPipelineOutput 或 元組

引數

影像 (PipelineImageInput) — 用於調節生成的輸入影像。必須是影像、影像列表或 torch.Tensor。
提示 (str 或 List[str], 可選) — 用於指導影像生成的提示。如果未定義，則必須傳遞 prompt_embeds。
負面提示 (str 或 List[str], 可選) — 不用於指導影像生成的提示。如果未定義，則必須傳遞 negative_prompt_embeds。在使用非指導（即，如果 guidance_scale 小於 1）時將被忽略。
高度 (int, 預設為 480) — 生成影片的高度。
寬度 (int, 預設為 832) — 生成影片的寬度。
幀數 (int, 預設為 81) — 生成影片中的幀數。
推理步數 (int, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但推理速度會變慢。
指導比例 (float, 預設為 5.0) — 無分類器擴散指導中定義的指導比例。guidance_scale 被定義為 Imagen 論文中公式 2 的 w。透過設定 guidance_scale > 1 來啟用指導比例。更高的指導比例鼓勵生成與文字 prompt 緊密相關的影像，通常以犧牲較低影像質量為代價。
每個提示的影片數量 (int, 可選, 預設為 1) — 每個提示生成的影像數量。
生成器 (torch.Generator 或 List[torch.Generator], 可選) — 一個 torch.Generator 用於使生成具有確定性。
隱式表示 (torch.Tensor, 可選) — 從高斯分佈中取樣的預生成噪聲隱式表示，用作影像生成的輸入。可用於使用不同提示調整相同的生成。如果未提供，則使用提供的隨機 生成器 取樣生成隱式表示張量。
提示嵌入 (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入（提示加權）。如果未提供，文字嵌入將從 提示 輸入引數生成。
負面提示嵌入 (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入（提示加權）。如果未提供，文字嵌入將從 negative_prompt 輸入引數生成。
影像嵌入 (torch.Tensor, 可選) — 預生成的影像嵌入。可用於輕鬆調整影像輸入（加權）。如果未提供，影像嵌入將從 image 輸入引數生成。
輸出型別 (str, 可選, 預設為 "np") — 生成影像的輸出格式。選擇 PIL.Image 或 np.array。
返回字典 (bool, 可選, 預設為 True) — 是否返回 WanPipelineOutput 而不是普通元組。
注意力引數 (dict, 可選) — 如果指定，此 kwargs 字典將傳遞給 diffusers.models.attention_processor 中 self.processor 下定義的 AttentionProcessor。
回撥在步驟結束時 (Callable, PipelineCallback, MultiPipelineCallbacks, 可選) — 一個函式或 PipelineCallback 或 MultiPipelineCallbacks 的子類，在推理過程中每個去噪步驟結束時呼叫，引數如下：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
回撥在步驟結束時張量輸入 (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。
最大序列長度 (int, 預設為 512) — 文字編碼器的最大序列長度。如果提示比此長度長，將被截斷。如果提示更短，將填充到此長度。

~WanPipelineOutput 或 元組

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> import numpy as np
>>> from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
>>> from diffusers.utils import export_to_video, load_image
>>> from transformers import CLIPVisionModel

>>> # Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
>>> model_id = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
>>> image_encoder = CLIPVisionModel.from_pretrained(
...     model_id, subfolder="image_encoder", torch_dtype=torch.float32
... )
>>> vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
>>> pipe = WanImageToVideoPipeline.from_pretrained(
...     model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
... )
>>> pipe.to("cuda")

>>> image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
... )
>>> max_area = 480 * 832
>>> aspect_ratio = image.height / image.width
>>> mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
>>> height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
>>> width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
>>> image = image.resize((width, height))
>>> prompt = (
...     "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in "
...     "the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
... )
>>> negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

>>> output = pipe(
...     image=image,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     height=height,
...     width=width,
...     num_frames=81,
...     guidance_scale=5.0,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=16)

編碼提示

< 來源 >

引數

提示 (str 或 List[str], 可選) — 要編碼的提示
負面提示 (str 或 List[str], 可選) — 不用於指導影像生成的提示。如果未定義，則必須傳遞 negative_prompt_embeds。在使用非指導（即，如果 guidance_scale 小於 1）時將被忽略。
執行無分類器指導 (bool, 可選, 預設為 True) — 是否使用無分類器指導。
每個提示的影片數量 (int, 可選, 預設為 1) — 每個提示應生成的影片數量。要放置結果嵌入的 torch 裝置。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，將從`prompt`輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，`negative_prompt_embeds`將從`negative_prompt`輸入引數生成。
device — (torch.device, 可選): torch 裝置
dtype — (torch.dtype, 可選): torch 資料型別

將提示編碼為文字編碼器隱藏狀態。

WanVACEPipeline

class diffusers.WanVACEPipeline

< source >

( tokenizer: AutoTokenizer text_encoder: UMT5EncoderModel transformer: WanVACETransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler )

引數

tokenizer (T5Tokenizer) — 來自 T5 的分詞器，具體是 google/umt5-xxl 變體。
text_encoder (T5EncoderModel) — T5，具體是 google/umt5-xxl 變體。
transformer (WanTransformer3DModel) — 用於對輸入潛空間進行去噪的條件 Transformer。
scheduler (UniPCMultistepScheduler) — 與 `transformer` 結合使用的排程器，用於對編碼後的影像潛空間進行去噪。
vae (AutoencoderKLWan) — 變分自動編碼器 (VAE) 模型，用於將影片編碼和解碼為潛空間表示。

用於使用 Wan 進行可控生成的流水線。

此模型繼承自DiffusionPipeline。有關所有流水線通用的方法（下載、儲存、在特定裝置上執行等），請檢視超類文件。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None video: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None mask: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None reference_images: typing.Optional[typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None conditioning_scale: typing.Union[float, typing.List[float], torch.Tensor] = 1.0 height: int = 480 width: int = 832 num_frames: int = 81 num_inference_steps: int = 50 guidance_scale: float = 5.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~WanPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞。如果未定義，則必須傳入 `prompt_embeds`。
negative_prompt (str 或 List[str], 可選) — 用於不引導影像生成的提示詞。如果未定義，則必須傳入 `negative_prompt_embeds`。當不使用指導時（即，如果 `guidance_scale` 小於 `1` 則忽略），將忽略此引數。
video (List[PIL.Image.Image], 可選) — 用作生成起點的輸入影片。影片應為 PIL 影像列表、numpy 陣列或 torch 張量。目前，該流水線僅支援一次生成一個影片。
mask (List[PIL.Image.Image], 可選) — 輸入掩碼定義了要進行條件化的影片區域和要生成的區域。掩碼中的黑色區域表示條件區域，而白色區域表示生成區域。掩碼應為 PIL 影像列表、numpy 陣列或 torch 張量。目前支援一次生成一個影片。
reference_images (List[PIL.Image.Image], 可選) — 一個或多個參考影像的列表，作為生成的額外條件。例如，如果您要對影片進行修復以更改角色，您可以在此處傳入新角色的參考影像。有關所有支援的任務和用例的完整列表，請參閱 Diffusers 示例和原始使用者指南。
conditioning_scale (float, List[float], torch.Tensor, 預設為 1.0) — 在模型的每個控制層中，將控制條件潛流新增到去噪潛流時應用的條件縮放。如果提供浮點數，它將統一應用於所有層。如果提供列表或張量，其長度應與模型中控制層的數量相同（`len(transformer.config.vace_layers)`）。
height (int, 預設為 480) — 生成影像的高度（畫素）。
width (int, 預設為 832) — 生成影像的寬度（畫素）。
num_frames (int, 預設為 81) — 生成影片的幀數。
num_inference_steps (int, 預設為 50) — 去噪步數。更多去噪步數通常會導致更高質量的影像，但推理速度會變慢。
guidance_scale (float, 預設為 5.0) — Classifier-Free Diffusion Guidance 中定義的指導比例。`guidance_scale` 被定義為 Imagen Paper 方程 2 中的 `w`。透過設定 `guidance_scale > 1` 啟用指導比例。較高的指導比例鼓勵生成與文字 `prompt` 密切相關的影像，通常以犧牲較低影像質量為代價。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示詞生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 用於使生成確定性的 torch.Generator。
latents (torch.Tensor, 可選) — 從高斯分佈中取樣的預生成的噪聲潛空間，用作影像生成的輸入。可用於使用不同的提示詞調整相同的生成。如果未提供，則透過使用提供的隨機 `generator` 進行取樣來生成潛空間張量。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入（提示詞權重）。如果未提供，文字嵌入將從 `prompt` 輸入引數生成。
output_type (str, 可選, 預設為 "np") — 生成影像的輸出格式。在 PIL.Image 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 WanPipelineOutput 而不是普通元組。
attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，將傳遞給 diffusers.models.attention_processor 中定義的 self.processor 下的 AttentionProcessor。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可選) — 一個函式或 PipelineCallback 或 MultiPipelineCallbacks 的子類，在推理過程中每個去噪步驟結束時呼叫，引數如下：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。`callback_kwargs` 將包含由 `callback_on_step_end_tensor_inputs` 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — `callback_on_step_end` 函式的張量輸入列表。列表中指定的張量將作為 `callback_kwargs` 引數傳遞。您只能包含在流水線類的 `._callback_tensor_inputs` 屬性中列出的變數。
max_sequence_length (int, 預設為 512) — 文字編碼器的最大序列長度。如果提示詞長度超過此值，則將被截斷。如果提示詞長度小於此值，則將被填充至此長度。

~WanPipelineOutput 或 元組

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> import PIL.Image
>>> from diffusers import AutoencoderKLWan, WanVACEPipeline
>>> from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
>>> from diffusers.utils import export_to_video, load_image
def prepare_video_and_mask(first_img: PIL.Image.Image, last_img: PIL.Image.Image, height: int, width: int, num_frames: int):
    first_img = first_img.resize((width, height))
    last_img = last_img.resize((width, height))
    frames = []
    frames.append(first_img)
    # Ideally, this should be 127.5 to match original code, but they perform computation on numpy arrays
    # whereas we are passing PIL images. If you choose to pass numpy arrays, you can set it to 127.5 to
    # match the original code.
    frames.extend([PIL.Image.new("RGB", (width, height), (128, 128, 128))] * (num_frames - 2))
    frames.append(last_img)
    mask_black = PIL.Image.new("L", (width, height), 0)
    mask_white = PIL.Image.new("L", (width, height), 255)
    mask = [mask_black, *[mask_white] * (num_frames - 2), mask_black]
    return frames, mask

>>> # Available checkpoints: Wan-AI/Wan2.1-VACE-1.3B-diffusers, Wan-AI/Wan2.1-VACE-14B-diffusers
>>> model_id = "Wan-AI/Wan2.1-VACE-1.3B-diffusers"
>>> vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
>>> pipe = WanVACEPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
>>> flow_shift = 3.0  # 5.0 for 720P, 3.0 for 480P
>>> pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
>>> pipe.to("cuda")

>>> prompt = "CG animation style, a small blue bird takes off from the ground, flapping its wings. The bird's feathers are delicate, with a unique pattern on its chest. The background shows a blue sky with white clouds under bright sunshine. The camera follows the bird upward, capturing its flight and the vastness of the sky from a close-up, low-angle perspective."
>>> negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
>>> first_frame = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_first_frame.png"
... )
>>> last_frame = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/flf2v_input_last_frame.png>>> "
... )

>>> height = 512
>>> width = 512
>>> num_frames = 81
>>> video, mask = prepare_video_and_mask(first_frame, last_frame, height, width, num_frames)

>>> output = pipe(
...     video=video,
...     mask=mask,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     height=height,
...     width=width,
...     num_frames=num_frames,
...     num_inference_steps=30,
...     guidance_scale=5.0,
...     generator=torch.Generator().manual_seed(42),
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=16)

編碼提示

< source >

引數

prompt (str 或 List[str], 可選) — 要編碼的提示詞
negative_prompt (str 或 List[str], 可選) — 用於不引導影像生成的提示詞。如果未定義，則必須傳入 `negative_prompt_embeds`。當不使用指導時（即，如果 `guidance_scale` 小於 `1` 則忽略），將忽略此引數。
do_classifier_free_guidance (bool, 可選, 預設為 True) — 是否使用無分類器指導。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示詞應生成的影片數量。將生成的嵌入放置到的 torch 裝置。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，將從`prompt`輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，`negative_prompt_embeds`將從`negative_prompt`輸入引數生成。
device — (torch.device, 可選): torch 裝置
dtype — (torch.dtype, 可選): torch 資料型別

將提示編碼為文字編碼器隱藏狀態。

WanVideoToVideoPipeline

class diffusers.WanVideoToVideoPipeline

< source >

( tokenizer: AutoTokenizer text_encoder: UMT5EncoderModel transformer: WanTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler )

引數

tokenizer (T5Tokenizer) — 來自 T5 的分詞器，具體是 google/umt5-xxl 變體。
text_encoder (T5EncoderModel) — T5，具體是 google/umt5-xxl 變體。
transformer (WanTransformer3DModel) — 用於對輸入潛空間進行去噪的條件 Transformer。
scheduler (UniPCMultistepScheduler) — 與 `transformer` 結合使用的排程器，用於對編碼後的影像潛空間進行去噪。
vae (AutoencoderKLWan) — 變分自動編碼器 (VAE) 模型，用於將影片編碼和解碼為潛空間表示。

用於使用 Wan 進行影片到影片生成的流水線。

此模型繼承自DiffusionPipeline。有關所有流水線通用的方法（下載、儲存、在特定裝置上執行等），請檢視超類文件。

call

< source >

( video: typing.List[PIL.Image.Image] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None height: int = 480 width: int = 832 num_inference_steps: int = 50 timesteps: typing.Optional[typing.List[int]] = None guidance_scale: float = 5.0 strength: float = 0.8 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~WanPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞。如果未定義，則必須傳入 `prompt_embeds`。
height (int, 預設為 480) — 生成影像的高度（畫素）。
width (int, 預設為 832) — 生成影像的寬度（畫素）。
num_frames (int, 預設為 81) — 生成影片的幀數。
num_inference_steps (int, 預設為 50) — 去噪步數。更多去噪步數通常會導致更高質量的影像，但推理速度會變慢。
guidance_scale (float, 預設為 5.0) — Classifier-Free Diffusion Guidance 中定義的引導比例。 guidance_scale 被定義為 Imagen Paper 中公式 2 的 w。當 guidance_scale > 1 時啟用引導比例。更高的引導比例鼓勵生成與文字 prompt 緊密相關的影像，通常以犧牲影像質量為代價。
strength (float, 預設為 0.8) — 更高的強度會導致原始影像和生成的影片之間出現更多差異。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個 prompt 生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個 torch.Generator，用於使生成具有確定性。
latents (torch.Tensor, 可選) — 從高斯分佈中取樣的預生成噪聲潛在值，用作影像生成的輸入。可用於使用不同的 prompt 調整相同的生成。如果未提供，則使用提供的隨機 generator 取樣生成潛在張量。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入（prompt 加權）。如果未提供，將從 prompt 輸入引數生成文字嵌入。
output_type (str, 可選, 預設為 "np") — 生成影像的輸出格式。在 PIL.Image 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 WanPipelineOutput 而不是普通的元組。
attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，將作為引數傳遞給 diffusers.models.attention_processor 中 self.processor 下定義的 AttentionProcessor。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可選) — 一個函式或 PipelineCallback 或 MultiPipelineCallbacks 的子類，在推理過程中每個去噪步驟結束時呼叫，並帶有以下引數：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。
max_sequence_length (int, 預設為 512) — 文字編碼器的最大序列長度。如果 prompt 長度超過此值，將被截斷。如果 prompt 長度小於此值，將填充到此長度。

~WanPipelineOutput 或 元組

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers.utils import export_to_video
>>> from diffusers import AutoencoderKLWan, WanVideoToVideoPipeline
>>> from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler

>>> # Available models: Wan-AI/Wan2.1-T2V-14B-Diffusers, Wan-AI/Wan2.1-T2V-1.3B-Diffusers
>>> model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
>>> vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
>>> pipe = WanVideoToVideoPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
>>> flow_shift = 3.0  # 5.0 for 720P, 3.0 for 480P
>>> pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
>>> pipe.to("cuda")

>>> prompt = "A robot standing on a mountain top. The sun is setting in the background"
>>> negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
>>> video = load_video(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
... )
>>> output = pipe(
...     video=video,
...     prompt=prompt,
...     negative_prompt=negative_prompt,
...     height=480,
...     width=720,
...     guidance_scale=5.0,
...     strength=0.7,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=16)

編碼提示

< 源 >

引數

prompt (str 或 List[str], 可選) — 待編碼的 prompt
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的 prompt。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1 時），將被忽略。
do_classifier_free_guidance (bool, 可選, 預設為 True) — 是否使用無分類器引導。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個 prompt 應生成的影片數量。將結果嵌入放置的 torch 裝置
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如 prompt 加權。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負文字嵌入。可用於輕鬆調整文字輸入，例如 prompt 加權。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
device — (torch.device, 可選): torch 裝置
dtype — (torch.dtype, 可選): torch 資料型別

將提示編碼為文字編碼器隱藏狀態。

WanPipelineOutput

類 diffusers.pipelines.wan.pipeline_output.WanPipelineOutput

< 源 >

( 幀: 張量 )

引數

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 影片輸出列表 - 可以是長度為 batch_size 的巢狀列表，其中每個子列表包含長度為 num_frames 的去噪 PIL 影像序列。它也可以是形狀為 (batch_size, num_frames, channels, height, width) 的 NumPy 陣列或 Torch 張量。

Wan 管道的輸出類。

< > 在 GitHub 上更新

←VisualCloze Wuerstchen→

Diffusers

Wan2.1

文字到影片生成

首尾幀到影片生成

任意到影片可控生成

注意事項

WanPipeline

class diffusers.WanPipeline

__call__

編碼提示

WanImageToVideoPipeline

class diffusers.WanImageToVideoPipeline

__call__

編碼提示

WanVACEPipeline

class diffusers.WanVACEPipeline

__call__

編碼提示

WanVideoToVideoPipeline

class diffusers.WanVideoToVideoPipeline

__call__

編碼提示

WanPipelineOutput

類 diffusers.pipelines.wan.pipeline_output.WanPipelineOutput

call

call

call

call