Diffusers 文件

混元影片（HunyuanVideo）

擴散模型（Diffusers）

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

混元影片（HunyuanVideo）

混元影片（HunyuanVideo）是一個130億引數的擴散變換器模型，旨在與閉源影片基礎模型競爭，併為更廣泛的社群提供訪問許可權。該模型採用“雙流到單流”架構，首先獨立處理影片和文字標記，然後將它們連線起來並送入變換器以融合多模態資訊。預訓練的多模態大型語言模型（MLLM）被用作編碼器，因為它具有更好的影像-文字對齊、更好的影像細節描述和推理能力，並且如果將系統指令新增到使用者提示中，它可以用作零樣本學習器。最後，混元影片（HunyuanVideo）使用3D因果變分自編碼器以更高的效率處理原始解析度和幀率的影片資料。

您可以在騰訊（Tencent）組織下找到所有原始的混元影片（HunyuanVideo）檢查點。

點選右側邊欄的混元影片（HunyuanVideo）模型，檢視更多影片生成任務示例。

以下示例使用hunyuanvideo-community的檢查點，因為其權重以相容 Diffusers 的佈局儲存。

以下示例演示瞭如何生成針對記憶體或推理速度進行最佳化的影片。

記憶體

推理速度

注意事項

混元影片（HunyuanVideo）支援使用 load_lora_weights() 載入 LoRA。

顯示示例程式碼

import torch
from diffusers import AutoModel, HunyuanVideoPipeline
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers.utils import export_to_video

# quantize weights to int4 with bitsandbytes
pipeline_quant_config = PipelineQuantizationConfig(
    quant_backend="bitsandbytes_4bit",
    quant_kwargs={
      "load_in_4bit": True,
      "bnb_4bit_quant_type": "nf4",
      "bnb_4bit_compute_dtype": torch.bfloat16
      },
    components_to_quantize=["transformer"]
)

pipeline = HunyuanVideoPipeline.from_pretrained(
    "hunyuanvideo-community/HunyuanVideo",
    quantization_config=pipeline_quant_config,
    torch_dtype=torch.bfloat16,
)

# load LoRA weights
pipeline.load_lora_weights("https://huggingface.co/lucataco/hunyuan-steamboat-willie-10", adapter_name="steamboat-willie")
pipeline.set_adapters("steamboat-willie", 0.9)

# model-offloading and tiling
pipeline.enable_model_cpu_offload()
pipeline.vae.enable_tiling()

# use "In the style of SWR" to trigger the LoRA
prompt = """
In the style of SWR. A black and white animated scene featuring a fluffy teddy bear sits on a bed of soft pillows surrounded by children's toys.
"""
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
export_to_video(video, "output.mp4", fps=15)

推薦的推理值請參考下表。

引數推薦值

文字編碼器資料型別 torch.float16

變換器資料型別 torch.bfloat16

VAE 資料型別 torch.float16

幀數 (k) 4 * `k` + 1
對於較低解析度的影片，請嘗試較低的 `shift` 值（`2.0` 到 `5.0`）；對於較高解析度的影像，請嘗試較高的 `shift` 值（`7.0` 到 `12.0`）。

引數	推薦值
文字編碼器資料型別	`torch.float16`
變換器資料型別	`torch.bfloat16`
VAE 資料型別	`torch.float16`
`幀數 (k)`	4 * `k` + 1

HunyuanVideoPipeline

類 diffusers.HunyuanVideoPipeline

< 來源 >

( text_encoder: LlamaModel tokenizer: LlamaTokenizerFast transformer: HunyuanVideoTransformer3DModel vae: AutoencoderKLHunyuanVideo scheduler: FlowMatchEulerDiscreteScheduler text_encoder_2: CLIPTextModel tokenizer_2: CLIPTokenizer )

引數

text_encoder (LlamaModel) — Llava Llama3-8B.
tokenizer (LlamaTokenizer) — 來自Llava Llama3-8B 的分詞器。
transformer (HunyuanVideoTransformer3DModel) — 用於對編碼影像潛在進行去噪的條件變換器。
scheduler (FlowMatchEulerDiscreteScheduler) — 與 transformer 結合使用以對編碼影像潛在進行去噪的排程器。
vae (AutoencoderKLHunyuanVideo) — 用於將影片編碼和解碼為潛在表示的變分自編碼器（VAE）模型。
text_encoder_2 (CLIPTextModel) — CLIP，特別是 clip-vit-large-patch14 變體。
tokenizer_2 (CLIPTokenizer) — CLIPTokenizer 類的分詞器。

用於使用混元影片（HunyuanVideo）進行文字到影片生成的管道。

此模型繼承自DiffusionPipeline。有關所有管道實現的通用方法（下載、儲存、在特定裝置上執行等），請檢視父類文件。

call

< 來源 >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str]] = None negative_prompt_2: typing.Union[str, typing.List[str]] = None height: int = 720 width: int = 1280 num_frames: int = 129 num_inference_steps: int = 50 sigmas: typing.List[float] = None true_cfg_scale: float = 1.0 guidance_scale: float = 6.0 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] prompt_template: typing.Dict[str, typing.Any] = {'template': '<|start_header_id|>system<|end_header_id|>\n\nDescribe the video by detailing the following aspects: 1. The main content and theme of the video.2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.4. background environment, light, style and atmosphere.5. camera angles, movements, and transitions used in the video:<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>', 'crop_start': 95} max_sequence_length: int = 256 ) → ~HunyuanVideoPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示或提示列表。如果未定義，則必須傳遞 prompt_embeds。
prompt_2 (str 或 List[str], 可選) — 傳送到 tokenizer_2 和 text_encoder_2 的提示或提示列表。如果未定義，將使用 prompt。
negative_prompt (str 或 List[str], 可選) — 用於不引導影像生成的提示或提示列表。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 true_cfg_scale 不大於 1 則忽略），此引數將被忽略。
negative_prompt_2 (str 或 List[str], 可選) — 用於不引導影像生成併發送到 tokenizer_2 和 text_encoder_2 的提示或提示列表。如果未定義，則在所有文字編碼器中使用 negative_prompt。
height (int, 預設為 720) — 生成影像的高度（畫素）。
width (int, 預設為 1280) — 生成影像的寬度（畫素）。
num_frames (int, 預設為 129) — 生成影片的幀數。
num_inference_steps (int, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但推理速度會變慢。
sigmas (List[float], 可選) — 在排程器的 set_timesteps 方法中支援 sigmas 引數時，用於去噪過程的自定義 sigmas。如果未定義，將使用傳遞 num_inference_steps 時的預設行為。
true_cfg_scale (float, 可選, 預設為 1.0) — 當 > 1.0 且提供了 negative_prompt 時，啟用真正的無分類器引導。
guidance_scale (float, 預設為 6.0) — Classifier-Free Diffusion Guidance 中定義的引導比例。guidance_scale 定義為 Imagen Paper 方程 2 中的 w。透過設定 guidance_scale > 1 啟用引導比例。更高的引導比例有助於生成與文字 prompt 緊密相關的影像，但通常以犧牲影像質量為代價。請注意，唯一可用的 HunyuanVideo 模型是 CFG 蒸餾模型，這意味著不應用無條件和有條件潛在之間的傳統引導。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 用於使生成具有確定性的 torch.Generator。
latents (torch.Tensor, 可選) — 預先生成的高斯分佈取樣噪聲潛在變數，用作影像生成的輸入。可用於使用不同提示調整相同的生成。如果未提供，則使用提供的隨機 generator 取樣生成潛在變數張量。
prompt_embeds (torch.Tensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入（提示權重）。如果未提供，則從 prompt 輸入引數生成文字嵌入。
pooled_prompt_embeds (torch.FloatTensor, 可選) — 預先生成的池化文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，將從 prompt 輸入引數生成池化文字嵌入。
negative_prompt_embeds (torch.FloatTensor, 可選) — 預先生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，將從 negative_prompt 輸入引數生成 negative_prompt_embeds。
negative_pooled_prompt_embeds (torch.FloatTensor, 可選) — 預先生成的負面池化文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，將從 negative_prompt 輸入引數生成池化 negative_prompt_embeds。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL.Image 或 np.array。
return_dict (bool, 可選, 預設為 True) — 是否返回 HunyuanVideoPipelineOutput 而不是普通元組。
attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，將作為引數傳遞給 AttentionProcessor，其定義位於 diffusers.models.attention_processor 中的 self.processor 下。
clip_skip (int, 可選) — 在計算提示嵌入時從 CLIP 跳過的層數。值為 1 意味著將使用倒數第二層的輸出計算提示嵌入。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可選) — 一個函式或 PipelineCallback 或 MultiPipelineCallbacks 的子類，在推理過程中每個去噪步驟結束時呼叫，並使用以下引數：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。

~HunyuanVideoPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 HunyuanVideoPipelineOutput，否則返回一個 tuple，其中第一個元素是生成的影像列表，第二個元素是指示相應生成的影像是否包含“不適合工作”(nsfw) 內容的 bool 列表。

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
>>> from diffusers.utils import export_to_video

>>> model_id = "hunyuanvideo-community/HunyuanVideo"
>>> transformer = HunyuanVideoTransformer3DModel.from_pretrained(
...     model_id, subfolder="transformer", torch_dtype=torch.bfloat16
... )
>>> pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16)
>>> pipe.vae.enable_tiling()
>>> pipe.to("cuda")

>>> output = pipe(
...     prompt="A cat walks on the grass, realistic",
...     height=320,
...     width=512,
...     num_frames=61,
...     num_inference_steps=30,
... ).frames[0]
>>> export_to_video(output, "output.mp4", fps=15)

disable_vae_slicing

< 原始檔 >

( )

停用切片 VAE 解碼。如果之前啟用了 enable_vae_slicing，此方法將返回一步計算解碼。

disable_vae_tiling

< 原始檔 >

( )

停用平鋪 VAE 解碼。如果之前啟用了 enable_vae_tiling，此方法將恢復一步計算解碼。

enable_vae_slicing

< 原始檔 >

( )

啟用切片 VAE 解碼。啟用此選項後，VAE 會將輸入張量分片，分步計算解碼。這有助於節省一些記憶體並允許更大的批次大小。

enable_vae_tiling

< 原始檔 >

( )

啟用平鋪 VAE 解碼。啟用此選項後，VAE 將把輸入張量分割成瓦片，分多步計算編碼和解碼。這對於節省大量記憶體和處理更大的影像非常有用。

HunyuanVideoPipelineOutput

class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput

< 原始檔 >

( frames: Tensor )

引數

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 影片輸出列表 - 可以是長度為 batch_size 的巢狀列表，每個子列表包含長度為 num_frames 的去噪 PIL 影像序列。也可以是形狀為 (batch_size, num_frames, channels, height, width) 的 NumPy 陣列或 Torch 張量。

HunyuanVideo 管道的輸出類。

< > 在 GitHub 上更新

←Hunyuan-DiT I2VGen-XL→

擴散模型（Diffusers）

混元影片（HunyuanVideo）

注意事項

HunyuanVideoPipeline

類 diffusers.HunyuanVideoPipeline

__call__

disable_vae_slicing

disable_vae_tiling

enable_vae_slicing

enable_vae_tiling

HunyuanVideoPipelineOutput

class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput

call