Diffusers 文件

CogVideoX

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

CogVideoX

CogVideoX 是一種文字到影片生成模型，致力於建立與提示更連貫的影片。它透過以下幾種方法實現這一點：

一個 3D 變分自編碼器，在空間和時間上壓縮影片，提高壓縮率和影片準確性。
一個專家 transformer 塊，用於幫助文字和影片對齊，以及一個 3D 全注意力模組，用於捕獲和建立空間和時間上準確的影片。

對影片指令維度進行的實際測試發現，CogVideoX 在主題一致性、動態資訊、背景一致性、物件資訊、平滑運動、顏色、場景、外觀風格和時間風格方面表現良好，但在人物動作、空間關係和多物件方面無法取得良好效果。

透過 Diffusers 進行微調可以彌補這些不足。

資料準備

訓練指令碼接受兩種資料格式。

第一種格式適用於小規模訓練，第二種格式使用 CSV 格式，更適合大規模訓練的資料流。未來，Diffusers 將支援標籤。

小格式

兩個檔案，其中一個檔案包含按行分隔的提示，另一個檔案包含按行分隔的影片資料路徑（影片檔案路徑必須相對於您在指定 --instance_data_root 時傳遞的路徑）。讓我們透過一個示例來更好地理解這一點！

假設您已將 --instance_data_root 指定為 /dataset，並且此目錄包含以下檔案：prompts.txt 和 videos.txt。

prompts.txt 檔案應包含按行分隔的提示。

A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.
A black and white animated sequence on a ship's deck features a bulldog character, named Bully Bulldoger, showcasing exaggerated facial expressions and body language. The character progresses from confident to focused, then to strained and distressed, displaying a range of emotions as it navigates challenges. The ship's interior remains static in the background, with minimalistic details such as a bell and open door. The character's dynamic movements and changing expressions drive the narrative, with no camera movement to distract from its evolving reactions and physical gestures.
...

videos.txt 檔案應包含按行分隔的影片檔案路徑。請注意，路徑應是相對於 --instance_data_root 目錄的。

videos/00000.mp4
videos/00001.mp4
...

總的來說，如果您在資料集根目錄上執行 `tree` 命令，您的資料集將如下所示：

/dataset
├── prompts.txt
├── videos.txt
├── videos
    ├── videos/00000.mp4
    ├── videos/00001.mp4
    ├── ...

使用此格式時，--caption_column 必須是 prompts.txt，--video_column 必須是 videos.txt。

流式格式

您可以使用單個 CSV 檔案。在本示例中，假設您有一個 metadata.csv 檔案。預期的格式是

<CAPTION_COLUMN>,<PATH_TO_VIDEO_COLUMN>
"""A black and white animated sequence featuring a rabbit, named Rabbity Ribfried, and an anthropomorphic goat in a musical, playful environment, showcasing their evolving interaction.""","""00000.mp4"""
"""A black and white animated sequence on a ship's deck features a bulldog character, named Bully Bulldoger, showcasing exaggerated facial expressions and body language. The character progresses from confident to focused, then to strained and distressed, displaying a range of emotions as it navigates challenges. The ship's interior remains static in the background, with minimalistic details such as a bell and open door. The character's dynamic movements and changing expressions drive the narrative, with no camera movement to distract from its evolving reactions and physical gestures.""","""00001.mp4"""
...

在這種情況下，--instance_data_root 應該是影片儲存的位置，--dataset_name 應該是本地資料夾的路徑，或者是一個在 Hub 上託管的 load_dataset 相容資料集。假設您在 https://huggingface.co/datasets/my-awesome-username/minecraft-videos 上有 Minecraft 遊戲影片，您需要指定 my-awesome-username/minecraft-videos。

使用此格式時，`--caption_column` 必須是 ``，`--video_column` 必須是 ``。

您不嚴格限於 CSV 格式。只要 load_dataset 方法支援檔案格式以載入基本的和，任何格式都適用。之所以在載入影片資料時要進行這些資料集組織上的調整，是因為 load_dataset 無法完全支援所有型別的影片格式。

[!NOTE] CogVideoX 在影片生成方面，搭配長而描述性的 LLM 增強提示，效果最佳。我們建議您先使用 VLM 生成摘要，然後使用 LLM 增強提示，對影片進行預處理。為了生成上述字幕，我們使用了 MiniCPM-V-26 和 Llama-3.1-8B-Instruct。這裡有一個非常基礎且不花哨的示例：這裡。官方推薦的增強提示是 ChatGLM，並且 50-100 個詞的長度被認為是比較好的。

[!NOTE] 您的資料集預計已預處理。如果沒有，可以透過調整以下引數進行一些基本預處理：`--height`、`--width`、`--fps`、`--max_num_frames`、`--skip_frames_start` 和 `--skip_frames_end`。目前，當訓練批次大小 > 1 時，資料集中所有影片應包含相同數量的影片幀。

訓練

您需要透過安裝必要的依賴項來設定開發環境。需要以下軟體包：

根據您正在使用的訓練功能，PyTorch 2.0 或更高版本（可能需要最新或 nightly 版本以進行量化/deepspeed 訓練）
`pip install diffusers transformers accelerate peft huggingface_hub` 用於所有與建模和訓練相關的事項。
`pip install datasets decord` 用於載入影片訓練資料。
`pip install bitsandbytes` 用於使用 8 位 Adam 或 AdamW 最佳化器進行記憶體最佳化訓練。
`pip install wandb` 可選，用於監控訓練日誌。
`pip install deepspeed` 可選，用於 DeepSpeed 訓練。
`pip install prodigyopt` 可選，如果您想使用 Prodigy 最佳化器進行訓練。

為了確保您能成功執行最新版本的示例指令碼，我們強烈建議您**從原始碼安裝**並保持安裝最新，因為我們會頻繁更新示例指令碼並安裝一些特定於示例的依賴項。為此，請在新虛擬環境中執行以下步驟：

在執行指令碼之前，請確保從原始碼安裝庫

git clone https://github.com/huggingface/diffusers
cd diffusers
pip install -e .

然後導航到包含訓練指令碼的示例資料夾並安裝您正在使用的指令碼所需的依賴項

PyTorch

cd examples/cogvideo
pip install -r requirements.txt

並透過以下命令初始化 🤗 Accelerate 環境：

accelerate config

或者，對於預設的 accelerate 配置，無需回答有關您的環境的問題，請使用：

accelerate config default

或者如果您的環境不支援互動式 shell（例如，筆記本），則

from accelerate.utils import write_basic_config
write_basic_config()

執行 `accelerate config` 時，如果使用 `torch.compile`，可以顯著提高速度。PEFT 庫用作 LoRA 訓練的後端，因此請確保您的環境中安裝了 `peft>=0.6.0`。

如果您想在訓練完成後將模型推送到 Hub 並附帶一份簡潔的模型卡，請確保您已登入：

huggingface-cli login

# Alternatively, you could upload your model manually using:
# huggingface-cli upload my-cool-account-name/my-cool-lora-name /path/to/awesome/lora

請確保您的資料已按照資料準備中的描述進行準備。準備就緒後，即可開始訓練！

假設您正在訓練 50 個相似概念的影片，我們發現 1500-2000 步效果良好。然而，官方建議是 100 個影片，總共 4000 步。假設您在單個 GPU 上訓練，`--train_batch_size` 為 `1`：

50個影片的1500步對應於30個訓練 epoch。
4000步針對100個影片，相當於40個訓練週期。

#!/bin/bash

GPU_IDS="0"

accelerate launch --gpu_ids $GPU_IDS examples/cogvideo/train_cogvideox_lora.py \
  --pretrained_model_name_or_path THUDM/CogVideoX-2b \
  --cache_dir <CACHE_DIR> \
  --instance_data_root <PATH_TO_WHERE_VIDEO_FILES_ARE_STORED> \
  --dataset_name my-awesome-name/my-awesome-dataset \
  --caption_column <CAPTION_COLUMN> \
  --video_column <PATH_TO_VIDEO_COLUMN> \
  --id_token <ID_TOKEN> \
  --validation_prompt "<ID_TOKEN> Spiderman swinging over buildings:::A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance" \
  --validation_prompt_separator ::: \
  --num_validation_videos 1 \
  --validation_epochs 10 \
  --seed 42 \
  --rank 64 \
  --lora_alpha 64 \
  --mixed_precision fp16 \
  --output_dir /raid/aryan/cogvideox-lora \
  --height 480 --width 720 --fps 8 --max_num_frames 49 --skip_frames_start 0 --skip_frames_end 0 \
  --train_batch_size 1 \
  --num_train_epochs 30 \
  --checkpointing_steps 1000 \
  --gradient_accumulation_steps 1 \
  --learning_rate 1e-3 \
  --lr_scheduler cosine_with_restarts \
  --lr_warmup_steps 200 \
  --lr_num_cycles 1 \
  --enable_slicing \
  --enable_tiling \
  --optimizer Adam \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --max_grad_norm 1.0 \
  --report_to wandb

為了更好地跟蹤我們的訓練實驗，我們在上面的命令中使用了以下標誌：

--report_to wandb 將確保訓練執行在 Weights and Biases 上進行跟蹤。要使用它，請務必使用 pip install wandb 安裝 wandb。
`validation_prompt` 和 `validation_epochs` 允許指令碼執行幾次驗證推理執行。這使我們能夠定性檢查訓練是否按預期進行。

設定並非必需。從有限的實驗來看，我們發現它比不設定效果更好（因為它類似於 Dreambooth 訓練）。提供時，會附加到每個提示的開頭。因此，如果您的是 "DISNEY"，而您的提示是 "Spiderman swinging over buildings"，那麼訓練中使用的有效提示將是 "DISNEY Spiderman swinging over buildings"。如果未提供，您將在不帶任何附加令牌的情況下進行訓練，或者可以在開始訓練之前增強資料集以應用您希望的令牌。

[!NOTE] 您可以傳遞 `--use_8bit_adam` 以減少訓練的記憶體需求。

[!IMPORTANT] 在新增 CogVideoX LoRA 訓練支援時，已測試以下設定：

我們的測試主要在 CogVideoX-2b 上完成。我們很快將著手於 CogVideoX-5b 和 CogVideoX-5b-I2V。

一個包含 70 個訓練影片的資料集，解析度為 200 x 480 x 720（幀 x 高 x 寬）。透過在資料預處理中使用幀跳過，我們從中建立了兩個較小的 49 幀和 16 幀資料集，以加快實驗速度，並且因為 CogVideoX 團隊建議的最大限制是 49 幀。在這 70 個影片中，我們建立了三組：10 個、25 個和 50 個影片。所有影片在訓練概念的性質上都是相似的。

25+ 個影片最適合訓練新概念和風格。

我們發現使用可指定為 `--id_token` 的識別符號令牌進行訓練效果更好。這類似於 Dreambooth 式訓練，但沒有此類令牌的常規微調也有效。

訓練出的概念與完全不相關的提示結合時，似乎表現不錯。我們預計如果對 CogVideoX-5B 進行微調，結果會更好。

原始儲存庫使用 lora_alpha 為 1。我們發現這在許多執行中不適用，可能是由於建模後端和訓練設定的差異。我們的建議是將 lora_alpha 設定為 rank 或 rank // 2。

如果您在對使用原始模型生成不良結果的字幕資料進行訓練，那麼 rank 為 64 或以上是很好的，這也是 CogVideoX 團隊的建議。如果您的訓練字幕生成結果已經中等良好，那麼 rank 為 16/32 應該可以。我們發現將 rank 設定得過低，例如 4，是不理想的，並且不會產生有希望的結果。

CogVideoX 的作者和我們自己的有限實驗都推薦總共 4000 個訓練步和 100 個訓練影片以獲得最佳結果。雖然這可能會產生最佳結果，但我們發現 2000 個步和 25 個影片也可能足夠。

當使用 Prodigy 最佳化器進行訓練時，可以遵循這篇部落格中的建議。Prodigy 傾向於快速過擬合。根據我非常有限的測試，我發現學習率 0.5 加上 --prodigy_use_bias_correction、prodigy_safeguard_warmup 和 --prodigy_decouple 是合適的。

CogVideoX 作者以及我們使用 Adam/AdamW 進行實驗的推薦學習率對於 25 個以上影片的資料集而言，介於 1e-3 和 1e-4 之間。

請注意，由於探索時間有限，我們的測試並不詳盡。我們的建議是嘗試不同的旋鈕和撥盤，以找到適合您資料的最佳設定。

推理

一旦您訓練了一個 LoRA 模型，推理可以透過簡單地將 LoRA 權重載入到 `CogVideoXPipeline` 中來完成。

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-2b", torch_dtype=torch.float16)
# pipe.load_lora_weights("/path/to/lora/weights", adapter_name="cogvideox-lora") # Or,
pipe.load_lora_weights("my-awesome-hf-username/my-awesome-lora-name", adapter_name="cogvideox-lora") # If loading from the HF Hub
pipe.to("cuda")

# Assuming lora_alpha=32 and rank=64 for training. If different, set accordingly
pipe.set_adapters(["cogvideox-lora"], [32 / 64])

prompt = "A vast, shimmering ocean flows gracefully under a twilight sky, its waves undulating in a mesmerizing dance of blues and greens. The surface glints with the last rays of the setting sun, casting golden highlights that ripple across the water. Seagulls soar above, their cries blending with the gentle roar of the waves. The horizon stretches infinitely, where the ocean meets the sky in a seamless blend of hues. Close-ups reveal the intricate patterns of the waves, capturing the fluidity and dynamic beauty of the sea in motion."
frames = pipe(prompt, guidance_scale=6, use_dynamic_cfg=True).frames[0]
export_to_video(frames, "output.mp4", fps=8)

減少記憶體使用

在使用 diffusers 庫進行測試時，所有包含在 diffusers 庫中的最佳化都已啟用。此方案尚未在除 **NVIDIA A100 / H100** 架構之外的裝置上進行實際記憶體使用測試。通常，此方案可適用於所有 **NVIDIA Ampere 架構**及更高版本裝置。如果停用最佳化，記憶體消耗將增加，峰值記憶體使用量約為表中數值的 3 倍。然而，速度將提高約 3-4 倍。您可以選擇性地停用一些最佳化，包括：

pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

對於多 GPU 推理，需要停用 enable_sequential_cpu_offload() 最佳化。
使用 INT8 模型會降低推理速度，這是為了適應記憶體較低的 GPU，同時將影片質量損失降至最低，儘管推理速度會顯著下降。
CogVideoX-2B 模型以 FP16 精度訓練，所有 CogVideoX-5B 模型均以 BF16 精度訓練。我們建議在推理時使用模型訓練時所用的精度。
PytorchAO 和 Optimum-quanto 可用於量化文字編碼器、Transformer 和 VAE 模組，以減少 CogVideoX 的記憶體需求。這使得模型可以在免費的 T4 Colab 或記憶體較小的 GPU 上執行！此外，請注意 TorchAO 量化與 torch.compile 完全相容，這可以顯著提高推理速度。FP8 精度必須在 NVIDIA H100 及以上裝置上使用，需要從原始碼安裝 torch、torchao、diffusers 和 accelerate Python 包。建議使用 CUDA 12.4。
推理速度測試也使用了上述記憶體最佳化方案。如果沒有記憶體最佳化，推理速度會提高約10%。只有 `diffusers` 版本的模型支援量化。
該模型僅支援英文輸入；其他語言可透過大型模型細化翻譯成英文使用。
模型微調的記憶體使用在 `8 * H100` 環境中測試，程式自動使用 `Zero 2` 最佳化。如果表中標記了特定數量的 GPU，則微調必須使用該數量或更多的 GPU。

屬性	CogVideoX-2B	CogVideoX-5B
模型名稱	CogVideoX-2B	CogVideoX-5B
推理精度	FP16（推薦），BF16，FP32，FP8，INT8，不支援 INT4	BF16（推薦）、FP16、FP32、FP8*、INT8，不支援 INT4
單GPU推理視訊記憶體	FP16：使用 diffusers 12.5GBINT8：使用 diffusers 和 torchao 7.8GB	BF16：使用 diffusers 20.7GBINT8：使用 diffusers 和 torchao 11.4GB
多GPU推理視訊記憶體	FP16：使用 diffusers 10GB*	BF16：使用 diffusers 15GB*
推理速度	單個 A100：~90 秒，單個 H100：~45 秒	單個 A100：~180 秒，單個 H100：~90 秒
微調精度	FP16	BF16
微調視訊記憶體消耗	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)

< > 在 GitHub 上更新

←InstructPix2Pix Textual Inversion→