Diffusers 文件

文字到影片生成

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

🧪 此管道僅用於研究目的。

文字到影片生成

ModelScope 文字到影片技術報告由王九牛、袁杭傑、陳大友、張瑩雅、王翔、張世偉撰寫。

論文摘要如下：

本文介紹了 ModelScopeT2V，一個從文字到影像合成模型（即 Stable Diffusion）演變而來的文字到影片合成模型。ModelScopeT2V 結合了時空塊以確保一致的幀生成和流暢的運動轉換。該模型可以適應訓練和推理期間不同的幀數，使其適用於影像-文字和影片-文字資料集。ModelScopeT2V 彙集了三個元件（即 VQGAN、文字編碼器和去噪 UNet），總共包含 17 億個引數，其中 5 億個引數專用於時間能力。該模型在三個評估指標上均表現優於現有最佳方法。程式碼和線上演示可在 https://modelscope.cn/models/damo/text-to-video-synthesis/summary 獲取。

您可以在專案頁面、原始程式碼庫上找到有關文字到影片的其他資訊，並在演示中進行嘗試。官方檢查點可在 damo-vilab 和 cerspense 獲取。

使用示例

text-to-video-ms-1.7b

讓我們首先生成一個預設長度為 16 幀（8 fps 時為 2 秒）的短影片。

import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe = pipe.to("cuda")

prompt = "Spiderman is surfing"
video_frames = pipe(prompt).frames[0]
video_path = export_to_video(video_frames)
video_path

Diffusers 支援不同的最佳化技術，以改善管道的延遲和記憶體佔用。由於影片通常比影像佔用更多記憶體，我們可以啟用 CPU 解除安裝和 VAE 分片以控制記憶體佔用。

讓我們在同一 GPU 上使用 CPU 解除安裝和 VAE 分片生成一個 8 秒（64 幀）的影片

import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.enable_model_cpu_offload()

# memory optimization
pipe.enable_vae_slicing()

prompt = "Darth Vader surfing a wave"
video_frames = pipe(prompt, num_frames=64).frames[0]
video_path = export_to_video(video_frames)
video_path

使用 PyTorch 2.0、“fp16”精度和上述技術生成 64 幀影片僅需**7 GB GPU 記憶體**。

我們也可以輕鬆地使用不同的排程器，使用與 Stable Diffusion 相同的方法

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video

pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames[0]
video_path = export_to_video(video_frames)
video_path

以下是一些樣本輸出

一個宇航員騎著馬。

達斯·維德在海浪中衝浪。
Darth vader surfing in waves.

cerspense/zeroscope_v2_576w & cerspense/zeroscope_v2_XL

Zeroscope 是無水印模型，已在特定尺寸（例如 `576x320` 和 `1024x576`）上進行訓練。應首先使用較低解析度的檢查點 `cerspense/zeroscope_v2_576w` 和 TextToVideoSDPipeline 生成影片，然後可以使用 VideoToVideoSDPipeline 和 `cerspense/zeroscope_v2_XL` 將其放大。

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
from PIL import Image

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

# memory optimization
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
pipe.enable_vae_slicing()

prompt = "Darth Vader surfing a wave"
video_frames = pipe(prompt, num_frames=24).frames[0]
video_path = export_to_video(video_frames)
video_path

現在影片可以被放大

pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_XL", torch_dtype=torch.float16)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# memory optimization
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
pipe.enable_vae_slicing()

video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]

video_frames = pipe(prompt, video=video, strength=0.6).frames[0]
video_path = export_to_video(video_frames)
video_path

以下是一些樣本輸出

達斯·維德在海浪中衝浪。
Darth vader surfing in waves.

提示

影片生成是記憶體密集型操作，減少記憶體使用的一種方法是在管道的 UNet 上設定 `enable_forward_chunking`，這樣您就不會一次性執行整個前饋層。將其分解為塊狀迴圈會更有效。

查閱文字或影像到影片指南，瞭解更多關於某些引數如何影響影片生成以及如何透過減少記憶體使用來最佳化推理的詳細資訊。

請務必查閱排程器指南，瞭解如何權衡排程器速度和質量，並參閱跨管道重用元件部分，瞭解如何有效地將相同元件載入到多個管道中。

TextToVideoSDPipeline

類 diffusers.TextToVideoSDPipeline

< 來源 >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet3DConditionModel scheduler: KarrasDiffusionSchedulers )

呼叫

< 來源 >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_frames: int = 16 num_inference_steps: int = 50 guidance_scale: float = 9.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None ) → TextToVideoSDPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞或提示詞列表。如果未定義，您需要傳遞 prompt_embeds。
height (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影片的高度（畫素）。
width (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影片的寬度（畫素）。
num_frames (int, 可選, 預設為 16) — 生成的影片幀數。預設為 16 幀，按每秒 8 幀計算，相當於 2 秒影片。
num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高質量的影片，但推理速度會變慢。
guidance_scale (float, 可選, 預設為 7.5) — 較高的引導比例值鼓勵模型生成與文字 prompt 緊密相關的影像，但會犧牲影像質量。當 guidance_scale > 1 時啟用引導比例。
negative_prompt (str 或 List[str], 可選) — 用於引導影像生成中不包含內容的提示詞或提示詞列表。如果未定義，您需要傳遞 negative_prompt_embeds。當不使用引導時（guidance_scale < 1）會被忽略。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示詞生成的影像數量。
eta (float, 可選, 預設為 0.0) — 對應於 DDIM 論文中的引數 eta (η)。僅適用於 DDIMScheduler，在其他排程器中忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 用於使生成具有確定性的 torch.Generator。
latents (torch.Tensor, 可選) — 從高斯分佈中取樣的預生成噪聲潛在變數，用作影片生成的輸入。可用於調整使用不同提示詞的相同生成。如果未提供，則使用提供的隨機 generator 取樣生成潛在張量。潛在變數應為形狀 (batch_size, num_channel, num_frames, height, width)。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入（提示詞權重）。如果未提供，則從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入（提示詞權重）。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
output_type (str, 可選, 預設為 "np") — 生成影片的輸出格式。在 torch.Tensor 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 TextToVideoSDPipelineOutput 而不是普通元組。
callback (Callable, 可選) — 在推理過程中，每隔 callback_steps 步呼叫的函式。該函式以以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，則在每個步驟都呼叫回撥。
cross_attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，則傳遞給 self.processor 中定義的 AttentionProcessor。
clip_skip (int, 可選) — 從 CLIP 跳過的層數，用於計算提示詞嵌入。值為 1 表示使用倒數第二層的輸出計算提示詞嵌入。

TextToVideoSDPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 TextToVideoSDPipelineOutput，否則返回一個 tuple，其中第一個元素是包含生成的幀的列表。

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers import TextToVideoSDPipeline
>>> from diffusers.utils import export_to_video

>>> pipe = TextToVideoSDPipeline.from_pretrained(
...     "damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16"
... )
>>> pipe.enable_model_cpu_offload()

>>> prompt = "Spiderman is surfing"
>>> video_frames = pipe(prompt).frames[0]
>>> video_path = export_to_video(video_frames)
>>> video_path

encode_prompt

< 來源 >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

引數

prompt (str 或 List[str], 可選) — 要編碼的提示詞
device — (torch.device): torch 裝置
num_images_per_prompt (int) — 每個提示詞應生成的影像數量
do_classifier_free_guidance (bool) — 是否使用分類器自由引導
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示詞。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1 時），將被忽略。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
lora_scale (float, 可選) — 如果載入了 LoRA 層，則應用於文字編碼器所有 LoRA 層的 LoRA 縮放比例。
clip_skip (int, 可選) — 從 CLIP 跳過的層數，用於計算提示詞嵌入。值為 1 表示使用倒數第二層的輸出計算提示詞嵌入。

將提示編碼為文字編碼器隱藏狀態。

VideoToVideoSDPipeline

class diffusers.VideoToVideoSDPipeline

< 來源 >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet3DConditionModel scheduler: KarrasDiffusionSchedulers )

呼叫

< 來源 >

( prompt: typing.Union[str, typing.List[str]] = None video: typing.Union[typing.List[numpy.ndarray], torch.Tensor] = None strength: float = 0.6 num_inference_steps: int = 50 guidance_scale: float = 15.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'np' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None ) → TextToVideoSDPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞。如果未定義，則需要傳遞 prompt_embeds。
video (List[np.ndarray] 或 torch.Tensor) — 影片幀或表示影片批次的張量，用作過程的起始點。如果直接傳遞潛變數，也可以接受影片潛變數作為 image，它將不再被編碼。
strength (float, 可選, 預設為 0.8) — 指示轉換參考 video 的程度。必須在 0 到 1 之間。video 用作起點，strength 越大，新增的噪聲越多。去噪步驟的數量取決於初始新增的噪聲量。當 strength 為 1 時，新增的噪聲最大，去噪過程執行 num_inference_steps 中指定的全部迭代次數。值為 1 基本忽略 video。
num_inference_steps (int, 可選, 預設為 50) — 去噪步驟的數量。更多的去噪步驟通常會導致更高質量的影片，但推理速度會變慢。
guidance_scale (float, 可選, 預設為 7.5) — 較高的引導比例值鼓勵模型生成與文字 prompt 緊密相關的影像，但會犧牲影像質量。當 guidance_scale > 1 時啟用引導比例。
negative_prompt (str 或 List[str], 可選) — 不用於影片生成的提示詞。如果未定義，則需要傳遞 negative_prompt_embeds。當不使用引導時（guidance_scale < 1），將被忽略。
eta (float, 可選, 預設為 0.0) — 對應於 DDIM 論文中的引數 eta (η)。僅適用於 DDIMScheduler，在其他排程器中被忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個 torch.Generator 用於使生成具有確定性。
latents (torch.Tensor, 可選) — 從高斯分佈中取樣的預生成噪聲潛變數，用作影片生成的輸入。可用於使用不同的提示詞調整相同的生成。如果未提供，則使用提供的隨機 generator 取樣生成潛變數張量。潛變數的形狀應為 (batch_size, num_channel, num_frames, height, width)。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入（提示詞權重）。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入（提示詞權重）。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
output_type (str, 可選, 預設為 "np") — 生成影片的輸出格式。在 torch.Tensor 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 TextToVideoSDPipelineOutput 而不是普通元組。
callback (Callable, 可選) — 在推理過程中，每隔 callback_steps 步呼叫的函式。該函式以以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，則在每個步驟都呼叫回撥。
cross_attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，則傳遞給 self.processor 中定義的 AttentionProcessor。
clip_skip (int, 可選) — 從 CLIP 跳過的層數，用於計算提示詞嵌入。值為 1 表示使用倒數第二層的輸出計算提示詞嵌入。

TextToVideoSDPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 TextToVideoSDPipelineOutput，否則返回一個 tuple，其中第一個元素是包含生成的幀的列表。

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
>>> from diffusers.utils import export_to_video

>>> pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
>>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
>>> pipe.to("cuda")

>>> prompt = "spiderman running in the desert"
>>> video_frames = pipe(prompt, num_inference_steps=40, height=320, width=576, num_frames=24).frames[0]
>>> # safe low-res video
>>> video_path = export_to_video(video_frames, output_video_path="./video_576_spiderman.mp4")

>>> # let's offload the text-to-image model
>>> pipe.to("cpu")

>>> # and load the image-to-image model
>>> pipe = DiffusionPipeline.from_pretrained(
...     "cerspense/zeroscope_v2_XL", torch_dtype=torch.float16, revision="refs/pr/15"
... )
>>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
>>> pipe.enable_model_cpu_offload()

>>> # The VAE consumes A LOT of memory, let's make sure we run it in sliced mode
>>> pipe.vae.enable_slicing()

>>> # now let's upscale it
>>> video = [Image.fromarray(frame).resize((1024, 576)) for frame in video_frames]

>>> # and denoise it
>>> video_frames = pipe(prompt, video=video, strength=0.6).frames[0]
>>> video_path = export_to_video(video_frames, output_video_path="./video_1024_spiderman.mp4")
>>> video_path

encode_prompt

< 來源 >

引數

prompt (str 或 List[str], 可選) — 要編碼的提示詞
device — (torch.device): torch 裝置
num_images_per_prompt (int) — 每個提示詞應生成的影像數量
do_classifier_free_guidance (bool) — 是否使用分類器自由引導
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示詞。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1 時），將被忽略。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
lora_scale (float, 可選) — 如果載入了 LoRA 層，則應用於文字編碼器所有 LoRA 層的 LoRA 縮放比例。
clip_skip (int, 可選) — 從 CLIP 跳過的層數，用於計算提示詞嵌入。值為 1 表示使用倒數第二層的輸出計算提示詞嵌入。

將提示編碼為文字編碼器隱藏狀態。

TextToVideoSDPipelineOutput

class diffusers.pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput

< 來源 >

( frames: typing.Union[torch.Tensor, numpy.ndarray, typing.List[typing.List[PIL.Image.Image]]] )

引數

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 影片輸出列表 - 可以是長度為 batch_size 的巢狀列表，每個子列表包含去噪後的

文字到影片流水線的輸出類。

PIL 影像序列，長度為 num_frames。也可以是形狀為 (batch_size, num_frames, channels, height, width) 的 NumPy 陣列或 Torch 張量

< > 在 GitHub 上更新

←Stable unCLIP Text2Video-Zero→

Diffusers

文字到影片生成

使用示例

text-to-video-ms-1.7b

cerspense/zeroscope_v2_576w & cerspense/zeroscope_v2_XL

提示

TextToVideoSDPipeline

類 diffusers.TextToVideoSDPipeline

__呼叫__

encode_prompt

VideoToVideoSDPipeline

class diffusers.VideoToVideoSDPipeline

__呼叫__

encode_prompt

TextToVideoSDPipelineOutput

class diffusers.pipelines.text_to_video_synthesis.TextToVideoSDPipelineOutput

呼叫

呼叫