Diffusers 文件

Allegro

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

Allegro

Allegro：開啟商業級影片生成模型的黑匣子，作者：RhymesAI，Yuan Zhou、Qiuyue Wang、Yuxuan Cai、Huan Yang。

論文摘要如下：

影片生成領域取得了顯著進展，開源社群貢獻了大量高質量模型訓練的研究論文和工具。然而，儘管付出了這些努力，現有資訊和資源仍不足以實現商業級效能。在本報告中，我們開啟黑匣子，介紹 Allegro，這是一種先進的影片生成模型，在質量和時間一致性方面都表現出色。我們還強調了該領域當前的侷限性，並提出了訓練高效能商業級影片生成模型的綜合方法，解決了資料、模型架構、訓練管道和評估等關鍵方面。我們的使用者研究表明，Allegro 超越了現有的開源模型和大多數商業模型，僅次於 Hailuo 和 Kling。程式碼：https://github.com/rhymes-ai/Allegro，模型：https://huggingface.co/rhymes-ai/Allegro，相簿：https://rhymes.ai/allegro_gallery。

請務必檢視排程器指南，瞭解如何探索排程器速度和質量之間的權衡，並檢視跨管道重用元件部分，瞭解如何有效地將相同元件載入到多個管道中。

量化

量化有助於透過以較低精度資料型別儲存模型權重來減少大型模型的記憶體需求。但是，量化對影片質量的影響可能因影片模型而異。

請參閱量化概述，瞭解更多受支援的量化後端以及如何選擇支援您用例的量化後端。下面的示例演示瞭如何使用 bitsandbytes 載入量化後的 AllegroPipeline 進行推理。

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, AllegroTransformer3DModel, AllegroPipeline
from diffusers.utils import export_to_video
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel

quant_config = BitsAndBytesConfig(load_in_8bit=True)
text_encoder_8bit = T5EncoderModel.from_pretrained(
    "rhymes-ai/Allegro",
    subfolder="text_encoder",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = AllegroTransformer3DModel.from_pretrained(
    "rhymes-ai/Allegro",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

pipeline = AllegroPipeline.from_pretrained(
    "rhymes-ai/Allegro",
    text_encoder=text_encoder_8bit,
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

prompt = (
    "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, "
    "the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this "
    "location might be a popular spot for docking fishing boats."
)
video = pipeline(prompt, guidance_scale=7.5, max_sequence_length=512).frames[0]
export_to_video(video, "harbor.mp4", fps=15)

AllegroPipeline

class diffusers.AllegroPipeline

< 來源 >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel vae: AutoencoderKLAllegro transformer: AllegroTransformer3DModel scheduler: KarrasDiffusionSchedulers )

引數

vae (AllegroAutoEncoderKL3D) — 用於編碼和解碼影片到潛在表示以及從潛在表示解碼的變分自動編碼器 (VAE) 模型。
text_encoder (T5EncoderModel) — 凍結文字編碼器。PixArt-Alpha 使用 T5，特別是 t5-v1_1-xxl 變體。
tokenizer (T5Tokenizer) — T5Tokenizer 類的分詞器。
transformer (AllegroTransformer3DModel) — 一個文字條件 AllegroTransformer3DModel，用於對編碼後的影片潛在表示進行去噪。
scheduler (SchedulerMixin) — 一個排程器，與 transformer 結合使用，對編碼後的影片潛在表示進行去噪。

用於文字到影片生成的 Allegro 管道。

此模型繼承自 DiffusionPipeline。請檢視超類文件，瞭解庫為所有管道實現的通用方法（例如下載或儲存、在特定裝置上執行等）

call

< 來源 >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: str = '' num_inference_steps: int = 100 timesteps: typing.List[int] = None guidance_scale: float = 7.5 num_frames: typing.Optional[int] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_videos_per_prompt: int = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] clean_caption: bool = True max_sequence_length: int = 512 ) → AllegroPipelineOutput or tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影片生成的提示詞。如果未定義，則必須傳遞 prompt_embeds。
negative_prompt (str 或 List[str], 可選) — 不用於引導影片生成的提示詞。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1 則忽略），此引數將被忽略。
num_inference_steps (int, 可選, 預設為 100) — 去噪步數。更多的去噪步數通常會產生更高質量的影片，但推理速度會變慢。
timesteps (List[int], 可選) — 用於去噪過程的自定義時間步。如果未定義，則使用等間距的 num_inference_steps 時間步。必須按降序排列。
guidance_scale (float, 可選, 預設為 7.5) — Classifier-Free Diffusion Guidance 中定義的引導比例。guidance_scale 被定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 來啟用引導比例。更高的引導比例鼓勵生成與文字 prompt 緊密相關的影片，通常以犧牲較低影片質量為代價。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示詞生成的影片數量。
num_frames — (int, 可選, 預設為 88)：控制生成的影片幀數。
height (int, 可選, 預設為 self.unet.config.sample_size) — 生成影片的高度（畫素）。
width (int, 可選, 預設為 self.unet.config.sample_size) — 生成影片的寬度（畫素）。
eta (float, 可選, 預設為 0.0) — 對應於 DDIM 論文中的引數 eta (η)：https://huggingface.co/papers/2010.02502。僅適用於 schedulers.DDIMScheduler，對其他排程器將被忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch 生成器，用於使生成具有確定性。
latents (torch.Tensor, 可選) — 生成。可用於透過不同提示詞調整同一生成。如果未提供，將透過使用提供的隨機 generator 進行取樣來生成潛在的預生成噪聲潛在張量，作為影片的輸入。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，文字嵌入將從 prompt 輸入引數生成。
prompt_attention_mask (torch.Tensor, 可選) — 預生成的文字嵌入注意力掩碼。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負文字嵌入。對於 PixArt-Sigma，此負提示詞應為空字串。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
negative_prompt_attention_mask (torch.Tensor, 可選) — 預生成的負文字嵌入注意力掩碼。
output_type (str, 可選, 預設為 "pil") — 生成影片的輸出格式。選擇 PIL: PIL.Image.Image 或 np.array。
return_dict (bool, 可選, 預設為 True) — 是否返回 ~pipelines.stable_diffusion.IFPipelineOutput 而不是普通元組。
callback (Callable, 可選) — 在推理過程中每 callback_steps 步都會呼叫的函式。該函式將使用以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，則每一步都會呼叫回撥。
clean_caption (bool, 可選, 預設為 True) — 在建立嵌入之前是否清理標題。需要安裝 beautifulsoup4 和 ftfy。如果未安裝依賴項，將從原始提示建立嵌入。
max_sequence_length (int 預設為 512) — 與 prompt 一起使用的最大序列長度。

AllegroPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 AllegroPipelineOutput，否則返回一個 tuple，其中第一個元素是生成的影片列表。

呼叫管道進行生成時呼叫的函式。

示例

>>> import torch
>>> from diffusers import AutoencoderKLAllegro, AllegroPipeline
>>> from diffusers.utils import export_to_video

>>> vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32)
>>> pipe = AllegroPipeline.from_pretrained("rhymes-ai/Allegro", vae=vae, torch_dtype=torch.bfloat16).to("cuda")
>>> pipe.enable_vae_tiling()

>>> prompt = (
...     "A seaside harbor with bright sunlight and sparkling seawater, with many boats in the water. From an aerial view, "
...     "the boats vary in size and color, some moving and some stationary. Fishing boats in the water suggest that this "
...     "location might be a popular spot for docking fishing boats."
... )
>>> video = pipe(prompt, guidance_scale=7.5, max_sequence_length=512).frames[0]
>>> export_to_video(video, "output.mp4", fps=15)

disable_vae_slicing

< source >

( )

停用切片 VAE 解碼。如果之前啟用了 enable_vae_slicing，此方法將返回一步計算解碼。

disable_vae_tiling

< source >

( )

停用平鋪 VAE 解碼。如果之前啟用了 enable_vae_tiling，此方法將恢復一步計算解碼。

enable_vae_slicing

< source >

( )

啟用切片 VAE 解碼。啟用此選項後，VAE 會將輸入張量分片，分步計算解碼。這有助於節省一些記憶體並允許更大的批次大小。

enable_vae_tiling

< source >

( )

啟用平鋪 VAE 解碼。啟用此選項後，VAE 將把輸入張量分割成瓦片，分多步計算編碼和解碼。這對於節省大量記憶體和處理更大的影像非常有用。

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: str = '' num_videos_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None clean_caption: bool = False max_sequence_length: int = 512 **kwargs )

引數

prompt (str 或 List[str], 可選) — 要編碼的提示。
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1 時），忽略此引數。對於 PixArt-Alpha，此引數應為 ""。
do_classifier_free_guidance (bool, 可選, 預設為 True) — 是否使用無分類器引導。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示應生成的影像數量。
device — (torch.device, 可選): 放置結果嵌入的 torch 裝置。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，將根據 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負文字嵌入。對於 PixArt-Alpha，它應該是 "" 字串的嵌入。
clean_caption (bool, 預設為 False) — 如果為 True，函式將在編碼前預處理並清理提供的標題。
max_sequence_length (int, 預設為 512) — 用於提示的最大序列長度。

將提示編碼為文字編碼器隱藏狀態。

AllegroPipelineOutput

class diffusers.pipelines.allegro.pipeline_output.AllegroPipelineOutput

< source >

( frames: typing.Union[torch.Tensor, numpy.ndarray, typing.List[typing.List[PIL.Image.Image]]] )

引數

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 影片輸出列表 - 可以是長度為 batch_size 的巢狀列表，其中每個子列表包含長度為 num_frames 的去噪 PIL 影像序列。它也可以是形狀為 (batch_size, num_frames, channels, height, width) 的 NumPy 陣列或 Torch 張量。

Allegro 管道的輸出類。

< > 在 GitHub 上更新

←概覽 aMUSEd→

Diffusers

Allegro

量化

AllegroPipeline

class diffusers.AllegroPipeline

__call__

disable_vae_slicing

disable_vae_tiling

enable_vae_slicing

enable_vae_tiling

encode_prompt

AllegroPipelineOutput

class diffusers.pipelines.allegro.pipeline_output.AllegroPipelineOutput

call