Diffusers 文件

EasyAnimate

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

EasyAnimate

由阿里巴巴 PAI 開發的 EasyAnimate。

其 GitHub 頁面上的描述：EasyAnimate 是一個基於 Transformer 架構的管道，專為生成 AI 影像和影片，以及為 Diffusion Transformer 訓練基線模型和 Lora 模型而設計。我們支援從預訓練的 EasyAnimate 模型直接預測，能夠生成各種解析度的影片，長度約 6 秒，8fps（EasyAnimateV5.1，1 到 49 幀）。此外，使用者還可以訓練自己的基線和 Lora 模型以實現特定的風格轉換。

此管道由 bubbliiiing 貢獻。原始程式碼庫可以在這裡找到。原始權重可以在hf.co/alibaba-pai下找到。

有針對文字到影片和影片到影片的兩個官方 EasyAnimate 檢查點。

模型檢查點	推薦的推理資料型別
`alibaba-pai/EasyAnimateV5.1-12b-zh`	torch.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	torch.float16

有一個官方 EasyAnimate 檢查點可用於影像到影片和影片到影片。

模型檢查點	推薦的推理資料型別
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	torch.float16

有兩個官方 EasyAnimate 檢查點可用於控制到影片。

模型檢查點	推薦的推理資料型別
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`	torch.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`	torch.float16

對於 EasyAnimateV5.1 系列

文字到影片 (T2V) 和影像到影片 (I2V) 適用於多種解析度。寬度和高度可以從 256 到 1024 不等。
T2V 和 I2V 模型均支援生成 1~49 幀，在此值下效果最佳。建議以 8 FPS 匯出影片。

量化

量化有助於透過以較低精度資料型別儲存模型權重來減少大型模型的記憶體需求。但是，量化對影片質量的影響可能因影片模型而異。

請參閱量化概覽，瞭解有關支援的量化後端以及如何選擇支援您用例的量化後端。下面的示例演示瞭如何使用 bitsandbytes 載入量化的 EasyAnimatePipeline 進行推理。

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
from diffusers.utils import export_to_video

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
    "alibaba-pai/EasyAnimateV5.1-12b-zh",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

pipeline = EasyAnimatePipeline.from_pretrained(
    "alibaba-pai/EasyAnimateV5.1-12b-zh",
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

prompt = "A cat walks on the grass, realistic style."
negative_prompt = "bad detailed"
video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
export_to_video(video, "cat.mp4", fps=8)

EasyAnimatePipeline

class diffusers.EasyAnimatePipeline

< 源 >

( vae: AutoencoderKLMagvit text_encoder: typing.Union[transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration, transformers.models.bert.modeling_bert.BertModel] tokenizer: typing.Union[transformers.models.qwen2.tokenization_qwen2.Qwen2Tokenizer, transformers.models.bert.tokenization_bert.BertTokenizer] transformer: EasyAnimateTransformer3DModel scheduler: FlowMatchEulerDiscreteScheduler )

引數

vae (AutoencoderKLMagvit) — 變分自動編碼器 (VAE) 模型，用於將影片編碼和解碼為潛在表示。
text_encoder (Optional[~transformers.Qwen2VLForConditionalGeneration, ~transformers.BertModel]) — EasyAnimate 在 V5.1 中使用 qwen2 vl。
tokenizer (Optional[~transformers.Qwen2Tokenizer, ~transformers.BertTokenizer]) — 用於文字分詞的 Qwen2Tokenizer 或 BertTokenizer。
transformer (EasyAnimateTransformer3DModel) — EasyAnimate 團隊設計的 EasyAnimate 模型。
scheduler (FlowMatchEulerDiscreteScheduler) — 與 EasyAnimate 結合使用的排程器，用於對編碼影像的潛在表示進行去噪。

用於使用 EasyAnimate 生成文字到影片的管道。

此模型繼承自 DiffusionPipeline。請查閱超類文件，瞭解庫為所有管道實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

EasyAnimate 在 V5.1 中使用一個文字編碼器 qwen2 vl。

call

< 源 >

( prompt: typing.Union[str, typing.List[str]] = None num_frames: typing.Optional[int] = 49 height: typing.Optional[int] = 512 width: typing.Optional[int] = 512 num_inference_steps: typing.Optional[int] = 50 guidance_scale: typing.Optional[float] = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: typing.Optional[float] = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None timesteps: typing.Optional[typing.List[int]] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] guidance_rescale: float = 0.0 ) → StableDiffusionPipelineOutput 或 tuple

StableDiffusionPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 StableDiffusionPipelineOutput，否則返回一個 tuple，其中第一個元素是生成的影像列表，第二個元素是指示相應生成的影像是否包含“不適合工作”（nsfw）內容的 bool 列表。

使用 EasyAnimate 管道根據提供的提示生成影像或影片。

示例

>>> import torch
>>> from diffusers import EasyAnimatePipeline
>>> from diffusers.utils import export_to_video

>>> # Models: "alibaba-pai/EasyAnimateV5.1-12b-zh"
>>> pipe = EasyAnimatePipeline.from_pretrained(
...     "alibaba-pai/EasyAnimateV5.1-7b-zh-diffusers", torch_dtype=torch.float16
... ).to("cuda")
>>> prompt = (
...     "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
...     "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
...     "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
...     "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
...     "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
...     "atmosphere of this unique musical performance."
... )
>>> sample_size = (512, 512)
>>> video = pipe(
...     prompt=prompt,
...     guidance_scale=6,
...     negative_prompt="bad detailed",
...     height=sample_size[0],
...     width=sample_size[1],
...     num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=8)

prompt (str 或 List[str], 可選)：用於指導影像或影片生成的文字提示。如果未提供，請改用 prompt_embeds。num_frames (int, 可選)：生成影片的長度（以幀為單位）。height (int, 可選)：生成影像的高度（畫素）。width (int, 可選)：生成影像的寬度（畫素）。num_inference_steps (int, 可選, 預設為 50)：生成期間的去噪步數。更多步驟通常會產生更高質量的影像，但會減慢推理速度。guidance_scale (float, 可選, 預設為 5.0)：鼓勵模型將輸出與提示對齊。較高的值可能會降低影像質量。negative_prompt (str 或 List[str], 可選)：指示生成中要排除的內容的提示。如果未指定，請使用 negative_prompt_embeds。num_images_per_prompt (int, 可選, 預設為 1)：為每個提示生成的影像數量。eta (float, 可選, 預設為 0.0)：適用於 DDIM 排程。由相關文獻中的 eta 引數控制。generator (torch.Generator 或 List[torch.Generator], 可選)：用於確保影像生成可復現性的生成器。latents (torch.Tensor, 可選)：預定義的潛在張量，用於條件生成。prompt_embeds (torch.Tensor, 可選)：提示的文字嵌入。覆蓋提示字串輸入，以提供更大的靈活性。negative_prompt_embeds (torch.Tensor, 可選)：負面提示的嵌入。如果已定義，則覆蓋字串輸入。prompt_attention_mask (torch.Tensor, 可選)：主要提示嵌入的注意力掩碼。negative_prompt_attention_mask (torch.Tensor, 可選)：負面提示嵌入的注意力掩碼。output_type (str, 可選, 預設為“latent”)：生成輸出的格式，可以是 PIL 影像或 NumPy 陣列。return_dict (bool, 可選, 預設為 True)：如果為 True，則返回結構化輸出。否則返回一個簡單的元組。callback_on_step_end (Callable, 可選)：在每個去噪步驟結束時呼叫的函式。callback_on_step_end_tensor_inputs (List[str], 可選)：要包含在回撥函式呼叫中的張量名稱。guidance_rescale (float, 可選, 預設為 0.0)：根據引導比例調整噪聲水平。original_size (Tuple[int, int], 可選, 預設為 (1024, 1024))：輸出的原始尺寸。target_size (Tuple[int, int], 可選)：所需的輸出尺寸，用於計算。crops_coords_top_left (Tuple[int, int], 可選, 預設為 (0, 0))：裁剪座標。

encode_prompt

< 源 >

( prompt: typing.Union[str, typing.List[str]] num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None max_sequence_length: int = 256 )

引數

prompt (str 或 List[str], 可選) — 待編碼的提示
device — (torch.device): torch 裝置
dtype (torch.dtype) — torch 資料型別
num_images_per_prompt (int) — 每個提示應生成的影像數量
do_classifier_free_guidance (bool) — 是否使用分類器自由引導
negative_prompt (str 或 List[str], 可選) — 不用於指導影像生成的提示。如果未定義，則必須傳入 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1，則忽略）。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，將從 negative_prompt 輸入引數生成負面提示嵌入。
prompt_attention_mask (torch.Tensor, 可選) — 提示的注意力掩碼。當直接傳入 prompt_embeds 時需要。
negative_prompt_attention_mask (torch.Tensor, 可選) — 負面提示的注意力掩碼。當直接傳入 negative_prompt_embeds 時需要。
max_sequence_length (int, 可選) — 用於提示的最大序列長度。

將提示編碼為文字編碼器隱藏狀態。

EasyAnimatePipelineOutput

class diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput

< 來源 >

( frames: Tensor )

引數

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 影片輸出列表 - 可以是長度為 batch_size 的巢狀列表，其中每個子列表包含長度為 num_frames 的去噪 PIL 影像序列。它也可以是形狀為 (batch_size, num_frames, channels, height, width) 的 NumPy 陣列或 Torch 張量。

EasyAnimate 流水線的輸出類。

< > 在 GitHub 上更新

←DiT Flux→

Diffusers

EasyAnimate

量化

EasyAnimatePipeline

class diffusers.EasyAnimatePipeline

__call__

encode_prompt

EasyAnimatePipelineOutput

class diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput

call