Diffusers 文件

ConsisID

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

ConsisID

北京大學、羅切斯特大學等機構的 Yuan Shenghai、Huang Jinfa、He Xianyi、Ge Yunyang、Shi Yujun、Chen Liuhan、Luo Jiebo、Yuan Li 的論文：《Identity-Preserving Text-to-Video Generation by Frequency Decomposition》

論文摘要如下：

身份保持文字到影片 (IPT2V) 生成旨在建立具有一致人類身份的高保真影片。這是影片生成中的一個重要任務，但對於生成模型來說仍然是一個懸而未決的問題。本文從兩個尚未解決的方向推動了 IPT2V 的技術前沿：(1) 無需繁瑣逐案例微調的免調優管道；(2) 基於頻率感知啟發式身份保持擴散 Transformer (DiT) 的控制方案。為實現這些目標，我們提出了 **ConsisID**，一種免調優的基於 DiT 的可控 IPT2V 模型，用於在生成的影片中保持人類**身份**的**一致性**。受視覺/擴散 Transformer 頻率分析中先前發現的啟發，它在頻域中採用了身份控制訊號，其中面部特徵可以分解為低頻全域性特徵（例如，輪廓、比例）和高頻內在特徵（例如，不受姿態變化影響的身份標記）。首先，從低頻視角，我們引入了一個全局面部提取器，它將參考影像和麵部關鍵點編碼到潛在空間中，生成富含低頻資訊的特徵。這些特徵隨後被整合到網路的淺層以緩解與 DiT 相關的訓練挑戰。其次，從高頻視角，我們設計了一個區域性面部提取器來捕獲高頻細節並將其注入到 Transformer 塊中，增強模型保留細粒度特徵的能力。為了利用頻率資訊進行身份保持，我們提出了一種分層訓練策略，將香草預訓練影片生成模型轉換為 IPT2V 模型。廣泛的實驗表明，我們的頻率感知啟發式方案為基於 DiT 的模型提供了最優控制解決方案。得益於該方案，我們的 **ConsisID** 在生成高質量、身份保持影片方面取得了優異的成果，向更有效的 IPT2V 邁出了堅實的一步。ConsisID 模型權重在 https://github.com/PKU-YuanGroup/ConsisID 公開可用。

請務必檢視排程器指南，瞭解如何在排程器速度和質量之間進行權衡，並檢視跨管道重用元件部分，瞭解如何高效地將相同元件載入到多個管道中。

此管道由 SHYuanBest 貢獻。原始程式碼庫可以在這裡找到。原始權重可以在 hf.co/BestWishYsh 下找到。

有 Identity-Preserving Text-to-Video 的兩個官方 ConsisID 檢查點。

模型檢查點	建議的推理資料型別
`BestWishYsh/ConsisID-preview`	torch.bfloat16
`BestWishYsh/ConsisID-1.5`	torch.bfloat16

記憶體最佳化

ConsisID 需要大約 44 GB 的 GPU 記憶體來解碼 49 幀（720x480 (W x H) 輸出解析度，8 FPS 的影片為 6 秒），這使得它無法在消費級 GPU 或免費層 T4 Colab 上執行。可以使用以下記憶體最佳化來減少記憶體佔用。如需復現，您可以參考此指令碼。

功能（覆蓋前一個）	最大分配記憶體	最大保留記憶體
-	37 GB	44 GB
啟用模型 CPU 解除安裝	22 GB	25 GB
啟用順序 CPU 解除安裝	16 GB	22 GB
vae.enable_slicing	16 GB	22 GB
vae.enable_tiling	5 GB	7 GB

ConsisIDPipeline

class diffusers.ConsisIDPipeline

< 來源 >

( tokenizer: T5Tokenizer text_encoder: T5EncoderModel vae: AutoencoderKLCogVideoX transformer: ConsisIDTransformer3DModel scheduler: CogVideoXDPMScheduler )

引數

vae (AutoencoderKL) — 變分自編碼器（VAE）模型，用於將影片編碼和解碼為潛在表示。
text_encoder (T5EncoderModel) — 凍結文字編碼器。ConsisID 使用 T5；具體是 t5-v1_1-xxl 變體。
tokenizer (T5Tokenizer) — T5Tokenizer 類的分詞器。
transformer (ConsisIDTransformer3DModel) — 一個文字條件化的 ConsisIDTransformer3DModel，用於對編碼後的影片潛在表示進行去噪。
scheduler (SchedulerMixin) — 與 transformer 結合使用的排程器，用於對編碼後的影片潛在表示進行去噪。

用於使用 ConsisID 進行影像到影片生成的管道。

此模型繼承自 DiffusionPipeline。有關庫為所有管道實現的通用方法（例如下載或儲存、在特定裝置上執行等）請檢視超類文件。

call

< 來源 >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 480 width: int = 720 num_frames: int = 49 num_inference_steps: int = 50 guidance_scale: float = 6.0 use_dynamic_cfg: bool = False num_videos_per_prompt: int = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: str = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 226 id_vit_hidden: typing.Optional[torch.Tensor] = None id_cond: typing.Optional[torch.Tensor] = None kps_cond: typing.Optional[torch.Tensor] = None ) → ConsisIDPipelineOutput 或 tuple

引數

image (PipelineImageInput) — 用於條件生成輸入的影像。必須是影像、影像列表或 torch.Tensor。
prompt (str 或 List[str], 可選) — 指導影像生成的提示詞。如果未定義，則必須傳遞 prompt_embeds。
negative_prompt (str 或 List[str], 可選) — 不用於指導影像生成的提示詞。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1，則忽略）。
height (int, 可選, 預設為 self.transformer.config.sample_height * self.vae_scale_factor_spatial) — 生成影像的高度（畫素）。為獲得最佳效果，預設設定為 480。
width (int, 可選, 預設為 self.transformer.config.sample_height * self.vae_scale_factor_spatial) — 生成影像的寬度（畫素）。為獲得最佳效果，預設設定為 720。
num_frames (int, 預設為 49) — 要生成的幀數。必須可被 self.vae_scale_factor_temporal 整除。生成的影片將包含 1 個額外幀，因為 ConsisID 以 (num_seconds * fps + 1) 幀為條件，其中 num_seconds 為 6，fps 為 4。然而，由於影片可以以任何 fps 儲存，唯一需要滿足的條件是上述可整除性。
num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多的去噪步數通常會導致更高質量的影像，但推理速度會變慢。
guidance_scale (float, 可選, 預設為 6) — Classifier-Free Diffusion Guidance 中定義的引導比例。guidance_scale 定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 啟用引導比例。更高的引導比例會促使生成與文字 prompt 緊密相關的影像，通常以犧牲較低影像質量為代價。
use_dynamic_cfg (bool, 可選, 預設為 False) — 如果為 True，則在推理期間動態調整引導比例。這允許模型使用漸進式引導比例，在推理步驟中平衡文字引導生成和影像質量。通常，早期推理步驟使用更高的引導比例以獲得更忠實的影像生成，而後期步驟則降低它以獲得更多樣化和自然的結果。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示詞生成的影片數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch generator(s) 以使生成具有確定性。
latents (torch.FloatTensor, 可選) — 預先生成的噪聲潛在變數，從高斯分佈中取樣，用作影像生成的輸入。可用於使用不同的提示詞調整相同的生成。如果未提供，將使用提供的隨機 generator 取樣生成潛在張量。
prompt_embeds (torch.FloatTensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.FloatTensor, 可選) — 預先生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，負面提示詞嵌入將從 negative_prompt 輸入引數生成。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL: PIL.Image.Image 或 np.array。
return_dict (bool, 可選, 預設為 True) — 是否返回 ~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput 而不是普通元組。
attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，將作為 self.processor 中定義的 diffusers.models.attention_processor 傳遞給 AttentionProcessor。
callback_on_step_end (Callable, 可選) — 在推理期間每個去噪步驟結束時呼叫的函式。該函式將使用以下引數呼叫：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。
max_sequence_length (int, 預設為 226) — 編碼提示中的最大序列長度。必須與 self.transformer.config.max_text_seq_length 保持一致，否則可能導致結果不佳。
id_vit_hidden (Optional[torch.Tensor], 可選) — 表示從人臉模型中提取的隱藏特徵張量，用於調節區域性人臉提取器。這對於模型獲取人臉高頻資訊至關重要。如果未提供，區域性人臉提取器將無法正常執行。
id_cond (Optional[torch.Tensor], 可選) — 表示從 clip 模型中提取的隱藏特徵張量，用於調節區域性人臉提取器。這對於模型編輯人臉特徵至關重要。如果未提供，區域性人臉提取器將無法正常執行。
kps_cond (Optional[torch.Tensor], 可選) — 一個張量，用於確定全域性人臉提取器是否使用關鍵點資訊進行條件化。如果提供，此張量將控制在生成過程中是否使用眼睛、鼻子和嘴巴等地標等面部關鍵點。這有助於確保模型保留更多面部低頻資訊。

ConsisIDPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 ConsisIDPipelineOutput，否則返回 tuple。返回元組時，第一個元素是包含生成影像的列表。

呼叫管道進行生成時呼叫的函式。

示例

>>> import torch
>>> from diffusers import ConsisIDPipeline
>>> from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
>>> from diffusers.utils import export_to_video
>>> from huggingface_hub import snapshot_download

>>> snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
>>> (
...     face_helper_1,
...     face_helper_2,
...     face_clip_model,
...     face_main_model,
...     eva_transform_mean,
...     eva_transform_std,
... ) = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
>>> pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> # ConsisID works well with long and well-described prompts. Make sure the face in the image is clearly visible (e.g., preferably half-body or full-body).
>>> prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
>>> image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"

>>> id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
...     face_helper_1,
...     face_clip_model,
...     face_helper_2,
...     eva_transform_mean,
...     eva_transform_std,
...     face_main_model,
...     "cuda",
...     torch.bfloat16,
...     image,
...     is_align_face=True,
... )

>>> video = pipe(
...     image=image,
...     prompt=prompt,
...     num_inference_steps=50,
...     guidance_scale=6.0,
...     use_dynamic_cfg=False,
...     id_vit_hidden=id_vit_hidden,
...     id_cond=id_cond,
...     kps_cond=face_kps,
...     generator=torch.Generator("cuda").manual_seed(42),
... )
>>> export_to_video(video.frames[0], "output.mp4", fps=8)

encode_prompt

< 來源 >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 226 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

引數

prompt (str 或 List[str], 可選) — 待編碼的提示詞
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示詞。如果未定義，則必須傳入 negative_prompt_embeds。在使用非引導模式時（即 guidance_scale 小於 1 時），此引數將被忽略。
do_classifier_free_guidance (bool, 可選, 預設為 True) — 是否使用分類器自由引導。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示詞應生成的影片數量。生成結果嵌入的 torch 裝置。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 negative_prompt 輸入引數生成 negative_prompt_embeds。
device — (torch.device, 可選): torch 裝置
dtype — (torch.dtype, 可選): torch 資料型別

將提示編碼為文字編碼器隱藏狀態。

ConsisIDPipelineOutput

class diffusers.pipelines.consisid.pipeline_output.ConsisIDPipelineOutput

< 來源 >

( frames: Tensor )

引數

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 影片輸出列表 - 可以是長度為 batch_size 的巢狀列表，每個子列表包含長度為 num_frames 的去噪 PIL 影像序列。它也可以是形狀為 (batch_size, num_frames, channels, height, width) 的 NumPy 陣列或 Torch 張量。

ConsisID 流水線的輸出類。

< > 在 GitHub 上更新

←CogView4 Consistency Models→

Diffusers

ConsisID

記憶體最佳化

ConsisIDPipeline

class diffusers.ConsisIDPipeline

__call__

encode_prompt

ConsisIDPipelineOutput

class diffusers.pipelines.consisid.pipeline_output.ConsisIDPipelineOutput

call