Diffusers 文件

UniDiffuser

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

UniDiffuser

UniDiffuser 模型由 Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu 在 One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale 中提出。

論文摘要如下：

本文提出了一種統一的擴散框架（命名為 UniDiffuser），用於在一個模型中擬合與一組多模態資料相關的所有分佈。我們的關鍵見解是——學習邊緣、條件和聯合分佈的擴散模型可以統一為預測擾動資料中的噪聲，其中不同模態的擾動級別（即時間步長）可以不同。受統一觀點的啟發，UniDiffuser 透過對原始擴散模型進行最小修改，同時學習所有分佈——擾動所有模態的資料而不是單一模態，輸入不同模態的獨立時間步長，並預測所有模態的噪聲而不是單一模態。UniDiffuser 由一個用於擴散模型的 Transformer 引數化，以處理不同模態的輸入型別。UniDiffuser 在大規模配對影像-文字資料上實現，透過設定適當的時間步長，無需額外開銷即可執行影像、文字、文字到影像、影像到文字和影像-文字對生成。特別是，UniDiffuser 能夠生成所有任務中感知上真實的樣本，其定量結果（例如，FID 和 CLIP 分數）不僅優於現有通用模型，而且在代表性任務（例如，文字到影像生成）中可與專用模型（例如，Stable Diffusion 和 DALL-E 2）媲美。

您可以在 thu-ml/unidiffuser 找到原始程式碼庫，在 thu-ml 找到其他檢查點。

PyTorch 1.X 目前存在一個問題，即輸出影像全黑或畫素值變為 NaNs。透過切換到 PyTorch 2.X 可以緩解此問題。

此流水線由 dg845 貢獻。❤️

使用示例

由於 UniDiffuser 模型經過訓練以對（影像、文字）對的聯合分佈進行建模，因此它能夠執行各種生成任務

無條件影像和文字生成

從 UniDiffuserPipeline 進行無條件生成（我們僅從標準高斯先驗中取樣的潛在空間開始）將生成一個（影像，文字）對

import torch

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Unconditional image and text generation. The generation task is automatically inferred.
sample = pipe(num_inference_steps=20, guidance_scale=8.0)
image = sample.images[0]
text = sample.text[0]
image.save("unidiffuser_joint_sample_image.png")
print(text)

這在 UniDiffuser 論文中也稱為“聯合”生成，因為我們從聯合影像-文字分佈中進行取樣。

請注意，生成任務是從呼叫流水線時使用的輸入推斷出來的。也可以使用 UniDiffuserPipeline.set_joint_mode() 手動指定無條件生成任務（“模式”）。

# Equivalent to the above.
pipe.set_joint_mode()
sample = pipe(num_inference_steps=20, guidance_scale=8.0)

手動設定模式後，對流水線的後續呼叫將使用所設定的模式，而不會嘗試推斷模式。您可以使用 UniDiffuserPipeline.reset_mode() 重置模式，之後流水線將再次推斷模式。

您也可以只生成影像或只生成文字（UniDiffuser 論文稱之為“邊緣”生成，因為我們分別從影像和文字的邊緣分佈中取樣）

# Unlike other generation tasks, image-only and text-only generation don't use classifier-free guidance
# Image-only generation
pipe.set_image_mode()
sample_image = pipe(num_inference_steps=20).images[0]
# Text-only generation
pipe.set_text_mode()
sample_text = pipe(num_inference_steps=20).text[0]

文字到影像生成

UniDiffuser 也能夠從條件分佈中取樣；也就是說，給定文字提示的影像分佈，或給定影像的文字分佈。以下是條件影像分佈取樣的示例（文字到影像生成或文字條件影像生成）

import torch

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image

text2img 模式要求提供輸入 prompt 或 prompt_embeds。您可以使用 UniDiffuserPipeline.set_text_to_image_mode() 手動設定 text2img 模式。

影像到文字生成

同樣，UniDiffuser 也可以根據影像生成文字樣本（影像到文字或影像條件文字生成）

import torch

from diffusers import UniDiffuserPipeline
from diffusers.utils import load_image

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

img2text 模式要求提供輸入 image。您可以使用 UniDiffuserPipeline.set_image_to_text_mode() 手動設定 img2text 模式。

影像變體

UniDiffuser 作者建議透過“往返”生成方法執行影像變體，即給定輸入影像，我們首先執行影像到文字生成，然後對第一次生成的輸出執行文字到影像生成。這會生成與輸入影像語義相似的新影像

import torch

from diffusers import UniDiffuserPipeline
from diffusers.utils import load_image

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Image variation can be performed with an image-to-text generation followed by a text-to-image generation:
# 1. Image-to-text generation
image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg"
init_image = load_image(image_url).resize((512, 512))

sample = pipe(image=init_image, num_inference_steps=20, guidance_scale=8.0)
i2t_text = sample.text[0]
print(i2t_text)

# 2. Text-to-image generation
sample = pipe(prompt=i2t_text, num_inference_steps=20, guidance_scale=8.0)
final_image = sample.images[0]
final_image.save("unidiffuser_image_variation_sample.png")

文字變體

同樣，文字變體可以透過文字到影像生成，然後影像到文字生成在輸入提示上執行

import torch

from diffusers import UniDiffuserPipeline

device = "cuda"
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)

# Text variation can be performed with a text-to-image generation followed by a image-to-text generation:
# 1. Text-to-image generation
prompt = "an elephant under the sea"

sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0)
t2i_image = sample.images[0]
t2i_image.save("unidiffuser_text2img_sample_image.png")

# 2. Image-to-text generation
sample = pipe(image=t2i_image, num_inference_steps=20, guidance_scale=8.0)
final_prompt = sample.text[0]
print(final_prompt)

請務必檢視排程器指南，瞭解如何探索排程器速度和質量之間的權衡，並檢視在流水線之間重用元件部分，瞭解如何高效地將相同元件載入到多個流水線中。

UniDiffuserPipeline

class diffusers.UniDiffuserPipeline

< 來源 >

( vae: AutoencoderKL text_encoder: CLIPTextModel image_encoder: CLIPVisionModelWithProjection clip_image_processor: CLIPImageProcessor clip_tokenizer: CLIPTokenizer text_decoder: UniDiffuserTextDecoder text_tokenizer: GPT2Tokenizer unet: UniDiffuserModel scheduler: KarrasDiffusionSchedulers )

引數

vae (AutoencoderKL) — 用於將影像編碼和解碼為潛在表示的變分自編碼器 (VAE) 模型。這是 UniDiffuser 影像表示的一部分，以及 CLIP 視覺編碼。
text_encoder (CLIPTextModel) — 凍結的文字編碼器 (clip-vit-large-patch14)。
image_encoder (CLIPVisionModel) — 一個 CLIPVisionModel，用於將影像編碼為影像表示的一部分，以及 VAE 潛在表示。
image_processor (CLIPImageProcessor) — CLIPImageProcessor，用於在用 image_encoder 進行 CLIP 編碼之前預處理影像。
clip_tokenizer (CLIPTokenizer) — 一個 CLIPTokenizer，用於在用 text_encoder 編碼提示之前對提示進行標記化。
text_decoder (UniDiffuserTextDecoder) — 凍結的文字解碼器。這是一個 GPT 風格的模型，用於從 UniDiffuser 嵌入生成文字。
text_tokenizer (GPT2Tokenizer) — 一個 GPT2Tokenizer，用於文字生成中的文字解碼；與 text_decoder 一起使用。
unet (UniDiffuserModel) — 一個 U-ViT 模型，具有 UNNet 風格的 Transformer 層之間的跳躍連線，用於對編碼後的影像潛在表示進行去噪。
scheduler (SchedulerMixin) — 一個與 unet 結合使用的排程器，用於對編碼後的影像和/或文字潛在表示進行去噪。原始的 UniDiffuser 論文使用 DPMSolverMultistepScheduler 排程器。

用於雙模態影像-文字模型的流水線，支援無條件文字和影像生成、文字條件影像生成、影像條件文字生成以及聯合影像-文字生成。

此模型繼承自 DiffusionPipeline。有關所有流水線通用的方法（下載、儲存、在特定裝置上執行等），請檢視超類文件。

call

< 來源 >

( prompt: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[torch.Tensor, PIL.Image.Image, NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None data_type: typing.Optional[int] = 1 num_inference_steps: int = 50 guidance_scale: float = 8.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 num_prompts_per_image: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_latents: typing.Optional[torch.Tensor] = None vae_latents: typing.Optional[torch.Tensor] = None clip_latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 ) → ImageTextPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示。如果未定義，則需要傳遞 prompt_embeds。文字條件影像生成 (text2img) 模式所必需。
image (torch.Tensor 或 PIL.Image.Image, 可選) — Image 或表示影像批次的張量。影像條件文字生成 (img2text) 模式所必需。
height (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的高度（畫素）。
width (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的寬度（畫素）。
data_type (int, 可選, 預設為 1) — 資料型別（0 或 1）。僅當您載入支援資料型別嵌入的檢查點時使用；這為了與 UniDiffuser-v1 檢查點相容而新增。
num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但推理速度會變慢。
guidance_scale (float, 可選, 預設為 8.0) — 較高的引導比例值鼓勵模型生成與文字 prompt 緊密相關的影像，但會以較低的影像質量為代價。當 guidance_scale > 1 時啟用引導比例。
negative_prompt (str 或 List[str], 可選) — 用於引導影像生成中不包含內容的提示。如果未定義，則需要傳遞 negative_prompt_embeds。當不使用引導時 (guidance_scale < 1) 忽略。在文字條件影像生成 (text2img) 模式下使用。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示要生成的影像數量。在 text2img（文字條件影像生成）和 img 模式下使用。如果模式為 joint 且同時提供了 num_images_per_prompt 和 num_prompts_per_image，則生成 min(num_images_per_prompt, num_prompts_per_image) 個樣本。
num_prompts_per_image (int, 可選, 預設為 1) — 每張影像生成的提示數量。用於 img2text（影像條件文字生成）和 text 模式。如果模式為聯合模式，並且同時提供了 num_images_per_prompt 和 num_prompts_per_image，則生成 min(num_images_per_prompt, num_prompts_per_image) 個樣本。
eta (float, 可選, 預設為 0.0) — 對應於 DDIM 論文中的引數 eta (η)。僅適用於 DDIMScheduler，在其他排程器中被忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 用於使生成具有確定性的 torch.Generator。
latents (torch.Tensor, 可選) — 預先生成的噪聲潛在變數，從高斯分佈中取樣，用作聯合影像-文字生成的輸入。可用於使用不同的提示微調相同的生成。如果未提供，則使用提供的隨機 generator 進行取樣生成一個潛在變數張量。這假定提供了一整套 VAE、CLIP 和文字潛在變數，如果提供了，則會覆蓋 prompt_latents、vae_latents 和 clip_latents 的值。
prompt_latents (torch.Tensor, 可選) — 預先生成的噪聲潛在變數，從高斯分佈中取樣，用作文字生成的輸入。可用於使用不同的提示微調相同的生成。如果未提供，則使用提供的隨機 generator 進行取樣生成一個潛在變數張量。
vae_latents (torch.Tensor, 可選) — 預先生成的噪聲潛在變數，從高斯分佈中取樣，用作影像生成的輸入。可用於使用不同的提示微調相同的生成。如果未提供，則使用提供的隨機 generator 進行取樣生成一個潛在變數張量。
clip_latents (torch.Tensor, 可選) — 預先生成的噪聲潛在變數，從高斯分佈中取樣，用作影像生成的輸入。可用於使用不同的提示微調相同的生成。如果未提供，則使用提供的隨機 generator 進行取樣生成一個潛在變數張量。
prompt_embeds (torch.Tensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入（提示權重）。如果未提供，文字嵌入將從 prompt 輸入引數生成。用於文字條件影像生成（text2img）模式。
negative_prompt_embeds (torch.Tensor, 可選) — 預先生成的負面文字嵌入。可用於輕鬆調整文字輸入（提示權重）。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。用於文字條件影像生成（text2img）模式。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL.Image 或 np.array。
return_dict (bool, 可選, 預設為 True) — 是否返回 ImageTextPipelineOutput 而不是普通元組。
callback (Callable, 可選) — 在推理過程中每 callback_steps 步呼叫的函式。該函式以以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — callback 函式被呼叫的頻率。如果未指定，則在每一步都呼叫回撥。

ImageTextPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 ImageTextPipelineOutput，否則返回一個 tuple，其中第一個元素是生成的影像列表，第二個元素是生成的文字列表。

用於生成的管道的呼叫函式。

disable_vae_slicing

< 源 >

( )

停用切片 VAE 解碼。如果之前啟用了 enable_vae_slicing，此方法將返回一步計算解碼。

disable_vae_tiling

< 源 >

( )

停用平鋪 VAE 解碼。如果之前啟用了 enable_vae_tiling，此方法將恢復一步計算解碼。

enable_vae_slicing

< 源 >

( )

啟用切片 VAE 解碼。啟用此選項後，VAE 會將輸入張量分片，分步計算解碼。這有助於節省一些記憶體並允許更大的批次大小。

enable_vae_tiling

< 源 >

( )

啟用平鋪 VAE 解碼。啟用此選項後，VAE 將把輸入張量分割成瓦片，分多步計算編碼和解碼。這對於節省大量記憶體和處理更大的影像非常有用。

encode_prompt

< 源 >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt: typing.Optional[typing.Union[str, typing.List[str]]] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

引數

prompt (str 或 List[str], 可選) — 待編碼的提示
device — (torch.device): torch 裝置
num_images_per_prompt (int) — 每個提示應生成的影像數量
do_classifier_free_guidance (bool) — 是否使用分類器無關引導
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1 時）將被忽略。
prompt_embeds (torch.Tensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預先生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
lora_scale (float, 可選) — 應用於文字編碼器所有 LoRA 層的 LoRA 比例（如果載入了 LoRA 層）。
clip_skip (int, 可選) — 計算提示嵌入時從 CLIP 跳過的層數。值為 1 表示使用倒數第二層的輸出計算提示嵌入。

將提示編碼為文字編碼器隱藏狀態。

reset_mode

< 源 >

( )

移除手動設定的模式；呼叫此函式後，管道將從輸入推斷模式。

set_image_mode

< 源 >

( )

手動將生成模式設定為無條件（“邊際”）影像生成。

set_image_to_text_mode

< 源 >

( )

手動將生成模式設定為影像條件文字生成。

set_joint_mode

< 源 >

( )

手動將生成模式設定為無條件聯合影像-文字生成。

set_text_mode

< 源 >

( )

手動將生成模式設定為無條件（“邊際”）文字生成。

set_text_to_image_mode

< 源 >

( )

手動將生成模式設定為文字條件影像生成。

ImageTextPipelineOutput

class diffusers.ImageTextPipelineOutput

< 源 >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray, NoneType] text: typing.Union[typing.List[str], typing.List[typing.List[str]], NoneType] )

引數

images (List[PIL.Image.Image] 或 np.ndarray) — 長度為 batch_size 的去噪 PIL 影像列表或形狀為 (batch_size, height, width, num_channels) 的 NumPy 陣列。
text (List[str] 或 List[List[str]]) — 長度為 batch_size 的生成文字字串列表，或者一個列表的列表，其中外部列表的長度為 batch_size。

聯合影像-文字管道的輸出類。

< > 在 GitHub 上更新

←unCLIP 值引導取樣→

Diffusers

UniDiffuser

使用示例

無條件影像和文字生成

文字到影像生成

影像到文字生成

影像變體

文字變體

UniDiffuserPipeline

class diffusers.UniDiffuserPipeline

__call__

disable_vae_slicing

disable_vae_tiling

enable_vae_slicing

enable_vae_tiling

encode_prompt

reset_mode

set_image_mode

set_image_to_text_mode

set_joint_mode

set_text_mode

set_text_to_image_mode

ImageTextPipelineOutput

class diffusers.ImageTextPipelineOutput

call