Diffusers 文件

T2I-Adapter

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

T2I-Adapter

T2I-Adapter：學習介面卡以挖掘文字到影像擴散模型更可控的能力作者：Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie。

使用預訓練模型，我們可以提供控制影像（例如，深度圖）來控制 Stable Diffusion 文字到影像生成，使其遵循深度影像的結構並填充細節。

論文摘要如下：

大型文字到影像（T2I）模型令人難以置信的生成能力展示了學習複雜結構和有意義語義的強大力量。然而，僅依靠文字提示無法充分利用模型學習到的知識，尤其是在需要靈活和精確控制（例如，顏色和結構）時。在本文中，我們旨在“挖掘”T2I 模型隱式學習的能力，然後明確地使用它們更細粒度地控制生成。具體來說，我們建議學習簡單輕量級的 T2I-Adapter，以將 T2I 模型中的內部知識與外部控制訊號對齊，同時凍結原始的大型 T2I 模型。透過這種方式，我們可以根據不同的條件訓練各種介面卡，實現生成結果在顏色和結構上的豐富控制和編輯效果。此外，所提出的 T2I-Adapter 具有實際價值的吸引人特性，例如可組合性和泛化能力。大量的實驗表明，我們的 T2I-Adapter 具有良好的生成質量和廣泛的應用。

該模型由社群貢獻者 HimariO ❤️ 貢獻。

StableDiffusionAdapterPipeline

class diffusers.StableDiffusionAdapterPipeline

< 源 >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel adapter: typing.Union[diffusers.models.adapter.T2IAdapter, diffusers.models.adapter.MultiAdapter, typing.List[diffusers.models.adapter.T2IAdapter]] scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True )

引數

介面卡 (T2IAdapter 或 MultiAdapter 或 List[T2IAdapter]) — 在去噪過程中為 unet 提供額外的條件。如果您將多個介面卡設定為列表，則每個介面卡的輸出將相加，以建立一個組合的附加條件。
adapter_weights (List[float], 可選, 預設為 None) — 一個浮點數列表，表示在將每個介面卡的輸出相加之前，將乘以該權重。
vae (AutoencoderKL) — 用於將影像編碼和解碼為潛在表示的變分自編碼器 (VAE) 模型。
text_encoder (CLIPTextModel) — 凍結的文字編碼器。Stable Diffusion 使用 CLIP 的文字部分，特別是 clip-vit-large-patch14 變體。
tokenizer (CLIPTokenizer) — CLIPTokenizer 類的分詞器。
unet (UNet2DConditionModel) — 用於去噪編碼影像潛在表示的條件 U-Net 架構。
scheduler (SchedulerMixin) — 與 unet 結合使用的排程器，用於去噪編碼影像潛在表示。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
safety_checker (StableDiffusionSafetyChecker) — 分類模組，用於評估生成的影像是否可能被視為冒犯性或有害。請參閱模型卡瞭解詳情。
feature_extractor (CLIPImageProcessor) — 從生成的影像中提取特徵的模型，用作 safety_checker 的輸入。

用於使用 T2I-Adapter 增強的 Stable Diffusion 文字到影像生成流水線 https://huggingface.co/papers/2302.08453

此模型繼承自 DiffusionPipeline。請檢視超類文件，瞭解庫為所有流水線實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

< 源 >

( prompt: typing.Union[str, typing.List[str]] = None image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[PIL.Image.Image]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None adapter_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 clip_skip: typing.Optional[int] = None ) → ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞或提示詞列表。如果未定義，則必須傳遞 prompt_embeds。
image (torch.Tensor, PIL.Image.Image, List[torch.Tensor] 或 List[PIL.Image.Image] 或 List[List[PIL.Image.Image]]) — 介面卡輸入條件。介面卡使用此輸入條件生成對 Unet 的引導。如果型別指定為 torch.Tensor，則直接傳遞給介面卡。PIL.Image.Image 也可以作為影像接受。控制影像會自動調整大小以適應輸出影像。
height (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的高度（畫素）。
width (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的寬度（畫素）。
num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高的影像質量，但推理速度會變慢。
timesteps (List[int], 可選) — 用於去噪過程的自定義時間步，適用於其 set_timesteps 方法支援 timesteps 引數的排程器。如果未定義，將使用傳遞 num_inference_steps 時的預設行為。必須按降序排列。
sigmas (List[float], 可選) — 用於去噪過程的自定義sigmas，適用於其 set_timesteps 方法支援 sigmas 引數的排程器。如果未定義，將使用傳遞 num_inference_steps 時的預設行為。
guidance_scale (float, 可選, 預設為 7.5) — 如 Classifier-Free Diffusion Guidance 中定義的引導比例。guidance_scale 定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 啟用引導比例。較高的引導比例鼓勵生成與文字 prompt 緊密相關的影像，通常以犧牲較低的影像質量為代價。
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示詞。如果未定義，則必須傳遞 negative_prompt_embeds。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1），則忽略此引數。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示詞生成的影像數量。
eta (float, 可選, 預設為 0.0) — 對應於 DDIM 論文中的引數 eta (η)：https://huggingface.co/papers/2010.02502。僅適用於 schedulers.DDIMScheduler，對其他排程器將被忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch 生成器，用於使生成具有確定性。
latents (torch.Tensor, 可選) — 預先生成的噪聲潛在變數，從高斯分佈中取樣，用作影像生成的輸入。可用於透過不同的提示詞調整相同的生成。如果未提供，將使用提供的隨機 generator 進行取樣生成一個潛在變數張量。
prompt_embeds (torch.Tensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預先生成的負文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 negative_prompt 輸入引數生成 negative_prompt_embeds。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL: PIL.Image.Image 或 np.array。
return_dict (bool, 可選, 預設為 True) — 是否返回 ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 而非普通元組。
callback (Callable, 可選) — 在推理過程中每 callback_steps 步都會呼叫的函式。該函式將使用以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，則在每一步都呼叫回撥。
cross_attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，將傳遞給 diffusers.models.attention_processor 中 self.processor 下定義的 AttnProcessor。
adapter_conditioning_scale (float 或 List[float], 可選, 預設為 1.0) — 介面卡的輸出在新增到原始 unet 中的殘差之前，將乘以 adapter_conditioning_scale。如果在初始化中指定了多個介面卡，您可以將相應的比例設定為列表。
clip_skip (int, 可選) — 從 CLIP 中跳過的層數，用於計算提示詞嵌入。值為 1 表示使用倒數第二層的輸出計算提示詞嵌入。

~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 或 tuple

如果 return_dict 為 True，則為 ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput，否則為 tuple。當返回元組時，第一個元素是生成的影像列表，第二個元素是布林值列表，表示相應的生成影像是否可能包含“不安全內容”(nsfw)，由 safety_checker 判斷。

呼叫管道進行生成時呼叫的函式。

示例

>>> from PIL import Image
>>> from diffusers.utils import load_image
>>> import torch
>>> from diffusers import StableDiffusionAdapterPipeline, T2IAdapter

>>> image = load_image(
...     "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png"
... )

>>> color_palette = image.resize((8, 8))
>>> color_palette = color_palette.resize((512, 512), resample=Image.Resampling.NEAREST)

>>> adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_color_sd14v1", torch_dtype=torch.float16)
>>> pipe = StableDiffusionAdapterPipeline.from_pretrained(
...     "CompVis/stable-diffusion-v1-4",
...     adapter=adapter,
...     torch_dtype=torch.float16,
... )

>>> pipe.to("cuda")

>>> out_image = pipe(
...     "At night, glowing cubes in front of the beach",
...     image=color_palette,
... ).images[0]

enable_attention_slicing

< 源 >

( slice_size: typing.Union[int, str, NoneType] = 'auto' )

引數

slice_size (str 或 int, 可選, 預設為 "auto") — 當為 "auto" 時，將注意力頭的輸入減半，因此注意力將分兩步計算。如果為 "max"，則透過每次只執行一個切片來節省最大記憶體。如果提供一個數字，則使用 attention_head_dim // slice_size 個切片。在這種情況下，attention_head_dim 必須是 slice_size 的倍數。

啟用切片注意力計算。啟用此選項後，注意力模組會將輸入張量切片以分多步計算注意力。對於多個注意力頭，計算按每個頭順序執行。這有助於節省記憶體，但會略微降低速度。

⚠️ 如果您已經在使用 PyTorch 2.0 或 xFormers 中的 scaled_dot_product_attention (SDPA)，請勿啟用注意力切片。這些注意力計算已經非常記憶體高效，因此您不需要啟用此功能。如果您將注意力切片與 SDPA 或 xFormers 一起啟用，可能會導致嚴重的效能下降！

示例

>>> import torch
>>> from diffusers import StableDiffusionPipeline

>>> pipe = StableDiffusionPipeline.from_pretrained(
...     "stable-diffusion-v1-5/stable-diffusion-v1-5",
...     torch_dtype=torch.float16,
...     use_safetensors=True,
... )

>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> pipe.enable_attention_slicing()
>>> image = pipe(prompt).images[0]

disable_attention_slicing

< 源 >

( )

停用切片注意力計算。如果之前呼叫過 enable_attention_slicing，則注意力將一步計算完成。

enable_vae_slicing

< 源 >

( )

啟用切片 VAE 解碼。啟用此選項後，VAE 會將輸入張量分片，分步計算解碼。這有助於節省一些記憶體並允許更大的批次大小。

disable_vae_slicing

< 源 >

( )

停用切片 VAE 解碼。如果之前啟用了 enable_vae_slicing，此方法將返回一步計算解碼。

enable_xformers_memory_efficient_attention

< 源 >

( attention_op: typing.Optional[typing.Callable] = None )

引數

attention_op (Callable, 可選) — 覆蓋預設的 None 運算子，用作 xFormers 的 memory_efficient_attention() 函式的 op 引數。

啟用 xFormers 的記憶體高效注意力。啟用此選項後，您將觀察到 GPU 記憶體使用量降低，推理速度可能會加快。訓練期間的速度提升不保證。

⚠️ 當記憶體高效注意力和切片注意力同時啟用時，記憶體高效注意力優先。

示例

>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from xformers.ops import MemoryEfficientAttentionFlashAttentionOp

>>> pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
>>> # Workaround for not accepting attention shape using VAE for Flash Attention
>>> pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)

disable_xformers_memory_efficient_attention

< 源 >

( )

停用 xFormers 的記憶體高效注意力。

encode_prompt

< 源 >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

引數

prompt (str 或 List[str], 可選) — 待編碼的提示詞
device — (torch.device): torch 裝置
num_images_per_prompt (int) — 每個提示詞應生成的影像數量
do_classifier_free_guidance (bool) — 是否使用分類器無關引導
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示詞。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1），則忽略此引數。
prompt_embeds (torch.Tensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預先生成的負文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 negative_prompt 輸入引數生成 negative_prompt_embeds。
lora_scale (float, 可選) — 將應用於文字編碼器所有 LoRA 層的 LoRA 比例（如果已載入 LoRA 層）。
clip_skip (int, 可選) — 從 CLIP 中跳過的層數，用於計算提示詞嵌入。值為 1 表示使用倒數第二層的輸出計算提示詞嵌入。

將提示編碼為文字編碼器隱藏狀態。

get_guidance_scale_embedding

< 源 >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

引數

w (torch.Tensor) — 生成具有指定引導比例的嵌入向量，隨後用於豐富時間步嵌入。
embedding_dim (int, 可選, 預設為 512) — 要生成的嵌入的維度。
dtype (torch.dtype, 可選, 預設為 torch.float32) — 生成嵌入的資料型別。

torch.Tensor

形狀為 (len(w), embedding_dim) 的嵌入向量。

請參閱 https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

StableDiffusionXLAdapterPipeline

class diffusers.StableDiffusionXLAdapterPipeline

< 源 >

( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel adapter: typing.Union[diffusers.models.adapter.T2IAdapter, diffusers.models.adapter.MultiAdapter, typing.List[diffusers.models.adapter.T2IAdapter]] scheduler: KarrasDiffusionSchedulers force_zeros_for_empty_prompt: bool = True feature_extractor: CLIPImageProcessor = None image_encoder: CLIPVisionModelWithProjection = None )

引數

adapter (T2IAdapter 或 MultiAdapter 或 List[T2IAdapter]) — 在去噪過程中為 unet 提供額外的條件。如果您將多個 Adapter 設定為列表，則每個 Adapter 的輸出將相加，以建立一個組合的額外條件。
adapter_weights (List[float], 可選, 預設為 None) — 浮點數列表，表示將乘以每個介面卡輸出再相加的權重。
vae (AutoencoderKL) — 變分自動編碼器 (VAE) 模型，用於將影像編碼和解碼為潛在表示。
text_encoder (CLIPTextModel) — 凍結的文字編碼器。Stable Diffusion 使用 CLIP 的文字部分，特別是 clip-vit-large-patch14 變體。
tokenizer (CLIPTokenizer) — CLIPTokenizer 類的分詞器。
unet (UNet2DConditionModel) — 用於去噪編碼影像潛在表示的條件 U-Net 架構。
scheduler (SchedulerMixin) — 與 unet 結合使用的排程器，用於去噪編碼影像潛在表示。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
safety_checker (StableDiffusionSafetyChecker) — 分類模組，用於估計生成的影像是否可能被視為冒犯性或有害。請參閱模型卡瞭解詳細資訊。
feature_extractor (CLIPImageProcessor) — 從生成的影像中提取特徵的模型，用作 safety_checker 的輸入。

用於使用 T2I-Adapter 增強的 Stable Diffusion 文字到影像生成流水線 https://huggingface.co/papers/2302.08453

此模型繼承自 DiffusionPipeline。請檢視超類文件，瞭解庫為所有流水線實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

該管道還繼承了以下載入方法

load_textual_inversion() 用於載入文字反演嵌入
from_single_file() 用於載入 .ckpt 檔案
load_lora_weights() 用於載入 LoRA 權重
save_lora_weights() 用於儲存 LoRA 權重
load_ip_adapter() 用於載入 IP 介面卡

call

< 來源 >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 timesteps: typing.List[int] = None sigmas: typing.List[float] = None denoising_end: typing.Optional[float] = None guidance_scale: float = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[typing.List[torch.Tensor]] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 original_size: typing.Optional[typing.Tuple[int, int]] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Optional[typing.Tuple[int, int]] = None negative_original_size: typing.Optional[typing.Tuple[int, int]] = None negative_crops_coords_top_left: typing.Tuple[int, int] = (0, 0) negative_target_size: typing.Optional[typing.Tuple[int, int]] = None adapter_conditioning_scale: typing.Union[float, typing.List[float]] = 1.0 adapter_conditioning_factor: float = 1.0 clip_skip: typing.Optional[int] = None ) → ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示或提示列表。如果未定義，則必須傳入 prompt_embeds。
prompt_2 (str 或 List[str], 可選) — 要傳送到 tokenizer_2 和 text_encoder_2 的提示或提示列表。如果未定義，則 prompt 將用於兩個文字編碼器。
image (torch.Tensor, PIL.Image.Image, List[torch.Tensor] 或 List[PIL.Image.Image] 或 List[List[PIL.Image.Image]]) — Adapter 的輸入條件。Adapter 使用此輸入條件為 Unet 生成引導。如果型別指定為 torch.Tensor，則按原樣傳遞給 Adapter。PIL.Image.Image 也可以作為影像接受。控制影像會自動調整大小以適應輸出影像。
height (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的畫素高度。對於 stabilityai/stable-diffusion-xl-base-1.0 和未專門針對低解析度進行微調的檢查點，任何低於 512 畫素的高度都無法很好地工作。
width (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的畫素寬度。對於 stabilityai/stable-diffusion-xl-base-1.0 和未專門針對低解析度進行微調的檢查點，任何低於 512 畫素的寬度都無法很好地工作。
num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多去噪步數通常會帶來更高質量的影像，但推理速度會變慢。
timesteps (List[int], 可選) — 自定義時間步長，用於支援 timesteps 引數的排程器的去噪過程中的 set_timesteps 方法。如果未定義，將使用傳遞 num_inference_steps 時的預設行為。必須按降序排列。
sigmas (List[float], 可選) — 自定義 sigmas，用於支援 sigmas 引數的排程器的去噪過程中的 set_timesteps 方法。如果未定義，將使用傳遞 num_inference_steps 時的預設行為。
denoising_end (float, 可選) — 指定時，確定在有意提前終止之前要完成的總去噪過程的分數（介於 0.0 和 1.0 之間）。因此，返回的樣本仍將保留大量噪聲，具體取決於排程器選擇的離散時間步長。denoising_end 引數最好在將此管道作為“去噪器混合”多管道設定的一部分時使用，詳見 最佳化影像輸出。
guidance_scale (float, 可選, 預設為 5.0) — 如 Classifier-Free Diffusion Guidance 中所定義的引導比例。guidance_scale 定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 啟用引導比例。更高的引導比例鼓勵生成與文字 prompt 密切相關的影像，通常以犧牲影像質量為代價。
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示或提示列表。如果未定義，則必須傳入 negative_prompt_embeds。不使用引導時（即，如果 guidance_scale 小於 1），則忽略此引數。
negative_prompt_2 (str 或 List[str], 可選) — 要傳送到 tokenizer_2 和 text_encoder_2 的不用於引導影像生成的提示或提示列表。如果未定義，則 negative_prompt 將用於兩個文字編碼器。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
eta (float, 可選, 預設為 0.0) — 對應於 DDIM 論文中的引數 eta (η)：https://huggingface.co/papers/2010.02502。僅適用於 schedulers.DDIMScheduler，對其他排程器將被忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch 生成器，用於使生成具有確定性。
latents (torch.Tensor, 可選) — 預先生成的噪聲潛變數，從高斯分佈中取樣，用作影像生成的輸入。可用於透過不同的提示調整同一生成。如果未提供，將使用提供的隨機 generator 取樣生成潛變數張量。
prompt_embeds (torch.Tensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示加權。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預先生成的負文字嵌入。可用於輕鬆調整文字輸入，例如提示加權。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
pooled_prompt_embeds (torch.Tensor, 可選) — 預先生成的池化文字嵌入。可用於輕鬆調整文字輸入，例如提示加權。如果未提供，池化文字嵌入將從 prompt 輸入引數生成。
negative_pooled_prompt_embeds (torch.Tensor, 可選) — 預先生成的負池化文字嵌入。可用於輕鬆調整文字輸入，例如提示加權。如果未提供，池化 negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
ip_adapter_image — (PipelineImageInput, 可選): 用於 IP Adapters 的可選影像輸入。
ip_adapter_image_embeds (List[torch.Tensor], 可選) — IP-Adapter 的預生成影像嵌入。它應該是一個列表，長度與 IP-adapters 的數量相同。每個元素都應該是一個形狀為 (batch_size, num_images, emb_dim) 的張量。如果 do_classifier_free_guidance 設定為 True，它應該包含負影像嵌入。如果未提供，嵌入將從 ip_adapter_image 輸入引數計算。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL: PIL.Image.Image 或 np.array。
return_dict (bool, 可選, 預設為 True) — 是否返回 ~pipelines.stable_diffusion_xl.StableDiffusionAdapterPipelineOutput 而不是普通元組。
callback (Callable, 可選) — 在推理期間，每隔 callback_steps 步呼叫的函式。該函式將使用以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，回撥將在每一步呼叫。
cross_attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，將作為 op 引數傳遞給 diffusers.models.attention_processor 中定義的 self.processor。
guidance_rescale (float, 可選, 預設為 0.0) — Common Diffusion Noise Schedules and Sample Steps are Flawed 提出的引導重縮放因子。guidance_scale 定義為 Common Diffusion Noise Schedules and Sample Steps are Flawed 中公式 16 的 φ。使用零終端 SNR 時，引導重縮放因子應解決過曝問題。
original_size (Tuple[int], 可選, 預設為 (1024, 1024)) — 如果 original_size 與 target_size 不同，影像將顯示為縮小或放大。如果未指定，original_size 預設為 (height, width)。SDXL 微條件化的一部分，如 https://huggingface.co/papers/2307.01952 第 2.2 節所述。
crops_coords_top_left (Tuple[int], 可選, 預設為 (0, 0)) — crops_coords_top_left 可用於生成從 crops_coords_top_left 位置向下“裁剪”的影像。通常透過將 crops_coords_top_left 設定為 (0, 0) 來獲得良好、居中的影像。SDXL 微條件化的一部分，如 https://huggingface.co/papers/2307.01952 第 2.2 節所述。
target_size (Tuple[int], 可選, 預設為 (1024, 1024)) — 在大多數情況下，target_size 應設定為生成影像的所需高度和寬度。如果未指定，它將預設為 (height, width)。SDXL 微條件化的一部分，如 https://huggingface.co/papers/2307.01952 第 2.2 節所述。
negative_original_size (Tuple[int], 可選, 預設為 (1024, 1024)) — 根據特定影像解析度對生成過程進行負面條件約束。SDXL 微條件化的一部分，如 https://huggingface.co/papers/2307.01952 第 2.2 節所述。更多資訊請參考此問題討論串：https://github.com/huggingface/diffusers/issues/4208。
negative_crops_coords_top_left (Tuple[int], 可選, 預設為 (0, 0)) — 根據特定裁剪座標對生成過程進行負面條件約束。SDXL 微條件化的一部分，如 https://huggingface.co/papers/2307.01952 第 2.2 節所述。更多資訊請參考此問題討論串：https://github.com/huggingface/diffusers/issues/4208。
negative_target_size (Tuple[int], 可選, 預設為 (1024, 1024)) — 根據目標影像解析度對生成過程進行負面條件約束。在大多數情況下應與 target_size 相同。SDXL 微條件化的一部分，如 https://huggingface.co/papers/2307.01952 第 2.2 節所述。更多資訊請參考此問題討論串：https://github.com/huggingface/diffusers/issues/4208。
adapter_conditioning_scale (float 或 List[float], 可選, 預設為 1.0) — adapter 的輸出在新增到原始 unet 中的殘差之前乘以 adapter_conditioning_scale。如果在初始化中指定了多個 adapter，則可以將其相應的比例設定為列表。
adapter_conditioning_factor (float, 可選, 預設為 1.0) — 應該應用 adapter 的時間步長分數。如果 adapter_conditioning_factor 為 0.0，則完全不應用 adapter。如果 adapter_conditioning_factor 為 1.0，則對所有時間步長應用 adapter。如果 adapter_conditioning_factor 為 0.5，則對一半的時間步長應用 adapter。
clip_skip (int, 可選) — 計算提示嵌入時要跳過的 CLIP 層數。值為 1 表示將使用倒數第二層的輸出計算提示嵌入。

~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput 或 tuple

如果 return_dict 為 True，則為 ~pipelines.stable_diffusion.StableDiffusionAdapterPipelineOutput，否則為 tuple。返回元組時，第一個元素是生成的影像列表。

呼叫管道進行生成時呼叫的函式。

示例

>>> import torch
>>> from diffusers import T2IAdapter, StableDiffusionXLAdapterPipeline, DDPMScheduler
>>> from diffusers.utils import load_image

>>> sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L")

>>> model_id = "stabilityai/stable-diffusion-xl-base-1.0"

>>> adapter = T2IAdapter.from_pretrained(
...     "Adapter/t2iadapter",
...     subfolder="sketch_sdxl_1.0",
...     torch_dtype=torch.float16,
...     adapter_type="full_adapter_xl",
... )
>>> scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler")

>>> pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
...     model_id, adapter=adapter, torch_dtype=torch.float16, variant="fp16", scheduler=scheduler
... ).to("cuda")

>>> generator = torch.manual_seed(42)
>>> sketch_image_out = pipe(
...     prompt="a photo of a dog in real world, high quality",
...     negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality",
...     image=sketch_image,
...     generator=generator,
...     guidance_scale=7.5,
... ).images[0]

enable_attention_slicing

< 來源 >

( slice_size: typing.Union[int, str, NoneType] = 'auto' )

引數

slice_size (str 或 int, 可選, 預設為 "auto") — 當為 "auto" 時，將注意力頭部的輸入減半，因此注意力將分兩步計算。如果為 "max"，則透過一次只執行一個切片來節省最大記憶體量。如果提供數字，則使用 attention_head_dim // slice_size 個切片。在這種情況下，attention_head_dim 必須是 slice_size 的倍數。

示例

>>> import torch
>>> from diffusers import StableDiffusionPipeline

>>> pipe = StableDiffusionPipeline.from_pretrained(
...     "stable-diffusion-v1-5/stable-diffusion-v1-5",
...     torch_dtype=torch.float16,
...     use_safetensors=True,
... )

>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> pipe.enable_attention_slicing()
>>> image = pipe(prompt).images[0]

disable_attention_slicing

< 來源 >

( )

停用切片注意力計算。如果之前呼叫過 enable_attention_slicing，則注意力將一步計算完成。

enable_vae_slicing

< 來源 >

( )

啟用切片 VAE 解碼。啟用此選項後，VAE 會將輸入張量分片，分步計算解碼。這有助於節省一些記憶體並允許更大的批次大小。

disable_vae_slicing

< 來源 >

( )

停用切片 VAE 解碼。如果之前啟用了 enable_vae_slicing，此方法將返回一步計算解碼。

enable_xformers_memory_efficient_attention

< 來源 >

( attention_op: typing.Optional[typing.Callable] = None )

引數

attention_op (Callable, 可選) — 覆蓋預設的 None 運算子，用作 xFormers 的 memory_efficient_attention() 函式的 op 引數。

啟用 xFormers 的記憶體高效注意力。啟用此選項後，您將觀察到 GPU 記憶體使用量降低，推理速度可能會加快。訓練期間的速度提升不保證。

⚠️ 當記憶體高效注意力和切片注意力同時啟用時，記憶體高效注意力優先。

示例

>>> import torch
>>> from diffusers import DiffusionPipeline
>>> from xformers.ops import MemoryEfficientAttentionFlashAttentionOp

>>> pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")
>>> pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
>>> # Workaround for not accepting attention shape using VAE for Flash Attention
>>> pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)

disable_xformers_memory_efficient_attention

< 來源 >

( )

停用 xFormers 的記憶體高效注意力。

encode_prompt

< 來源 >

( prompt: str prompt_2: typing.Optional[str] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None pooled_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

引數

prompt (str 或 List[str], 可選) — 要編碼的提示。
prompt_2 (str 或 List[str], 可選) — 要傳送到 tokenizer_2 和 text_encoder_2 的提示或提示列表。如果未定義，則 prompt 將用於兩個文字編碼器。
device — (torch.device): torch 裝置。
num_images_per_prompt (int) — 每個提示詞應生成的影像數量。
do_classifier_free_guidance (bool) — 是否使用分類器自由引導。
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示詞或提示詞列表。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1 則忽略），此引數將被忽略。
negative_prompt_2 (str 或 List[str], 可選) — 不用於引導影像生成，將傳送到 tokenizer_2 和 text_encoder_2 的提示詞或提示詞列表。如果未定義，則在兩個文字編碼器中都使用 negative_prompt。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 negative_prompt 輸入引數生成負文字嵌入。
pooled_prompt_embeds (torch.Tensor, 可選) — 預生成的池化文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 prompt 輸入引數生成池化文字嵌入。
negative_pooled_prompt_embeds (torch.Tensor, 可選) — 預生成的負池化文字嵌入。可用於輕鬆調整文字輸入，例如提示詞加權。如果未提供，將從 negative_prompt 輸入引數生成池化負文字嵌入。
lora_scale (float, 可選) — 如果載入了 LoRA 層，此 LoRA 比例將應用於文字編碼器的所有 LoRA 層。
clip_skip (int, 可選) — 計算提示詞嵌入時要跳過的 CLIP 層數。值為 1 表示將使用倒數第二層的輸出計算提示詞嵌入。

將提示編碼為文字編碼器隱藏狀態。

get_guidance_scale_embedding

< 源 >

( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor

引數

w (torch.Tensor) — 生成具有指定引導尺度的嵌入向量，以隨後豐富時間步嵌入。
embedding_dim (int, 可選, 預設為 512) — 要生成的嵌入維度。
dtype (torch.dtype, 可選, 預設為 torch.float32) — 生成嵌入的資料型別。

torch.Tensor

形狀為 (len(w), embedding_dim) 的嵌入向量。

請參閱 https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298

< > 在 GitHub 上更新

←超解析度文字轉影像→

Diffusers

T2I-Adapter

StableDiffusionAdapterPipeline

class diffusers.StableDiffusionAdapterPipeline

__call__

enable_attention_slicing

disable_attention_slicing

enable_vae_slicing

disable_vae_slicing

enable_xformers_memory_efficient_attention

disable_xformers_memory_efficient_attention

encode_prompt

get_guidance_scale_embedding

StableDiffusionXLAdapterPipeline

class diffusers.StableDiffusionXLAdapterPipeline

__call__

enable_attention_slicing

disable_attention_slicing

enable_vae_slicing

disable_vae_slicing

enable_xformers_memory_efficient_attention

disable_xformers_memory_efficient_attention

encode_prompt

get_guidance_scale_embedding

call

call