Diffusers 文件

Stable Diffusion 3

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

Stable Diffusion 3

Stable Diffusion 3 (SD3) 在 Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Muller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, 和 Robin Rombach 的論文《透過擴充套件整流流 Transformer 實現高解析度影像合成》（Scaling Rectified Flow Transformers for High-Resolution Image Synthesis）中被提出。

論文摘要如下：

擴散模型透過反轉資料朝向噪聲的前向路徑，從噪聲中建立資料，並已成為一種強大的生成建模技術，適用於影像和影片等高維感知資料。整流流是一種新近的生成模型公式，它將資料和噪聲以直線連線。儘管其理論性質更優且概念簡單，但尚未被確立為標準實踐。在這項工作中，我們改進了現有的用於訓練整流流模型的噪聲取樣技術，使其偏向於與感知相關的尺度。透過大規模研究，我們證明了該方法在高解析度文字到影像合成方面，相較於已有的擴散公式表現更優。此外，我們提出了一種新穎的基於 Transformer 的文字到影像生成架構，該架構為兩種模態使用獨立的權重，並允許影像和文字 token 之間進行雙向資訊流，從而改善了文字理解、排版和人類偏好評分。我們證明，該架構遵循可預測的擴充套件趨勢，並且較低的驗證損失與透過各種指標和人類評估測量的文字到影像合成效果的提升相關。

使用示例

由於該模型是受限訪問的，在使用 diffusers 之前，您首先需要訪問 Stable Diffusion 3 Medium Hugging Face 頁面，填寫表格並接受使用條款。一旦獲得授權，您需要登入，以便您的系統知道您已接受條款。

使用以下命令登入

huggingface-cli login

SD3 pipeline 使用三個文字編碼器來生成影像。為了在大多數消費級硬體上執行，必須進行模型解除安裝。請使用 torch.float16 資料型別以節省更多記憶體。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe.to("cuda")

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

image.save("sd3_hello_world.png")

注意： Stable Diffusion 3.5 也可以使用 SD3 pipeline 執行，並且所有提到的最佳化和技術也同樣適用。SD3 系列總共有三個官方模型

使用 IP-Adapters 進行影像提示

IP-Adapter 允許您除了文字提示外，還可以用影像來提示 SD3。當您有參考影像並且要描述難以用文字表達的複雜概念時，這尤其有用。要載入和使用 IP-Adapter，您需要

image_encoder：用於獲取影像特徵的預訓練視覺模型，通常是 CLIP 影像編碼器。
feature_extractor：為所選的 image_encoder 準備輸入影像的影像處理器。
ip_adapter_id：包含影像交叉注意力層和影像投影引數的檢查點。

IP-Adapters 是為特定模型架構訓練的，因此它們也適用於基礎模型的微調變體。您可以使用 ~SD3IPAdapterMixin.set_ip_adapter_scale 函式來調整輸出與影像提示的對齊強度。值越高，模型越緊密地遵循影像提示。預設值 0.5 通常是一個很好的平衡，確保模型同等考慮文字和影像提示。

import torch
from PIL import Image

from diffusers import StableDiffusion3Pipeline
from transformers import SiglipVisionModel, SiglipImageProcessor

image_encoder_id = "google/siglip-so400m-patch14-384"
ip_adapter_id = "InstantX/SD3.5-Large-IP-Adapter"

feature_extractor = SiglipImageProcessor.from_pretrained(
    image_encoder_id,
    torch_dtype=torch.float16
)
image_encoder = SiglipVisionModel.from_pretrained(
    image_encoder_id,
    torch_dtype=torch.float16
).to( "cuda")

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large",
    torch_dtype=torch.float16,
    feature_extractor=feature_extractor,
    image_encoder=image_encoder,
).to("cuda")

pipe.load_ip_adapter(ip_adapter_id)
pipe.set_ip_adapter_scale(0.6)

ref_img = Image.open("image.jpg").convert('RGB')

image = pipe(
    width=1024,
    height=1024,
    prompt="a cat",
    negative_prompt="lowres, low quality, worst quality",
    num_inference_steps=24,
    guidance_scale=5.0,
    ip_adapter_image=ref_img
).images[0]

image.save("result.jpg")

使用提示詞“a cat”的 IP-Adapter 示例

檢視 IP-Adapter 瞭解更多關於 IP-Adapters 工作原理的資訊。

SD3 記憶體最佳化

SD3 使用三個文字編碼器，其中一個是巨大的 T5-XXL 模型。這使得在視訊記憶體小於 24GB 的 GPU 上執行該模型變得具有挑戰性，即使使用 fp16 精度也是如此。以下部分概述了 Diffusers 中的一些記憶體最佳化方法，使在低資源硬體上執行 SD3 更加容易。

透過模型解除安裝執行推理

Diffusers 中最基本的記憶體最佳化方法允許您在推理過程中將模型的元件解除安裝到 CPU 以節省記憶體，同時推理延遲會略有增加。模型解除安裝只會在需要執行時將模型元件移動到 GPU，而將其餘元件保留在 CPU 上。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

image.save("sd3_hello_world.png")

在推理期間移除 T5 文字編碼器

在推理期間移除佔用大量記憶體的 4.7B 引數 T5-XXL 文字編碼器，可以顯著減少 SD3 的記憶體需求，而效能僅有輕微下降。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    text_encoder_3=None,
    tokenizer_3=None,
    torch_dtype=torch.float16
)
pipe.to("cuda")

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

image.save("sd3_hello_world-no-T5.png")

使用 T5 文字編碼器的量化版本

我們可以利用 bitsandbytes 庫來載入並將 T5-XXL 文字編碼器量化到 8 位精度。這允許您繼續使用所有三個文字編碼器，而效能僅受輕微影響。

首先安裝 bitsandbytes 庫。

pip install bitsandbytes

然後使用 BitsAndBytesConfig 載入 T5-XXL 模型。

import torch
from diffusers import StableDiffusion3Pipeline
from transformers import T5EncoderModel, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
text_encoder = T5EncoderModel.from_pretrained(
    model_id,
    subfolder="text_encoder_3",
    quantization_config=quantization_config,
)
pipe = StableDiffusion3Pipeline.from_pretrained(
    model_id,
    text_encoder_3=text_encoder,
    device_map="balanced",
    torch_dtype=torch.float16
)

image = pipe(
    prompt="a photo of a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    height=1024,
    width=1024,
    guidance_scale=7.0,
).images[0]

image.save("sd3_hello_world-8bit-T5.png")

您可以在這裡找到端到端的指令碼。

SD3 效能最佳化

使用 Torch Compile 加速推理

在 SD3 pipeline 中使用編譯後的元件可以將推理速度提高多達 4 倍。以下程式碼片段演示瞭如何編譯 SD3 pipeline 的 Transformer 和 VAE 元件。

import torch
from diffusers import StableDiffusion3Pipeline

torch.set_float32_matmul_precision("high")

torch._inductor.config.conv_1x1_as_mm = True
torch._inductor.config.coordinate_descent_tuning = True
torch._inductor.config.epilogue_fusion = False
torch._inductor.config.coordinate_descent_check_all_directions = True

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
).to("cuda")
pipe.set_progress_bar_config(disable=True)

pipe.transformer.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)

pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

# Warm Up
prompt = "a photo of a cat holding a sign that says hello world"
for _ in range(3):
    _ = pipe(prompt=prompt, generator=torch.manual_seed(1))

# Run Inference
image = pipe(prompt=prompt, generator=torch.manual_seed(1)).images[0]
image.save("sd3_hello_world.png")

檢視完整指令碼這裡。

量化

量化有助於透過以較低精度資料型別儲存模型權重來減少大型模型的記憶體需求。但是，量化對影片質量的影響可能因影片模型而異。

請參考量化概述以瞭解更多關於支援的量化後端以及如何選擇適合您用例的量化後端。下面的示例演示瞭如何使用 bitsandbytes 載入一個量化後的 StableDiffusion3Pipeline 進行推理。

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, SD3Transformer2DModel, StableDiffusion3Pipeline
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel

quant_config = BitsAndBytesConfig(load_in_8bit=True)
text_encoder_8bit = T5EncoderModel.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large",
    subfolder="text_encoder_3",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = SD3Transformer2DModel.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

pipeline = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large",
    text_encoder=text_encoder_8bit,
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

prompt = "a tiny astronaut hatching from an egg on the moon"
image = pipeline(prompt, num_inference_steps=28, guidance_scale=7.0).images[0]
image.save("sd3.png")

在 T5 文字編碼器中使用長提示詞

預設情況下，T5 文字編碼器的提示詞最大序列長度為 256。可以透過設定 max_sequence_length 來調整以接受更少或更多的 token。請注意，較長的序列需要更多的資源並會導致更長的生成時間，例如在批次推理時。

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipe(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]

向 T5 文字編碼器傳送不同的提示詞

您可以向 CLIP 文字編碼器和 T5 文字編碼器傳送不同的提示詞，以防止提示詞被 CLIP 文字編碼器截斷並改善生成效果。

CLIP 文字編碼器的提示詞仍然被截斷到 77 個 token 的限制。

prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. A river of warm, melted butter, pancake-like foliage in the background, a towering pepper mill standing in for a tree."

prompt_3 = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature’s body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree.  As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"

image = pipe(
    prompt=prompt,
    prompt_3=prompt_3,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]

用於 Stable Diffusion 3 的微型自動編碼器

用於 Stable Diffusion 的微型自動編碼器 (TAESD3) 是由 Ollin Boer Bohan 開發的 Stable Diffusion 3 的 VAE 的微型蒸餾版本，可以幾乎瞬間解碼 StableDiffusion3Pipeline 的潛變數。

與 Stable Diffusion 3 一起使用

import torch
from diffusers import StableDiffusion3Pipeline, AutoencoderTiny

pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd3", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("cheesecake.png")

透過 from_single_file 載入原始檢查點

SD3Transformer2DModel 和 StableDiffusion3Pipeline 類支援透過 from_single_file 方法載入原始檢查點。此方法允許您載入用於訓練模型的原始檢查點檔案。

為 SD3Transformer2DModel 載入原始檢查點

from diffusers import SD3Transformer2DModel

model = SD3Transformer2DModel.from_single_file("https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium.safetensors")

為 StableDiffusion3Pipeline 載入單個檢查點

載入不帶 T5 的單個檔案檢查點

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips.safetensors",
    torch_dtype=torch.float16,
    text_encoder_3=None
)
pipe.enable_model_cpu_offload()

image = pipe("a picture of a cat holding a sign that says hello world").images[0]
image.save('sd3-single-file.png')

載入帶 T5 的單個檔案檢查點

以下示例載入以 8 位浮點格式儲存的檢查點，需要 PyTorch 2.3 或更高版本。

import torch
from diffusers import StableDiffusion3Pipeline

pipe = StableDiffusion3Pipeline.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors",
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

image = pipe("a picture of a cat holding a sign that says hello world").images[0]
image.save('sd3-single-file-t5-fp8.png')

為 Stable Diffusion 3.5 Transformer 模型載入單個檔案檢查點

import torch
from diffusers import SD3Transformer2DModel, StableDiffusion3Pipeline

transformer = SD3Transformer2DModel.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3.5-large-turbo/blob/main/sd3.5_large.safetensors",
    torch_dtype=torch.bfloat16,
)
pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3.5-large",
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
image = pipe("a cat holding a sign that says hello world").images[0]
image.save("sd35.png")

StableDiffusion3Pipeline

class diffusers.StableDiffusion3Pipeline

< 原始檔 >

( transformer: SD3Transformer2DModel scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer text_encoder_2: CLIPTextModelWithProjection tokenizer_2: CLIPTokenizer text_encoder_3: T5EncoderModel tokenizer_3: T5TokenizerFast image_encoder: SiglipVisionModel = None feature_extractor: SiglipImageProcessor = None )

引數

transformer (SD3Transformer2DModel) — 用於對編碼後的影像潛變數進行去噪的條件 Transformer (MMDiT) 架構。
scheduler (FlowMatchEulerDiscreteScheduler) — 與 transformer 結合使用的排程器，用於對編碼後的影像潛變數進行去噪。
vae (AutoencoderKL) — 變分自編碼器 (VAE) 模型，用於將影像編碼為潛變量表示以及從潛變量表示解碼為影像。
text_encoder (CLIPTextModelWithProjection) — CLIP，特別是 clip-vit-large-patch14 變體，帶有一個額外新增的投影層，該層使用一個維度為 `hidden_size` 的對角矩陣進行初始化。
text_encoder_2 (CLIPTextModelWithProjection) — CLIP，特別是 laion/CLIP-ViT-bigG-14-laion2B-39B-b160k 變體。
text_encoder_3 (T5EncoderModel) — 凍結的文字編碼器。Stable Diffusion 3 使用 T5，特別是 t5-v1_1-xxl 變體。
tokenizer (CLIPTokenizer) — CLIPTokenizer 類的分詞器。
tokenizer_2 (CLIPTokenizer) — 第二個 CLIPTokenizer 類的分詞器。
tokenizer_3 (T5TokenizerFast) — T5Tokenizer 類的分詞器。
image_encoder (SiglipVisionModel, optional) — 用於 IP Adapter 的預訓練視覺模型。
feature_extractor (SiglipImageProcessor, optional) — 用於 IP Adapter 的影像處理器。

call

< 原始碼 >

( prompt: typing.Union[str, typing.List[str]] = None prompt_2: typing.Union[str, typing.List[str], NoneType] = None prompt_3: typing.Union[str, typing.List[str], NoneType] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 28 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 7.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 256 skip_guidance_layers: typing.List[int] = None skip_layer_guidance_scale: float = 2.8 skip_layer_guidance_stop: float = 0.2 skip_layer_guidance_start: float = 0.01 mu: typing.Optional[float] = None ) → ~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput 或 tuple

引數

prompt (str 或 List[str], optional) — 用於引導影像生成的提示或提示列表。如果未定義，則必須傳遞 `prompt_embeds`。
prompt_2 (str 或 List[str], optional) — 將傳送到 `tokenizer_2` 和 `text_encoder_2` 的提示或提示列表。如果未定義，將使用 `prompt`。
prompt_3 (str 或 List[str], optional) — 將傳送到 `tokenizer_3` 和 `text_encoder_3` 的提示或提示列表。如果未定義，將使用 `prompt`。
height (int, optional, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的高度（以畫素為單位）。為獲得最佳效果，預設設定為 1024。
width (int, optional, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的寬度（以畫素為單位）。為獲得最佳效果，預設設定為 1024。
num_inference_steps (int, optional, 預設為 50) — 去噪步驟的數量。更多的去噪步驟通常會帶來更高質量的影像，但會犧牲推理速度。
sigmas (List[float], optional) — 用於去噪過程的自定義 sigmas，適用於在其 `set_timesteps` 方法中支援 `sigmas` 引數的排程器。如果未定義，將使用傳遞 `num_inference_steps` 時的預設行為。
guidance_scale (float, optional, 預設為 7.0) — 如Classifier-Free Diffusion Guidance中所定義的引導比例。`guidance_scale` 被定義為Imagen Paper中公式2的 `w`。透過設定 `guidance_scale > 1` 來啟用引導比例。更高的引導比例會鼓勵生成與文字 `prompt` 緊密相關的影像，但這通常會犧牲影像質量。
negative_prompt (str 或 List[str], optional) — 不用於引導影像生成的提示或提示列表。如果未定義，則必須傳遞 `negative_prompt_embeds`。在不使用引導時（即 `guidance_scale` 小於 `1` 時）會被忽略。
negative_prompt_2 (str 或 List[str], optional) — 不用於引導影像生成併發送到 `tokenizer_2` 和 `text_encoder_2` 的提示或提示列表。如果未定義，將使用 `negative_prompt`。
negative_prompt_3 (str 或 List[str], optional) — 不用於引導影像生成併發送到 `tokenizer_3` 和 `text_encoder_3` 的提示或提示列表。如果未定義，將使用 `negative_prompt`。
num_images_per_prompt (int, optional, 預設為 1) — 每個提示要生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], optional) — 一個或多個 torch generator，用於使生成過程具有確定性。
latents (torch.FloatTensor, optional) — 預生成的噪聲潛變數，從高斯分佈中取樣，用作影像生成的輸入。可用於使用不同提示微調相同的生成過程。如果未提供，將使用提供的隨機 `generator` 進行取樣生成潛變數張量。
prompt_embeds (torch.FloatTensor, optional) — 預生成的文字嵌入。可用於輕鬆微調文字輸入，例如提示加權。如果未提供，將從 `prompt` 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.FloatTensor, optional) — 預生成的負向文字嵌入。可用於輕鬆微調文字輸入，例如提示加權。如果未提供，將從 `negative_prompt` 輸入引數生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.FloatTensor, optional) — 預生成的池化文字嵌入。可用於輕鬆微調文字輸入，例如提示加權。如果未提供，將從 `prompt` 輸入引數生成池化文字嵌入。
negative_pooled_prompt_embeds (torch.FloatTensor, optional) — 預生成的負向池化文字嵌入。可用於輕鬆微調文字輸入，例如提示加權。如果未提供，將從 `negative_prompt` 輸入引數生成池化的 negative_prompt_embeds。
ip_adapter_image (PipelineImageInput, optional) — 可選的影像輸入，用於與 IP Adapters 配合使用。
ip_adapter_image_embeds (torch.Tensor, optional) — 為 IP-Adapter 預生成的影像嵌入。應為形狀為 `(batch_size, num_images, emb_dim)` 的張量。如果 `do_classifier_free_guidance` 設定為 `True`，則應包含負向影像嵌入。如果未提供，將從 `ip_adapter_image` 輸入引數計算嵌入。
output_type (str, optional, 預設為 "pil") — 生成影像的輸出格式。在 PIL: PIL.Image.Image 或 np.array 之間選擇。
return_dict (bool, optional, 預設為 True) — 是否返回一個 `~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput` 而不是一個普通的元組。
joint_attention_kwargs (dict, optional) — 一個 kwargs 字典，如果指定，將傳遞給在 diffusers.models.attention_processor 中 `self.processor` 下定義的 `AttentionProcessor`。
callback_on_step_end (Callable, optional) — 在推理過程中每個去噪步驟結束時呼叫的函式。該函式使用以下引數呼叫：`callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`。`callback_kwargs` 將包含由 `callback_on_step_end_tensor_inputs` 指定的所有張量的列表。
callback_on_step_end_tensor_inputs (List, optional) — `callback_on_step_end` 函式的張量輸入列表。列表中指定的張量將作為 `callback_kwargs` 引數傳遞。您只能包含在您的 pipeline 類的 `._callback_tensor_inputs` 屬性中列出的變數。
max_sequence_length (int 預設為 256) — 與 `prompt` 一起使用的最大序列長度。
skip_guidance_layers (List[int], optional) — 一個整數列表，指定在引導期間要跳過的層。如果未提供，所有層都將用於引導。如果提供，引導將僅應用於列表中指定的層。StabiltyAI 針對 Stable Diffusion 3.5 Medium 推薦的值為 [7, 8, 9]。
skip_layer_guidance_scale (int, optional) — `skip_guidance_layers` 中指定層的引導比例。引導將以 `skip_layer_guidance_scale` 的比例應用於 `skip_guidance_layers` 中指定的層。引導將以 `1` 的比例應用於其餘層。
skip_layer_guidance_stop (int, optional) — `skip_guidance_layers` 中指定層的引導停止的步驟。引導將應用於 `skip_guidance_layers` 中指定的層，直到達到 `skip_layer_guidance_stop` 中指定的分數。StabiltyAI 針對 Stable Diffusion 3.5 Medium 推薦的值為 0.2。
skip_layer_guidance_start (int, optional) — `skip_guidance_layers` 中指定層的引導開始的步驟。引導將從 `skip_layer_guidance_start` 中指定的分數開始應用於 `skip_guidance_layers` 中指定的層。StabiltyAI 針對 Stable Diffusion 3.5 Medium 推薦的值為 0.01。
mu (float, optional) — 用於 `dynamic_shifting` 的 `mu` 值。

~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput 或 tuple

如果 `return_dict` 為 True，則返回 `~pipelines.stable_diffusion_3.StableDiffusion3PipelineOutput`，否則返回 `tuple`。當返回元組時，第一個元素是包含生成影像的列表。

呼叫管道進行生成時呼叫的函式。

示例

>>> import torch
>>> from diffusers import StableDiffusion3Pipeline

>>> pipe = StableDiffusion3Pipeline.from_pretrained(
...     "stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16
... )
>>> pipe.to("cuda")
>>> prompt = "A cat holding a sign that says hello world"
>>> image = pipe(prompt).images[0]
>>> image.save("sd3.png")

encode_image

< 原始碼 >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] device: device ) → torch.Tensor

引數

image (PipelineImageInput) — 要編碼的輸入影像。
device — (torch.device): Torch 裝置。

torch.Tensor

編碼後的影像特徵表示。

使用預訓練的影像編碼器將給定影像編碼為特徵表示。

encode_prompt

< 原始碼 >

( prompt: typing.Union[str, typing.List[str]] prompt_2: typing.Union[str, typing.List[str]] prompt_3: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_3: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None clip_skip: typing.Optional[int] = None max_sequence_length: int = 256 lora_scale: typing.Optional[float] = None )

引數

prompt (str 或 List[str], optional) — 要編碼的提示
prompt_2 (str 或 List[str], optional) — 將傳送到 `tokenizer_2` 和 `text_encoder_2` 的提示或提示列表。如果未定義，`prompt` 將用於所有文字編碼器
prompt_3 (str 或 List[str], optional) — 將傳送到 `tokenizer_3` 和 `text_encoder_3` 的提示或提示列表。如果未定義，`prompt` 將用於所有文字編碼器
device — (torch.device): torch 裝置
num_images_per_prompt (int) — 每個提示應生成的影像數量
do_classifier_free_guidance (bool) — 是否使用無分類器引導
negative_prompt (str 或 List[str], optional) — 不用於引導影像生成的提示或提示列表。如果未定義，則必須傳遞 `negative_prompt_embeds`。在不使用引導時（即 `guidance_scale` 小於 `1` 時）會被忽略。
negative_prompt_2 (str 或 List[str], optional) — 不用於引導影像生成併發送到 `tokenizer_2` 和 `text_encoder_2` 的提示或提示列表。如果未定義，`negative_prompt` 將用於所有文字編碼器。
negative_prompt_3 (str 或 List[str], optional) — 不用於引導影像生成併發送到 `tokenizer_3` 和 `text_encoder_3` 的提示或提示列表。如果未定義，`negative_prompt` 將用於所有文字編碼器。
prompt_embeds (torch.FloatTensor, optional) — 預生成的文字嵌入。可用於輕鬆微調文字輸入，例如提示加權。如果未提供，將從 `prompt` 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.FloatTensor, optional) — 預生成的負向文字嵌入。可用於輕鬆微調文字輸入，例如提示加權。如果未提供，將從 `negative_prompt` 輸入引數生成 negative_prompt_embeds。
pooled_prompt_embeds (torch.FloatTensor, 可選) — 預生成的池化文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，將從 prompt 輸入引數生成池化文字嵌入。
negative_pooled_prompt_embeds (torch.FloatTensor, 可選) — 預生成的負向池化文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，將從 negative_prompt 輸入引數生成負向池化文字嵌入。
clip_skip (int, 可選) — 在計算提示詞嵌入時從 CLIP 中跳過的層數。值為 1 表示將使用倒數第二層的輸出計算提示詞嵌入。
lora_scale (float, 可選) — 如果載入了 LoRA 層，將應用於文字編碼器所有 LoRA 層的 lora 縮放因子。

prepare_ip_adapter_image_embeds

< 源 >

( ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None ip_adapter_image_embeds: typing.Optional[torch.Tensor] = None device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True )

引數

ip_adapter_image (PipelineImageInput, 可選) — 用於為 IP-Adapter 提取特徵的輸入影像。
ip_adapter_image_embeds (torch.Tensor, 可選) — 預計算的影像嵌入。
device — (torch.device, 可選): Torch 裝置。
num_images_per_prompt (int, 預設為 1) — 每個提示詞應生成的影像數量。
do_classifier_free_guidance (bool, 預設為 True) — 是否使用無分類器指導。

為 IP-Adapter 準備影像嵌入。

必須傳入 ip_adapter_image 或 ip_adapter_image_embeds。

< > 在 GitHub 上更新

←Stable Diffusion 2 Stable Diffusion XL→

Diffusers

Stable Diffusion 3

使用示例

使用 IP-Adapters 進行影像提示

SD3 記憶體最佳化

透過模型解除安裝執行推理

在推理期間移除 T5 文字編碼器

使用 T5 文字編碼器的量化版本

SD3 效能最佳化

使用 Torch Compile 加速推理

量化

在 T5 文字編碼器中使用長提示詞

向 T5 文字編碼器傳送不同的提示詞

用於 Stable Diffusion 3 的微型自動編碼器

透過 from_single_file 載入原始檢查點

為 SD3Transformer2DModel 載入原始檢查點

為 StableDiffusion3Pipeline 載入單個檢查點

載入不帶 T5 的單個檔案檢查點

載入帶 T5 的單個檔案檢查點

為 Stable Diffusion 3.5 Transformer 模型載入單個檔案檢查點

StableDiffusion3Pipeline

class diffusers.StableDiffusion3Pipeline

__call__

encode_image

encode_prompt

prepare_ip_adapter_image_embeds

call