Diffusers 文件

文字到影像

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

文字到影像

當您想到擴散模型時，文字到影像通常是首先浮現在腦海中的功能之一。文字到影像根據文字描述（例如，“叢林中的宇航員，冷色調，柔和色彩，細節豐富，8k”）生成影像，這種描述也稱為 *提示詞*。

從宏觀角度來看，擴散模型接收一個提示詞和一些隨機初始噪聲，然後迭代地去除噪聲以構建影像。*去噪* 過程由提示詞引導，一旦去噪過程在預定的時間步數後結束，影像表示將被解碼為影像。

閱讀 Stable Diffusion 如何工作？部落格文章，瞭解有關潛在擴散模型工作原理的更多資訊。

您可以透過兩個步驟在 🤗 Diffusers 中從提示詞生成影像：

將檢查點載入到 AutoPipelineForText2Image 類中，該類會自動根據檢查點檢測要使用的相應管道類。

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
	"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
).to("cuda")

將提示詞傳遞給管道以生成影像。

image = pipeline(
	"stained glass of darth vader, backlight, centered composition, masterpiece, photorealistic, 8k"
).images[0]
image

流行模型

最常見的文字到影像模型是 Stable Diffusion v1.5、Stable Diffusion XL (SDXL) 和 Kandinsky 2.2。還有 ControlNet 模型或介面卡可與文字到影像模型一起使用，以更直接地控制影像生成。由於它們的架構和訓練過程不同，每個模型的結果略有不同，但無論您選擇哪個模型，它們的用法或多或少都是相同的。讓我們為每個模型使用相同的提示詞並比較它們的結果。

Stable Diffusion v1.5

Stable Diffusion v1.5 是從 Stable Diffusion v1-4 初始化的潛在擴散模型，並在 LAION-Aesthetics V2 資料集上的 512x512 影像上微調了 595K 步。您可以像這樣使用此模型：

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
	"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
generator = torch.Generator("cuda").manual_seed(31)
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
image

Stable Diffusion XL

SDXL 是之前 Stable Diffusion 模型的更大版本，它包含一個兩階段模型過程，可以為影像新增更多細節。它還包含一些額外的 *微條件*，以生成以主題為中心的高質量影像。請檢視更全面的 SDXL 指南，瞭解如何使用它。通常，您可以像這樣使用 SDXL：

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
generator = torch.Generator("cuda").manual_seed(31)
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
image

Kandinsky 2.2

Kandinsky 模型與 Stable Diffusion 模型有些不同，因為它還使用影像先驗模型來建立嵌入，這些嵌入用於更好地對齊擴散模型中的文字和影像。

使用 Kandinsky 2.2 最簡單的方法是：

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
	"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
).to("cuda")
generator = torch.Generator("cuda").manual_seed(31)
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator).images[0]
image

ControlNet

ControlNet 模型是輔助模型或介面卡，在文字到影像模型（例如 Stable Diffusion v1.5）之上進行微調。將 ControlNet 模型與文字到影像模型結合使用，可以提供多種選項，更明確地控制影像的生成方式。使用 ControlNet，您可以向模型新增額外的條件輸入影像。例如，如果您提供一個人體姿勢影像（通常表示為連線成骨架的多個關鍵點）作為條件輸入，模型將生成遵循影像姿勢的影像。檢視更深入的 ControlNet 指南，瞭解更多關於其他條件輸入以及如何使用它們的資訊。

在此示例中，我們使用人體姿態估計影像來條件化 ControlNet。載入預訓練在人體姿態估計上的 ControlNet 模型：

from diffusers import ControlNetModel, AutoPipelineForText2Image
from diffusers.utils import load_image
import torch

controlnet = ControlNetModel.from_pretrained(
	"lllyasviel/control_v11p_sd15_openpose", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
pose_image = load_image("https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png")

將 `controlnet` 傳遞給 AutoPipelineForText2Image，並提供提示詞和姿態估計影像：

pipeline = AutoPipelineForText2Image.from_pretrained(
	"stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16"
).to("cuda")
generator = torch.Generator("cuda").manual_seed(31)
image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=pose_image, generator=generator).images[0]
image

Stable Diffusion v1.5

Stable Diffusion XL

Kandinsky 2.2

ControlNet (姿態條件)

配置管道引數

管道中有許多引數可以配置，它們會影響影像的生成方式。您可以更改影像的輸出大小，指定負面提示以提高影像質量等。本節將深入探討如何使用這些引數。

高度和寬度

`height` 和 `width` 引數控制生成影像的高度和寬度（以畫素為單位）。預設情況下，Stable Diffusion v1.5 模型輸出 512x512 影像，但您可以將其更改為 8 的任意倍數大小。例如，要建立矩形影像：

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
	"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
image = pipeline(
	"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", height=768, width=512
).images[0]
image

其他模型可能會根據訓練資料集中的影像大小具有不同的預設影像大小。例如，SDXL 的預設影像大小為 1024x1024，使用較低的 `height` 和 `width` 值可能會導致影像質量下降。請務必先檢視模型的 API 參考！

引導比例

`guidance_scale` 引數影響提示詞對影像生成的影響程度。較低的值賦予模型“創造力”，生成與提示詞關聯度更鬆散的影像。較高的 `guidance_scale` 值會促使模型更緊密地遵循提示詞，如果此值過高，您可能會在生成的影像中觀察到一些偽影。

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
	"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
image = pipeline(
	"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", guidance_scale=3.5
).images[0]
image

guidance_scale = 2.5

guidance_scale = 7.5

guidance_scale = 10.5

負面提示

就像提示詞引導生成一樣，*負面提示* 會引導模型避開您不希望模型生成的內容。這通常用於透過去除“低解析度”或“糟糕細節”等不良或缺陷影像特徵來提高整體影像質量。您還可以使用負面提示來刪除或修改影像的內容和風格。

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
	"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
image = pipeline(
	prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
	negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy",
).images[0]
image

negative_prompt = "醜陋，畸形，毀容，細節差，解剖結構差"

negative_prompt = "宇航員"

生成器

一個 torch.Generator 物件透過設定手動種子來實現在管道中的可復現性。您可以使用 `Generator` 生成批次影像，並像使用確定性生成提高影像質量指南中詳細介紹的那樣，迭代地改進從種子生成的影像。

您可以按如下所示設定種子和 `Generator`。使用 `Generator` 建立影像每次都應該返回相同的結果，而不是隨機生成新影像。

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
	"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
generator = torch.Generator(device="cuda").manual_seed(30)
image = pipeline(
	"Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
	generator=generator,
).images[0]
image

控制影像生成

除了配置管道引數外，還有幾種方法可以更好地控制影像的生成方式，例如提示詞加權和 ControlNet 模型。

提示詞加權

提示詞加權是一種提高或降低提示詞中概念重要性的技術，用於強調或最小化影像中的某些特徵。我們建議使用 Compel 庫來幫助您生成加權提示詞嵌入。

瞭解如何在提示詞加權指南中建立提示詞嵌入。本示例重點介紹如何在管道中使用提示詞嵌入。

建立嵌入後，您可以將它們傳遞給管道中的 `prompt_embeds`（如果使用負面提示，則傳遞給 `negative_prompt_embeds`）引數。

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
	"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16
).to("cuda")
image = pipeline(
	prompt_embeds=prompt_embeds, # generated from Compel
	negative_prompt_embeds=negative_prompt_embeds, # generated from Compel
).images[0]

ControlNet

正如您在 ControlNet 部分看到的那樣，這些模型透過結合額外的條件影像輸入，提供了更靈活和準確的影像生成方式。每個 ControlNet 模型都經過特定型別條件影像的預訓練，以生成與它相似的新影像。例如，如果您使用在深度圖上預訓練的 ControlNet 模型，您可以將深度圖作為條件輸入提供給模型，它將生成一個保留其中空間資訊的影像。這比在提示詞中指定深度資訊更快、更容易。您甚至可以使用 MultiControlNet 結合多個條件輸入！

有許多型別的條件輸入可以使用，並且 🤗 Diffusers 支援 Stable Diffusion 和 SDXL 模型的 ControlNet。檢視更全面的 ControlNet 指南，瞭解如何使用這些模型。

最佳化

擴散模型很大，影像去噪的迭代性質計算量大且密集。但這並不意味著您需要訪問功能強大的（甚至多個）GPU 才能使用它們。有許多最佳化技術可以在消費級和免費層資源上執行擴散模型。例如，您可以以半精度載入模型權重以節省 GPU 記憶體並提高速度，或者將整個模型解除安裝到 GPU 以節省更多記憶體。

PyTorch 2.0 還支援一種更節省記憶體的注意力機制，稱為 縮放點積注意力，如果您使用的是 PyTorch 2.0，則會自動啟用此功能。您可以將其與 torch.compile 結合使用，以進一步加速您的程式碼：

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16").to("cuda")
pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)

有關如何最佳化程式碼以節省記憶體和加速推理的更多提示，請閱讀加速推理和減少記憶體使用指南。

< > 在 GitHub 上更新

←無條件影像生成影像到影像→