Stable Diffusion 簡介

本章介紹 Stable Diffusion 的構建模組，Stable Diffusion 是一種生成式人工智慧（生成式 AI）模型，可根據文字和影像提示生成獨特的逼真影像。它最初於 2022 年推出，得益於 Stability AI、RunwayML 和慕尼黑大學 CompVis 團隊在論文發表後的合作。

您將從本章學到什麼？

Stable Diffusion 的基本組成部分
如何使用 text-to-image、image2image、影像修復管道

Stable Diffusion 如何工作？

為了使本節有趣，我們將嘗試回答一些問題，以瞭解 Stable Diffusion 過程的基本組成部分。我們將簡要討論每個組成部分，因為它們已在我們的 Diffusers 課程中介紹。此外，您可以訪問我們之前的章節，其中詳細介紹了 GANs 和擴散模型。

Stable Diffusion 採用哪些策略來學習新資訊？
- 它使用擴散模型的前向和反向過程。在前向過程中，我們向影像新增高斯噪聲，直到只剩下隨機噪聲。通常我們無法識別影像的最終噪聲版本。
- 在反向過程中，我們有一個經過訓練的神經網路，用於從純噪聲開始逐漸去噪影像，直到得到實際影像。

這兩個過程都發生在有限的步數 T（根據 DDPM 論文 T=1000）。您從時間開始該過程 $t_0$ 透過從資料分佈中取樣真實影像，並且前向過程在每個時間步 t 從高斯分佈中取樣一些噪聲，並將其新增到前一個時間步的影像中。要獲得更多數學直覺，請閱讀 Hugging Face 部落格上關於擴散模型的文章。

由於我們的影像可能很大，我們如何壓縮它？

當您擁有大影像時，它們需要更多的計算能力來處理。這在稱為自注意力的特定操作中變得非常明顯。影像越大，所需的計算量越大，並且這些計算量隨著影像的大小呈“二次方”快速增加。例如，如果您的影像寬度和高度為 128 畫素，則其畫素是寬度和高度僅為 64 畫素的影像的四倍。由於自注意力工作原理，處理此較大影像不僅需要四倍的記憶體和計算能力，它實際上需要十六倍（因為 4 乘以 4 等於 16）。這使得處理高解析度影像變得具有挑戰性，因為它們需要大量的資源來處理。潛在擴散模型透過使用變分自編碼器（VAE）將影像縮小到更易於管理的大小來解決處理大影像的高計算需求。其思想是，許多影像都包含重複或不必要的資訊。VAE 在大量資料上進行訓練後，可以將影像壓縮成更小、更緊湊的形式。這種較小的版本仍然保留了原始影像的基本特徵。

既然我們正在使用提示，我們如何將文字與影像融合？

我們知道，在推理時，我們可以輸入我們希望看到的影像描述和一些純噪聲作為起點，模型會盡力將隨機輸入“去噪”成與描述匹配的內容。SD 利用基於 CLIP 的預訓練 Transformer 模型。CLIP 的文字編碼器旨在將影像描述處理成可用於比較影像和文字的形式，因此它非常適合從影像描述建立有用表示的任務。輸入提示首先被標記化（基於一個大詞彙表，其中每個單詞或子單詞被分配一個特定的標記），然後透過 CLIP 文字編碼器，為每個標記生成一個 768 維（SD 1.X 的情況）或 1024 維（SD 2.X）向量。為了保持一致性，提示總是被填充/截斷為 77 個標記長，因此我們用作條件的最終表示是一個形狀為每個提示 77x1024 的張量。

我們如何加入良好的歸納偏置？

因為我們正在嘗試生成新的東西（例如，一隻逼真的寶可夢），我們需要一種方法來超越我們以前見過的影像（例如，一隻動漫寶可夢）。這就是 U-Net 和自注意力發揮作用的地方。給定影像的噪聲版本，模型的任務是根據額外的線索（例如影像的文字描述）預測去噪版本。好的，我們如何將這些條件資訊實際輸入到 U-Net 中以供其在進行預測時使用？答案是所謂的交叉注意力。U-Net 中散佈著交叉注意力層。U-Net 中的每個空間位置都可以“關注”文字條件中的不同標記，從而從提示中引入相關資訊。

如何在 Diffusers 中使用 text-to-image、image-to-image、影像修復模型

本節介紹有用的用例以及如何使用 Diffusers 庫執行這些任務。

text-to-image 推理步驟：其思想是傳入文字提示，然後將其轉換為輸出影像。

使用 diffusers 庫，您可以透過 2 個步驟使 text-to-image 工作。

我們首先安裝 diffusers 庫。

pip install diffusers

現在我們將初始化管道並傳入我們的提示並進行推理。

from diffusers import AutoPipelineForText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16"
).to("cuda")
generator = torch.Generator(device="cuda").manual_seed(31)
image = pipeline(
    "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k",
    generator=generator,
).images[0]

image-to-image 推理步驟：以類似的方式，我們可以初始化管道，但傳入影像和文字提示。

import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForImage2Image.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder",
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()

# Load an image to pass to the pipeline:
init_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"
)

# Pass a prompt and image to the pipeline to generate an image:
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
image = pipeline(prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

影像修復步驟：對於影像修復管道，我們需要傳入影像、文字提示和基於影像中物件的遮罩，這表明要在影像中進行修復。在此示例中，我們還傳入一個負面提示，以進一步影響我們希望避免的推理。

# Load the pipeline
import torch
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForInpainting.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()

# Load the base and mask images:
init_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
)
mask_image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png"
)

# Create a prompt to inpaint the image with and pass it to the pipeline with the base and mask images:
prompt = (
    "a black cat with glowing eyes, cute, adorable, disney, pixar, highly detailed, 8k"
)
negative_prompt = "bad anatomy, deformed, ugly, disfigured"
image = pipeline(
    prompt=prompt,
    negative_prompt=negative_prompt,
    image=init_image,
    mask_image=mask_image,
).images[0]
make_image_grid([init_image, mask_image, image], rows=1, cols=3)

進一步閱讀

< > 在 GitHub 上更新

社群計算機視覺課程

Stable Diffusion 簡介

Stable Diffusion 如何工作？

如何在 Diffusers 中使用 text-to-image、image-to-image、影像修復模型

進一步閱讀