Diffusers 文件

理解流水線、模型和排程器

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

理解流水線、模型和排程器

🧨 Diffusers 旨在成為一個使用者友好且靈活的工具箱，用於構建適合您用例的擴散系統。該工具箱的核心是模型和排程器。雖然 DiffusionPipeline 為了方便將這些元件捆綁在一起，您也可以將流水線解綁，並單獨使用模型和排程器來建立新的擴散系統。

在本教程中，您將學習如何使用模型和排程器來組裝一個用於推理的擴散系統，從基本的流水線開始，然後逐步過渡到 Stable Diffusion 流水線。

解構基本流水線

流水線是執行模型進行推理的快速簡便方法，只需四行程式碼即可生成影像

>>> from diffusers import DDPMPipeline

>>> ddpm = DDPMPipeline.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")
>>> image = ddpm(num_inference_steps=25).images[0]
>>> image

這真是太簡單了，但是流水線是如何做到的呢？讓我們分解一下流水線，看看幕後發生了什麼。

在上面的示例中，流水線包含一個 UNet2DModel 模型和一個 DDPMScheduler。流水線透過將所需輸出大小的隨機噪聲多次傳遞到模型中來對影像進行去噪。在每個時間步，模型預測*噪聲殘差*，排程器使用它來預測噪聲較少的影像。流水線重複此過程，直到達到指定數量的推理步驟。

要使用模型和排程器單獨重新建立流水線，讓我們編寫自己的去噪過程。

載入模型和排程器

>>> from diffusers import DDPMScheduler, UNet2DModel

>>> scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256")
>>> model = UNet2DModel.from_pretrained("google/ddpm-cat-256", use_safetensors=True).to("cuda")

設定去噪過程執行的時間步數

>>> scheduler.set_timesteps(50)

設定排程器時間步會建立一個張量，其中包含均勻分佈的元素，本示例中為 50 個。每個元素對應於模型對影像去噪的時間步。當您稍後建立去噪迴圈時，您將遍歷此張量以對影像進行去噪

>>> scheduler.timesteps
tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720,
    700, 680, 660, 640, 620, 600, 580, 560, 540, 520, 500, 480, 460, 440,
    420, 400, 380, 360, 340, 320, 300, 280, 260, 240, 220, 200, 180, 160,
    140, 120, 100,  80,  60,  40,  20,   0])

建立與所需輸出形狀相同的隨機噪聲

>>> import torch

>>> sample_size = model.config.sample_size
>>> noise = torch.randn((1, 3, sample_size, sample_size), device="cuda")

現在編寫一個迴圈來遍歷時間步。在每個時間步，模型進行 UNet2DModel.forward() 前向傳播並返回噪聲殘差。排程器的 step() 方法接收噪聲殘差、時間步和輸入，並預測前一個時間步的影像。此輸出將成為去噪迴圈中模型的下一個輸入，並重復此過程直到達到 `timesteps` 陣列的末尾。

>>> input = noise

>>> for t in scheduler.timesteps:
...     with torch.no_grad():
...         noisy_residual = model(input, t).sample
...     previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
...     input = previous_noisy_sample

這就是整個去噪過程，您可以使用相同的模式編寫任何擴散系統。

最後一步是將去噪輸出轉換為影像

>>> from PIL import Image
>>> import numpy as np

>>> image = (input / 2 + 0.5).clamp(0, 1).squeeze()
>>> image = (image.permute(1, 2, 0) * 255).round().to(torch.uint8).cpu().numpy()
>>> image = Image.fromarray(image)
>>> image

在下一節中，您將測試您的技能並分解更復雜的 Stable Diffusion 流水線。步驟或多或少相同。您將初始化必要的元件，並設定時間步數以建立 `timestep` 陣列。`timestep` 陣列用於去噪迴圈，對於此陣列中的每個元素，模型預測噪聲較少的影像。去噪迴圈遍歷 `timestep`，在每個時間步，它輸出一個噪聲殘差，排程器使用它來預測前一個時間步的噪聲較少的影像。此過程重複直到您到達 `timestep` 陣列的末尾。

讓我們試試看！

解構 Stable Diffusion 流水線

Stable Diffusion 是一種文字到影像的*潛在擴散*模型。之所以稱之為潛在擴散模型，是因為它處理影像的低維表示而不是實際的畫素空間，這使其更節省記憶體。編碼器將影像壓縮成較小的表示，解碼器將壓縮的表示轉換回影像。對於文字到影像模型，您需要一個分詞器和一個編碼器來生成文字嵌入。從前面的示例中，您已經知道需要一個 UNet 模型和一個排程器。

正如你所看到的，這比只包含 UNet 模型的 DDPM 流水線要複雜得多。Stable Diffusion 模型有三個獨立的預訓練模型。

💡 閱讀Stable Diffusion 如何工作？部落格，瞭解有關 VAE、UNet 和文字編碼器模型如何工作的更多詳細資訊。

現在您知道 Stable Diffusion 流水線需要什麼了，使用 from_pretrained() 方法載入所有這些元件。您可以在預訓練的 stable-diffusion-v1-5/stable-diffusion-v1-5 檢查點中找到它們，每個元件都儲存在單獨的子資料夾中

>>> from PIL import Image
>>> import torch
>>> from transformers import CLIPTextModel, CLIPTokenizer
>>> from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler

>>> vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True)
>>> tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
>>> text_encoder = CLIPTextModel.from_pretrained(
...     "CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True
... )
>>> unet = UNet2DConditionModel.from_pretrained(
...     "CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True
... )

與其使用預設的 PNDMScheduler，不如將其替換為 UniPCMultistepScheduler，看看插入不同的排程器有多容易

>>> from diffusers import UniPCMultistepScheduler

>>> scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")

為了加速推理，將模型移動到 GPU，因為與排程器不同，它們具有可訓練的權重

>>> torch_device = "cuda"
>>> vae.to(torch_device)
>>> text_encoder.to(torch_device)
>>> unet.to(torch_device)

建立文字嵌入

下一步是分詞文字以生成嵌入。文字用於條件化 UNet 模型，並引導擴散過程生成類似於輸入提示的內容。

💡 `guidance_scale` 引數確定在生成影像時應給予提示多少權重。

如果您想生成其他內容，請隨意選擇您喜歡的任何提示！

>>> prompt = ["a photograph of an astronaut riding a horse"]
>>> height = 512  # default height of Stable Diffusion
>>> width = 512  # default width of Stable Diffusion
>>> num_inference_steps = 25  # Number of denoising steps
>>> guidance_scale = 7.5  # Scale for classifier-free guidance
>>> generator = torch.manual_seed(0)  # Seed generator to create the initial latent noise
>>> batch_size = len(prompt)

將文字分詞並從提示生成嵌入

>>> text_input = tokenizer(
...     prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
... )

>>> with torch.no_grad():
...     text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

您還需要生成*無條件文字嵌入*，即填充標記的嵌入。它們需要與條件 `text_embeddings` 具有相同的形狀（`batch_size` 和 `seq_length`）

>>> max_length = text_input.input_ids.shape[-1]
>>> uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
>>> uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

讓我們將條件嵌入和無條件嵌入連線成一個批次，以避免進行兩次前向傳播

>>> text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

建立隨機噪聲

接下來，生成一些初始隨機噪聲作為擴散過程的起點。這是影像的潛在表示，它將逐漸去噪。此時，`latent` 影像小於最終影像大小，但這沒關係，因為模型稍後會將其轉換為最終的 512x512 影像尺寸。

💡 高度和寬度除以 8，因為 `vae` 模型有 3 個下采樣層。您可以透過執行以下命令進行檢查

2 ** (len(vae.config.block_out_channels) - 1) == 8

>>> latents = torch.randn(
...     (batch_size, unet.config.in_channels, height // 8, width // 8),
...     generator=generator,
...     device=torch_device,
... )

影像去噪

首先使用初始噪聲分佈（即噪聲尺度值 *sigma*）縮放輸入，這是 UniPCMultistepScheduler 等改進型排程器所必需的

>>> latents = latents * scheduler.init_noise_sigma

最後一步是建立去噪迴圈，它將逐步將 `latents` 中的純噪聲轉換為由您的提示描述的影像。請記住，去噪迴圈需要完成三件事

設定排程器在去噪過程中使用的時間步。
遍歷時間步。
在每個時間步，呼叫 UNet 模型預測噪聲殘差，並將其傳遞給排程器以計算前一個噪聲樣本。

>>> from tqdm.auto import tqdm

>>> scheduler.set_timesteps(num_inference_steps)

>>> for t in tqdm(scheduler.timesteps):
...     # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
...     latent_model_input = torch.cat([latents] * 2)

...     latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

...     # predict the noise residual
...     with torch.no_grad():
...         noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

...     # perform guidance
...     noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
...     noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

...     # compute the previous noisy sample x_t -> x_t-1
...     latents = scheduler.step(noise_pred, t, latents).prev_sample

解碼影像

最後一步是使用 `vae` 將潛在表示解碼為影像，並透過 `sample` 獲取解碼後的輸出

# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample

最後，將影像轉換為 `PIL.Image` 以檢視您生成的影像！

>>> image = (image / 2 + 0.5).clamp(0, 1).squeeze()
>>> image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
>>> image = Image.fromarray(image)
>>> image

後續步驟

從基本流水線到複雜流水線，您已經看到，編寫自己的擴散系統真正需要的只是一個去噪迴圈。該迴圈應設定排程器的時間步，遍歷它們，並交替呼叫 UNet 模型預測噪聲殘差並將其傳遞給排程器以計算前一個噪聲樣本。

這正是 🧨 Diffusers 的設計初衷：讓您能夠直觀輕鬆地使用模型和排程器編寫自己的擴散系統。

接下來，您可以

瞭解如何構建流水線並貢獻到 🧨 Diffusers。我們迫不及待地想看看您能創造出什麼！
探索庫中現有流水線，看看您是否可以利用模型和排程器從頭開始解構和構建流水線。

< > 在 GitHub 上更新

←概述 AutoPipeline→