微調和引導

在本筆記本中，我們將介紹兩種用於調整現有擴散模型的主要方法

透過**微調**，我們將在新資料上重新訓練現有模型，以改變它們生成的輸出型別
透過**引導**，我們將在推理時引導現有模型的生成過程，以實現額外的控制

你將學到什麼：

學完本筆記本，你將知道如何

建立一個取樣迴圈並使用新的排程器更快地生成樣本
在新資料上微調現有的擴散模型，包括
- 使用梯度累積來解決小批次帶來的一些問題
- 在訓練期間將樣本記錄到 Weights and Biases 以監控進度（透過附帶的示例指令碼）
- 儲存生成的 pipeline 並將其上傳到 Hub
使用額外的損失函式引導取樣過程，以增加對現有模型的控制，包括
- 使用簡單的基於顏色的損失探索不同的引導方法
- 使用 CLIP 根據文字提示引導生成
- 使用 Gradio 和 🤗 Spaces 共享自定義取樣迴圈

❓如果你有任何問題，請在 Hugging Face Discord 伺服器上的 #diffusion-models-class 頻道中提出。如果你還沒有註冊，可以在這裡註冊：https://huggingface.co/join/discord

設定和匯入

要將你微調的模型儲存到 Hugging Face Hub，你需要使用**具有寫入許可權的令牌**登入。下面的程式碼將提示你輸入令牌，並連結到你帳戶的相關令牌頁面。如果你想使用訓練指令碼在模型訓練時記錄樣本，你還需要一個 Weights and Biases 帳戶——同樣，程式碼會在需要時提示你登入。

除此之外，唯一的設定是安裝一些依賴項，匯入我們將需要的所有內容，並指定我們將使用的裝置

%pip install -qq diffusers datasets accelerate wandb open-clip-torch

>>> # Code to log in to the Hugging Face Hub, needed for sharing models
>>> # Make sure you use a token with WRITE access
>>> from huggingface_hub import notebook_login

>>> notebook_login()

Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful

import numpy as np
import torch
import torch.nn.functional as F
import torchvision
from datasets import load_dataset
from diffusers import DDIMScheduler, DDPMPipeline
from matplotlib import pyplot as plt
from PIL import Image
from torchvision import transforms
from tqdm.auto import tqdm

device = "mps" if torch.backends.mps.is_available() else "cuda" if torch.cuda.is_available() else "cpu"

載入預訓練的 Pipeline

要開始本筆記本，讓我們載入一個現有的 pipeline，看看我們可以用它做什麼

image_pipe = DDPMPipeline.from_pretrained("google/ddpm-celebahq-256")
image_pipe.to(device)

生成影像就像執行 pipeline 的 __call__ 方法一樣簡單，像呼叫函式一樣呼叫它即可

>>> images = image_pipe().images
>>> images[0]

很酷，但很慢！所以，在我們進入今天的主要主題之前，讓我們先看看實際的取樣迴圈，看看如何使用更高階的取樣器來加速它

使用 DDIM 加速取樣

在每一步，模型都會接收一個帶噪聲的輸入，並被要求預測噪聲（從而估計完全去噪後的影像可能是什麼樣子）。最初這些預測不是很好，這就是為什麼我們將過程分解為許多步驟。然而，研究發現使用 1000 多個步驟是沒有必要的，最近的一系列研究探索瞭如何用盡可能少的步驟獲得好的樣本。

在 🤗 Diffusers 庫中，這些**取樣方法由排程器處理**，排程器必須透過 step() 函式執行每次更新。要生成影像，我們從隨機噪聲 $x$ 開始。然後，對於排程器噪聲計劃中的每個時間步，我們將帶噪聲的輸入 $x$ 提供給模型，並將得到的預測傳遞給 step() 函式。這將返回一個帶有 prev_sample 屬性的輸出——之所以是“上一個”，是因為我們正在時間上“向後”移動，從高噪聲到低噪聲（與前向擴散過程相反）。

讓我們看看實際操作！首先，我們載入一個排程器，這裡是基於論文去噪擴散隱式模型的 DDIMScheduler，它可以在比原始 DDPM 實現少得多的步驟中給出不錯的樣本

# Create new scheduler and set num inference steps
scheduler = DDIMScheduler.from_pretrained("google/ddpm-celebahq-256")
scheduler.set_timesteps(num_inference_steps=40)

你可以看到這個模型總共執行 40 個步驟，每一步相當於原始 1000 步計劃中的 25 步

scheduler.timesteps

讓我們建立 4 個隨機影像並執行取樣迴圈，隨著過程的進展，檢視當前的 $x$ 和預測的去噪版本

>>> # The random starting point
>>> x = torch.randn(4, 3, 256, 256).to(device)  # Batch of 4, 3-channel 256 x 256 px images

>>> # Loop through the sampling timesteps
>>> for i, t in tqdm(enumerate(scheduler.timesteps)):

...     # Prepare model input
...     model_input = scheduler.scale_model_input(x, t)

...     # Get the prediction
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]

...     # Calculate what the updated sample should look like with the scheduler
...     scheduler_output = scheduler.step(noise_pred, t, x)

...     # Update x
...     x = scheduler_output.prev_sample

...     # Occasionally display both x and the predicted denoised images
...     if i % 10 == 0 or i == len(scheduler.timesteps) - 1:
...         fig, axs = plt.subplots(1, 2, figsize=(12, 5))

...         grid = torchvision.utils.make_grid(x, nrow=4).permute(1, 2, 0)
...         axs[0].imshow(grid.cpu().clip(-1, 1) * 0.5 + 0.5)
...         axs[0].set_title(f"Current x (step {i})")

...         pred_x0 = scheduler_output.pred_original_sample  # Not available for all schedulers
...         grid = torchvision.utils.make_grid(pred_x0, nrow=4).permute(1, 2, 0)
...         axs[1].imshow(grid.cpu().clip(-1, 1) * 0.5 + 0.5)
...         axs[1].set_title(f"Predicted denoised images (step {i})")
...         plt.show()

如你所見，最初的預測不是很好，但隨著過程的進行，預測的輸出變得越來越精細。如果你好奇 step() 函式內部發生了什麼數學運算，可以用以下方式檢查（註釋良好的）程式碼

# ??scheduler.step

你也可以用這個新的排程器替換 pipeline 自帶的原始排程器，然後像這樣進行取樣

>>> image_pipe.scheduler = scheduler
>>> images = image_pipe(num_inference_steps=40).images
>>> images[0]

好了——我們現在可以在合理的時間內獲得樣本了！在我們繼續本筆記本的其餘部分時，這應該會加快速度 :)

微調

現在是有趣的部分了！給定這個預訓練的 pipeline，我們如何重新訓練模型以根據新的訓練資料生成影像？

事實證明，這看起來幾乎與從零開始訓練模型（正如我們在第一單元中看到的那樣）完全相同，只是我們從現有模型開始。讓我們看看實際操作，並在此過程中討論一些額外的考慮因素。

首先是資料集：你可以嘗試這個復古人臉資料集或這些動漫人臉，它們更接近這個面部模型的原始訓練資料，但為了好玩，我們還是使用我們在第一單元中從零開始訓練時用過的同一個小型蝴蝶資料集。執行下面的程式碼以下載蝴蝶資料集並建立一個我們可以從中取樣一批影像的資料載入器

>>> # @markdown load and prepare a dataset:
>>> # Not on Colab? Comments with #@ enable UI tweaks like headings or user inputs
>>> # but can safely be ignored if you're working on a different platform.

>>> dataset_name = "huggan/smithsonian_butterflies_subset"  # @param
>>> dataset = load_dataset(dataset_name, split="train")
>>> image_size = 256  # @param
>>> batch_size = 4  # @param
>>> preprocess = transforms.Compose(
...     [
...         transforms.Resize((image_size, image_size)),
...         transforms.RandomHorizontalFlip(),
...         transforms.ToTensor(),
...         transforms.Normalize([0.5], [0.5]),
...     ]
... )


>>> def transform(examples):
...     images = [preprocess(image.convert("RGB")) for image in examples["image"]]
...     return {"images": images}


>>> dataset.set_transform(transform)

>>> train_dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True)

>>> print("Previewing batch:")
>>> batch = next(iter(train_dataloader))
>>> grid = torchvision.utils.make_grid(batch["images"], nrow=4)
>>> plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5)

Previewing batch:

考慮事項 1： 我們這裡的批次大小（4）相當小，因為我們正在使用一個相當大的模型以大影像尺寸（256px）進行訓練，如果我們將批次大小推得太高，就會耗盡 GPU RAM。你可以減小影像尺寸來加快速度並允許更大的批次，但這些模型是為 256px 生成而設計和最初訓練的。

現在是訓練迴圈。我們將透過將最佳化目標設定為 `image_pipe.unet.parameters()` 來更新預訓練模型的權重。其餘部分與第一單元的示例訓練迴圈幾乎相同。這在 Colab 上執行大約需要 10 分鐘，所以現在是喝杯咖啡或茶的好時機

>>> num_epochs = 2  # @param
>>> lr = 1e-5  # 2param
>>> grad_accumulation_steps = 2  # @param

>>> optimizer = torch.optim.AdamW(image_pipe.unet.parameters(), lr=lr)

>>> losses = []

>>> for epoch in range(num_epochs):
...     for step, batch in tqdm(enumerate(train_dataloader), total=len(train_dataloader)):
...         clean_images = batch["images"].to(device)
...         # Sample noise to add to the images
...         noise = torch.randn(clean_images.shape).to(clean_images.device)
...         bs = clean_images.shape[0]

...         # Sample a random timestep for each image
...         timesteps = torch.randint(
...             0,
...             image_pipe.scheduler.num_train_timesteps,
...             (bs,),
...             device=clean_images.device,
...         ).long()

...         # Add noise to the clean images according to the noise magnitude at each timestep
...         # (this is the forward diffusion process)
...         noisy_images = image_pipe.scheduler.add_noise(clean_images, noise, timesteps)

...         # Get the model prediction for the noise
...         noise_pred = image_pipe.unet(noisy_images, timesteps, return_dict=False)[0]

...         # Compare the prediction with the actual noise:
...         loss = F.mse_loss(
...             noise_pred, noise
...         )  # NB - trying to predict noise (eps) not (noisy_ims-clean_ims) or just (clean_ims)

...         # Store for later plotting
...         losses.append(loss.item())

...         # Update the model parameters with the optimizer based on this loss
...         loss.backward(loss)

...         # Gradient accumulation:
...         if (step + 1) % grad_accumulation_steps == 0:
...             optimizer.step()
...             optimizer.zero_grad()

...     print(f"Epoch {epoch} average loss: {sum(losses[-len(train_dataloader):])/len(train_dataloader)}")

>>> # Plot the loss curve:
>>> plt.plot(losses)

Epoch 0 average loss: 0.013324214214226231

考慮事項 2： 我們的損失訊號非常嘈雜，因為我們每一步只處理四個處於隨機噪聲水平的樣本。這對於訓練來說並不理想。一個解決方法是使用極低的學習率來限制每一步的更新大小。如果我們能找到一種方法，在不使記憶體需求飆升的情況下，獲得與使用更大批次大小相同的好處，那就更好了……

來看梯度累積。如果我們在執行 `optimizer.step()` 和 `optimizer.zero_grad()` 之前多次呼叫 `loss.backward()`，PyTorch 會累積（求和）梯度，有效地合併來自多個批次的訊號，從而給出一個單一（更好）的估計，然後用它來更新引數。這導致總更新次數減少，就像我們使用更大批次大小一樣。這是許多框架會為你處理的事情（例如，🤗 Accelerate 使這變得簡單），但從頭開始實現它很好，因為這是處理 GPU 記憶體限制下訓練的有用技術！正如你從上面的程式碼（在 `# Gradient accumulation` 註釋之後）可以看到的，實際上並不需要太多程式碼。

# Exercise: See if you can add gradient accumulation to the training loop in Unit 1.
# How does it perform? Think how you might adjust the learning rate based on the
# number of gradient accumulation steps - should it stay the same as before?

考慮事項 3： 這仍然需要很多時間，並且每個 epoch 列印一行更新不足以為我們提供一個好的想法，讓我們瞭解正在發生的事情。我們可能應該

偶爾生成一些樣本，以在模型訓練時定性地檢查效能
在訓練期間記錄損失和樣本生成等資訊，或許可以使用 Weights and Biases 或 tensorboard 之類的工具。

我建立了一個快速指令碼（finetune_model.py），它採用了上面的訓練程式碼並添加了最少的日誌記錄功能。你可以在下面看到一次訓練執行的日誌

%wandb johnowhitaker/dm_finetune/2upaa341 # You'll need a W&B account for this to work - skip if you don't want to log in

看到生成的樣本隨著訓練的進行而變化是很有趣的——儘管損失似乎沒有太大改善，但我們可以看到從原始領域（臥室影像）向新訓練資料（wikiart）的轉變。在本筆記本的末尾有註釋掉的程式碼，用於使用此指令碼微調模型，作為執行上述單元格的替代方案。

# Exercise: see if you can modify the official example training script we saw
# in Unit 1 to begin with a pre-trained model rather than training from scratch.
# Compare it to the minimal script linked above - what extra features is the minimal script missing?

用這個模型生成一些影像，我們可以看到這些人臉已經看起來非常奇怪了！

>>> # @markdown Generate and plot some images:
>>> x = torch.randn(8, 3, 256, 256).to(device)  # Batch of 8
>>> for i, t in tqdm(enumerate(scheduler.timesteps)):
...     model_input = scheduler.scale_model_input(x, t)
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]
...     x = scheduler.step(noise_pred, t, x).prev_sample
>>> grid = torchvision.utils.make_grid(x, nrow=4)
>>> plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5)

考慮事項 4： 微調可能相當不可預測！如果我們訓練的時間更長，我們可能會看到一些完美的蝴蝶。但中間步驟本身可能非常有趣，特別是如果你的興趣更偏向藝術方面！探索訓練很短或很長的時間，並改變學習率，看看這如何影響最終模型產生的輸出型別。

使用我們在 WikiArt 演示模型上使用的最小示例指令碼微調模型的程式碼

如果你想訓練一個與我在 WikiArt 上製作的模型相似的模型，可以取消註釋並執行下面的單元格。由於這需要一些時間並且可能會耗盡你的 GPU 記憶體，我建議在完成本筆記本的其餘部分之後再執行此操作。

## To download the fine-tuning script:
# !wget https://github.com/huggingface/diffusion-models-class/raw/main/unit2/finetune_model.py

## To run the script, training the face model on some vintage faces
## (ideally run this in a terminal):
# !python finetune_model.py --image_size 128 --batch_size 8 --num_epochs 16\
#     --grad_accumulation_steps 2 --start_model "google/ddpm-celebahq-256"\
#     --dataset_name "Norod78/Vintage-Faces-FFHQAligned" --wandb_project 'dm-finetune'\
#     --log_samples_every 100 --save_model_every 1000 --model_save_name 'vintageface'

儲存和載入微調後的 Pipeline

現在我們已經微調了擴散模型中的 U-Net，讓我們透過執行以下命令將其儲存到本地資料夾

image_pipe.save_pretrained("my-finetuned-model")

正如我們在第一單元中看到的，這將儲存配置、模型、排程器

>>> !ls {"my-finetuned-model"}

model_index.json  scheduler  unet

接下來，你可以按照第一單元的Diffusers 簡介中概述的相同步驟，將模型推送到 Hub 以供以後使用

# @title Upload a locally saved pipeline to the hub

# Code to upload a pipeline saved locally to the hub
from huggingface_hub import HfApi, ModelCard, create_repo, get_full_repo_name

# Set up repo and upload files
model_name = "ddpm-celebahq-finetuned-butterflies-2epochs"  # @param What you want it called on the hub
local_folder_name = (
    "my-finetuned-model"  # @param Created by the script or one you created via image_pipe.save_pretrained('save_name')
)
description = "Describe your model here"  # @param
hub_model_id = get_full_repo_name(model_name)
create_repo(hub_model_id)
api = HfApi()
api.upload_folder(folder_path=f"{local_folder_name}/scheduler", path_in_repo="", repo_id=hub_model_id)
api.upload_folder(folder_path=f"{local_folder_name}/unet", path_in_repo="", repo_id=hub_model_id)
api.upload_file(
    path_or_fileobj=f"{local_folder_name}/model_index.json",
    path_in_repo="model_index.json",
    repo_id=hub_model_id,
)

# Add a model card (optional but nice!)
content = f"""
---
license: mit
tags:
- pytorch
- diffusers
- unconditional-image-generation
- diffusion-models-class
---

# Example Fine-Tuned Model for Unit 2 of the [Diffusion Models Class 🧨](https://github.com/huggingface/diffusion-models-class)

{description}

## Usage

```python
from diffusers import DDPMPipeline

pipeline = DDPMPipeline.from_pretrained('{hub_model_id}')
image = pipeline().images[0]
image

"""

card = ModelCard(content) card.push_to_hub(hub_model_id)


Congratulations, you've now fine-tuned your first diffusion model!

For the rest of this notebook we'll use a [model](https://huggingface.co/johnowhitaker/sd-class-wikiart-from-bedrooms) I fine-tuned from [this model trained on LSUN bedrooms](https://huggingface.co/google/ddpm-bedroom-256) approximately one epoch on the [WikiArt dataset](https://huggingface.co/datasets/huggan/wikiart). If you'd prefer, you can skip this cell and use the faces/butterflies pipeline we fine-tuned in the previous section or load one from the Hub instead:

```python
>>> # Load the pretrained pipeline
>>> pipeline_name = "johnowhitaker/sd-class-wikiart-from-bedrooms"
>>> image_pipe = DDPMPipeline.from_pretrained(pipeline_name).to(device)

>>> # Sample some images with a DDIM Scheduler over 40 steps
>>> scheduler = DDIMScheduler.from_pretrained(pipeline_name)
>>> scheduler.set_timesteps(num_inference_steps=40)

>>> # Random starting point (batch of 8 images)
>>> x = torch.randn(8, 3, 256, 256).to(device)

>>> # Minimal sampling loop
>>> for i, t in tqdm(enumerate(scheduler.timesteps)):
...     model_input = scheduler.scale_model_input(x, t)
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]
...     x = scheduler.step(noise_pred, t, x).prev_sample

>>> # View the results
>>> grid = torchvision.utils.make_grid(x, nrow=4)
>>> plt.imshow(grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5)

考慮事項 5： 通常很難判斷微調的效果如何，以及“良好效能”的含義可能因用例而異。例如，如果你在一個小資料集上微調像 Stable Diffusion 這樣的文字條件模型，你可能希望它**保留**大部分原始訓練，以便它能理解新資料集未涵蓋的任意提示，同時**適應**以更好地匹配新訓練資料的風格。這可能意味著使用低學習率以及像指數模型平均這樣的方法，正如這篇關於建立 Stable Diffusion 口袋妖怪版本的精彩部落格文章中所演示的那樣。在不同的情況下，你可能希望在新資料上完全重新訓練一個模型（例如我們的臥室 -> wikiart 示例），在這種情況下，更大的學習率和更多的訓練是有意義的。儘管損失圖沒有顯示太大改善，但樣本清楚地顯示出從原始資料向更“藝術”的輸出轉變，儘管它們大多仍然不連貫。

這引導我們進入下一部分，我們將研究如何為這樣的模型新增額外的引導，以更好地控制輸出……

引導

如果我們想對生成的樣本進行一些控制，該怎麼辦？例如，假設我們希望生成的影像偏向於特定的顏色。我們該如何做到這一點？來看**引導**，這是一種為取樣過程增加額外控制的技術。

第一步是建立我們的條件函式：一個我們希望最小化的度量（損失）。這裡有一個顏色示例的函式，它將影像的畫素與目標顏色（預設為一種淡青色）進行比較，並返回平均誤差

def color_loss(images, target_color=(0.1, 0.9, 0.5)):
    """Given a target color (R, G, B) return a loss for how far away on average
    the images' pixels are from that color. Defaults to a light teal: (0.1, 0.9, 0.5)"""
    target = torch.tensor(target_color).to(images.device) * 2 - 1  # Map target color to (-1, 1)
    target = target[None, :, None, None]  # Get shape right to work with the images (b, c, h, w)
    error = torch.abs(images - target).mean()  # Mean absolute difference between the image pixels and the target color
    return error

接下來，我們將建立一個取樣迴圈的修改版本，在每一步中，我們執行以下操作

建立一個新版本的 x，使其具有 requires_grad = True
計算去噪後的版本 (x0)
將預測的 x0 輸入我們的損失函式
找到該損失函式相對於 x 的**梯度**
在使用排程器進行步進之前，使用此條件梯度來修改 x，希望將 x 推向一個根據我們的引導函式會導致更低損失的方向

這裡有兩個變體可供你探索。在第一個變體中，我們在從 UNet 獲得噪聲預測**之後**在 x 上設定 requires_grad，這樣更節省記憶體（因為我們不必追溯梯度回溯到擴散模型），但給出的梯度不太準確。在第二個變體中，我們**首先**在 x 上設定 requires_grad，然後將其輸入 UNet 並計算預測的 x0。

>>> # Variant 1: shortcut method

>>> # The guidance scale determines the strength of the effect
>>> guidance_loss_scale = 40  # Explore changing this to 5, or 100

>>> x = torch.randn(8, 3, 256, 256).to(device)

>>> for i, t in tqdm(enumerate(scheduler.timesteps)):

...     # Prepare the model input
...     model_input = scheduler.scale_model_input(x, t)

...     # predict the noise residual
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]

...     # Set x.requires_grad to True
...     x = x.detach().requires_grad_()

...     # Get the predicted x0
...     x0 = scheduler.step(noise_pred, t, x).pred_original_sample

...     # Calculate loss
...     loss = color_loss(x0) * guidance_loss_scale
...     if i % 10 == 0:
...         print(i, "loss:", loss.item())

...     # Get gradient
...     cond_grad = -torch.autograd.grad(loss, x)[0]

...     # Modify x based on this gradient
...     x = x.detach() + cond_grad

...     # Now step with scheduler
...     x = scheduler.step(noise_pred, t, x).prev_sample

>>> # View the output
>>> grid = torchvision.utils.make_grid(x, nrow=4)
>>> im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
>>> Image.fromarray(np.array(im * 255).astype(np.uint8))

0 loss: 27.279136657714844
10 loss: 11.286816596984863
20 loss: 10.683112144470215
30 loss: 10.942476272583008

第二個選項需要近兩倍的 GPU RAM 才能執行，即使我們只生成一批四個影像而不是八個。看看你是否能發現差異，並思考為什麼這種方式更“準確”

>>> # Variant 2: setting x.requires_grad before calculating the model predictions

>>> guidance_loss_scale = 40
>>> x = torch.randn(4, 3, 256, 256).to(device)

>>> for i, t in tqdm(enumerate(scheduler.timesteps)):

...     # Set requires_grad before the model forward pass
...     x = x.detach().requires_grad_()
...     model_input = scheduler.scale_model_input(x, t)

...     # predict (with grad this time)
...     noise_pred = image_pipe.unet(model_input, t)["sample"]

...     # Get the predicted x0:
...     x0 = scheduler.step(noise_pred, t, x).pred_original_sample

...     # Calculate loss
...     loss = color_loss(x0) * guidance_loss_scale
...     if i % 10 == 0:
...         print(i, "loss:", loss.item())

...     # Get gradient
...     cond_grad = -torch.autograd.grad(loss, x)[0]

...     # Modify x based on this gradient
...     x = x.detach() + cond_grad

...     # Now step with scheduler
...     x = scheduler.step(noise_pred, t, x).prev_sample


>>> grid = torchvision.utils.make_grid(x, nrow=4)
>>> im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
>>> Image.fromarray(np.array(im * 255).astype(np.uint8))

0 loss: 30.750328063964844
10 loss: 18.550724029541016
20 loss: 17.515094757080078
30 loss: 17.55681037902832

在第二個變體中，記憶體需求更高，效果也不那麼明顯，所以你可能會認為它較差。然而，輸出可以說更接近模型訓練時所見的影像型別，而且你總是可以增加引導尺度以獲得更強的效果。你使用哪種方法最終將取決於實驗效果最好的方法。

# Exercise: pick your favourite colour and look up it's values in RGB space.
# Edit the `color_loss()` line in the cell above to receive these new RGB values and examine the outputs - do they match what you expect?

CLIP 引導

引導向一種顏色給了我們一點控制權，但如果我們能直接輸入一些描述我們想要的文字呢？

CLIP 是 OpenAI 建立的一個模型，它允許我們比較影像和文字標題。這非常強大，因為它讓我們能夠量化一幅影像與一個提示的匹配程度。而且由於這個過程是可微的，我們可以用它作為損失函式來引導我們的擴散模型！

我們在這裡不會深入太多細節。基本方法如下

嵌入文字提示以獲得一個 512 維的文字 CLIP 嵌入
對於擴散模型過程中的每一步
- 製作預測的去噪影像的幾個變體（擁有多個變體可以得到更清晰的損失訊號）
- 對於每一個變體，用 CLIP 嵌入影像，並將此嵌入與提示的文字嵌入進行比較（使用一種稱為“大圓距離平方”的度量）
計算此損失相對於當前帶噪聲的 x 的梯度，並在用排程器更新 x 之前使用此梯度來修改 x。

要更深入地瞭解 CLIP，請檢視關於該主題的這節課或這篇關於 OpenCLIP 專案的報告，我們正在使用它來載入 CLIP 模型。執行下一個單元格以載入 CLIP 模型

# @markdown load a CLIP model and define the loss function
import open_clip

clip_model, _, preprocess = open_clip.create_model_and_transforms("ViT-B-32", pretrained="openai")
clip_model.to(device)

# Transforms to resize and augment an image + normalize to match CLIP's training data
tfms = torchvision.transforms.Compose(
    [
        torchvision.transforms.RandomResizedCrop(224),  # Random CROP each time
        torchvision.transforms.RandomAffine(5),  # One possible random augmentation: skews the image
        torchvision.transforms.RandomHorizontalFlip(),  # You can add additional augmentations if you like
        torchvision.transforms.Normalize(
            mean=(0.48145466, 0.4578275, 0.40821073),
            std=(0.26862954, 0.26130258, 0.27577711),
        ),
    ]
)


# And define a loss function that takes an image, embeds it and compares with
# the text features of the prompt
def clip_loss(image, text_features):
    image_features = clip_model.encode_image(tfms(image))  # Note: applies the above transforms
    input_normed = torch.nn.functional.normalize(image_features.unsqueeze(1), dim=2)
    embed_normed = torch.nn.functional.normalize(text_features.unsqueeze(0), dim=2)
    dists = input_normed.sub(embed_normed).norm(dim=2).div(2).arcsin().pow(2).mul(2)  # Squared Great Circle Distance
    return dists.mean()

定義了損失函式後，我們的引導取樣迴圈看起來與之前的示例類似，只是用我們新的基於 CLIP 的損失函式替換了 `color_loss()`

>>> # @markdown applying guidance using CLIP

>>> prompt = "Red Rose (still life), red flower painting"  # @param

>>> # Explore changing this
>>> guidance_scale = 8  # @param
>>> n_cuts = 4  # @param

>>> # More steps -> more time for the guidance to have an effect
>>> scheduler.set_timesteps(50)

>>> # We embed a prompt with CLIP as our target
>>> text = open_clip.tokenize([prompt]).to(device)
>>> with torch.no_grad(), torch.cuda.amp.autocast():
...     text_features = clip_model.encode_text(text)


>>> x = torch.randn(4, 3, 256, 256).to(device)  # RAM usage is high, you may want only 1 image at a time

>>> for i, t in tqdm(enumerate(scheduler.timesteps)):

...     model_input = scheduler.scale_model_input(x, t)

...     # predict the noise residual
...     with torch.no_grad():
...         noise_pred = image_pipe.unet(model_input, t)["sample"]

...     cond_grad = 0

...     for cut in range(n_cuts):

...         # Set requires grad on x
...         x = x.detach().requires_grad_()

...         # Get the predicted x0:
...         x0 = scheduler.step(noise_pred, t, x).pred_original_sample

...         # Calculate loss
...         loss = clip_loss(x0, text_features) * guidance_scale

...         # Get gradient (scale by n_cuts since we want the average)
...         cond_grad -= torch.autograd.grad(loss, x)[0] / n_cuts

...     if i % 25 == 0:
...         print("Step:", i, ", Guidance loss:", loss.item())

...     # Modify x based on this gradient
...     alpha_bar = scheduler.alphas_cumprod[i]
...     x = x.detach() + cond_grad * alpha_bar.sqrt()  # Note the additional scaling factor here!

...     # Now step with scheduler
...     x = scheduler.step(noise_pred, t, x).prev_sample


>>> grid = torchvision.utils.make_grid(x.detach(), nrow=4)
>>> im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
>>> Image.fromarray(np.array(im * 255).astype(np.uint8))

Step: 0 , Guidance loss: 7.437869548797607
Step: 25 , Guidance loss: 7.174620628356934

那些看起來有點像玫瑰！它並不完美，但如果你調整一下設定，你可以用這個得到一些令人愉快的影像。

如果你檢查上面的程式碼，你會發現我正在用 `alpha_bar.sqrt()` 因子來縮放條件梯度。有一些理論表明了縮放這些梯度的“正確”方法，但在實踐中，這也是你可以實驗的東西。對於某些型別的引導，你可能希望大部分效果集中在早期步驟，而對於其他型別（比如專注於紋理的風格損失），你可能更希望它們只在生成過程的末尾才起作用。一些可能的排程如下所示

>>> # @markdown Plotting some possible schedules:
>>> plt.plot([1 for a in scheduler.alphas_cumprod], label="no scaling")
>>> plt.plot([a for a in scheduler.alphas_cumprod], label="alpha_bar")
>>> plt.plot([a.sqrt() for a in scheduler.alphas_cumprod], label="alpha_bar.sqrt()")
>>> plt.plot([(1 - a).sqrt() for a in scheduler.alphas_cumprod], label="(1-alpha_bar).sqrt()")
>>> plt.legend()
>>> plt.title("Possible guidance scaling schedules")

嘗試不同的排程、引導尺度和你能想到的任何其他技巧（將梯度裁剪在某個範圍內是一種流行的修改方法），看看你能做到多好！另外，確保你嘗試換用其他模型。也許是我們開始時載入的面部模型——你能可靠地引導它生成一張男性面孔嗎？如果你將 CLIP 引導與我們之前使用的顏色損失結合起來會怎麼樣？等等。

如果你檢視一些實踐中 CLIP 引導擴散的程式碼，你會看到一個更復雜的方法，它有一個更好的類用於從影像中選擇隨機裁剪，以及對損失函式的大量額外調整以獲得更好的效能。在文字條件擴散模型出現之前，這是最好的文字到影像系統！我們這個小玩具版本還有很多改進空間，但它抓住了核心思想：感謝引導加上 CLIP 的驚人能力，我們可以為一個無條件擴散模型新增文字控制 🎨。

將自定義取樣迴圈作為 Gradio 演示分享

也許你已經想出了一個有趣的損失函式來引導生成，現在你想與世界分享你微調的模型和這個自定義取樣策略……

來看 Gradio。Gradio 是一個免費的開源工具，允許使用者透過簡單的網頁介面輕鬆建立和分享互動式機器學習模型。使用 Gradio，使用者可以為他們的機器學習模型構建自定義介面，然後透過唯一的 URL 與他人分享。它還整合到 🤗 Spaces 中，這使得託管演示和與他人分享變得容易。

我們將把我們的核心邏輯放在一個函式中，該函式接收一些輸入並生成一張影像作為輸出。然後，這可以被包裝在一個簡單的介面中，允許使用者指定一些引數（這些引數作為輸入傳遞給主生成函式）。有許多可用的元件——在這個例子中，我們將使用一個滑塊來控制引導尺度，一個顏色選擇器來定義目標顏色。

%pip install -q gradio # Install the library

import gradio as gr
from PIL import Image, ImageColor


# The function that does the hard work
def generate(color, guidance_loss_scale):
    target_color = ImageColor.getcolor(color, "RGB")  # Target color as RGB
    target_color = [a / 255 for a in target_color]  # Rescale from (0, 255) to (0, 1)
    x = torch.randn(1, 3, 256, 256).to(device)
    for i, t in tqdm(enumerate(scheduler.timesteps)):
        model_input = scheduler.scale_model_input(x, t)
        with torch.no_grad():
            noise_pred = image_pipe.unet(model_input, t)["sample"]
        x = x.detach().requires_grad_()
        x0 = scheduler.step(noise_pred, t, x).pred_original_sample
        loss = color_loss(x0, target_color) * guidance_loss_scale
        cond_grad = -torch.autograd.grad(loss, x)[0]
        x = x.detach() + cond_grad
        x = scheduler.step(noise_pred, t, x).prev_sample
    grid = torchvision.utils.make_grid(x, nrow=4)
    im = grid.permute(1, 2, 0).cpu().clip(-1, 1) * 0.5 + 0.5
    im = Image.fromarray(np.array(im * 255).astype(np.uint8))
    im.save("test.jpeg")
    return im


# See the gradio docs for the types of inputs and outputs available
inputs = [
    gr.ColorPicker(label="color", value="55FFAA"),  # Add any inputs you need here
    gr.Slider(label="guidance_scale", minimum=0, maximum=30, value=3),
]
outputs = gr.Image(label="result")

# And the minimal interface
demo = gr.Interface(
    fn=generate,
    inputs=inputs,
    outputs=outputs,
    examples=[
        ["#BB2266", 3],
        ["#44CCAA", 5],  # You can provide some example inputs to get people started
    ],
)
demo.launch(debug=True)  # debug=True allows you to see errors and output in Colab

構建更復雜的介面是可能的，可以有花哨的樣式和各種可能的輸入，但對於這個演示，我們保持儘可能的簡單。

🤗 Spaces 上的演示預設在 CPU 上執行，所以在遷移之前，在 Colab 中（如上所示）原型化你的介面是很好的。當你準備好分享你的演示時，你將建立一個 Space，設定一個 `requirements.txt` 檔案列出你的程式碼將使用的庫，然後將所有程式碼放在一個 `app.py` 檔案中，該檔案定義了相關函式和介面。

Screenshot from 2022-12-11 10-28-26.png

幸運的是，還有一個“複製” Space 的選項。你可以訪問我的演示空間這裡（如上所示），然後點選“複製此 Space”以獲得一個模板，然後你可以修改它以使用你自己的模型和引導函式。

在設定中，你可以配置你的 Space 以在更高階的硬體上執行（按小時收費）。做出了一些驚人的東西，想在更好的硬體上分享但沒有錢？透過 Discord 告訴我們，我們會看看是否能提供幫助！

總結與後續步驟

我們在本筆記本中涵蓋了很多內容！讓我們回顧一下核心思想

載入現有模型並使用不同的排程器進行取樣相對容易
微調看起來就像從頭開始訓練，只是透過從現有模型開始，我們希望更快地獲得更好的結果
要在大影像上微調大模型，我們可以使用梯度累積等技巧來規避批次大小的限制
對於微調來說，記錄樣本影像很重要，因為損失曲線可能不會顯示太多有用的資訊
引導允許我們採用一個無條件模型，並根據某個引導/損失函式來引導生成過程，其中在每一步，我們找到損失相對於帶噪聲影像 x 的梯度，並在進入下一個時間步之前根據此梯度更新它
使用 CLIP 引導讓我們能用文字控制無條件模型！

為了將這些付諸實踐，以下是你可以採取的一些具體後續步驟

微調你自己的模型並將其推送到 Hub。這將涉及選擇一個起點（例如，一個在人臉、臥室、貓或上面的 wikiart 示例上訓練的模型）和一個數據集（也許是這些動物面孔或你自己的影像），然後執行本筆記本中的程式碼或示例指令碼（下面是演示用法）。
使用你微調的模型探索引導，可以使用示例引導函式之一（color_loss 或 CLIP），或者發明你自己的。
使用 Gradio 分享一個基於此的演示，可以修改示例空間以使用你自己的模型，或者建立你自己的具有更多功能的自定義版本。

我們期待在 Discord、Twitter 和其他地方看到你的成果 🤗！

< > 在 GitHub 上更新

擴散模型課程