解釋 SDXL 潛在空間

社群文章釋出於 2024 年 5 月 20 日

簡短背景故事
 SDXL 潛在空間的 4 個通道
     8 位畫素空間有 3 個通道
     影像的 SDXL 潛在表示有 4 個通道
     使用線性近似直接將 SDXL 潛在空間轉換為 RGB
    SDXL 色彩範圍偏向黃色的可能原因
 需要糾正什麼？
讓我們看一個 SDXL 的輸出示例
 完整演示
 增加色彩範圍/移除色彩偏差
 高引導尺度下的長提示成為可能

引言

這篇文章被其他網站爬取並改寫後，我收到了一些奇怪的問題。如果你在 Hugging Face 以外的任何地方讀到這篇文章，這裡是原文：https://huggingface.co/blog/TimothyAlexisVass/explaining-the-sdxl-latent-space

簡短背景故事

特別感謝：Ollin Boer Bohan、Haoming、Cristina Segalin 和 Birchlabs 提供的幫助、討論和知識！

我正在為我正在建立的擴散模型 UI 開發SDXL 推理過程的修正濾鏡。

經過多年影像修正經驗，我希望具備從根本上改進 SDXL 實際輸出的能力。我希望在使用者體驗中提供許多技術，我決定自己解決。我注意到 SDXL 輸出幾乎總是呈現有規律的噪點或過度平滑。由於 SD 模型的工作方式，色彩空間總是需要白平衡，並且色彩範圍存在偏差和限制。

如果可以在實際輸出之前改進資訊和色彩範圍，那麼在影像生成並轉換為 8 位 RGB 後進行後期處理的修正就意義不大了。

要建立濾鏡和修正工具，最重要的是瞭解你正在處理的資料。

這促使我實驗性地探索 SDXL 潛在空間，以期理解它們。擴散模型基於 SDXL 架構處理的張量看起來是這樣的：

[batch_size, 4 channels, height (y), width (x)]

我的第一個問題很簡單：“**這 4 個通道到底是什麼？**”。我收到的多數回答都類似“這不是人類能理解的東西。”

但它絕對可以理解。甚至非常容易理解且很有用。

SDXL 潛在空間的 4 個通道

對於 SDXL 生成的 1024×1024 畫素影像，潛在張量為 128×128 畫素，其中潛在空間中的每個畫素代表畫素空間中的 64 (8×8) 個畫素。如果我們將潛在空間生成並解碼為標準 8 位 JPG 影像，那麼...

8 位畫素空間有 3 個通道

紅色 (R)、綠色 (G) 和藍色 (B)，每個通道有 256 個可能值，範圍在 0-255 之間。因此，為了儲存 64 個畫素的完整資訊，我們每個潛在畫素的每個通道需要儲存 64×256 = 16,384 個值。

影像的 SDXL 潛在表示有 4 個通道

點選標題檢視互動式演示！

0: 亮度
1: 青色/紅色 => 等同於 rgb(0, 255, 255)/rgb(255, 0, 0)
2: 檸檬綠/中紫色 => 等同於 rgb(127, 255, 0)/rgb(127, 0, 255)
3: 模式/結構。

如果在解碼時每個值可以在 -4 到 4 之間變化，那麼在半精度 16 位浮點格式中，每個潛在畫素的 4 個通道可以包含 16,384 個不同的值。

使用線性近似直接將 SDXL 潛在空間轉換為 RGB

有了這個理解，我們可以建立一個近似函式，直接將潛在空間轉換為 RGB

def latents_to_rgb(latents):
    weights = (
        (60, -60, 25, -70),
        (60,  -5, 15, -50),
        (60,  10, -5, -35)
    )

    weights_tensor = torch.t(torch.tensor(weights, dtype=latents.dtype).to(latents.device))
    biases_tensor = torch.tensor((150, 140, 130), dtype=latents.dtype).to(latents.device)
    rgb_tensor = torch.einsum("...lxy,lr -> ...rxy", latents, weights_tensor) + biases_tensor.unsqueeze(-1).unsqueeze(-1)
    image_array = rgb_tensor.clamp(0, 255)[0].byte().cpu().numpy()
    image_array = image_array.transpose(1, 2, 0)  # Change the order of dimensions

    return Image.fromarray(image_array)

這裡是 latents_to_rgb 結果和常規解碼輸出，為了比較而調整了大小

SDXL 色彩範圍偏向黃色的可能原因

自然界中藍色或白色的事物相對較少。這些顏色在宜人的天氣條件下在天空中最為突出。因此，模型透過影像瞭解現實，以亮度（通道 0）、青色/紅色（通道 1）和檸檬綠/中紫色（通道 2）進行思考，其中紅色和綠色是主色，藍色是副色。這就是為什麼 SDXL 生成的影像通常偏向黃色（紅色 + 綠色）。

在推理過程中，張量中的值將從 min < -30 和 max > 30 開始，解碼時的 min/max 邊界約為 -4 到 4。在較高的 guidance_scale 下，值在 min 和 max 之間會有更高的差異。

理解邊界的一個關鍵是檢視解碼過程中發生的事情

decoded = vae.decode(latents / vae.scaling_factor).sample # (SDXL vae.scaling_factor = 0.13025)
decoded = decoded.div(2).add(0.5).clamp(0, 1) # The dynamics outside of 0 to 1 at this point will be lost

如果此時的值超出 0 到 1 的範圍，某些資訊將在裁剪中丟失。因此，如果我們在去噪過程中進行修正以提供 VAE 期望的值，我們可能會獲得更好的結果。

需要糾正什麼？

如何銳化模糊影像、白平衡、改善細節、增加對比度或增加色彩範圍？最好的方法是從一張銳利、白平衡正確、對比度好、細節清晰且範圍廣的影像開始。

模糊一張銳利影像、改變色彩平衡、降低對比度、產生無意義的細節並限制色彩範圍，遠比改進它容易。

SDXL 有一個非常顯著的顏色偏差傾向，並將值放在實際邊界之外（左圖）。透過將值居中並使其位於邊界內，這很容易解決（右圖）

超出邊界的原始輸出

為說明目的而誇大的校正

def center_tensor(input_tensor, per_channel_shift=1, full_tensor_shift=1, channels=[0, 1, 2, 3]):
    for channel in channels:
        input_tensor[0, channel] -= input_tensor[0, channel].mean() * per_channel_shift
    return input_tensor - input_tensor.mean() * full_tensor_shift

讓我們看一個 SDXL 的輸出示例

seed: 77777777
guidance_scale: 20 # A high guidance scale can be fixed too
steps with base: 23
steps with refiner: 10

prompt: Cinematic.Beautiful smile action woman in detailed white mecha gundam armor with red details,green details,blue details,colorful,star wars universe,lush garden,flowers,volumetric lighting,perfect eyes,perfect teeth,blue sky,bright,intricate details,extreme detail of environment,infinite focus,well lit,interesting clothes,radial gradient fade,directional particle lighting,wow

negative_prompt: helmet, bokeh, painting, artwork, blocky, blur, ugly, old, boring, photoshopped, tired, wrinkles, scar, gray hair, big forehead, crosseyed, dumb, stupid, cockeyed, disfigured, crooked, blurry, unrealistic, grayscale, bad anatomy, unnatural irises, no pupils, blurry eyes, dark eyes, extra limbs, deformed, disfigured eyes, out of frame, no irises, assymetrical face, broken fingers, extra fingers, disfigured hands

請注意，我特意選擇了高引導尺度。

我們如何修復這張影像？它半是繪畫，半是照片。色彩範圍偏向黃色。右邊是使用完全相同設定的修復生成影像。

即使在 guidance_scale 設定為 7.5 的合理值下，我們仍然可以得出結論，修復後的輸出更好，沒有無意義的細節，並且白平衡正確。

我們可以在潛在空間中做很多事情來普遍改善生成，也可以做一些非常簡單的事情來針對生成中的特定錯誤

異常值移除

這將透過修剪偏離分佈均值最遠的值來控制無意義細節的數量。它還有助於在更高的 guidance_scale 下生成。

# Shrinking towards the mean (will also remove outliers)
def soft_clamp_tensor(input_tensor, threshold=3.5, boundary=4):
    if max(abs(input_tensor.max()), abs(input_tensor.min())) < 4:
        return input_tensor
    channel_dim = 1

    max_vals = input_tensor.max(channel_dim, keepdim=True)[0]
    max_replace = ((input_tensor - threshold) / (max_vals - threshold)) * (boundary - threshold) + threshold
    over_mask = (input_tensor > threshold)

    min_vals = input_tensor.min(channel_dim, keepdim=True)[0]
    min_replace = ((input_tensor + threshold) / (min_vals + threshold)) * (-boundary + threshold) - threshold
    under_mask = (input_tensor < -threshold)

    return torch.where(over_mask, max_replace, torch.where(under_mask, min_replace, input_tensor))

色彩平衡和增加範圍

我主要有兩種方法來實現這一點。第一種是透過歸一化值來收縮到平均值（這也會移除異常值），第二種是當值偏向某種顏色時進行修復。這也有助於在較高的 guidance_scale 下生成。

# Center tensor (balance colors)
def center_tensor(input_tensor, channel_shift=1, full_shift=1, channels=[0, 1, 2, 3]):
    for channel in channels:
        input_tensor[0, channel] -= input_tensor[0, channel].mean() * channel_shift
    return input_tensor - input_tensor.mean() * full_shift

張量最大化

這基本上是透過將張量乘以一個非常小的量（如 1e-5）進行幾步操作，並確保最終張量在使用所有可能的範圍（接近 -4/4）後才轉換為 RGB。請記住，在畫素空間中，在不損害動態的情況下降低對比度、飽和度和銳度比提高它們更容易。

# Maximize/normalize tensor
def maximize_tensor(input_tensor, boundary=4, channels=[0, 1, 2]):
    min_val = input_tensor.min()
    max_val = input_tensor.max()

    normalization_factor = boundary / max(abs(min_val), abs(max_val))
    input_tensor[0, channels] *= normalization_factor

    return input_tensor

回撥實現示例

def callback(pipe, step_index, timestep, cbk):
      if timestep > 950:
          threshold = max(cbk["latents"].max(), abs(cbk["latents"].min())) * 0.998
          cbk["latents"] = soft_clamp_tensor(cbk["latents"], threshold*0.998, threshold)
      if timestep > 700:
          cbk["latents"] = center_tensor(cbk["latents"], 0.8, 0.8)
      if timestep > 1 and timestep < 100:
          cbk["latents"] = center_tensor(cbk["latents"], 0.6, 1.0)
          cbk["latents"] = maximize_tensor(cbk["latents"])
      return cbk

  image = base(
      prompt,
      guidance_scale = guidance_scale,
      callback_on_step_end=callback,
      callback_on_step_end_inputs=["latents"]
  ).images[0]

這三種方法的簡單實現用於最後幾張圖片，其中有花園裡的女人。

完整演示

點選標題或此連結檢視互動式演示！

本演示透過使用 Z-score 檢測異常值、動態地向均值移動以及對每種技術施加強度，使用了更高階的技術實現。

原始 SDXL（太黃）與輕微修改（白平衡）

中等修改和強力修改（均應用了所有 3 種技術）

增加色彩範圍/移除色彩偏差

在下方，SDXL 在常規輸出中將色彩範圍限制為紅色和綠色。因為提示中沒有提及藍色的存在。這是一個相當好的生成，但色彩範圍已被限制。

如果你給某人提供黑、紅、綠、黃的調色盤，然後讓他們畫一片湛藍的天空，自然的反應是要求你提供藍色和白色。

要在生成中包含藍色，我們只需在色彩空間受限時重新對齊，SDXL 就會在生成中適當地包含完整的色彩光譜。

高引導尺度下的長提示成為可能

這是一個典型的場景，增加色彩範圍使整個提示變得可能。
此示例應用了前面所示的簡單、強力修改，以更清晰地說明差異。

提示：身穿紅色連衣裙的女人在奢華花園中的照片，周圍環繞著藍色、黃色、紫色和多種顏色的花朵，高階，屢獲殊榮的攝影，Portra 400，全畫幅。藍天，即使是最小的粒子也有複雜的細節，環境的極致細節，清晰的肖像，光線充足，有趣的服裝，美麗的陰影，明亮，照片質量，超現實，傑作