使用 Stable Diffusion 進行影像插值

本 Notebook 展示瞭如何使用 Stable Diffusion 在影像之間進行插值。使用 Stable Diffusion 進行影像插值是利用基於擴散的生成模型，建立從一張給定影像平滑過渡到另一張影像的中間影像的過程。

以下是使用 Stable Diffusion 進行影像插值的一些不同用例：

資料增強：Stable Diffusion 可以透過生成介於現有資料點之間的合成影像來增強機器學習模型的訓練資料。這可以提高機器學習模型的泛化能力和魯棒性，尤其是在影像生成、分類或目標檢測等任務中。
產品設計和原型製作：Stable Diffusion 可以透過生成具有細微差異的產品設計或原型變體來輔助產品設計。這對於探索設計替代方案、進行使用者研究或在提交物理原型之前視覺化設計迭代非常有用。
媒體制作的內容生成：在電影和影片編輯等媒體制作中，Stable Diffusion 可用於在關鍵幀之間生成中間幀，從而實現更平滑的過渡並增強視覺敘事。與手動逐幀編輯相比，這可以節省時間和資源。

在影像插值的情境中，Stable Diffusion 模型通常用於在高維潛在空間中進行導航。每個維度代表模型學習到的特定特徵。透過遍歷這個潛在空間並在影像的不同潛在表示之間進行插值，模型能夠生成一系列中間影像，這些影像顯示了原始影像之間的平滑過渡。Stable Diffusion 中有兩種型別的潛在變數：提示潛在變數和影像潛在變數。

潛在空間漫遊涉及沿著由兩個或更多點（表示影像）定義的路徑在潛在空間中移動。透過仔細選擇這些點及其之間的路徑，可以控制生成影像的特徵，例如樣式、內容和其他視覺方面。

在本 Notebook 中，我們將探討使用 Stable Diffusion 進行影像插值的示例，並演示如何實現和利用潛在空間漫遊以在影像之間建立平滑過渡。我們將提供程式碼片段和視覺化，以說明此過程的實際應用，從而更深入地瞭解生成模型如何以有意義的方式操縱和變形影像表示。

首先，讓我們安裝所有必需的模組。

!pip install -q diffusers transformers xformers accelerate
!pip install -q numpy scipy ftfy Pillow

匯入模組

import torch
import numpy as np
import os

import time

from PIL import Image
from IPython import display as IPdisplay
from tqdm.auto import tqdm

from diffusers import StableDiffusionPipeline
from diffusers import (
    DDIMScheduler,
    PNDMScheduler,
    LMSDiscreteScheduler,
    DPMSolverMultistepScheduler,
    EulerAncestralDiscreteScheduler,
    EulerDiscreteScheduler,
)
from transformers import logging

logging.set_verbosity_error()

讓我們檢查 CUDA 是否可用。

print(torch.cuda.is_available())

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

這些設定用於最佳化 PyTorch 模型在啟用 CUDA 的 GPU 上的效能，尤其是在使用混合精度訓練或推理時，這在速度和記憶體使用方面都可能帶來好處。
來源：https://huggingface.co/docs/diffusers/optimization/fp16#memory-efficient-attention

torch.backends.cudnn.benchmark = True
torch.backends.cuda.matmul.allow_tf32 = True

模型

選擇 runwayml/stable-diffusion-v1-5 模型和 LMSDiscreteScheduler 排程器來生成影像。儘管它是一種較舊的技術，但由於其快速的效能、極低的記憶體需求以及基於 SD1.5 構建的大量社群微調模型的可用性，它仍然廣受歡迎。但是，您可以自由嘗試其他模型和排程器來比較結果。

model_name_or_path = "runwayml/stable-diffusion-v1-5"

scheduler = LMSDiscreteScheduler(
    beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000
)


pipe = StableDiffusionPipeline.from_pretrained(
    model_name_or_path,
    scheduler=scheduler,
    torch_dtype=torch.float32,
).to(device)

# Disable image generation progress bar, we'll display our own
pipe.set_progress_bar_config(disable=True)

這些方法旨在減少 GPU 消耗的記憶體。如果您的視訊記憶體足夠，可以跳過此單元格。

可以在此處找到更詳細的資訊：https://huggingface.co/docs/diffusers/en/optimization/opt_overview
特別是，可以在此處找到有關以下方法的資訊：https://huggingface.co/docs/diffusers/optimization/memory

# Offloading the weights to the CPU and only loading them on the GPU can reduce memory consumption to less than 3GB.
pipe.enable_model_cpu_offload()

# Tighter ordering of memory tensors.
pipe.unet.to(memory_format=torch.channels_last)

# Decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time.
pipe.enable_vae_slicing()

# Splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image.
pipe.enable_vae_tiling()

# Using Flash Attention; If you have PyTorch >= 2.0 installed, you should not expect a speed-up for inference when enabling xformers.
pipe.enable_xformers_memory_efficient_attention()

display_images 函式將影像陣列列表轉換為 GIF，將其儲存到指定路徑並返回 GIF 物件以供顯示。它使用當前時間命名 GIF 檔案並處理任何錯誤，透過打印出來。

def display_images(images, save_path):
    try:
        # Convert each image in the 'images' list from an array to an Image object.
        images = [Image.fromarray(np.array(image[0], dtype=np.uint8)) for image in images]

        # Generate a file name based on the current time, replacing colons with hyphens
        # to ensure the filename is valid for file systems that don't allow colons.
        filename = time.strftime("%H:%M:%S", time.localtime()).replace(":", "-")
        # Save the first image in the list as a GIF file at the 'save_path' location.
        # The rest of the images in the list are added as subsequent frames to the GIF.
        # The GIF will play each frame for 100 milliseconds and will loop indefinitely.
        images[0].save(
            f"{save_path}/{filename}.gif",
            save_all=True,
            append_images=images[1:],
            duration=100,
            loop=0,
        )
    except Exception as e:
        # If there is an error during the process, print the exception message.
        print(e)

    # Return the saved GIF as an IPython display object so it can be displayed in a notebook.
    return IPdisplay.Image(f"{save_path}/{filename}.gif")

生成引數

seed：此變數用於設定特定的隨機種子以實現可重現性。
generator：如果提供了種子，則將其設定為 PyTorch 隨機數生成器物件，否則為 None。它確保使用它的操作具有可重現的結果。
guidance_scale：此引數控制模型在文字到影像生成任務中應遵循提示的程度，值越高，對提示的遵循程度越強。
num_inference_steps：這指定了模型生成影像所需的步數。更多步數可以生成更高質量的影像，但生成時間更長。
num_interpolation_steps：這決定了在潛在空間中兩個點之間插值時使用的步數，影響生成動畫中過渡的平滑度。
height：生成影像的高度（畫素）。
width：生成影像的寬度（畫素）。
save_path：生成 GIF 將儲存到的檔案系統路徑。

# The seed is set to "None", because we want different results each time we run the generation.
seed = None

if seed is not None:
    generator = torch.manual_seed(seed)
else:
    generator = None

# The guidance scale is set to its normal range (7 - 10).
guidance_scale = 8

# The number of inference steps was chosen empirically to generate an acceptable picture within an acceptable time.
num_inference_steps = 15

# The higher you set this value, the smoother the interpolations will be. However, the generation time will increase. This value was chosen empirically.
num_interpolation_steps = 30

# I would not recommend less than 512 on either dimension. This is because this model was trained on 512x512 image resolution.
height = 512
width = 512

# The path where the generated GIFs will be saved
save_path = "/output"

if not os.path.exists(save_path):
    os.makedirs(save_path)

示例 1：提示插值

在此示例中，正向和負向提示嵌入之間的插值允許探索由提示定義的兩個概念點之間的空間，這可能導致各種影像逐漸融合提示所規定的特徵。在這種情況下，插值涉及向原始嵌入新增按比例縮放的增量，建立一系列新的嵌入，這些嵌入稍後將用於基於原始提示生成具有平滑過渡的不同狀態的影像。

Example 1

首先，我們需要對正向和負向文字提示進行分詞並獲取它們的嵌入。正向提示引導影像生成朝著所需特徵方向發展，而負向提示則引導影像遠離不想要的特徵。

# The text prompt that describes the desired output image.
prompt = "Epic shot of Sweden, ultra detailed lake with an ren dear, nostalgic vintage, ultra cozy and inviting, wonderful light atmosphere, fairy, little photorealistic, digital painting, sharp focus, ultra cozy and inviting, wish to be there. very detailed, arty, should rank high on youtube for a dream trip."
# A negative prompt that can be used to steer the generation away from certain features; here, it is empty.
negative_prompt = "poorly drawn,cartoon, 2d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"

# The step size for the interpolation in the latent space.
step_size = 0.001

# Tokenizing and encoding the prompt into embeddings.
prompt_tokens = pipe.tokenizer(
    prompt,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
prompt_embeds = pipe.text_encoder(prompt_tokens.input_ids.to(device))[0]


# Tokenizing and encoding the negative prompt into embeddings.
if negative_prompt is None:
    negative_prompt = [""]

negative_prompt_tokens = pipe.tokenizer(
    negative_prompt,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
negative_prompt_embeds = pipe.text_encoder(negative_prompt_tokens.input_ids.to(device))[0]

現在讓我們看看程式碼部分，該部分使用正態分佈生成一個隨機初始向量，其結構與擴散模型 (UNet) 預期維度匹配。這允許透過選擇性地使用隨機數生成器來重現結果。建立初始向量後，程式碼透過逐步為每次迭代新增一個小步長，在兩個嵌入（正向和負向提示）之間執行一系列插值。結果儲存在一個名為“walked_embeddings”的列表中。

# Generating initial latent vectors from a random normal distribution, with the option to use a generator for reproducibility.
latents = torch.randn(
    (1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)

walked_embeddings = []

# Interpolating between embeddings for the given number of interpolation steps.
for i in range(num_interpolation_steps):
    walked_embeddings.append([prompt_embeds + step_size * i, negative_prompt_embeds + step_size * i])

最後，讓我們根據插值嵌入生成一系列影像，然後顯示這些影像。我們將遍歷嵌入陣列，使用每個嵌入來生成具有指定特徵（如高度、寬度和其他與影像生成相關的引數）的影像。然後，我們將這些影像收集到一個列表中。生成完成後，我們將呼叫 display_image 函式，將這些影像儲存並顯示為給定儲存路徑的 GIF。

# Generating images using the interpolated embeddings.
images = []
for latent in tqdm(walked_embeddings):
    images.append(
        pipe(
            height=height,
            width=width,
            num_images_per_prompt=1,
            prompt_embeds=latent[0],
            negative_prompt_embeds=latent[1],
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latents,
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

示例 2：單個提示的擴散潛在空間插值

與第一個示例不同，在此示例中，我們正在執行擴散模型本身的兩個嵌入之間的插值，而不是提示。請注意，在這種情況下，我們使用 slerp 函式進行插值。但是，沒有什麼能阻止我們向一個嵌入新增常量值。

Example 2

下面介紹的函式代表球形線性插值。這是一種在球面上進行插值的方法。此函式通常用於計算機圖形學中以平滑方式動畫化旋轉，也可用於機器學習中高維資料點（如生成模型中使用的潛在向量）之間的插值。

來源來自 Andrej Karpathy 的 gist：https://gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355。
有關此方法的更詳細解釋，請參閱：https://en.wikipedia.org/wiki/Slerp。

def slerp(v0, v1, num, t0=0, t1=1):
    v0 = v0.detach().cpu().numpy()
    v1 = v1.detach().cpu().numpy()

    def interpolation(t, v0, v1, DOT_THRESHOLD=0.9995):
        """helper function to spherically interpolate two arrays v1 v2"""
        dot = np.sum(v0 * v1 / (np.linalg.norm(v0) * np.linalg.norm(v1)))
        if np.abs(dot) > DOT_THRESHOLD:
            v2 = (1 - t) * v0 + t * v1
        else:
            theta_0 = np.arccos(dot)
            sin_theta_0 = np.sin(theta_0)
            theta_t = theta_0 * t
            sin_theta_t = np.sin(theta_t)
            s0 = np.sin(theta_0 - theta_t) / sin_theta_0
            s1 = sin_theta_t / sin_theta_0
            v2 = s0 * v0 + s1 * v1
        return v2

    t = np.linspace(t0, t1, num)

    v3 = torch.tensor(np.array([interpolation(t[i], v0, v1) for i in range(num)]))

    return v3

# The text prompt that describes the desired output image.
prompt = (
    "Sci-fi digital painting of an alien landscape with otherworldly plants, strange creatures, and distant planets."
)
# A negative prompt that can be used to steer the generation away from certain features.
negative_prompt = "poorly drawn,cartoon, 3d, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry"

# Generating initial latent vectors from a random normal distribution. In this example two latent vectors are generated, which will serve as start and end points for the interpolation.
# These vectors are shaped to fit the input requirements of the diffusion model's U-Net architecture.
latents = torch.randn(
    (2, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)

# Getting our latent embeddings
interpolated_latents = slerp(latents[0], latents[1], num_interpolation_steps)

# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(interpolated_latents):
    images.append(
        pipe(
            prompt,
            height=height,
            width=width,
            negative_prompt=negative_prompt,
            num_images_per_prompt=1,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latent_vector[None, ...],
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

示例 3：多個提示之間的插值

與第一個示例（我們偏離了單個提示）相反，在此示例中，我們將在任意數量的提示之間進行插值。為此，我們將獲取連續的提示對，並在它們之間建立平滑過渡。然後，我們將結合這些連續對的插值，並指示模型根據它們生成影像。對於插值，我們將使用 slerp 函式，如第二個示例所示。

Example 3

再次，讓我們對多個正向和負向文字提示進行分詞並獲取它們的嵌入。

# Text prompts that describes the desired output image.
prompts = [
    "A cute dog in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
    "A cute cat in a beautiful field of lavander colorful flowers everywhere, perfect lighting, leica summicron 35mm f2.0, kodak portra 400, film grain",
]
# Negative prompts that can be used to steer the generation away from certain features.
negative_prompts = [
    "poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
    "poorly drawn,cartoon, 2d, sketch, cartoon, drawing, anime, disfigured, bad art, deformed, poorly drawn, extra limbs, close up, b&w, weird colors, blurry",
]

# NOTE: The number of prompts must match the number of negative prompts

batch_size = len(prompts)

# Tokenizing and encoding prompts into embeddings.
prompts_tokens = pipe.tokenizer(
    prompts,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
prompts_embeds = pipe.text_encoder(prompts_tokens.input_ids.to(device))[0]

# Tokenizing and encoding negative prompts into embeddings.
if negative_prompts is None:
    negative_prompts = [""] * batch_size

negative_prompts_tokens = pipe.tokenizer(
    negative_prompts,
    padding="max_length",
    max_length=pipe.tokenizer.model_max_length,
    truncation=True,
    return_tensors="pt",
)
negative_prompts_embeds = pipe.text_encoder(negative_prompts_tokens.input_ids.to(device))[0]

如前所述，我們將獲取連續的提示對，並使用 slerp 函式在它們之間建立平滑過渡。

# Generating initial U-Net latent vectors from a random normal distribution.
latents = torch.randn(
    (1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)

# Interpolating between embeddings pairs for the given number of interpolation steps.
interpolated_prompt_embeds = []
interpolated_negative_prompts_embeds = []
for i in range(batch_size - 1):
    interpolated_prompt_embeds.append(slerp(prompts_embeds[i], prompts_embeds[i + 1], num_interpolation_steps))
    interpolated_negative_prompts_embeds.append(
        slerp(
            negative_prompts_embeds[i],
            negative_prompts_embeds[i + 1],
            num_interpolation_steps,
        )
    )

interpolated_prompt_embeds = torch.cat(interpolated_prompt_embeds, dim=0).to(device)

interpolated_negative_prompts_embeds = torch.cat(interpolated_negative_prompts_embeds, dim=0).to(device)

最後，我們需要根據嵌入生成影像。

# Generating images using the interpolated embeddings.
images = []
for prompt_embeds, negative_prompt_embeds in tqdm(
    zip(interpolated_prompt_embeds, interpolated_negative_prompts_embeds),
    total=len(interpolated_prompt_embeds),
):
    images.append(
        pipe(
            height=height,
            width=width,
            num_images_per_prompt=1,
            prompt_embeds=prompt_embeds[None, ...],
            negative_prompt_embeds=negative_prompt_embeds[None, ...],
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latents,
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

示例 4：單個提示的擴散潛在空間迴圈漫遊

此示例取自：https://keras.io/examples/generative/random_walks_with_stable_diffusion/

讓我們想象一下我們有兩個噪聲分量，我們稱之為 x 和 y。我們從 0 移動到 2π，並且在每一步中，我們將 x 的餘弦和 y 的正弦新增到結果中。使用這種方法，在運動結束時，我們最終會得到與我們開始時相同的噪聲值。這意味著向量最終會變成它們自己，從而結束我們的運動。

Example 4

# The text prompt that describes the desired output image.
prompt = "Beautiful sea sunset, warm light, Aivazovsky style"
# A negative prompt that can be used to steer the generation away from certain features
negative_prompt = "picture frames"

# Generating initial latent vectors from a random normal distribution to create a loop interpolation between them.
latents = torch.randn(
    (2, 1, pipe.unet.config.in_channels, height // 8, width // 8),
    generator=generator,
)


# Calculation of looped embeddings
walk_noise_x = latents[0].to(device)
walk_noise_y = latents[1].to(device)

# Walking on a trigonometric circle
walk_scale_x = torch.cos(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)
walk_scale_y = torch.sin(torch.linspace(0, 2, num_interpolation_steps) * np.pi).to(device)

# Applying interpolation to noise
noise_x = torch.tensordot(walk_scale_x, walk_noise_x, dims=0)
noise_y = torch.tensordot(walk_scale_y, walk_noise_y, dims=0)

circular_latents = noise_x + noise_y

# Generating images using the interpolated embeddings.
images = []
for latent_vector in tqdm(circular_latents):
    images.append(
        pipe(
            prompt,
            height=height,
            width=width,
            negative_prompt=negative_prompt,
            num_images_per_prompt=1,
            num_inference_steps=num_inference_steps,
            guidance_scale=guidance_scale,
            generator=generator,
            latents=latent_vector,
        ).images
    )

# Display of saved generated images.
display_images(images, save_path)

下一步

接下來，您可以探索各種引數，例如指導比例、種子和插值步數，以觀察它們如何影響生成的影像。此外，考慮嘗試不同的提示和排程器，以進一步改進您的結果。另一個有價值的步驟是實現線性插值（linspace）而不是球形線性插值（slerp），並比較結果以更深入地瞭解插值過程。

< > 在 GitHub 上更新