Diffusers 文件

評估擴散模型

擴散器

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

評估擴散模型

鑑於現有影像生成擴散模型評估框架的出現，本文件現已過時。請檢視 HEIM、T2I-Compbench、GenEval 等工作。

Stable Diffusion 等生成模型的評估本質上是主觀的。但作為實踐者和研究人員，我們常常需要在眾多可能性中做出審慎的選擇。那麼，在使用不同的生成模型（如 GAN、擴散模型等）時，我們如何選擇其中一個呢？

對這類模型進行定性評估可能容易出錯，並可能錯誤地影響決策。然而，定量指標不一定與影像質量相對應。因此，通常情況下，定性評估和定量評估的結合在選擇模型時能提供更強的訊號。

在本文件中，我們對評估擴散模型的定性和定量方法進行了非詳盡的概述。對於定量方法，我們特別關注如何在 diffusers 中實現它們。

本文件中所示的方法也可用於評估不同的噪聲排程器，同時保持底層生成模型固定。

場景

我們涵蓋了具有以下流水線的擴散模型

文字引導影像生成（例如 StableDiffusionPipeline）。
文字引導影像生成，附加條件是輸入影像（例如 StableDiffusionImg2ImgPipeline 和 StableDiffusionInstructPix2PixPipeline）。
類別條件影像生成模型（例如 DiTPipeline）。

定性評估

定性評估通常涉及對生成影像的人工評估。質量透過組合性、影像-文字對齊和空間關係等方面進行衡量。常用提示為定性指標提供了一定程度的統一性。DrawBench 和 PartiPrompts 是用於定性基準測試的提示資料集。DrawBench 和 PartiPrompts 分別由 Imagen 和 Parti 引入。

來自 Parti 官方網站

PartiPrompts (P2) 是我們作為這項工作的一部分發布的一組豐富的、超過 1600 個英文提示。P2 可用於衡量模型在各種類別和挑戰方面的能力。

parti-prompts

PartiPrompts 包含以下列

提示
提示的類別（例如“抽象”、“世界知識”等）
反映難度的挑戰（例如“基本”、“複雜”、“寫作與符號”等）

這些基準允許對不同的影像生成模型進行並排的人工評估。

為此，🧨 Diffusers 團隊構建了 Open Parti Prompts，這是一個社群驅動的基於 Parti Prompts 的定性基準，用於比較最先進的開源擴散模型

Open Parti Prompts Game：對於 10 個 Parti 提示，顯示 4 個生成的影像，使用者選擇最適合提示的影像。
Open Parti Prompts 排行榜：比較當前最佳開源擴散模型的排行榜。

要手動比較影像，讓我們看看如何在幾個 PartiPrompts 上使用 diffusers。

下面我們展示了一些在不同挑戰中抽樣的提示：基本、複雜、語言結構、想象力和寫作與符號。這裡我們使用 PartiPrompts 作為資料集。

from datasets import load_dataset

# prompts = load_dataset("nateraw/parti-prompts", split="train")
# prompts = prompts.shuffle()
# sample_prompts = [prompts[i]["Prompt"] for i in range(5)]

# Fixing these sample prompts in the interest of reproducibility.
sample_prompts = [
    "a corgi",
    "a hot air balloon with a yin-yang symbol, with the moon visible in the daytime sky",
    "a car with no windows",
    "a cube made of porcupine",
    'The saying "BE EXCELLENT TO EACH OTHER" written on a red brick wall with a graffiti image of a green alien wearing a tuxedo. A yellow fire hydrant is on a sidewalk in the foreground.',
]

現在我們可以使用這些提示來生成一些影像，使用 Stable Diffusion (v1-4 checkpoint)

import torch

seed = 0
generator = torch.manual_seed(seed)

images = sd_pipeline(sample_prompts, num_images_per_prompt=1, generator=generator).images

parti-prompts-14

我們還可以相應地設定 `num_images_per_prompt`，以比較同一提示的不同影像。執行相同的流水線，但使用不同的檢查點 (v1-5)，結果如下：

parti-prompts-15

一旦使用多個模型（在評估中）從所有提示生成了多張影像，這些結果將呈現給人類評估員進行評分。有關 DrawBench 和 PartiPrompts 基準的更多詳細資訊，請參閱其各自的論文。

在模型訓練時檢視一些推理樣本以衡量訓練進度是很有用的。在我們的訓練指令碼中，我們支援此實用程式，並額外支援日誌記錄到 TensorBoard 和 Weights & Biases。

定量評估

在本節中，我們將向您介紹如何使用以下方法評估三種不同的擴散管道

CLIP 分數
CLIP 方向相似度
FID

文字引導影像生成

CLIP 分數衡量影像-字幕對的相容性。CLIP 分數越高意味著相容性越高 🔼。CLIP 分數是定性概念“相容性”的定量測量。影像-字幕對相容性也可以看作是影像和字幕之間的語義相似度。CLIP 分數被發現與人類判斷具有高度相關性。

我們先載入一個 StableDiffusionPipeline

from diffusers import StableDiffusionPipeline
import torch

model_ckpt = "CompVis/stable-diffusion-v1-4"
sd_pipeline = StableDiffusionPipeline.from_pretrained(model_ckpt, torch_dtype=torch.float16).to("cuda")

用多個提示生成一些影像

prompts = [
    "a photo of an astronaut riding a horse on mars",
    "A high tech solarpunk utopia in the Amazon rainforest",
    "A pikachu fine dining with a view to the Eiffel Tower",
    "A mecha robot in a favela in expressionist style",
    "an insect robot preparing a delicious meal",
    "A small cabin on top of a snowy mountain in the style of Disney, artstation",
]

images = sd_pipeline(prompts, num_images_per_prompt=1, output_type="np").images

print(images.shape)
# (6, 512, 512, 3)

然後，我們計算 CLIP 分數。

from torchmetrics.functional.multimodal import clip_score
from functools import partial

clip_score_fn = partial(clip_score, model_name_or_path="openai/clip-vit-base-patch16")

def calculate_clip_score(images, prompts):
    images_int = (images * 255).astype("uint8")
    clip_score = clip_score_fn(torch.from_numpy(images_int).permute(0, 3, 1, 2), prompts).detach()
    return round(float(clip_score), 4)

sd_clip_score = calculate_clip_score(images, prompts)
print(f"CLIP score: {sd_clip_score}")
# CLIP score: 35.7038

在上面的示例中，我們為每個提示生成了一張影像。如果我們為每個提示生成了多張影像，我們就需要取每張影像的平均分數。

現在，如果我們要比較兩個與 StableDiffusionPipeline 相容的檢查點，我們應該在呼叫流水線時傳入一個生成器。首先，我們使用 v1-4 Stable Diffusion 檢查點以固定種子生成影像

seed = 0
generator = torch.manual_seed(seed)

images = sd_pipeline(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images

然後我們載入 v1-5 checkpoint 來生成影像

model_ckpt_1_5 = "stable-diffusion-v1-5/stable-diffusion-v1-5"
sd_pipeline_1_5 = StableDiffusionPipeline.from_pretrained(model_ckpt_1_5, torch_dtype=torch.float16).to("cuda")

images_1_5 = sd_pipeline_1_5(prompts, num_images_per_prompt=1, generator=generator, output_type="np").images

最後，我們比較它們的 CLIP 分數

sd_clip_score_1_4 = calculate_clip_score(images, prompts)
print(f"CLIP Score with v-1-4: {sd_clip_score_1_4}")
# CLIP Score with v-1-4: 34.9102

sd_clip_score_1_5 = calculate_clip_score(images_1_5, prompts)
print(f"CLIP Score with v-1-5: {sd_clip_score_1_5}")
# CLIP Score with v-1-5: 36.2137

v1-5 檢查點似乎比其前身表現更好。然而，請注意，我們用於計算 CLIP 分數的提示數量相當少。對於更實際的評估，這個數量應該更高，並且提示應該多樣化。

從結構上講，這個分數存在一些侷限性。訓練資料集中的字幕是從網路上爬取並從網際網路上與影像相關的`alt`和類似標籤中提取的。它們不一定能代表人類會用來描述影像的詞語。因此，我們不得不在這裡“設計”一些提示。

影像條件文字到影像生成

在這種情況下，我們以輸入影像和文字提示作為條件來生成流水線。以StableDiffusionInstructPix2PixPipeline為例。它將編輯指令作為輸入提示，並將輸入影像作為要編輯的物件。

這是一個例子

edit-instruction

評估此類模型的一種策略是衡量兩幅影像（在 CLIP 空間中）之間變化的**一致性**，以及兩個影像標題之間變化的**一致性**（如 CLIP-Guided Domain Adaptation of Image Generators 所示）。這被稱為“**CLIP 方向相似度**”。

標題 1 對應於要編輯的輸入影像（影像 1）。
標題 2 對應於編輯後的影像（影像 2）。它應該反映編輯指令。

以下是圖示概述

edit-consistency

我們準備了一個小型資料集來實現此指標。首先載入資料集。

from datasets import load_dataset

dataset = load_dataset("sayakpaul/instructpix2pix-demo", split="train")
dataset.features

{'input': Value(dtype='string', id=None),
 'edit': Value(dtype='string', id=None),
 'output': Value(dtype='string', id=None),
 'image': Image(decode=True, id=None)}

這裡我們有

input 是與 image 對應的標題。
edit 表示編輯指令。
output 表示反映 edit 指令的修改後的標題。

我們來看一個樣本。

idx = 0
print(f"Original caption: {dataset[idx]['input']}")
print(f"Edit instruction: {dataset[idx]['edit']}")
print(f"Modified caption: {dataset[idx]['output']}")

Original caption: 2. FAROE ISLANDS: An archipelago of 18 mountainous isles in the North Atlantic Ocean between Norway and Iceland, the Faroe Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills'
Edit instruction: make the isles all white marble
Modified caption: 2. WHITE MARBLE ISLANDS: An archipelago of 18 mountainous white marble isles in the North Atlantic Ocean between Norway and Iceland, the White Marble Islands has 'everything you could hope for', according to Big 7 Travel. It boasts 'crystal clear waterfalls, rocky cliffs that seem to jut out of nowhere and velvety green hills'

這是影像

dataset[idx]["image"]

edit-dataset

我們首先使用編輯指令編輯資料集中的影像並計算方向相似性。

我們先載入 StableDiffusionInstructPix2PixPipeline

from diffusers import StableDiffusionInstructPix2PixPipeline

instruct_pix2pix_pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained(
    "timbrooks/instruct-pix2pix", torch_dtype=torch.float16
).to("cuda")

現在，我們執行編輯

import numpy as np


def edit_image(input_image, instruction):
    image = instruct_pix2pix_pipeline(
        instruction,
        image=input_image,
        output_type="np",
        generator=generator,
    ).images[0]
    return image

input_images = []
original_captions = []
modified_captions = []
edited_images = []

for idx in range(len(dataset)):
    input_image = dataset[idx]["image"]
    edit_instruction = dataset[idx]["edit"]
    edited_image = edit_image(input_image, edit_instruction)

    input_images.append(np.array(input_image))
    original_captions.append(dataset[idx]["input"])
    modified_captions.append(dataset[idx]["output"])
    edited_images.append(edited_image)

為了衡量方向相似性，我們首先載入 CLIP 的影像和文字編碼器

from transformers import (
    CLIPTokenizer,
    CLIPTextModelWithProjection,
    CLIPVisionModelWithProjection,
    CLIPImageProcessor,
)

clip_id = "openai/clip-vit-large-patch14"
tokenizer = CLIPTokenizer.from_pretrained(clip_id)
text_encoder = CLIPTextModelWithProjection.from_pretrained(clip_id).to("cuda")
image_processor = CLIPImageProcessor.from_pretrained(clip_id)
image_encoder = CLIPVisionModelWithProjection.from_pretrained(clip_id).to("cuda")

請注意，我們使用的是特定的 CLIP 檢查點，即 `openai/clip-vit-large-patch14`。這是因為 Stable Diffusion 預訓練是使用此 CLIP 變體進行的。有關更多詳細資訊，請參閱文件。

接下來，我們準備一個 PyTorch `nn.Module` 來計算方向相似性。

import torch.nn as nn
import torch.nn.functional as F


class DirectionalSimilarity(nn.Module):
    def __init__(self, tokenizer, text_encoder, image_processor, image_encoder):
        super().__init__()
        self.tokenizer = tokenizer
        self.text_encoder = text_encoder
        self.image_processor = image_processor
        self.image_encoder = image_encoder

    def preprocess_image(self, image):
        image = self.image_processor(image, return_tensors="pt")["pixel_values"]
        return {"pixel_values": image.to("cuda")}

    def tokenize_text(self, text):
        inputs = self.tokenizer(
            text,
            max_length=self.tokenizer.model_max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )
        return {"input_ids": inputs.input_ids.to("cuda")}

    def encode_image(self, image):
        preprocessed_image = self.preprocess_image(image)
        image_features = self.image_encoder(**preprocessed_image).image_embeds
        image_features = image_features / image_features.norm(dim=1, keepdim=True)
        return image_features

    def encode_text(self, text):
        tokenized_text = self.tokenize_text(text)
        text_features = self.text_encoder(**tokenized_text).text_embeds
        text_features = text_features / text_features.norm(dim=1, keepdim=True)
        return text_features

    def compute_directional_similarity(self, img_feat_one, img_feat_two, text_feat_one, text_feat_two):
        sim_direction = F.cosine_similarity(img_feat_two - img_feat_one, text_feat_two - text_feat_one)
        return sim_direction

    def forward(self, image_one, image_two, caption_one, caption_two):
        img_feat_one = self.encode_image(image_one)
        img_feat_two = self.encode_image(image_two)
        text_feat_one = self.encode_text(caption_one)
        text_feat_two = self.encode_text(caption_two)
        directional_similarity = self.compute_directional_similarity(
            img_feat_one, img_feat_two, text_feat_one, text_feat_two
        )
        return directional_similarity

現在，讓我們使用 `DirectionalSimilarity`。

dir_similarity = DirectionalSimilarity(tokenizer, text_encoder, image_processor, image_encoder)
scores = []

for i in range(len(input_images)):
    original_image = input_images[i]
    original_caption = original_captions[i]
    edited_image = edited_images[i]
    modified_caption = modified_captions[i]

    similarity_score = dir_similarity(original_image, edited_image, original_caption, modified_caption)
    scores.append(float(similarity_score.detach().cpu()))

print(f"CLIP directional similarity: {np.mean(scores)}")
# CLIP directional similarity: 0.0797976553440094

與 CLIP 分數一樣，CLIP 方向相似度越高越好。

需要注意的是，`StableDiffusionInstructPix2PixPipeline` 暴露了兩個引數，即 `image_guidance_scale` 和 `guidance_scale`，它們允許您控制最終編輯影像的質量。我們鼓勵您嘗試這兩個引數，看看它們對方向相似度的影響。

我們可以將此度量的思想擴充套件到衡量原始影像和編輯版本之間的相似程度。為此，我們只需執行 `F.cosine_similarity(img_feat_two, img_feat_one)`。對於此類編輯，我們仍然希望儘可能保留影像的主要語義，即獲得高相似度分數。

我們可以將這些指標用於類似的流水線，例如 StableDiffusionPix2PixZeroPipeline。

CLIP 分數和 CLIP 方向相似度都依賴於 CLIP 模型，這可能會使評估產生偏差。

當評估中的模型是使用大型影像-字幕資料集（例如 LAION-5B 資料集）進行預訓練時，**擴充套件 IS、FID（稍後討論）或 KID 等指標可能會很困難**。這是因為這些指標的基礎是一個用於提取中間影像特徵的 InceptionNet（在 ImageNet-1k 資料集上預訓練）。Stable Diffusion 的預訓練資料集可能與 InceptionNet 的預訓練資料集重疊有限，因此它不是一個好的特徵提取候選。

使用上述指標有助於評估類別條件模型。例如，DiT。它是在 ImageNet-1k 類別上預訓練的。

類別條件影像生成

類別條件生成模型通常在 ImageNet-1k 等帶類別標籤的資料集上進行預訓練。評估這些模型的常用指標包括 Fréchet Inception 距離 (FID)、Kernel Inception 距離 (KID) 和 Inception 分數 (IS)。在本文件中，我們重點介紹 FID (Heusel et al.)。我們將展示如何使用 DiTPipeline 來計算 FID，該流水線底層使用 DiT 模型。

FID 旨在衡量兩個影像資料集的相似程度。根據此資源

Fréchet Inception Distance 是衡量兩個影像資料集相似程度的指標。它已被證明與人類對視覺質量的判斷高度相關，最常用於評估生成對抗網路的樣本質量。FID 的計算方法是：計算擬合到 Inception 網路特徵表示的兩個高斯分佈之間的 Fréchet 距離。

這兩個資料集本質上是真實影像資料集和虛假影像資料集（在我們的例子中是生成的影像）。FID 通常使用兩個大型資料集進行計算。但是，對於本文件，我們將使用兩個小型資料集。

我們先從 ImageNet-1k 訓練集中下載一些影像

from zipfile import ZipFile
import requests


def download(url, local_filepath):
    r = requests.get(url)
    with open(local_filepath, "wb") as f:
        f.write(r.content)
    return local_filepath

dummy_dataset_url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/sample-imagenet-images.zip"
local_filepath = download(dummy_dataset_url, dummy_dataset_url.split("/")[-1])

with ZipFile(local_filepath, "r") as zipper:
    zipper.extractall(".")

from PIL import Image
import os
import numpy as np

dataset_path = "sample-imagenet-images"
image_paths = sorted([os.path.join(dataset_path, x) for x in os.listdir(dataset_path)])

real_images = [np.array(Image.open(path).convert("RGB")) for path in image_paths]

這些是 ImageNet-1k 類別中的 10 張影像：“cassette_player”、“chain_saw”（x2）、“church”、“gas_pump”（x3）、“parachute”（x2）和“tench”。

real-images
真實影像。

影像載入完畢，讓我們對它們進行一些輕量級預處理，以便用於 FID 計算。

from torchvision.transforms import functional as F
import torch


def preprocess_image(image):
    image = torch.tensor(image).unsqueeze(0)
    image = image.permute(0, 3, 1, 2) / 255.0
    return F.center_crop(image, (256, 256))

real_images = torch.cat([preprocess_image(image) for image in real_images])
print(real_images.shape)
# torch.Size([10, 3, 256, 256])

我們現在載入`DiTPipeline`，以生成以上述類別為條件的影像。

from diffusers import DiTPipeline, DPMSolverMultistepScheduler

dit_pipeline = DiTPipeline.from_pretrained("facebook/DiT-XL-2-256", torch_dtype=torch.float16)
dit_pipeline.scheduler = DPMSolverMultistepScheduler.from_config(dit_pipeline.scheduler.config)
dit_pipeline = dit_pipeline.to("cuda")

seed = 0
generator = torch.manual_seed(seed)


words = [
    "cassette player",
    "chainsaw",
    "chainsaw",
    "church",
    "gas pump",
    "gas pump",
    "gas pump",
    "parachute",
    "parachute",
    "tench",
]

class_ids = dit_pipeline.get_label_ids(words)
output = dit_pipeline(class_labels=class_ids, generator=generator, output_type="np")

fake_images = output.images
fake_images = torch.tensor(fake_images)
fake_images = fake_images.permute(0, 3, 1, 2)
print(fake_images.shape)
# torch.Size([10, 3, 256, 256])

現在，我們可以使用`torchmetrics`計算 FID。

from torchmetrics.image.fid import FrechetInceptionDistance

fid = FrechetInceptionDistance(normalize=True)
fid.update(real_images, real=True)
fid.update(fake_images, real=False)

print(f"FID: {float(fid.compute())}")
# FID: 177.7147216796875

FID 越低越好。以下幾點會影響 FID：

影像數量（真實影像和偽造影像）
擴散過程中引入的隨機性
擴散過程中的推理步驟數
擴散過程中使用的排程器

對於最後兩點，因此，最佳實踐是在不同的種子和推理步驟下執行評估，然後報告平均結果。

FID 結果往往不穩定，因為它依賴於許多因素

計算中使用的特定 Inception 模型。
計算的實現精度。
影像格式（從 PNG 或 JPG 開始不同）。

考慮到這一點，FID 通常在比較相似的執行結果時最有用，但除非作者仔細披露 FID 測量程式碼，否則很難重現論文結果。

這些點也適用於其他相關指標，如 KID 和 IS。

作為最後一步，我們來目視檢查 `fake_images`。

fake-images
假影像。

< > 在 GitHub 上更新

←Diffusers 道德準則使用 Diffusers 構建的專案→