Diffusers 文件

InstructPix2Pix

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

InstructPix2Pix

InstructPix2Pix 是一個 Stable Diffusion 模型，經過訓練可以根據人類提供的指令編輯影像。例如，您的提示可以是“讓雲朵變得多雨”，模型將相應地編輯輸入影像。該模型根據文字提示（或編輯指令）和輸入影像進行條件化。

本指南將探討 train_instruct_pix2pix.py 訓練指令碼，幫助您熟悉它，以及如何根據自己的用例進行調整。

在執行指令碼之前，請確保從原始碼安裝庫

git clone https://github.com/huggingface/diffusers
cd diffusers
pip install .

然後導航到包含訓練指令碼的示例資料夾並安裝您正在使用的指令碼所需的依賴項

cd examples/instruct_pix2pix
pip install -r requirements.txt

🤗 Accelerate 是一個幫助您在多個 GPU/TPU 上進行訓練或使用混合精度進行訓練的庫。它將根據您的硬體和環境自動配置您的訓練設定。請檢視 🤗 Accelerate 快速入門以瞭解更多資訊。

初始化 🤗 Accelerate 環境

accelerate config

要設定預設的 🤗 Accelerate 環境而不選擇任何配置

accelerate config default

或者如果您的環境不支援互動式 shell（例如筆記本），您可以使用

from accelerate.utils import write_basic_config

write_basic_config()

最後，如果您想在自己的資料集上訓練模型，請檢視建立訓練資料集指南，瞭解如何建立與訓練指令碼相容的資料集。

以下部分重點介紹了訓練指令碼中對理解如何修改它很重要的部分，但它沒有詳細介紹指令碼的每個方面。如果您有興趣瞭解更多資訊，請隨時閱讀指令碼，如果您有任何問題或疑慮，請告訴我們。

指令碼引數

訓練指令碼有許多引數可幫助您自定義訓練執行。所有引數及其描述都可以在 parse_args() 函式中找到。大多數引數都提供了預設值，效果很好，但您也可以在訓練命令中設定自己的值。

例如，增加輸入影像的解析度

accelerate launch train_instruct_pix2pix.py \
  --resolution=512 \

許多基本和重要的引數在文字到影像訓練指南中進行了描述，因此本指南只關注 InstructPix2Pix 的相關引數

--original_image_column：編輯前的原始影像
--edited_image_column：編輯後的影像
--edit_prompt_column：編輯影像的指令
--conditioning_dropout_prob：訓練期間編輯影像和編輯提示的 dropout 機率，它為一個或兩個條件輸入啟用無分類器引導（CFG）

訓練指令碼

資料集預處理程式碼和訓練迴圈可在 main() 函式中找到。您將在此處更改訓練指令碼以適應您自己的用例。

與指令碼引數一樣，文字到影像訓練指南中提供了訓練指令碼的演練。相反，本指南著眼於 InstructPix2Pix 指令碼的相關部分。

指令碼首先修改 UNet 第一個卷積層中的輸入通道數，以適應 InstructPix2Pix 的額外條件影像

in_channels = 8
out_channels = unet.conv_in.out_channels
unet.register_to_config(in_channels=in_channels)

with torch.no_grad():
    new_conv_in = nn.Conv2d(
        in_channels, out_channels, unet.conv_in.kernel_size, unet.conv_in.stride, unet.conv_in.padding
    )
    new_conv_in.weight.zero_()
    new_conv_in.weight[:, :4, :, :].copy_(unet.conv_in.weight)
    unet.conv_in = new_conv_in

這些 UNet 引數由最佳化器更新

optimizer = optimizer_cls(
    unet.parameters(),
    lr=args.learning_rate,
    betas=(args.adam_beta1, args.adam_beta2),
    weight_decay=args.adam_weight_decay,
    eps=args.adam_epsilon,
)

接下來，編輯後的影像和編輯指令被預處理和分詞。重要的是，對原始影像和編輯後的影像應用相同的影像變換。

def preprocess_train(examples):
    preprocessed_images = preprocess_images(examples)

    original_images, edited_images = preprocessed_images.chunk(2)
    original_images = original_images.reshape(-1, 3, args.resolution, args.resolution)
    edited_images = edited_images.reshape(-1, 3, args.resolution, args.resolution)

    examples["original_pixel_values"] = original_images
    examples["edited_pixel_values"] = edited_images

    captions = list(examples[edit_prompt_column])
    examples["input_ids"] = tokenize_captions(captions)
    return examples

最後，在訓練迴圈中，它首先將編輯後的影像編碼到潛在空間中。

latents = vae.encode(batch["edited_pixel_values"].to(weight_dtype)).latent_dist.sample()
latents = latents * vae.config.scaling_factor

然後，指令碼將 dropout 應用於原始影像和編輯指令嵌入，以支援 CFG。這使模型能夠調節編輯指令和原始影像對編輯後圖像的影響。

encoder_hidden_states = text_encoder(batch["input_ids"])[0]
original_image_embeds = vae.encode(batch["original_pixel_values"].to(weight_dtype)).latent_dist.mode()

if args.conditioning_dropout_prob is not None:
    random_p = torch.rand(bsz, device=latents.device, generator=generator)
    prompt_mask = random_p < 2 * args.conditioning_dropout_prob
    prompt_mask = prompt_mask.reshape(bsz, 1, 1)
    null_conditioning = text_encoder(tokenize_captions([""]).to(accelerator.device))[0]
    encoder_hidden_states = torch.where(prompt_mask, null_conditioning, encoder_hidden_states)

    image_mask_dtype = original_image_embeds.dtype
    image_mask = 1 - (
        (random_p >= args.conditioning_dropout_prob).to(image_mask_dtype)
        * (random_p < 3 * args.conditioning_dropout_prob).to(image_mask_dtype)
    )
    image_mask = image_mask.reshape(bsz, 1, 1, 1)
    original_image_embeds = image_mask * original_image_embeds

差不多就是這樣了！除了這裡描述的差異之外，指令碼的其餘部分與文字到影像訓練指令碼非常相似，因此您可以隨意檢視它以獲取更多詳細資訊。如果您想了解更多關於訓練迴圈如何工作的資訊，請檢視理解管道、模型和排程器教程，該教程分解了去噪過程的基本模式。

啟動指令碼

一旦您對指令碼的更改滿意，或者您對預設配置滿意，您就可以啟動訓練指令碼了！🚀

本指南使用 fusing/instructpix2pix-1000-samples 資料集，它是原始資料集的一個較小版本。如果您願意，您也可以建立並使用自己的資料集（請參閱建立訓練資料集指南）。

將 MODEL_NAME 環境變數設定為模型的名稱（可以是 Hub 上的模型 ID 或本地模型的路徑），將 DATASET_ID 設定為 Hub 上資料集的名稱。指令碼會在您的儲存庫中建立一個子資料夾並儲存所有元件（特徵提取器、排程器、文字編碼器、UNet 等）到其中。

為了獲得更好的結果，請嘗試使用更大的資料集進行更長時間的訓練執行。我們只在小規模資料集上測試過此訓練指令碼。

要使用 Weights and Biases 監控訓練進度，請在訓練命令中新增 `--report_to=wandb` 引數，並使用 `--val_image_url` 指定驗證影像，使用 `--validation_prompt` 指定驗證提示。這對於除錯模型非常有用。

如果您在多個 GPU 上進行訓練，請將 --multi_gpu 引數新增到 accelerate launch 命令中。

accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \
    --pretrained_model_name_or_path=$MODEL_NAME \
    --dataset_name=$DATASET_ID \
    --enable_xformers_memory_efficient_attention \
    --resolution=256 \
    --random_flip \
    --train_batch_size=4 \
    --gradient_accumulation_steps=4 \
    --gradient_checkpointing \
    --max_train_steps=15000 \
    --checkpointing_steps=5000 \
    --checkpoints_total_limit=1 \
    --learning_rate=5e-05 \
    --max_grad_norm=1 \
    --lr_warmup_steps=0 \
    --conditioning_dropout_prob=0.05 \
    --mixed_precision=fp16 \
    --seed=42 \
    --push_to_hub

訓練完成後，您可以使用新的 InstructPix2Pix 進行推理

import PIL
import requests
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline
from diffusers.utils import load_image

pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained("your_cool_model", torch_dtype=torch.float16).to("cuda")
generator = torch.Generator("cuda").manual_seed(0)

image = load_image("https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png")
prompt = "add some ducks to the lake"
num_inference_steps = 20
image_guidance_scale = 1.5
guidance_scale = 10

edited_image = pipeline(
   prompt,
   image=image,
   num_inference_steps=num_inference_steps,
   image_guidance_scale=image_guidance_scale,
   guidance_scale=guidance_scale,
   generator=generator,
).images[0]
edited_image.save("edited_image.png")

您應該嘗試不同的 num_inference_steps、image_guidance_scale 和 guidance_scale 值，以瞭解它們如何影響推理速度和質量。引導比例引數尤其重要，因為它們控制原始影像和編輯指令對編輯影像的影響程度。

Stable Diffusion XL

Stable Diffusion XL (SDXL) 是一種功能強大的文字到影像模型，可生成高解析度影像，並在其架構中添加了第二個文字編碼器。使用 train_instruct_pix2pix_sdxl.py 指令碼來訓練 SDXL 模型以遵循影像編輯指令。

SDXL 訓練指令碼在 SDXL 訓練指南中有更詳細的討論。

後續步驟

恭喜您訓練了自己的 InstructPix2Pix 模型！🥳 要了解更多關於該模型的資訊，可能會有幫助的

閱讀使用 InstructPix2Pix 對 Stable Diffusion 進行指令微調這篇部落格文章，瞭解我們對 InstructPix2Pix 所做的一些實驗、資料集準備以及不同指令的結果。

< > 在 GitHub 上更新

←T2I-介面卡 CogVideoX→