Diffusers 文件

IP-Adapter

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

IP-Adapter

IP-Adapter 是一種輕量級介面卡，旨在將基於影像的指導與文字到影像擴散模型整合。該介面卡使用影像編碼器提取影像特徵，然後將其傳遞到 UNet 中新新增的交叉注意力層並進行微調。原始 UNet 模型和現有的與文字特徵對應的交叉注意力層被凍結。解耦影像和文字特徵的交叉注意力可以實現更精細和可控的生成。

IP-Adapter 檔案通常約為 100MB，因為它們只包含影像嵌入。這意味著您需要先載入模型，然後使用 load_ip_adapter() 載入 IP-Adapter。

IP-Adapter 可用於許多模型，例如 Flux 和 Stable Diffusion 3 等。本指南中的示例使用 Stable Diffusion 和 Stable Diffusion XL。

使用 set_ip_adapter_scale() 引數來縮放 IP-Adapter 在生成過程中的影響。值為 1.0 表示模型僅受影像提示的條件限制，而 0.5 通常會在文字和影像提示之間產生平衡的結果。

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image

pipeline = AutoPipelineForText2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name="ip-adapter_sdxl.bin"
)
pipeline.set_ip_adapter_scale(0.8)

將影像作為 ip_adapter_image 與文字提示一起傳遞以生成影像。

image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
pipeline(
    prompt="a polar bear sitting in a chair drinking a milkshake",
    ip_adapter_image=image,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
).images[0]

請檢視以下示例，瞭解如何將 IP-Adapter 用於其他任務。

影像到影像

影像修復

影片

模型變體

IP-Adapter 有兩種變體：Plus 和 FaceID。Plus 變體使用補丁嵌入和 ViT-H 影像編碼器。FaceID 變體使用透過 InsightFace 生成的人臉嵌入。

IP-Adapter Plus

IP-Adapter FaceID

影像嵌入

如果管道執行多次，並且有多個影像，則 prepare_ip_adapter_image_embeds 會生成可重複使用的影像嵌入。每次使用管道時載入和編碼多個影像效率低下。預先計算影像嵌入，將其儲存到磁碟，並在需要時載入它們更高效。

import torch
from diffusers import AutoPipelineForText2Image

pipeline = AutoPipelineForImage2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")

image_embeds = pipeline.prepare_ip_adapter_image_embeds(
    ip_adapter_image=image,
    ip_adapter_image_embeds=None,
    device="cuda",
    num_images_per_prompt=1,
    do_classifier_free_guidance=True,
)

torch.save(image_embeds, "image_embeds.ipadpt")

透過將影像嵌入傳遞給 ip_adapter_image_embeds 引數來重新載入它們。將 image_encoder_folder 設定為 None，因為不再需要影像編碼器來生成影像嵌入。

您還可以從其他來源（例如 ComfyUI）載入影像嵌入。

pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  image_encoder_folder=None,
  weight_name="ip-adapter_sdxl.bin"
)
pipeline.set_ip_adapter_scale(0.8)
image_embeds = torch.load("image_embeds.ipadpt")
pipeline(
    prompt="a polar bear sitting in a chair drinking a milkshake",
    ip_adapter_image_embeds=image_embeds,
    negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality",
    num_inference_steps=100,
    generator=generator,
).images[0]

遮罩

二進位制遮罩可以將 IP-Adapter 影像分配到輸出影像的特定區域，這對於組合多個 IP-Adapter 影像非常有用。每個 IP-Adapter 影像都需要一個二進位制遮罩。

載入 IPAdapterMaskProcessor 以預處理影像遮罩。為獲得最佳效果，請提供輸出 height 和 width 以確保不同寬高比的遮罩大小合適。如果輸入遮罩已與生成影像的寬高比匹配，則無需設定 height 和 width。

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.image_processor import IPAdapterMaskProcessor
from diffusers.utils import load_image

pipeline = AutoPipelineForImage2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")

mask1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask1.png")
mask2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_mask2.png")

processor = IPAdapterMaskProcessor()
masks = processor.preprocess([mask1, mask2], height=1024, width=1024)

將 IP-Adapter 影像及其比例作為列表提供。將預處理的遮罩傳遞給管道中的 cross_attention_kwargs。

face_image1 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl1.png")
face_image2 = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_mask_girl2.png")

pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors"]
)
pipeline.set_ip_adapter_scale([[0.7, 0.7]])

ip_images = [[face_image1, face_image2]]
masks = [masks.reshape(1, masks.shape[0], masks.shape[2], masks.shape[3])]

pipeline(
  prompt="2 girls",
  ip_adapter_image=ip_images,
  negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
  cross_attention_kwargs={"ip_adapter_masks": masks}
).images[0]

應用

以下部分介紹了一些流行的 IP-Adapter 應用。

人臉模型

人臉生成並保留其細節可能具有挑戰性。為了幫助生成更準確的人臉，有一些檢查點專門以裁剪過的人臉影像為條件。您可以在 h94/IP-Adapter 倉庫或 h94/IP-Adapter-FaceID 倉庫中找到人臉模型。FaceID 檢查點使用 InsightFace 的 FaceID 嵌入，而不是 CLIP 影像嵌入。

我們建議將 DDIMScheduler 或 EulerDiscreteScheduler 用於人臉模型。

h94/IP-Adapter

h94/IP-Adapter-FaceID

多個 IP-Adapter

結合多個 IP-Adapter 可以生成更多樣式的影像。例如，您可以使用 IP-Adapter Face 生成一致的人臉和角色，並使用 IP-Adapter Plus 以特定樣式生成這些人臉。

使用 CLIPVisionModelWithProjection 載入影像編碼器。

import torch
from diffusers import AutoPipelineForText2Image, DDIMScheduler
from transformers import CLIPVisionModelWithProjection
from diffusers.utils import load_image

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "h94/IP-Adapter",
    subfolder="models/image_encoder",
    torch_dtype=torch.float16,
)

載入基礎模型、排程器和以下 IP-Adapter。

ip-adapter-plus_sdxl_vit-h 使用補丁嵌入和 ViT-H 影像編碼器
ip-adapter-plus-face_sdxl_vit-h 使用補丁嵌入和 ViT-H 影像編碼器，但它以裁剪過的人臉影像為條件。

pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    image_encoder=image_encoder,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name=["ip-adapter-plus_sdxl_vit-h.safetensors", "ip-adapter-plus-face_sdxl_vit-h.safetensors"]
)
pipeline.set_ip_adapter_scale([0.7, 0.3])
# enable_model_cpu_offload to reduce memory usage
pipeline.enable_model_cpu_offload()

載入影像和包含特定樣式影像的資料夾以應用樣式。

face_image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png")
style_folder = "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/style_ziggy"
style_images = [load_image(f"{style_folder}/img{i}.png") for i in range(10)]

將樣式和人臉影像作為列表傳遞給 ip_adapter_image。

generator = torch.Generator(device="cpu").manual_seed(0)

pipeline(
    prompt="wonderwoman",
    ip_adapter_image=[style_images, face_image],
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
).images[0]

即時生成

潛在一致性模型 (LCM) 可以在 4 步或更少步中生成影像，與需要更多步數的其他擴散模型不同，這使其感覺“即時”。IP-Adapter 與 LCM 模型相容，可即時生成影像。

載入 IP-Adapter 權重並使用 load_lora_weights() 載入 LoRA 權重。

import torch
from diffusers import DiffusionPipeline, LCMScheduler
from diffusers.utils import load_image

pipeline = DiffusionPipeline.from_pretrained(
  "sd-dreambooth-library/herge-style",
  torch_dtype=torch.float16
)

pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="models",
  weight_name="ip-adapter_sd15.bin"
)
pipeline.load_lora_weights("latent-consistency/lcm-lora-sdv1-5")
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
# enable_model_cpu_offload to reduce memory usage
pipeline.enable_model_cpu_offload()

嘗試使用較低的 IP-Adapter 比例來更多地根據您要應用的樣式來條件生成，並記住在您的提示中使用特殊標記來觸發其生成。

pipeline.set_ip_adapter_scale(0.4)

prompt = "herge_style woman in armor, best quality, high quality"

ip_adapter_image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
pipeline(
    prompt=prompt,
    ip_adapter_image=ip_adapter_image,
    num_inference_steps=4,
    guidance_scale=1,
).images[0]

結構控制

對於結構控制，將 IP-Adapter 與以深度圖、邊緣圖、姿態估計等為條件的 ControlNet 結合使用。

下面的示例載入了一個以深度圖為條件的 ControlNetModel 檢查點，並將其與 IP-Adapter 結合使用。

import torch
from diffusers.utils import load_image
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel

controlnet = ControlNetModel.from_pretrained(
  "lllyasviel/control_v11f1p_sd15_depth",
  torch_dtype=torch.float16
)

pipeline = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="models",
  weight_name="ip-adapter_sd15.bin"
)

將深度圖和 IP-Adapter 影像傳遞給流水線。

pipeline(
  prompt="best quality, high quality",
  image=depth_map,
  ip_adapter_image=ip_adapter_image,
  negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
).images[0]

樣式和佈局控制

為了進行樣式和佈局控制，將 IP-Adapter 與 InstantStyle 結合使用。InstantStyle 將 *樣式*（顏色、紋理、整體感覺）和 *內容* 分開。它只在模型的樣式特定塊中應用樣式，以防止其扭曲影像的其他區域。這可以生成具有更強和更一致樣式以及更好佈局控制的影像。

IP-Adapter 僅針對模型的特定部分啟用。使用 set_ip_adapter_scale() 方法可以縮放 IP-Adapter 在不同層中的影響。以下示例啟用 IP-Adapter 在模型的下行 block_2 和上行 block_0 的第二層中。下行 block_2 是 IP-Adapter 注入佈局資訊的地方，而上行 block_0 是注入樣式的地方。

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image

pipeline = AutoPipelineForText2Image.from_pretrained(
  "stabilityai/stable-diffusion-xl-base-1.0",
  torch_dtype=torch.float16
).to("cuda")
pipeline.load_ip_adapter(
  "h94/IP-Adapter",
  subfolder="sdxl_models",
  weight_name="ip-adapter_sdxl.bin"
)

scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

載入樣式影像並生成影像。

style_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg")

pipeline(
    prompt="a cat, masterpiece, best quality, high quality",
    ip_adapter_image=style_image,
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
).images[0]

您還可以將 IP-Adapter 插入到所有模型層中。這傾向於生成更注重影像提示的影像，並可能降低生成影像的多樣性。僅在上行 block_0 或樣式層中啟用 IP-Adapter。

您無需在 scale 字典中指定所有層。未包含的層將設定為 0，這意味著 IP-Adapter 已停用。

scale = {
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)

pipeline(
    prompt="a cat, masterpiece, best quality, high quality",
    ip_adapter_image=style_image,
    negative_prompt="text, watermark, lowres, low quality, worst quality, deformed, glitch, low contrast, noisy, saturation, blurry",
    guidance_scale=5,
).images[0]

Generated image (IP-Adapter only) — 所有層生成影像

< > 在 GitHub 上更新

←LoRA ControlNet→