Diffusers 文件
影像到影像
並獲得增強的文件體驗
開始使用
影像到影像
影像到影像類似於文字到影像,但除了提示詞之外,您還可以傳遞一個初始影像作為擴散過程的起點。初始影像被編碼到潛在空間並新增噪聲。然後,潛在擴散模型接收提示詞和帶噪聲的潛在影像,預測新增的噪聲,並從初始潛在影像中移除預測的噪聲以獲得新的潛在影像。最後,解碼器將新的潛在影像解碼回影像。
使用 🤗 Diffusers,這就像 1-2-3 一樣簡單
- 將檢查點載入到 AutoPipelineForImage2Image 類中;此流水線根據檢查點自動處理載入正確的流水線類
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid
pipeline = AutoPipelineForImage2Image.from_pretrained(
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
您會注意到在整個指南中,我們使用 enable_model_cpu_offload() 和 enable_xformers_memory_efficient_attention() 來節省記憶體並提高推理速度。如果您使用的是 PyTorch 2.0,則無需在流水線上呼叫 enable_xformers_memory_efficient_attention(),因為它已經在使用 PyTorch 2.0 原生的 縮放點積注意力。
- 載入要傳遞給流水線的影像
init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
- 將提示詞和影像傳遞給流水線以生成影像
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"
image = pipeline(prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)


流行模型
最流行的影像到影像模型是 Stable Diffusion v1.5、Stable Diffusion XL (SDXL) 和 Kandinsky 2.2。Stable Diffusion 和 Kandinsky 模型的結果因其架構差異和訓練過程而異;您通常可以預期 SDXL 生成的影像質量高於 Stable Diffusion v1.5。讓我們快速瞭解一下如何使用這些模型並比較它們的結果。
Stable Diffusion v1.5
Stable Diffusion v1.5 是一個潛在擴散模型,由早期檢查點初始化,並在 512x512 影像上進一步微調了 595K 步。要將此流水線用於影像到影像,您需要準備一個初始影像以傳遞給流水線。然後,您可以將提示詞和影像傳遞給流水線以生成新影像
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
pipeline = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# pass prompt and image to pipeline
image = pipeline(prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)


Stable Diffusion XL (SDXL)
SDXL 是 Stable Diffusion 模型的更強大版本。它使用更大的基礎模型,以及一個額外的精煉模型來提高基礎模型的輸出質量。閱讀 SDXL 指南,瞭解有關如何使用此模型以及它用於生成高質量影像的其他技術的更詳細的演練。
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
pipeline = AutoPipelineForImage2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-sdxl-init.png"
init_image = load_image(url)
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# pass prompt and image to pipeline
image = pipeline(prompt, image=init_image, strength=0.5).images[0]
make_image_grid([init_image, image], rows=1, cols=2)


Kandinsky 2.2
Kandinsky 模型與 Stable Diffusion 模型不同,因為它使用影像先驗模型來建立影像嵌入。這些嵌入有助於在文字和影像之間建立更好的對齊,從而使潛在擴散模型能夠生成更好的影像。
使用 Kandinsky 2.2 最簡單的方法是
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
pipeline = AutoPipelineForImage2Image.from_pretrained(
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# pass prompt and image to pipeline
image = pipeline(prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)


配置流水線引數
您可以在流水線中配置幾個重要引數,這些引數將影響影像生成過程和影像質量。讓我們仔細看看這些引數的作用以及更改它們如何影響輸出。
強度
strength
是最重要的引數之一,它將對您生成的影像產生巨大影響。它決定了生成的影像與初始影像的相似程度。換句話說
- 📈 較高的
strength
值賦予模型更多的“創造力”來生成與初始影像不同的影像;strength
值為 1.0 意味著初始影像或多或少被忽略 - 📉 較低的
strength
值意味著生成的影像與初始影像更相似
strength
和 num_inference_steps
引數是相關的,因為 strength
決定了要新增的噪聲步數。例如,如果 num_inference_steps
為 50 且 strength
為 0.8,則意味著向初始影像新增 40 (50 * 0.8) 步的噪聲,然後進行 40 步去噪以獲得新生成的影像。
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
pipeline = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# pass prompt and image to pipeline
image = pipeline(prompt, image=init_image, strength=0.8).images[0]
make_image_grid([init_image, image], rows=1, cols=2)



引導比例
guidance_scale
引數用於控制生成的影像和文字提示之間的緊密程度。較高的 guidance_scale
值意味著您生成的影像與提示更緊密地對齊,而較低的 guidance_scale
值意味著您生成的影像有更多空間偏離提示。
您可以將 guidance_scale
與 strength
結合使用,以更精確地控制模型的表現力。例如,結合高 strength + guidance_scale
可實現最大程度的創造力,或者使用低 strength
和低 guidance_scale
的組合來生成與初始影像相似但不受提示嚴格限制的影像。
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
pipeline = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# pass prompt and image to pipeline
image = pipeline(prompt, image=init_image, guidance_scale=8.0).images[0]
make_image_grid([init_image, image], rows=1, cols=2)



負面提示詞
負面提示詞條件模型在影像中 *不* 包含某些內容,它可以用於提高影像質量或修改影像。例如,您可以透過包含“細節差”或“模糊”等負面提示詞來鼓勵模型生成更高質量的影像,從而提高影像質量。或者,您可以透過指定要從影像中排除的內容來修改影像。
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
pipeline = AutoPipelineForImage2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"
# pass prompt and image to pipeline
image = pipeline(prompt, negative_prompt=negative_prompt, image=init_image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)


鏈式影像到影像流水線
除了生成影像(儘管這也很酷)之外,您還可以透過其他一些有趣的方式使用影像到影像流水線。您可以更進一步,將其與其他流水線串聯起來。
文字到影像再到影像
將文字到影像流水線和影像到影像流水線串聯起來,可以從文字生成影像,並使用生成的影像作為影像到影像流水線的初始影像。如果您想完全從頭生成影像,這非常有用。例如,讓我們串聯 Stable Diffusion 和 Kandinsky 模型。
首先使用文字到影像流水線生成影像
from diffusers import AutoPipelineForText2Image, AutoPipelineForImage2Image
import torch
from diffusers.utils import make_image_grid
pipeline = AutoPipelineForText2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
text2image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k").images[0]
text2image
現在您可以將此生成的影像傳遞給影像到影像流水線
pipeline = AutoPipelineForImage2Image.from_pretrained(
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
image2image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=text2image).images[0]
make_image_grid([text2image, image2image], rows=1, cols=2)
影像到影像再到影像
您還可以將多個影像到影像流水線串聯起來,以建立更有趣的影像。這對於迭代地對影像執行風格遷移、生成短 GIF、為影像恢復顏色或恢復影像中缺失的區域非常有用。
首先生成一張影像
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
pipeline = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# pass prompt and image to pipeline
image = pipeline(prompt, image=init_image, output_type="latent").images[0]
在流水線中指定 `output_type="latent"` 以將所有輸出保留在潛在空間中,這很重要,以避免不必要的解碼-編碼步驟。這僅在鏈式流水線使用相同的 VAE 時才有效。
將此流水線的潛在輸出傳遞給下一個流水線,以生成具有漫畫書藝術風格的影像
pipeline = AutoPipelineForImage2Image.from_pretrained(
"ogkalu/Comic-Diffusion", torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# need to include the token "charliebo artstyle" in the prompt to use this checkpoint
image = pipeline("Astronaut in a jungle, charliebo artstyle", image=image, output_type="latent").images[0]
再重複一次,以生成具有畫素藝術風格的最終影像
pipeline = AutoPipelineForImage2Image.from_pretrained(
"kohbanye/pixel-art-style", torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# need to include the token "pixelartstyle" in the prompt to use this checkpoint
image = pipeline("Astronaut in a jungle, pixelartstyle", image=image).images[0]
make_image_grid([init_image, image], rows=1, cols=2)
影像到放大器到超解析度
另一種串聯影像到影像流水線的方法是與放大器和超解析度流水線結合使用,以真正提高影像的細節水平。
從影像到影像流水線開始
import torch
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
pipeline = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
# pass prompt and image to pipeline
image_1 = pipeline(prompt, image=init_image, output_type="latent").images[0]
在流水線中指定 `output_type="latent"` 以將所有輸出保留在 *潛在* 空間中,以避免不必要的解碼-編碼步驟。這僅在鏈式流水線使用相同的 VAE 時才有效。
將其與放大器流水線串聯,以提高影像解析度
from diffusers import StableDiffusionLatentUpscalePipeline
upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained(
"stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float16, use_safetensors=True
)
upscaler.enable_model_cpu_offload()
upscaler.enable_xformers_memory_efficient_attention()
image_2 = upscaler(prompt, image=image_1).images[0]
最後,將其與超解析度流水線串聯,以進一步增強解析度
from diffusers import StableDiffusionUpscalePipeline
super_res = StableDiffusionUpscalePipeline.from_pretrained(
"stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
super_res.enable_model_cpu_offload()
super_res.enable_xformers_memory_efficient_attention()
image_3 = super_res(prompt, image=image_2).images[0]
make_image_grid([init_image, image_3.resize((512, 512))], rows=1, cols=2)
控制影像生成
嘗試生成一張完全符合您要求的影像可能很困難,這就是為什麼受控生成技術和模型如此有用的原因。雖然您可以使用 `negative_prompt` 來部分控制影像生成,但還有更強大的方法,如提示詞加權和 ControlNets。
提示詞加權
提示詞加權允許您調整提示中每個概念的表示比例。例如,在“叢林中的宇航員,冷色調,柔和色彩,細節豐富,8k”這樣的提示中,您可以選擇增加或減少“宇航員”和“叢林”的嵌入。 Compel 庫提供了調整提示權重和生成嵌入的簡單語法。您可以在提示詞加權指南中瞭解如何建立嵌入。
AutoPipelineForImage2Image 有一個 `prompt_embeds` (如果您使用負面提示,則還有 `negative_prompt_embeds`) 引數,您可以在其中傳遞嵌入,它會替換 `prompt` 引數。
from diffusers import AutoPipelineForImage2Image
import torch
pipeline = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
image = pipeline(prompt_embeds=prompt_embeds, # generated from Compel
negative_prompt_embeds=negative_prompt_embeds, # generated from Compel
image=init_image,
).images[0]
ControlNet
ControlNets 提供了一種更靈活、更準確的影像生成控制方式,因為您可以使用額外的條件影像。條件影像可以是 Canny 影像、深度圖、影像分割,甚至是塗鴉!無論您選擇哪種型別的條件影像,ControlNet 都會生成保留其中資訊的影像。
例如,讓我們使用深度圖來控制影像,以保留影像中的空間資訊。
from diffusers.utils import load_image, make_image_grid
# prepare image
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png"
init_image = load_image(url)
init_image = init_image.resize((958, 960)) # resize to depth image dimensions
depth_image = load_image("https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png")
make_image_grid([init_image, depth_image], rows=1, cols=2)
載入一個以深度圖和 AutoPipelineForImage2Image 為條件的 ControlNet 模型
from diffusers import ControlNetModel, AutoPipelineForImage2Image
import torch
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, variant="fp16", use_safetensors=True)
pipeline = AutoPipelineForImage2Image.from_pretrained(
"stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
現在根據深度圖、初始影像和提示生成新影像
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image_control_net = pipeline(prompt, image=init_image, control_image=depth_image).images[0]
make_image_grid([init_image, depth_image, image_control_net], rows=1, cols=3)



讓我們透過將其與影像到影像流水線串聯起來,將新的樣式應用於從 ControlNet 生成的影像
pipeline = AutoPipelineForImage2Image.from_pretrained(
"nitrosocke/elden-ring-diffusion", torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()
# remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed
pipeline.enable_xformers_memory_efficient_attention()
prompt = "elden ring style astronaut in a jungle" # include the token "elden ring style" in the prompt
negative_prompt = "ugly, deformed, disfigured, poor details, bad anatomy"
image_elden_ring = pipeline(prompt, negative_prompt=negative_prompt, image=image_control_net, strength=0.45, guidance_scale=10.5).images[0]
make_image_grid([init_image, depth_image, image_control_net, image_elden_ring], rows=2, cols=2)

最佳化
執行擴散模型在計算上既昂貴又密集,但透過一些最佳化技巧,完全可以在消費級和免費 GPU 上執行它們。例如,您可以使用更節省記憶體的注意力形式,例如 PyTorch 2.0 的縮放點積注意力或xFormers(您可以二者選其一,但沒有必要同時使用兩者)。您還可以將模型解除安裝到 GPU,而其他流水線元件則在 CPU 上等待。
+ pipeline.enable_model_cpu_offload()
+ pipeline.enable_xformers_memory_efficient_attention()
藉助 `torch.compile`,您可以透過將其封裝到 UNet 中來進一步提高推理速度
pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True)