Diffusers 文件

Kandinsky 2.2

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

Kandinsky 2.2

Kandinsky 2.2 由 Arseniy Shakhmatov、Anton Razzhigaev、Aleksandr Nikolich、Vladimir Arkhipkin、Igor Pavlov、Andrey Kuznetsov 和 Denis Dimitrov 建立。

其 GitHub 頁面的描述如下：

Kandinsky 2.2 在其前身 Kandinsky 2.1 的基礎上帶來了重大改進，引入了一個新的、更強大的影像編碼器——CLIP-ViT-G，並支援 ControlNet。將影像編碼器切換為 CLIP-ViT-G 顯著增強了模型生成更具美感圖片和更好理解文字的能力，從而提升了模型的整體效能。ControlNet 機制的加入使得模型能夠有效地控制影像生成過程。這帶來了更準確、視覺上更吸引人的輸出，併為文字引導的影像處理開闢了新的可能性。

原始程式碼庫可以在 ai-forever/Kandinsky-2 找到。

請檢視 Hub 上的 Kandinsky 社群組織，獲取用於文字到影像、影像到影像和影像修復等任務的官方模型檢查點。

請務必檢視排程器指南，瞭解如何在排程器速度和質量之間進行權衡，並參閱在不同 pipeline 中複用元件部分，學習如何高效地將相同元件載入到多個 pipeline 中。

KandinskyV22PriorPipeline

class diffusers.KandinskyV22PriorPipeline

( prior: PriorTransformer image_encoder: CLIPVisionModelWithProjection text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer scheduler: UnCLIPScheduler image_processor: CLIPImageProcessor )

引數

prior (PriorTransformer) — 典型的 unCLIP prior 模型，用於從文字嵌入中近似影像嵌入。
image_encoder (CLIPVisionModelWithProjection) — 凍結的影像編碼器。
text_encoder (CLIPTextModelWithProjection) — 凍結的文字編碼器。
tokenizer (CLIPTokenizer) — CLIPTokenizer 類的分詞器。
scheduler (UnCLIPScheduler) — 與 prior 結合使用的排程器，用於生成影像嵌入。
image_processor (CLIPImageProcessor) — 用於預處理 clip 影像的 image_processor。

用於為 Kandinsky 生成影像 prior 的 pipeline。

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None guidance_scale: float = 4.0 output_type: typing.Optional[str] = 'pt' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → KandinskyPriorPipelineOutput 或 tuple

引數

prompt (str 或 List[str]) — 用於指導影像生成的提示或提示列表。
negative_prompt (str 或 List[str]，可選) — 不用於指導影像生成的提示或提示列表。當不使用引導時會被忽略（即，如果 guidance_scale 小於 1）。
num_images_per_prompt (int，可選，預設為 1) — 每個提示生成的影像數量。
num_inference_steps (int，可選，預設為 100) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但會犧牲推理速度。
generator (torch.Generator 或 List[torch.Generator]，可選) — 一個或多個 torch generator(s)，用於使生成過程具有確定性。
latents (torch.Tensor，可選) — 預生成的噪聲潛變數，從高斯分佈中取樣，用作影像生成的輸入。可用於使用不同提示微調相同的生成過程。如果未提供，將使用提供的隨機 generator 進行取樣生成一個潛變數張量。
guidance_scale (float，可選，預設為 4.0) — 無分類器擴散引導中定義的引導比例。guidance_scale 定義為 Imagen 論文中公式 2 的 w。透過設定 guidance_scale > 1 來啟用引導比例。更高的引導比例會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲較低的影像質量。
output_type (str，可選，預設為 "pt") — 生成影像的輸出格式。在 "np" (np.array) 或 "pt" (torch.Tensor) 之間選擇。
return_dict (bool，可選，預設為 True) — 是否返回 ImagePipelineOutput 而不是普通的元組。
callback_on_step_end (Callable，可選) — 在推理過程中每個去噪步驟結束時呼叫的函式。該函式使用以下引數呼叫：`callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`。`callback_kwargs` 將包含由 `callback_on_step_end_tensor_inputs` 指定的所有張量的列表。
callback_on_step_end_tensor_inputs (List，可選) — `callback_on_step_end` 函式的張量輸入列表。列表中指定的張量將作為 `callback_kwargs` 引數傳遞。您只能包含 pipeline 類的 `._callback_tensor_inputs` 屬性中列出的變數。

返回

KandinskyPriorPipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

>>> from diffusers import KandinskyV22Pipeline, KandinskyV22PriorPipeline
>>> import torch

>>> pipe_prior = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior")
>>> pipe_prior.to("cuda")
>>> prompt = "red cat, 4k photo"
>>> image_emb, negative_image_emb = pipe_prior(prompt).to_tuple()

>>> pipe = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder")
>>> pipe.to("cuda")
>>> image = pipe(
...     image_embeds=image_emb,
...     negative_image_embeds=negative_image_emb,
...     height=768,
...     width=768,
...     num_inference_steps=50,
... ).images
>>> image[0].save("cat.png")

interpolate

( images_and_prompts: typing.List[typing.Union[str, PIL.Image.Image, torch.Tensor]] weights: typing.List[float] num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None negative_prior_prompt: typing.Optional[str] = None negative_prompt: str = '' guidance_scale: float = 4.0 device = None ) → KandinskyPriorPipelineOutput 或 tuple

引數

images_and_prompts (List[Union[str, PIL.Image.Image, torch.Tensor]]) — 用於指導影像生成的提示和影像列表。
weights — (List[float]): `images_and_prompts` 中每個條件的權重列表。
num_images_per_prompt (int，可選，預設為 1) — 每個提示生成的影像數量。
num_inference_steps (int，可選，預設為 100) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但會犧牲推理速度。
generator (torch.Generator 或 List[torch.Generator]，可選) — 一個或多個 torch generator(s)，用於使生成過程具有確定性。
latents (torch.Tensor，可選) — 預生成的噪聲潛變數，從高斯分佈中取樣，用作影像生成的輸入。可用於使用不同提示微調相同的生成過程。如果未提供，將使用提供的隨機 generator 進行取樣生成一個潛變數張量。
negative_prior_prompt (str，可選) — 不用於指導 prior 擴散過程的提示。當不使用引導時會被忽略（即，如果 guidance_scale 小於 1）。
negative_prompt (str 或 List[str]，可選) — 不用於指導影像生成的提示。當不使用引導時會被忽略（即，如果 guidance_scale 小於 1）。
guidance_scale (float，可選，預設為 4.0) — 無分類器擴散引導中定義的引導比例。guidance_scale 定義為 Imagen 論文中公式 2 的 w。透過設定 guidance_scale > 1 來啟用引導比例。更高的引導比例會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲較低的影像質量。

返回

KandinskyPriorPipelineOutput 或 tuple

在使用 prior pipeline 進行插值時呼叫的函式。

示例

>>> from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
>>> from diffusers.utils import load_image
>>> import PIL
>>> import torch
>>> from torchvision import transforms

>>> pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
... )
>>> pipe_prior.to("cuda")
>>> img1 = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/cat.png"
... )
>>> img2 = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/starry_night.jpeg"
... )
>>> images_texts = ["a cat", img1, img2]
>>> weights = [0.3, 0.3, 0.4]
>>> out = pipe_prior.interpolate(images_texts, weights)
>>> pipe = KandinskyV22Pipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
... )
>>> pipe.to("cuda")
>>> image = pipe(
...     image_embeds=out.image_embeds,
...     negative_image_embeds=out.negative_image_embeds,
...     height=768,
...     width=768,
...     num_inference_steps=50,
... ).images[0]
>>> image.save("starry_cat.png")

KandinskyV22Pipeline

class diffusers.KandinskyV22Pipeline

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

引數

scheduler (Union[DDIMScheduler,DDPMScheduler]) — 與 unet 結合使用的排程器，用於生成影像潛在表示 (image latents)。
unet (UNet2DConditionModel) — 用於對影像嵌入進行去噪的條件 U-Net 架構。
movq (VQModel) — MoVQ 解碼器，用於從潛在表示生成影像。

使用 Kandinsky 進行文字到影像生成的 Pipeline

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput 或 tuple

引數

image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用於文字提示的 CLIP 影像嵌入，將用於條件化影像生成。
negative_image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用於負面文字提示的 CLIP 影像嵌入，將用於條件化影像生成。
height (int, 可選, 預設為 512) — 生成影像的高度（以畫素為單位）。
width (int, 可選, 預設為 512) — 生成影像的寬度（以畫素為單位）。
num_inference_steps (int, 可選, 預設為 100) — 去噪步驟的數量。更多的去噪步驟通常會帶來更高質量的影像，但會犧牲推理速度。
guidance_scale (float, 可選, 預設為 4.0) — 指導尺度，定義於 Classifier-Free Diffusion Guidance 中。guidance_scale 在 Imagen 論文的公式 2 中定義為 w。透過設定 guidance_scale > 1 啟用指導尺度。較高的指導尺度會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲影像質量。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch generator(s)，用於使生成過程具有確定性。
latents (torch.Tensor, 可選) — 預生成的噪聲潛在表示，從高斯分佈中取樣，用作影像生成的輸入。可用於調整使用不同提示的相同生成過程。如果未提供，將使用提供的隨機 generator 進行取樣生成一個潛在表示張量。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。可選值包括："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可選, 預設為 True) — 是否返回 ImagePipelineOutput 而不是普通的元組。
callback_on_step_end (Callable, 可選) — 在推理過程中每個去噪步驟結束時呼叫的函式。該函式呼叫時帶有以下引數：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含在您的 pipeline 類的 ._callback_tensor_inputs 屬性中列出的變數。

返回

ImagePipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

>>> from diffusers import KandinskyV22Pipeline, KandinskyV22PriorPipeline
>>> import torch

>>> pipe_prior = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior")
>>> pipe_prior.to("cuda")
>>> prompt = "red cat, 4k photo"
>>> out = pipe_prior(prompt)
>>> image_emb = out.image_embeds
>>> zero_image_emb = out.negative_image_embeds
>>> pipe = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder")
>>> pipe.to("cuda")
>>> image = pipe(
...     image_embeds=image_emb,
...     negative_image_embeds=zero_image_emb,
...     height=768,
...     width=768,
...     num_inference_steps=50,
... ).images
>>> image[0].save("cat.png")

KandinskyV22CombinedPipeline

class diffusers.KandinskyV22CombinedPipeline

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel prior_prior: PriorTransformer prior_image_encoder: CLIPVisionModelWithProjection prior_text_encoder: CLIPTextModelWithProjection prior_tokenizer: CLIPTokenizer prior_scheduler: UnCLIPScheduler prior_image_processor: CLIPImageProcessor )

引數

scheduler (Union[DDIMScheduler,DDPMScheduler]) — 與 unet 結合使用的排程器，用於生成影像潛在表示 (image latents)。
unet (UNet2DConditionModel) — 用於對影像嵌入進行去噪的條件 U-Net 架構。
movq (VQModel) — MoVQ 解碼器，用於從潛在表示生成影像。
prior_prior (PriorTransformer) — 用於從文字嵌入中近似影像嵌入的規範 unCLIP 先驗。
prior_image_encoder (CLIPVisionModelWithProjection) — 凍結的影像編碼器。
prior_text_encoder (CLIPTextModelWithProjection) — 凍結的文字編碼器。
prior_tokenizer (CLIPTokenizer) — CLIPTokenizer 類的分詞器。
prior_scheduler (UnCLIPScheduler) — 與 prior 結合使用的排程器，用於生成影像嵌入。
prior_image_processor (CLIPImageProcessor) — 用於預處理來自 CLIP 的影像的影像處理器。

使用 Kandinsky 進行文字到影像生成的組合 Pipeline

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 height: int = 512 width: int = 512 prior_guidance_scale: float = 4.0 prior_num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 return_dict: bool = True prior_callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None prior_callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → ImagePipelineOutput 或 tuple

引數

prompt (str 或 List[str]) — 用於指導影像生成的提示或提示列表。
negative_prompt (str 或 List[str], 可選) — 不用於指導影像生成的提示或提示列表。當不使用指導時（即，如果 guidance_scale 小於 1），此項將被忽略。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
num_inference_steps (int, 可選, 預設為 100) — 去噪步驟的數量。更多的去噪步驟通常會帶來更高質量的影像，但會犧牲推理速度。
height (int, 可選, 預設為 512) — 生成影像的高度（以畫素為單位）。
width (int, 可選, 預設為 512) — 生成影像的寬度（以畫素為單位）。
prior_guidance_scale (float, 可選, 預設為 4.0) — 指導尺度，定義於 Classifier-Free Diffusion Guidance 中。guidance_scale 在 Imagen 論文的公式 2 中定義為 w。透過設定 guidance_scale > 1 啟用指導尺度。較高的指導尺度會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲影像質量。
prior_num_inference_steps (int, 可選, 預設為 100) — 去噪步驟的數量。更多的去噪步驟通常會帶來更高質量的影像，但會犧牲推理速度。
guidance_scale (float, 可選, 預設為 4.0) — 指導尺度，定義於 Classifier-Free Diffusion Guidance 中。guidance_scale 在 Imagen 論文的公式 2 中定義為 w。透過設定 guidance_scale > 1 啟用指導尺度。較高的指導尺度會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲影像質量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch generator(s)，用於使生成過程具有確定性。
latents (torch.Tensor, 可選) — 預生成的噪聲潛在表示，從高斯分佈中取樣，用作影像生成的輸入。可用於調整使用不同提示的相同生成過程。如果未提供，將使用提供的隨機 generator 進行取樣生成一個潛在表示張量。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。可選值包括："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可選, 預設為 True) — 是否返回 ImagePipelineOutput 而不是普通的元組。
prior_callback_on_step_end (Callable, 可選) — 在先驗 pipeline 推理過程中每個去噪步驟結束時呼叫的函式。該函式呼叫時帶有以下引數：prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。
prior_callback_on_step_end_tensor_inputs (List, 可選) — prior_callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含在您的先驗 pipeline 類的 ._callback_tensor_inputs 屬性中列出的變數。
callback_on_step_end (Callable, 可選) — 在解碼器 pipeline 推理過程中每個去噪步驟結束時呼叫的函式。該函式呼叫時帶有以下引數：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含在您的 pipeline 類的 ._callback_tensor_inputs 屬性中列出的變數。

返回

ImagePipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"

image = pipe(prompt=prompt, num_inference_steps=25).images[0]

enable_sequential_cpu_offload

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = None )

使用 accelerate 將所有模型解除安裝到 CPU，顯著減少記憶體使用。呼叫時，unet、text_encoder、vae 和 safety checker 的狀態字典將儲存到 CPU，然後移動到 torch.device('meta')，僅當其特定子模組的 forward 方法被呼叫時才載入到 GPU。請注意，解除安裝是基於子模組的。記憶體節省比 enable_model_cpu_offload 更高，但效能較低。

KandinskyV22ControlnetPipeline

class diffusers.KandinskyV22ControlnetPipeline

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

引數

scheduler (DDIMScheduler) — 與 unet 結合使用的排程器，用於生成影像潛在表示 (image latents)。
unet (UNet2DConditionModel) — 用於對影像嵌入進行去噪的條件 U-Net 架構。
movq (VQModel) — MoVQ 解碼器，用於從潛在表示生成影像。

使用 Kandinsky 進行文字到影像生成的 Pipeline

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] hint: Tensor height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 return_dict: bool = True ) → ImagePipelineOutput 或 tuple

引數

prompt (str 或 List[str]) — 用於指導影像生成的提示或提示列表。
hint (torch.Tensor) — controlnet 條件。
image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用於文字提示的 clip 影像嵌入，將用於對影像生成進行條件化。
negative_image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用於負面文字提示的 clip 影像嵌入，將用於對影像生成進行條件化。
negative_prompt (str 或 List[str], 可選) — 不用於指導影像生成的提示或提示列表。當不使用指導時（即，如果 guidance_scale 小於 1），此引數將被忽略。
height (int, 可選, 預設為 512) — 生成影像的高度（以畫素為單位）。
width (int, 可選, 預設為 512) — 生成影像的寬度（以畫素為單位）。
num_inference_steps (int, 可選, 預設為 100) — 去噪步驟的數量。更多的去噪步驟通常會帶來更高質量的影像，但代價是推理速度較慢。
guidance_scale (float, 可選, 預設為 4.0) — 指導尺度，定義於 Classifier-Free Diffusion Guidance。guidance_scale 定義為 Imagen 論文中公式 2 的 w。透過設定 guidance_scale > 1 啟用指導尺度。較高的指導尺度會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲影像質量。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch generator(s)，用於使生成過程具有確定性。
latents (torch.Tensor, 可選) — 預生成的噪聲潛變數，從高斯分佈中取樣，用作影像生成的輸入。可用於使用不同提示微調相同的生成。如果未提供，將使用提供的隨機 generator 取樣生成潛變數張量。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。可選擇："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
callback (Callable, 可選) — 一個在推理期間每 callback_steps 步呼叫的函式。該函式使用以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，則在每一步都呼叫回撥。
return_dict (bool, 可選, 預設為 True) — 是否返回 ImagePipelineOutput 而不是普通的元組。

返回

ImagePipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

KandinskyV22PriorEmb2EmbPipeline

class diffusers.KandinskyV22PriorEmb2EmbPipeline

( prior: PriorTransformer image_encoder: CLIPVisionModelWithProjection text_encoder: CLIPTextModelWithProjection tokenizer: CLIPTokenizer scheduler: UnCLIPScheduler image_processor: CLIPImageProcessor )

引數

prior (PriorTransformer) — 典型的 unCLIP 先驗模型，用於從文字嵌入中近似影像嵌入。
image_encoder (CLIPVisionModelWithProjection) — 凍結的影像編碼器。
text_encoder (CLIPTextModelWithProjection) — 凍結的文字編碼器。
tokenizer (CLIPTokenizer) — CLIPTokenizer 類的分詞器。
scheduler (UnCLIPScheduler) — 與 prior 結合使用的排程器，用於生成影像嵌入。

用於為 Kandinsky 生成影像 prior 的 pipeline。

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( prompt: typing.Union[str, typing.List[str]] image: typing.Union[torch.Tensor, typing.List[torch.Tensor], PIL.Image.Image, typing.List[PIL.Image.Image]] strength: float = 0.3 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None guidance_scale: float = 4.0 output_type: typing.Optional[str] = 'pt' return_dict: bool = True ) → KandinskyPriorPipelineOutput 或 tuple

引數

prompt (str 或 List[str]) — 用於指導影像生成的提示或提示列表。
strength (float, 可選, 預設為 0.8) — 概念上，指示對參考 emb 的轉換程度。必須介於 0 和 1 之間。image 將被用作起點，strength 越大，新增的噪聲越多。去噪步驟的數量取決於最初新增的噪聲量。
emb (torch.Tensor) — 影像嵌入。
negative_prompt (str 或 List[str], 可選) — 不用於指導影像生成的提示或提示列表。當不使用指導時（即，如果 guidance_scale 小於 1），此引數將被忽略。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
num_inference_steps (int, 可選, 預設為 100) — 去噪步驟的數量。更多的去噪步驟通常會帶來更高質量的影像，但代價是推理速度較慢。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch generator(s)，用於使生成過程具有確定性。
guidance_scale (float, 可選, 預設為 4.0) — 指導尺度，定義於 Classifier-Free Diffusion Guidance。guidance_scale 定義為 Imagen 論文中公式 2 的 w。透過設定 guidance_scale > 1 啟用指導尺度。較高的指導尺度會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲影像質量。
output_type (str, 可選, 預設為 "pt") — 生成影像的輸出格式。可選擇："np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可選, 預設為 True) — 是否返回 ImagePipelineOutput 而不是普通的元組。

返回

KandinskyPriorPipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

>>> from diffusers import KandinskyV22Pipeline, KandinskyV22PriorEmb2EmbPipeline
>>> import torch

>>> pipe_prior = KandinskyPriorPipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
... )
>>> pipe_prior.to("cuda")

>>> prompt = "red cat, 4k photo"
>>> img = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/cat.png"
... )
>>> image_emb, nagative_image_emb = pipe_prior(prompt, image=img, strength=0.2).to_tuple()

>>> pipe = KandinskyPipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-decoder, torch_dtype=torch.float16"
... )
>>> pipe.to("cuda")

>>> image = pipe(
...     image_embeds=image_emb,
...     negative_image_embeds=negative_image_emb,
...     height=768,
...     width=768,
...     num_inference_steps=100,
... ).images

>>> image[0].save("cat.png")

interpolate

( images_and_prompts: typing.List[typing.Union[str, PIL.Image.Image, torch.Tensor]] weights: typing.List[float] num_images_per_prompt: int = 1 num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None negative_prior_prompt: typing.Optional[str] = None negative_prompt: str = '' guidance_scale: float = 4.0 device = None ) → KandinskyPriorPipelineOutput 或 tuple

引數

images_and_prompts (List[Union[str, PIL.Image.Image, torch.Tensor]]) — 用於指導影像生成的提示和影像列表。
weights — (List[float]): images_and_prompts 中每個條件的權重列表
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
num_inference_steps (int, 可選, 預設為 100) — 去噪步驟的數量。更多的去噪步驟通常會帶來更高質量的影像，但代價是推理速度較慢。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch generator(s)，用於使生成過程具有確定性。
latents (torch.Tensor, 可選) — 預生成的噪聲潛變數，從高斯分佈中取樣，用作影像生成的輸入。可用於使用不同提示微調相同的生成。如果未提供，將使用提供的隨機 generator 取樣生成潛變數張量。
negative_prior_prompt (str, 可選) — 不用於指導先驗擴散過程的提示。當不使用指導時（即，如果 guidance_scale 小於 1），此引數將被忽略。
negative_prompt (str 或 List[str], 可選) — 不用於指導影像生成的提示。當不使用指導時（即，如果 guidance_scale 小於 1），此引數將被忽略。
guidance_scale (float, 可選, 預設為 4.0) — 指導尺度，定義於 Classifier-Free Diffusion Guidance。guidance_scale 定義為 Imagen 論文中公式 2 的 w。透過設定 guidance_scale > 1 啟用指導尺度。較高的指導尺度會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲影像質量。

返回

KandinskyPriorPipelineOutput 或 tuple

在使用 prior pipeline 進行插值時呼叫的函式。

示例

>>> from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22Pipeline
>>> from diffusers.utils import load_image
>>> import PIL

>>> import torch
>>> from torchvision import transforms

>>> pipe_prior = KandinskyV22PriorPipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16
... )
>>> pipe_prior.to("cuda")

>>> img1 = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/cat.png"
... )

>>> img2 = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main"
...     "/kandinsky/starry_night.jpeg"
... )

>>> images_texts = ["a cat", img1, img2]
>>> weights = [0.3, 0.3, 0.4]
>>> image_emb, zero_image_emb = pipe_prior.interpolate(images_texts, weights)

>>> pipe = KandinskyV22Pipeline.from_pretrained(
...     "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
... )
>>> pipe.to("cuda")

>>> image = pipe(
...     image_embeds=image_emb,
...     negative_image_embeds=zero_image_emb,
...     height=768,
...     width=768,
...     num_inference_steps=150,
... ).images[0]

>>> image.save("starry_cat.png")

KandinskyV22Img2ImgPipeline

class diffusers.KandinskyV22Img2ImgPipeline

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

引數

scheduler (DDIMScheduler) — 與 unet 結合使用的排程器，用於生成影像潛變數。
unet (UNet2DConditionModel) — 用於對影像嵌入進行去噪的條件 U-Net 架構。
movq (VQModel) — MoVQ 解碼器，用於從潛變數生成影像。

使用 Kandinsky 進行圖生圖的流水線

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 strength: float = 0.3 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput 或 tuple

引數

image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用於文字提示的 clip 影像嵌入，將用於對影像生成進行條件化。
image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], 或 List[np.ndarray]) — Image 或表示影像批次的張量，將用作該過程的起點。也可以接受影像潛變數作為 image，如果直接傳遞潛變數，則不會再次編碼。
strength (float, 可選, 預設為 0.8) — 概念上，指示對參考 image 的轉換程度。必須介於 0 和 1 之間。image 將被用作起點，strength 越大，新增的噪聲越多。去噪步驟的數量取決於最初新增的噪聲量。當 strength 為 1 時，新增的噪聲將達到最大，去噪過程將執行在 num_inference_steps 中指定的完整迭代次數。因此，值為 1 實際上會忽略 image。
negative_image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用於負面文字提示的 clip 影像嵌入，將用於對影像生成進行條件化。
height (int, 可選, 預設為 512) — 生成影像的高度（以畫素為單位）。
width (int, 可選, 預設為 512) — 生成影像的寬度（以畫素為單位）。
num_inference_steps (int, 可選, 預設為 100) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但代價是推理速度變慢。
guidance_scale (float, 可選, 預設為 4.0) — 指導比例（guidance scale），定義於無分類器擴散指導 (Classifier-Free Diffusion Guidance)論文中。guidance_scale 在 Imagen 論文的公式2中定義為 w。透過設定 guidance_scale > 1 啟用指導比例。更高的指導比例會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲影像質量。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或一組 torch generator(s)，用於使生成過程具有確定性。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。可選值為："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可選, 預設為 True) — 是否返回 ImagePipelineOutput 而不是普通的元組。
callback_on_step_end (Callable, 可選) — 在推理過程中每個去噪步驟結束時呼叫的函式。該函式呼叫時會傳入以下引數：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含由 callback_on_step_end_tensor_inputs 指定的所有張量的列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。你只能包含 pipeline 類中 ._callback_tensor_inputs 屬性所列出的變數。

返回

ImagePipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

KandinskyV22Img2ImgCombinedPipeline

class diffusers.KandinskyV22Img2ImgCombinedPipeline

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel prior_prior: PriorTransformer prior_image_encoder: CLIPVisionModelWithProjection prior_text_encoder: CLIPTextModelWithProjection prior_tokenizer: CLIPTokenizer prior_scheduler: UnCLIPScheduler prior_image_processor: CLIPImageProcessor )

引數

scheduler (Union[DDIMScheduler,DDPMScheduler]) — 一個與 unet 結合使用的排程器，用於生成影像潛變數。
unet (UNet2DConditionModel) — 用於對影像嵌入進行去噪的條件 U-Net 架構。
movq (VQModel) — MoVQ 解碼器，用於從潛變數生成影像。
prior_prior (PriorTransformer) — 經典的 unCLIP 先驗，用於從文字嵌入近似影像嵌入。
prior_image_encoder (CLIPVisionModelWithProjection) — 凍結的影像編碼器。
prior_text_encoder (CLIPTextModelWithProjection) — 凍結的文字編碼器。
prior_tokenizer (CLIPTokenizer) — 類別為 CLIPTokenizer 的分詞器。
prior_scheduler (UnCLIPScheduler) — 一個與 prior 結合使用的排程器，用於生成影像嵌入。
prior_image_processor (CLIPImageProcessor) — 用於預處理來自 clip 的影像的 image_processor。

使用 Kandinsky 進行圖生圖的組合 Pipeline

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( prompt: typing.Union[str, typing.List[str]] image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_inference_steps: int = 100 guidance_scale: float = 4.0 strength: float = 0.3 num_images_per_prompt: int = 1 height: int = 512 width: int = 512 prior_guidance_scale: float = 4.0 prior_num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 return_dict: bool = True prior_callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None prior_callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] ) → ImagePipelineOutput 或 tuple

引數

prompt (str 或 List[str]) — 用於指導影像生成的提示或提示列表。
image (torch.Tensor、PIL.Image.Image、np.ndarray、List[torch.Tensor]、List[PIL.Image.Image] 或 List[np.ndarray]) — Image 或表示影像批次的張量，將作為該過程的起點。如果直接傳遞潛變數作為 image，則不會再次編碼。
negative_prompt (str 或 List[str], 可選) — 不用於指導影像生成的提示或提示列表。當不使用指導時（即如果 guidance_scale 小於 1），此引數將被忽略。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
guidance_scale (float, 可選, 預設為 4.0) — 指導比例（guidance scale），定義於無分類器擴散指導 (Classifier-Free Diffusion Guidance)論文中。guidance_scale 在 Imagen 論文的公式2中定義為 w。透過設定 guidance_scale > 1 啟用指導比例。更高的指導比例會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲影像質量。
strength (float, 可選, 預設為 0.3) — 概念上，表示對參考 image 的轉換程度。必須在 0 到 1 之間。image 將作為起點，strength 越大，新增的噪聲就越多。去噪步驟的數量取決於最初新增的噪聲量。當 strength 為 1 時，新增的噪聲將是最大的，去噪過程將執行在 num_inference_steps 中指定的完整迭代次數。因此，值為 1 實際上會忽略 image。
num_inference_steps (int, 可選, 預設為 100) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但代價是推理速度變慢。
height (int, 可選, 預設為 512) — 生成影像的高度（以畫素為單位）。
width (int, 可選, 預設為 512) — 生成影像的寬度（以畫素為單位）。
prior_guidance_scale (float, 可選, 預設為 4.0) — 指導比例（guidance scale），定義於無分類器擴散指導 (Classifier-Free Diffusion Guidance)論文中。guidance_scale 在 Imagen 論文的公式2中定義為 w。透過設定 guidance_scale > 1 啟用指導比例。更高的指導比例會鼓勵生成與文字 prompt 緊密相關的影像，但通常會犧牲影像質量。
prior_num_inference_steps (int, 可選, 預設為 100) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但代價是推理速度變慢。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或一組 torch generator(s)，用於使生成過程具有確定性。
latents (torch.Tensor, 可選) — 預生成的噪聲潛變數，從高斯分佈中取樣，用作影像生成的輸入。可用於調整使用不同提示的相同生成。如果未提供，將使用提供的隨機 generator 進行取樣生成一個潛變數張量。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。可選值為："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
callback (Callable, 可選) — 在推理過程中每 callback_steps 步呼叫一次的函式。該函式呼叫時會傳入以下引數：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，則在每一步都呼叫回撥函式。
return_dict (bool, 可選, 預設為 True) — 是否返回 ImagePipelineOutput 而不是普通的元組。

返回

ImagePipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

from diffusers import AutoPipelineForImage2Image
import torch
import requests
from io import BytesIO
from PIL import Image
import os

pipe = AutoPipelineForImage2Image.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
image.thumbnail((768, 768))

image = pipe(prompt=prompt, image=original_image, num_inference_steps=25).images[0]

enable_model_cpu_offload

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = None )

使用 accelerate 將所有模型解除安裝到 CPU，以較低的效能影響減少記憶體使用。與 enable_sequential_cpu_offload 相比，此方法在呼叫模型的 forward 方法時一次性將整個模型移至 GPU，並且模型會一直保留在 GPU 中直到下一個模型執行。記憶體節省量低於 enable_sequential_cpu_offload，但由於 unet 的迭代執行，效能要好得多。

enable_sequential_cpu_offload

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = None )

使用 accelerate 將所有模型解除安裝到 CPU，顯著減少記憶體使用。呼叫時，unet、text_encoder、vae 和 safety checker 的狀態字典將儲存到 CPU，然後移動到 torch.device('meta')，僅當其特定子模組的 forward 方法被呼叫時才載入到 GPU。請注意，解除安裝是基於子模組的。記憶體節省比 enable_model_cpu_offload 更高，但效能較低。

KandinskyV22ControlnetImg2ImgPipeline

class diffusers.KandinskyV22ControlnetImg2ImgPipeline

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

引數

scheduler (DDIMScheduler) — 一個與 unet 結合使用的排程器，用於生成影像潛變數。
unet (UNet2DConditionModel) — 用於對影像嵌入進行去噪的條件 U-Net 架構。
movq (VQModel) — MoVQ 解碼器，用於從潛變數生成影像。

使用 Kandinsky 進行圖生圖的流水線

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] hint: Tensor height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 strength: float = 0.3 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: typing.Optional[str] = 'pil' callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 return_dict: bool = True ) → ImagePipelineOutput 或 tuple

引數

image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用於文字提示的 clip 影像嵌入，將用於條件化影像生成。
image (torch.Tensor、PIL.Image.Image、np.ndarray、List[torch.Tensor]、List[PIL.Image.Image] 或 List[np.ndarray]) — Image 或表示影像批次的張量，將作為該過程的起點。如果直接傳遞潛變數作為 image，則不會再次編碼。
strength (float, 可選, 預設為 0.8) — 概念上，表示對參考 image 的轉換程度。必須在 0 到 1 之間。image 將作為起點，strength 越大，新增的噪聲就越多。去噪步驟的數量取決於最初新增的噪聲量。當 strength 為 1 時，新增的噪聲將是最大的，去噪過程將執行在 num_inference_steps 中指定的完整迭代次數。因此，值為 1 實際上會忽略 image。
hint (torch.Tensor) — controlnet 條件。
negative_image_embeds (torch.Tensor 或 List[torch.Tensor]) — 用於負面文字提示的 clip 影像嵌入，將用於條件化影像生成。
height (int, 可選, 預設為 512) — 生成影像的高度（以畫素為單位）。
width (int, 可選, 預設為 512) — 生成影像的寬度（以畫素為單位）。
num_inference_steps (int, 可選, 預設為 100) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但代價是推理速度變慢。
guidance_scale (float, optional, defaults to 4.0) — 指導比例（guidance scale），定義於Classifier-Free Diffusion Guidance。guidance_scale被定義為Imagen論文中公式2的w。透過設定guidance_scale > 1來啟用指導比例。更高的指導比例會鼓勵生成與文字prompt緊密相關的影像，但通常會犧牲影像質量。
num_images_per_prompt (int, optional, defaults to 1) — 每個提示（prompt）生成的影像數量。
generator (torch.Generator or List[torch.Generator], optional) — 一個或一組torch生成器，用於使生成過程具有確定性。
output_type (str, optional, defaults to "pil") — 生成影像的輸出格式。可選擇："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
callback (Callable, optional) — 一個在推理過程中每隔callback_steps步被呼叫的函式。該函式呼叫時帶有以下引數：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, optional, defaults to 1) — 呼叫callback函式的頻率。如果未指定，則在每一步都呼叫回撥。
return_dict (bool, optional, defaults to True) — 是否返回ImagePipelineOutput而不是普通的元組。

返回

ImagePipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

KandinskyV22InpaintPipeline

class diffusers.KandinskyV22InpaintPipeline

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel )

引數

scheduler (DDIMScheduler) — 與unet結合使用的排程器，用於生成影像潛變數。
unet (UNet2DConditionModel) — 用於對影像嵌入進行去噪的條件U-Net架構。
movq (VQModel) — 用於從潛變數生成影像的MoVQ解碼器。

使用Kandinsky2.1進行文字引導影像修復的管線

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] image: typing.Union[torch.Tensor, PIL.Image.Image] mask_image: typing.Union[torch.Tensor, PIL.Image.Image, numpy.ndarray] negative_image_embeds: typing.Union[torch.Tensor, typing.List[torch.Tensor]] height: int = 512 width: int = 512 num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput 或 tuple

引數

image_embeds (torch.Tensor or List[torch.Tensor]) — 用於文字提示的CLIP影像嵌入，將用於調節影像生成。
image (PIL.Image.Image) — Image，或表示要修復的影像批次的張量，即影像的部分割槽域將被mask_image遮蓋並根據prompt重新繪製。
mask_image (np.array) — 表示用於遮蓋image的影像批次的張量。遮罩中的白色畫素將被重新繪製，而黑色畫素將被保留。如果mask_image是PIL影像，它在使用前將被轉換為單通道（亮度）。如果是張量，它應該包含一個顏色通道（L）而不是3個，因此預期的形狀是(B, H, W, 1)。
negative_image_embeds (torch.Tensor or List[torch.Tensor]) — 用於負面文字提示的CLIP影像嵌入，將用於調節影像生成。
height (int, optional, defaults to 512) — 生成影像的高度（以畫素為單位）。
width (int, optional, defaults to 512) — 生成影像的寬度（以畫素為單位）。
num_inference_steps (int, optional, defaults to 100) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但會犧牲推理速度。
guidance_scale (float, optional, defaults to 4.0) — 指導比例（guidance scale），定義於Classifier-Free Diffusion Guidance。guidance_scale被定義為Imagen論文中公式2的w。透過設定guidance_scale > 1來啟用指導比例。更高的指導比例會鼓勵生成與文字prompt緊密相關的影像，但通常會犧牲影像質量。
num_images_per_prompt (int, optional, defaults to 1) — 每個提示（prompt）生成的影像數量。
generator (torch.Generator or List[torch.Generator], optional) — 一個或一組torch生成器，用於使生成過程具有確定性。
latents (torch.Tensor, optional) — 預生成的噪聲潛變數，從高斯分佈中取樣，用作影像生成的輸入。可用於調整使用不同提示的相同生成過程。如果未提供，將使用提供的隨機generator進行取樣生成一個潛變數張量。
output_type (str, optional, defaults to "pil") — 生成影像的輸出格式。可選擇："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, optional, defaults to True) — 是否返回ImagePipelineOutput而不是普通的元組。
callback_on_step_end (Callable, optional) — 一個在推理過程中每個去噪步驟結束時呼叫的函式。該函式呼叫時帶有以下引數：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs將包含由callback_on_step_end_tensor_inputs指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, optional) — 用於callback_on_step_end函式的張量輸入列表。列表中指定的張量將作為callback_kwargs引數傳遞。你只能包含管線類._callback_tensor_inputs屬性中列出的變數。

返回

ImagePipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

KandinskyV22InpaintCombinedPipeline

class diffusers.KandinskyV22InpaintCombinedPipeline

( unet: UNet2DConditionModel scheduler: DDPMScheduler movq: VQModel prior_prior: PriorTransformer prior_image_encoder: CLIPVisionModelWithProjection prior_text_encoder: CLIPTextModelWithProjection prior_tokenizer: CLIPTokenizer prior_scheduler: UnCLIPScheduler prior_image_processor: CLIPImageProcessor )

引數

scheduler (Union[DDIMScheduler,DDPMScheduler]) — 與unet結合使用的排程器，用於生成影像潛變數。
unet (UNet2DConditionModel) — 用於對影像嵌入進行去噪的條件U-Net架構。
movq (VQModel) — 用於從潛變數生成影像的MoVQ解碼器。
prior_prior (PriorTransformer) — 典型的unCLIP先驗，用於從文字嵌入中近似影像嵌入。
prior_image_encoder (CLIPVisionModelWithProjection) — 凍結的影像編碼器。
prior_text_encoder (CLIPTextModelWithProjection) — 凍結的文字編碼器。
prior_tokenizer (CLIPTokenizer) — CLIPTokenizer類的分詞器。
prior_scheduler (UnCLIPScheduler) — 與prior結合使用的排程器，用於生成影像嵌入。
prior_image_processor (CLIPImageProcessor) — 用於預處理來自CLIP的影像的影像處理器。

使用Kandinsky進行修復生成的組合管線

該模型繼承自 DiffusionPipeline。請檢視超類文件以瞭解該庫為所有 pipeline 實現的通用方法（例如下載或儲存、在特定裝置上執行等）。

call

( prompt: typing.Union[str, typing.List[str]] image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] mask_image: typing.Union[torch.Tensor, PIL.Image.Image, typing.List[torch.Tensor], typing.List[PIL.Image.Image]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_inference_steps: int = 100 guidance_scale: float = 4.0 num_images_per_prompt: int = 1 height: int = 512 width: int = 512 prior_guidance_scale: float = 4.0 prior_num_inference_steps: int = 25 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True prior_callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None prior_callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → ImagePipelineOutput 或 tuple

引數

prompt (str or List[str]) — 用於指導影像生成的提示或提示列表。
image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], or List[np.ndarray]) — Image，或表示影像批次的張量，將用作該過程的起點。也可以接受影像潛變數作為image，如果直接傳遞潛變數，則不會再次編碼。
mask_image (np.array) — 表示用於遮蓋image的影像批次的張量。遮罩中的白色畫素將被重新繪製，而黑色畫素將被保留。如果mask_image是PIL影像，它在使用前將被轉換為單通道（亮度）。如果是張量，它應該包含一個顏色通道（L）而不是3個，因此預期的形狀是(B, H, W, 1)。
negative_prompt (str or List[str], optional) — 不用於指導影像生成的提示或提示列表。在不使用指導時（即，如果guidance_scale小於1）將被忽略。
num_images_per_prompt (int, optional, defaults to 1) — 每個提示（prompt）生成的影像數量。
guidance_scale (float, optional, defaults to 4.0) — 指導比例（guidance scale），定義於Classifier-Free Diffusion Guidance。guidance_scale被定義為Imagen論文中公式2的w。透過設定guidance_scale > 1來啟用指導比例。更高的指導比例會鼓勵生成與文字prompt緊密相關的影像，但通常會犧牲影像質量。
num_inference_steps (int, optional, defaults to 100) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但會犧牲推理速度。
height (int, optional, defaults to 512) — 生成影像的高度（以畫素為單位）。
width (int, optional, defaults to 512) — 生成影像的寬度（以畫素為單位）。
prior_guidance_scale (float, optional, defaults to 4.0) — 指導比例（guidance scale），定義於Classifier-Free Diffusion Guidance。guidance_scale被定義為Imagen論文中公式2的w。透過設定guidance_scale > 1來啟用指導比例。更高的指導比例會鼓勵生成與文字prompt緊密相關的影像，但通常會犧牲影像質量。
prior_num_inference_steps (int, optional, defaults to 100) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但會犧牲推理速度。
generator (torch.Generator or List[torch.Generator], optional) — 一個或一組torch生成器，用於使生成過程具有確定性。
latents (torch.Tensor, optional) — 預生成的噪聲潛變數，從高斯分佈中取樣，用作影像生成的輸入。可用於調整使用不同提示的相同生成過程。如果未提供，將使用提供的隨機generator進行取樣生成一個潛變數張量。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。可選擇："pil" (PIL.Image.Image)、"np" (np.array) 或 "pt" (torch.Tensor)。
return_dict (bool, 可選, 預設為 True) — 是否返回 ImagePipelineOutput 而不是一個普通的元組。
prior_callback_on_step_end (Callable, 可選) — 一個在推理過程中每個去噪步驟結束時呼叫的函式。該函式呼叫時會傳入以下引數：prior_callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。
prior_callback_on_step_end_tensor_inputs (List, 可選) — prior_callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。你只能包含 pipeline 類 ._callback_tensor_inputs 屬性中列出的變數。
callback_on_step_end (Callable, 可選) — 一個在推理過程中每個去噪步驟結束時呼叫的函式。該函式呼叫時會傳入以下引數：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量的列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。你只能包含 pipeline 類 ._callback_tensor_inputs 屬性中列出的變數。

返回

ImagePipelineOutput 或 tuple

呼叫管道進行生成時呼叫的函式。

示例

from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image
import torch
import numpy as np

pipe = AutoPipelineForInpainting.from_pretrained(
    "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()

prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"

original_image = load_image(
    "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinsky/cat.png"
)

mask = np.zeros((768, 768), dtype=np.float32)
# Let's mask out an area above the cat's head
mask[:250, 250:-250] = 1

image = pipe(prompt=prompt, image=original_image, mask_image=mask, num_inference_steps=25).images[0]

enable_sequential_cpu_offload

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = None )

使用 accelerate 將所有模型解除安裝到 CPU，顯著減少記憶體使用。呼叫時，unet、text_encoder、vae 和 safety checker 的狀態字典將儲存到 CPU，然後移動到 torch.device('meta')，僅當其特定子模組的 forward 方法被呼叫時才載入到 GPU。請注意，解除安裝是基於子模組的。記憶體節省比 enable_model_cpu_offload 更高，但效能較低。

< > 在 GitHub 上更新

←Kandinsky 2.1 Kandinsky 3→

© . This site is unofficial and not affiliated with Hugging Face, Inc.