擴散模型（Diffusers）

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

語義引導

擴散模型的語義引導是在 SEGA: Instructing Text-to-Image Models using Semantic Guidance 中提出的，它為影像生成提供了強大的語義控制。通常情況下，文字提示的微小改動會導致完全不同的輸出影像。然而，透過 SEGA，可以輕鬆直觀地控制影像的各種變化，同時保持原始影像構圖的真實性。

論文摘要如下：

文字到影像擴散模型最近因其僅透過文字就能生成高保真影像的驚人能力而受到了廣泛關注。然而，實現與使用者意圖一致的一次性生成幾乎是不可能的，而且輸入提示的微小改動通常會導致截然不同的影像。這使得使用者對語義的控制很少。為了讓使用者掌握控制權，我們展示瞭如何與擴散過程互動，以靈活地沿著語義方向引導它。這種語義引導（SEGA）可以推廣到任何使用無分類器引導的生成架構。更重要的是，它允許進行細微而廣泛的編輯、構圖和風格的變化，以及最佳化整體藝術構思。我們透過各種任務在潛在和基於畫素的擴散模型（如 Stable Diffusion、Paella 和 DeepFloyd-IF）上演示了 SEGA 的有效性，從而為其多功能性、靈活性以及相對於現有方法的改進提供了強有力的證據。

請務必檢視排程器指南，瞭解如何探索排程器速度和質量之間的權衡，並檢視跨管道重用元件部分，瞭解如何高效地將相同元件載入到多個管道中。

SemanticStableDiffusionPipeline

class diffusers.SemanticStableDiffusionPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True )

call

< source >

( prompt: typing.Union[str, typing.List[str]] height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: int = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 editing_prompt: typing.Union[str, typing.List[str], NoneType] = None editing_prompt_embeddings: typing.Optional[torch.Tensor] = None reverse_editing_direction: typing.Union[bool, typing.List[bool], NoneType] = False edit_guidance_scale: typing.Union[float, typing.List[float], NoneType] = 5 edit_warmup_steps: typing.Union[int, typing.List[int], NoneType] = 10 edit_cooldown_steps: typing.Union[int, typing.List[int], NoneType] = None edit_threshold: typing.Union[float, typing.List[float], NoneType] = 0.9 edit_momentum_scale: typing.Optional[float] = 0.1 edit_mom_beta: typing.Optional[float] = 0.4 edit_weights: typing.Optional[typing.List[float]] = None sem_guidance: typing.Optional[typing.List[torch.Tensor]] = None ) → ~pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput or tuple

引數

prompt (str 或 List[str]) — 用於引導影像生成的提示或提示列表。
height (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的高度（畫素）。
width (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的寬度（畫素）。
num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高的影像質量，但推理速度會變慢。
guidance_scale (float, 可選, 預設為 7.5) — 更高的引導尺度值會鼓勵模型生成與文字prompt更緊密相關的影像，但會犧牲影像質量。當 guidance_scale > 1 時啟用引導尺度。
negative_prompt (str 或 List[str], 可選) — 用於引導影像生成中不包含內容的提示或提示列表。如果未定義，則需要傳入 negative_prompt_embeds。當不使用引導（guidance_scale < 1）時，此引數將被忽略。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示要生成的影像數量。
eta (float, 可選, 預設為 0.0) — 對應於 DDIM 論文中的引數 eta (η)。僅適用於 DDIMScheduler，在其他排程器中將被忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個 torch.Generator 用於使生成具有確定性。
latents (torch.Tensor, 可選) — 預先生成的從高斯分佈中取樣的噪聲潛在變數，用作影像生成的輸入。可用於使用不同提示調整相同生成。如果未提供，則透過使用提供的隨機generator進行取樣來生成潛在張量。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。在 PIL.Image 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 StableDiffusionPipelineOutput 而不是普通元組。
callback (Callable, 可選) — 一個函式，在推理過程中每隔 callback_steps 步呼叫一次。該函式將使用以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，則在每一步都呼叫回撥函式。
editing_prompt (str 或 List[str], 可選) — 用於語義引導的提示或提示列表。透過設定 editing_prompt = None 停用語義引導。提示的引導方向應透過 reverse_editing_direction 指定。
editing_prompt_embeddings (torch.Tensor, 可選) — 用於語義引導的預計算嵌入。嵌入的引導方向應透過 reverse_editing_direction 指定。
reverse_editing_direction (bool 或 List[bool], 可選, 預設為 False) — editing_prompt 中對應的提示是否應該增加或減少。
edit_guidance_scale (float 或 List[float], 可選, 預設為 5) — 語義引導的引導尺度。如果作為列表提供，值應與 editing_prompt 對應。
edit_warmup_steps (float 或 List[float], 可選, 預設為 10) — 不應用語義引導的擴散步數（每個提示）。動量在這些步中計算，並在所有預熱期結束後應用。
edit_cooldown_steps (float 或 List[float], 可選, 預設為 None) — 在此之後不再應用語義引導的擴散步數（每個提示）。
edit_threshold (float 或 List[float], 可選, 預設為 0.9) — 語義引導的閾值。
edit_momentum_scale (float, 可選, 預設為 0.1) — 在每個擴散步中新增到語義引導的動量尺度。如果設定為 0.0，則停用動量。動量已在預熱期間（對於小於 sld_warmup_steps 的擴散步）建立。一旦所有預熱期結束，動量才新增到潛在引導中。
edit_mom_beta (float, 可選, 預設為 0.4) — 定義語義引導動量如何建立。edit_mom_beta 表示保留多少先前的動量。動量已在預熱期間（對於小於 edit_warmup_steps 的擴散步）建立。
edit_weights (List[float], 可選, 預設為 None) — 指示每個單獨概念對整體引導的影響程度。如果未提供權重，則所有概念都將同等應用。
sem_guidance (List[torch.Tensor], 可選) — 預先生成的引導向量列表，將在生成時應用。列表長度必須與 num_inference_steps 對應。

~pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 ~pipelines.semantic_stable_diffusion.SemanticStableDiffusionPipelineOutput，否則返回一個 tuple，其中第一個元素是包含生成影像的列表，第二個元素是指示相應生成影像是否包含“不適合工作”（nsfw）內容的 bool 列表。

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers import SemanticStableDiffusionPipeline

>>> pipe = SemanticStableDiffusionPipeline.from_pretrained(
...     "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> out = pipe(
...     prompt="a photo of the face of a woman",
...     num_images_per_prompt=1,
...     guidance_scale=7,
...     editing_prompt=[
...         "smiling, smile",  # Concepts to apply
...         "glasses, wearing glasses",
...         "curls, wavy hair, curly hair",
...         "beard, full beard, mustache",
...     ],
...     reverse_editing_direction=[
...         False,
...         False,
...         False,
...         False,
...     ],  # Direction of guidance i.e. increase all concepts
...     edit_warmup_steps=[10, 10, 10, 10],  # Warmup period for each concept
...     edit_guidance_scale=[4, 5, 5, 5.4],  # Guidance scale for each concept
...     edit_threshold=[
...         0.99,
...         0.975,
...         0.925,
...         0.96,
...     ],  # Threshold for each concept. Threshold equals the percentile of the latent space that will be discarded. I.e. threshold=0.99 uses 1% of the latent dimensions
...     edit_momentum_scale=0.3,  # Momentum scale that will be added to the latent guidance
...     edit_mom_beta=0.6,  # Momentum beta
...     edit_weights=[1, 1, 1, 1, 1],  # Weights of the individual concepts against each other
... )
>>> image = out.images[0]

SemanticStableDiffusionPipelineOutput

class diffusers.pipelines.semantic_stable_diffusion.pipeline_output.SemanticStableDiffusionPipelineOutput

< 來源 >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] nsfw_content_detected: typing.Optional[typing.List[bool]] )

引數

images (List[PIL.Image.Image] 或 np.ndarray) — 長度為 batch_size 的去噪 PIL 影像列表，或形狀為 (batch_size, height, width, num_channels) 的 NumPy 陣列。
nsfw_content_detected (List[bool]) — 列表，指示相應的生成影像是否包含“不安全”（nsfw）內容，如果無法執行安全檢查，則為 None。

Stable Diffusion 管道的輸出類。

< > 在 GitHub 上更新

←自注意力引導 Shap-E→