Diffusers 文件

GLIGEN（基於語言的影像生成）

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

GLIGEN（基於語言的影像生成）

GLIGEN 模型由威斯康星大學麥迪遜分校、哥倫比亞大學和微軟的研究人員和工程師建立。StableDiffusionGLIGENPipeline 和 StableDiffusionGLIGENTextImagePipeline 能夠根據基礎輸入生成逼真的影像。除了文字和邊界框與 StableDiffusionGLIGENPipeline 結合使用外，如果提供輸入影像，StableDiffusionGLIGENTextImagePipeline 可以在邊界框定義的區域插入文字描述的物件。否則，它將生成由標題/提示描述的影像，並在邊界框定義的區域插入文字描述的物件。它在 COCO2014D 和 COCO2014CD 資料集上訓練，模型使用凍結的 CLIP ViT-L/14 文字編碼器，以便根據基礎輸入進行條件化。

這篇論文的摘要是：

大規模文字到影像擴散模型取得了驚人的進展。然而，現狀是僅使用文字輸入，這會阻礙可控性。在這項工作中，我們提出了 GLIGEN，即基於語言的影像生成，這是一種新穎的方法，它在現有預訓練文字到影像擴散模型的功能上進行構建和擴充套件，使其能夠也根據基礎輸入進行條件化。為了保留預訓練模型的巨大概念知識，我們凍結了其所有權重，並透過門控機制將基礎資訊注入新的可訓練層。我們的模型實現了開放世界的基礎文字到影像生成，具有標題和邊界框條件輸入，並且基礎能力很好地推廣到新穎的空間配置和概念。GLIGEN 在 COCO 和 LVIS 上的零樣本效能大大優於現有的受監督佈局到影像基線。

請務必檢視 Stable Diffusion 提示部分，瞭解如何探索排程器速度和質量之間的權衡，以及如何有效地重用管道元件！

如果您想使用官方檢查點之一完成任務，請探索 gligen Hub 組織！

StableDiffusionGLIGENPipeline 由 Nikhil Gajendrakumar 貢獻，StableDiffusionGLIGENTextImagePipeline 由 Nguyễn Công Tú Anh 貢獻。

StableDiffusionGLIGENPipeline

class diffusers.StableDiffusionGLIGENPipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True )

引數

vae (AutoencoderKL) — 用於將影像編碼和解碼為潛在表示的變分自編碼器 (VAE) 模型。
text_encoder (CLIPTextModel) — 凍結的文字編碼器 (clip-vit-large-patch14)。
tokenizer (CLIPTokenizer) — 用於標記文字的 CLIPTokenizer。
unet (UNet2DConditionModel) — 用於對編碼影像潛在表示進行去噪的 UNet2DConditionModel。
scheduler (SchedulerMixin) — 用於與 unet 結合以對編碼影像潛在表示進行去噪的排程器。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
safety_checker (StableDiffusionSafetyChecker) — 分類模組，用於評估生成的影像是否可能被認為是冒犯性或有害的。有關模型潛在危害的更多詳細資訊，請參閱模型卡。
feature_extractor (CLIPImageProcessor) — 用於從生成的影像中提取特徵的 CLIPImageProcessor；用作 safety_checker 的輸入。

使用 Stable Diffusion 和 GLIGEN（基於語言的影像生成）進行文字到影像生成的管道。

此模型繼承自 DiffusionPipeline。有關庫為所有管道實現的通用方法（例如下載或儲存、在特定裝置上執行等），請檢視超類文件。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 gligen_scheduled_sampling_beta: float = 0.3 gligen_phrases: typing.List[str] = None gligen_boxes: typing.List[typing.List[float]] = None gligen_inpaint_image: typing.Optional[PIL.Image.Image] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None ) → StableDiffusionPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞或提示詞列表。如果未定義，您需要傳入 prompt_embeds。
height (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的高度（畫素）。
width (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的寬度（畫素）。
num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多去噪步數通常會導致更高質量的影像，但會犧牲推理速度。
guidance_scale (float, 可選, 預設為 7.5) — 較高的引導比例值鼓勵模型生成與文字 prompt 緊密相關的影像，但會降低影像質量。當 guidance_scale > 1 時啟用引導比例。
gligen_phrases (List[str]) — 用於引導在對應 gligen_boxes 定義的每個區域中包含內容的短語。每個邊界框應該只有一個短語。
gligen_boxes (List[List[float]]) — 邊界框，用於標識影像中將填充相應 gligen_phrases 所描述內容的矩形區域。每個矩形框定義為包含 4 個元素 [xmin, ymin, xmax, ymax] 的 List[float]，其中每個值都在 [0,1] 之間。
gligen_inpaint_image (PIL.Image.Image, 可選) — 輸入影像（如果提供）將使用 gligen_boxes 和 gligen_phrases 描述的物件進行影像修復。否則，它將被視為對空白輸入影像的生成任務。
gligen_scheduled_sampling_beta (float, 預設為 0.3) — 來自 GLIGEN: 開放集基礎文字到影像生成的排程取樣因子。排程取樣因子僅在推理過程中為排程取樣而變化，以提高質量和可控性。
negative_prompt (str 或 List[str], 可選) — 用於指導影像生成中不應包含內容的提示詞。如果未定義，則需要傳入 negative_prompt_embeds。當不使用指導（guidance_scale < 1）時，此引數將被忽略。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示詞生成的影像數量。
eta (float, 可選, 預設為 0.0) — 對應於 DDIM 論文中的引數 eta (η)。僅適用於 DDIMScheduler，在其他排程器中將被忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 用於使生成具有確定性的 torch.Generator。
latents (torch.Tensor, 可選) — 從高斯分佈取樣的預生成噪聲潛在變數，用作影像生成的輸入。可用於透過不同提示詞微調相同生成。如果未提供，將使用提供的隨機 generator 取樣生成一個潛在變數張量。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入（提示詞權重）。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入（提示詞權重）。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。可在 PIL.Image 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 StableDiffusionPipelineOutput 而不是普通元組。
callback (Callable, 可選) — 在推理期間每 callback_steps 步呼叫的函式。該函式將使用以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，則在每一步都呼叫回撥函式。
cross_attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，將作為引數傳遞給 self.processor 中定義的 AttentionProcessor。
guidance_rescale (float, 可選, 預設為 0.0) — 來自 Common Diffusion Noise Schedules and Sample Steps are Flawed 的指導重新縮放因子。指導重新縮放因子應該在使用零終端信噪比時修復過曝問題。
clip_skip (int, 可選) — 在計算提示詞嵌入時從 CLIP 中跳過的層數。值為 1 意味著將使用倒數第二層的輸出計算提示詞嵌入。

StableDiffusionPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 StableDiffusionPipelineOutput，否則返回一個 tuple，其中第一個元素是生成的影像列表，第二個元素是布林值列表，指示相應的生成影像是否包含“不適合工作”(nsfw) 內容。

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers import StableDiffusionGLIGENPipeline
>>> from diffusers.utils import load_image

>>> # Insert objects described by text at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENPipeline.from_pretrained(
...     "masterful/gligen-1-4-inpainting-text-box", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> input_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/livingroom_modern.png"
... )
>>> prompt = "a birthday cake"
>>> boxes = [[0.2676, 0.6088, 0.4773, 0.7183]]
>>> phrases = ["a birthday cake"]

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=phrases,
...     gligen_inpaint_image=input_image,
...     gligen_boxes=boxes,
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-1-4-inpainting-text-box.jpg")

>>> # Generate an image described by the prompt and
>>> # insert objects described by text at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENPipeline.from_pretrained(
...     "masterful/gligen-1-4-generation-text-box", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> prompt = "a waterfall and a modern high speed train running through the tunnel in a beautiful forest with fall foliage"
>>> boxes = [[0.1387, 0.2051, 0.4277, 0.7090], [0.4980, 0.4355, 0.8516, 0.7266]]
>>> phrases = ["a waterfall", "a modern high speed train running through the tunnel"]

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=phrases,
...     gligen_boxes=boxes,
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-1-4-generation-text-box.jpg")

enable_vae_slicing

< source >

( )

啟用切片 VAE 解碼。啟用此選項後，VAE 會將輸入張量分片，分步計算解碼。這有助於節省一些記憶體並允許更大的批次大小。

disable_vae_slicing

< source >

( )

停用切片 VAE 解碼。如果之前啟用了 enable_vae_slicing，此方法將返回一步計算解碼。

enable_vae_tiling

< source >

( )

啟用平鋪 VAE 解碼。啟用此選項後，VAE 將把輸入張量分割成瓦片，分多步計算編碼和解碼。這對於節省大量記憶體和處理更大的影像非常有用。

disable_vae_tiling

< source >

( )

停用平鋪 VAE 解碼。如果之前啟用了 enable_vae_tiling，此方法將恢復一步計算解碼。

enable_model_cpu_offload

< source >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = None )

引數

gpu_id (int, 可選) — 推理中應使用的加速器 ID。如果未指定，預設為 0。
device (torch.Device 或 str, 可選, 預設為 None) — 推理中應使用的加速器的 PyTorch 裝置型別。如果未指定，它將自動檢測可用的加速器並使用。

使用 accelerate 將所有模型解除安裝到 CPU，減少記憶體使用，對效能影響較小。與 enable_sequential_cpu_offload 相比，此方法在呼叫其 forward 方法時一次將一個完整的模型移動到加速器，並且該模型在加速器中保留直到下一個模型執行。記憶體節省低於 enable_sequential_cpu_offload，但由於 unet 的迭代執行，效能要好得多。

prepare_latents

< source >

( batch_size num_channels_latents height width dtype device generator latents = None )

enable_fuser

< source >

( enabled = True )

encode_prompt

< source >

( prompt device num_images_per_prompt do_classifier_free_guidance negative_prompt = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )

引數

prompt (str 或 List[str], 可選) — 要編碼的提示詞。
device — (torch.device): torch 裝置。
num_images_per_prompt (int) — 每個提示詞應生成的影像數量。
do_classifier_free_guidance (bool) — 是否使用分類器自由指導。
negative_prompt (str 或 List[str], 可選) — 不用於指導影像生成的提示詞。如果未定義，則必須傳入 negative_prompt_embeds。當不使用指導時（即，如果 guidance_scale 小於 1），此引數將被忽略。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
lora_scale (float, 可選) — 如果載入了 LoRA 層，將應用於文字編碼器所有 LoRA 層的 LoRA 縮放因子。
clip_skip (int, 可選) — 在計算提示詞嵌入時從 CLIP 中跳過的層數。值為 1 意味著將使用倒數第二層的輸出計算提示詞嵌入。

將提示編碼為文字編碼器隱藏狀態。

StableDiffusionGLIGENTextImagePipeline

class diffusers.StableDiffusionGLIGENTextImagePipeline

< source >

( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer processor: CLIPProcessor image_encoder: CLIPVisionModelWithProjection image_project: CLIPImageProjection unet: UNet2DConditionModel scheduler: KarrasDiffusionSchedulers safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True )

引數

vae (AutoencoderKL) — 用於將影像編碼和解碼為潛在表示的變分自編碼器（VAE）模型。
text_encoder (CLIPTextModel) — 凍結文字編碼器 (clip-vit-large-patch14)。
tokenizer (CLIPTokenizer) — 用於標記文字的 CLIPTokenizer。
processor (CLIPProcessor) — 用於處理參考影像的 CLIPProcessor。
image_encoder (CLIPVisionModelWithProjection) — 凍結影像編碼器 (clip-vit-large-patch14)。
image_project (CLIPImageProjection) — 將影像嵌入投影到短語嵌入空間的 CLIPImageProjection。
unet (UNet2DConditionModel) — 用於對編碼影像潛在變數去噪的 UNet2DConditionModel。
scheduler (SchedulerMixin) — 與 unet 結合使用以對編碼影像潛在變數去噪的排程器。可以是 DDIMScheduler、LMSDiscreteScheduler 或 PNDMScheduler 之一。
safety_checker (StableDiffusionSafetyChecker) — 用於評估生成影像是否可能具有冒犯性或有害的分類模組。有關模型潛在危害的更多詳細資訊，請參閱模型卡。
feature_extractor (CLIPImageProcessor) — 用於從生成影像中提取特徵的 CLIPImageProcessor；用作 safety_checker 的輸入。

使用 Stable Diffusion 和 GLIGEN（基於語言的影像生成）進行文字到影像生成的管道。

此模型繼承自 DiffusionPipeline。有關庫為所有管道實現的通用方法（例如下載或儲存、在特定裝置上執行等），請檢視超類文件。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 guidance_scale: float = 7.5 gligen_scheduled_sampling_beta: float = 0.3 gligen_phrases: typing.List[str] = None gligen_images: typing.List[PIL.Image.Image] = None input_phrases_mask: typing.Union[int, typing.List[int]] = None input_images_mask: typing.Union[int, typing.List[int]] = None gligen_boxes: typing.List[typing.List[float]] = None gligen_inpaint_image: typing.Optional[PIL.Image.Image] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: float = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback: typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None callback_steps: int = 1 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None gligen_normalize_constant: float = 28.7 clip_skip: int = None ) → StableDiffusionPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞。如果未定義，則需要傳入 prompt_embeds。
height (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的高度（畫素）。
width (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的寬度（畫素）。
num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但推理速度會變慢。
guidance_scale (float, 可選, 預設為 7.5) — 較高的引導比例值會促使模型生成與文字 prompt 緊密相關的影像，但會犧牲影像質量。當 guidance_scale > 1 時啟用引導比例。
gligen_phrases (List[str]) — 用於引導在 gligen_boxes 定義的每個區域中包含內容的短語。每個邊界框應該只有一個短語。
gligen_images (List[PIL.Image.Image]) — 用於引導在 gligen_boxes 定義的每個區域中包含內容的影像。每個邊界框應該只有一個影像。
input_phrases_mask (int 或 List[int]) — 由相應的 input_phrases_mask 定義的預短語掩碼輸入
input_images_mask (int 或 List[int]) — 由相應的 input_images_mask 定義的預影像掩碼輸入
gligen_boxes (List[List[float]]) — 邊界框，用於標識影像中將填充相應 gligen_phrases 所描述內容的矩形區域。每個矩形框定義為包含 4 個元素 [xmin, ymin, xmax, ymax] 的 List[float]，其中每個值都在 [0,1] 之間。
gligen_inpaint_image (PIL.Image.Image, 可選) — 輸入影像（如果提供）將使用 gligen_boxes 和 gligen_phrases 描述的物件進行修復。否則，它將被視為在空白輸入影像上執行生成任務。
gligen_scheduled_sampling_beta (float, 預設為 0.3) — 來自 GLIGEN: Open-Set Grounded Text-to-Image Generation 的排程取樣因子。為了提高質量和可控性，排程取樣因子僅在推理期間進行排程取樣時進行調整。
negative_prompt (str 或 List[str], 可選) — 用於引導影像生成中不包含內容的提示詞。如果未定義，則需要傳入 negative_prompt_embeds。當不使用引導時（guidance_scale < 1），此引數將被忽略。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示詞生成的影像數量。
eta (float, 可選, 預設為 0.0) — 對應於 DDIM 論文中的引數 eta (η)。僅適用於 DDIMScheduler，在其他排程器中將被忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 用於使生成過程具有確定性的 torch.Generator。
latents (torch.Tensor, 可選) — 預先從高斯分佈中取樣的噪聲潛在變數，用作影像生成的輸入。可用於使用不同的提示詞調整相同的生成。如果未提供，則使用提供的隨機 generator 進行取樣生成潛在張量。
prompt_embeds (torch.Tensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入（提示詞權重）。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預先生成的負面文字嵌入。可用於輕鬆調整文字輸入（提示詞權重）。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。在 PIL.Image 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 StableDiffusionPipelineOutput 而不是普通元組。
callback (Callable, 可選) — 在推理過程中每 callback_steps 步呼叫的函式。該函式將使用以下引數呼叫：callback(step: int, timestep: int, latents: torch.Tensor)。
callback_steps (int, 可選, 預設為 1) — 呼叫 callback 函式的頻率。如果未指定，則在每一步都呼叫回撥。
cross_attention_kwargs (dict, 可選) — 一個 kwargs 字典，如果指定，將作為引數傳遞給 self.processor 中定義的 AttentionProcessor。
gligen_normalize_constant (float, 可選, 預設為 28.7) — 影像嵌入的歸一化值。
clip_skip (int, 可選) — 在計算提示詞嵌入時要跳過 CLIP 的層數。值為 1 表示將使用倒數第二層的輸出計算提示詞嵌入。

StableDiffusionPipelineOutput 或 tuple

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers import StableDiffusionGLIGENTextImagePipeline
>>> from diffusers.utils import load_image

>>> # Insert objects described by image at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENTextImagePipeline.from_pretrained(
...     "anhnct/Gligen_Inpainting_Text_Image", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> input_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/livingroom_modern.png"
... )
>>> prompt = "a backpack"
>>> boxes = [[0.2676, 0.4088, 0.4773, 0.7183]]
>>> phrases = None
>>> gligen_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/backpack.jpeg"
... )

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=phrases,
...     gligen_inpaint_image=input_image,
...     gligen_boxes=boxes,
...     gligen_images=[gligen_image],
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-inpainting-text-image-box.jpg")

>>> # Generate an image described by the prompt and
>>> # insert objects described by text and image at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENTextImagePipeline.from_pretrained(
...     "anhnct/Gligen_Text_Image", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> prompt = "a flower sitting on the beach"
>>> boxes = [[0.0, 0.09, 0.53, 0.76]]
>>> phrases = ["flower"]
>>> gligen_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/pexels-pixabay-60597.jpg"
... )

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=phrases,
...     gligen_images=[gligen_image],
...     gligen_boxes=boxes,
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-generation-text-image-box.jpg")

>>> # Generate an image described by the prompt and
>>> # transfer style described by image at the region defined by bounding boxes
>>> pipe = StableDiffusionGLIGENTextImagePipeline.from_pretrained(
...     "anhnct/Gligen_Text_Image", torch_dtype=torch.float16
... )
>>> pipe = pipe.to("cuda")

>>> prompt = "a dragon flying on the sky"
>>> boxes = [[0.4, 0.2, 1.0, 0.8], [0.0, 1.0, 0.0, 1.0]]  # Set `[0.0, 1.0, 0.0, 1.0]` for the style

>>> gligen_image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
... )

>>> gligen_placeholder = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
... )

>>> images = pipe(
...     prompt=prompt,
...     gligen_phrases=[
...         "dragon",
...         "placeholder",
...     ],  # Can use any text instead of `placeholder` token, because we will use mask here
...     gligen_images=[
...         gligen_placeholder,
...         gligen_image,
...     ],  # Can use any image in gligen_placeholder, because we will use mask here
...     input_phrases_mask=[1, 0],  # Set 0 for the placeholder token
...     input_images_mask=[0, 1],  # Set 0 for the placeholder image
...     gligen_boxes=boxes,
...     gligen_scheduled_sampling_beta=1,
...     output_type="pil",
...     num_inference_steps=50,
... ).images

>>> images[0].save("./gligen-generation-text-image-box-style-transfer.jpg")

enable_vae_slicing

< 源 >

( )

啟用切片 VAE 解碼。啟用此選項後，VAE 會將輸入張量分片，分步計算解碼。這有助於節省一些記憶體並允許更大的批次大小。

disable_vae_slicing

< 源 >

( )

停用切片 VAE 解碼。如果之前啟用了 enable_vae_slicing，此方法將返回一步計算解碼。

enable_vae_tiling

< 源 >

( )

啟用平鋪 VAE 解碼。啟用此選項後，VAE 將把輸入張量分割成瓦片，分多步計算編碼和解碼。這對於節省大量記憶體和處理更大的影像非常有用。

disable_vae_tiling

< 源 >

( )

停用平鋪 VAE 解碼。如果之前啟用了 enable_vae_tiling，此方法將恢復一步計算解碼。

enable_model_cpu_offload

< 源 >

( gpu_id: typing.Optional[int] = None device: typing.Union[torch.device, str] = None )

引數

gpu_id (int, 可選) — 推理中應使用的加速器 ID。如果未指定，預設為 0。
device (torch.Device 或 str, 可選, 預設為 None) — 推理中應使用的加速器的 PyTorch 裝置型別。如果未指定，它將自動檢測可用加速器並使用。

prepare_latents

< 源 >

( batch_size num_channels_latents height width dtype device generator latents = None )

enable_fuser

< 源 >

( enabled = True )

complete_mask

< 源 >

( has_mask max_objs device )

根據每個短語和影像的輸入掩碼對應值 0 或 1，掩蓋與短語和影像對應的特徵。

crop

< 源 >

( im new_width new_height )

將輸入影像裁剪到指定的尺寸。

draw_inpaint_mask_from_boxes

< 源 >

( boxes size )

根據給定的框建立內畫遮罩。該函式使用提供的框生成內畫遮罩，以標記需要內畫的區域。

encode_prompt

< 源 >

引數

prompt (str 或 List[str], 可選) — 要編碼的提示詞
device — (torch.device): torch 裝置
num_images_per_prompt (int) — 每個提示詞應生成的影像數量
do_classifier_free_guidance (bool) — 是否使用分類器自由引導
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示詞。如果未定義，則必須傳入 negative_prompt_embeds。當不使用引導時（即，如果 guidance_scale 小於 1），此引數將被忽略。
prompt_embeds (torch.Tensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預先生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
lora_scale (float, 可選) — 將應用於文字編碼器所有 LoRA 層的 LoRA 縮放因子（如果載入了 LoRA 層）。
clip_skip (int, 可選) — 在計算提示詞嵌入時要跳過 CLIP 的層數。值為 1 表示將使用倒數第二層的輸出計算提示詞嵌入。

將提示編碼為文字編碼器隱藏狀態。

get_clip_feature

< 源 >

( input normalize_constant device is_image = False )

使用 CLIP 預訓練模型獲取影像和短語嵌入。影像嵌入透過投影變換到短語嵌入空間。

get_cross_attention_kwargs_with_grounded

< 源 >

( hidden_size gligen_phrases gligen_images gligen_boxes input_phrases_mask input_images_mask repeat_batch normalize_constant max_objs device )

準備包含關於接地輸入（框、掩碼、影像嵌入、短語嵌入）資訊的交叉注意力 kwargs。

get_cross_attention_kwargs_without_grounded

< 源 >

( hidden_size repeat_batch max_objs device )

準備不包含接地輸入（框、掩碼、影像嵌入、短語嵌入）資訊的交叉注意力 kwargs（所有都是零張量）。

target_size_center_crop

< source >

( im new_hw )

裁剪影像並調整大小以適應目標尺寸，同時保持中心不變。

StableDiffusionPipelineOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] nsfw_content_detected: typing.Optional[typing.List[bool]] )

引數

images (List[PIL.Image.Image] 或 np.ndarray) — 經過去噪的 PIL 影像列表，長度為 batch_size，或形狀為 (batch_size, height, width, num_channels) 的 NumPy 陣列。
nsfw_content_detected (List[bool]) — 指示相應生成的影像是否包含“不安全內容”（nsfw）的列表，如果無法執行安全檢查，則為 None。

Stable Diffusion 管道的輸出類。

< > 在 GitHub 上更新

←深度到影像影像變體→

Diffusers

GLIGEN（基於語言的影像生成）

StableDiffusionGLIGENPipeline

class diffusers.StableDiffusionGLIGENPipeline

__call__

enable_vae_slicing

disable_vae_slicing

enable_vae_tiling

disable_vae_tiling

enable_model_cpu_offload

prepare_latents

enable_fuser

encode_prompt

StableDiffusionGLIGENTextImagePipeline

class diffusers.StableDiffusionGLIGENTextImagePipeline

__call__

enable_vae_slicing

disable_vae_slicing

enable_vae_tiling

disable_vae_tiling

enable_model_cpu_offload

prepare_latents

enable_fuser

complete_mask

crop

draw_inpaint_mask_from_boxes

encode_prompt

get_clip_feature

get_cross_attention_kwargs_with_grounded

get_cross_attention_kwargs_without_grounded

target_size_center_crop

StableDiffusionPipelineOutput

class diffusers.pipelines.stable_diffusion.StableDiffusionPipelineOutput

call

call