Diffusers 文件

VisualCloze

Hugging Face's logo
加入 Hugging Face 社群

並獲得增強的文件體驗

開始使用

VisualCloze

VisualCloze:基於視覺上下文學習的通用影像生成框架是一個創新的基於上下文學習的通用影像生成框架,提供關鍵能力:

  1. 支援各種域內任務
  2. 透過上下文學習泛化到未見任務
  3. 將多個任務統一到一個步驟中,並生成目標影像和中間結果
  4. 支援從目標影像逆向工程條件

概述

論文摘要如下:

擴散模型在影像生成任務上的最新進展顯著推動了該領域的發展。然而,目前的主流方法仍側重於構建特定任務模型,這在支援各種不同需求時效率有限。儘管通用模型試圖解決這一限制,但它們面臨著關鍵挑戰,包括可泛化的任務指令、合適的任務分佈和統一的架構設計。為了應對這些挑戰,我們提出了 VisualCloze,一個通用的影像生成框架,它支援廣泛的域內任務,能夠泛化到未見任務,實現多工的未見統一,並支援逆向生成。與現有依賴於基於語言的任務指令導致任務模糊和泛化能力弱的方法不同,我們整合了視覺上下文學習,允許模型從視覺演示中識別任務。同時,視覺任務分佈固有的稀疏性阻礙了跨任務可遷移知識的學習。為此,我們引入了 Graph200K,一個圖結構資料集,它建立了各種相互關聯的任務,從而提高了任務密度和可遷移知識。此外,我們發現我們統一的影像生成公式與影像補全具有一致的目標,這使我們能夠在不修改架構的情況下利用預訓練補全模型的強大生成先驗。程式碼、資料集和模型可在 https://visualcloze.github.io 獲取。

推理

模型載入

VisualCloze 是一個兩階段級聯管道,包含 `VisualClozeGenerationPipeline` 和 `VisualClozeUpsamplingPipeline`。

  • 在 `VisualClozeGenerationPipeline` 中,每個影像在拼接成網格佈局之前會進行下采樣,以避免解析度過高。VisualCloze 釋出了兩個適用於 diffusers 的模型,即 VisualClozePipeline-384VisualClozePipeline-512,它們分別將影像下采樣到 384 和 512 的解析度。
  • `VisualClozeUpsamplingPipeline` 使用 SDEdit 來實現高解析度影像合成。

`VisualClozePipeline` 整合了這兩個階段,支援方便的端到端取樣,同時允許使用者根據需要獨立使用每個管道。

輸入規格

任務和內容提示

  • 任務提示:需要描述生成任務意圖
  • 內容提示:目標影像的可選描述或說明
  • 當不需要內容提示時,傳入 `None`
  • 對於批次推理,傳入 `List[str|None]`

影像輸入格式

  • 格式:`List[List[Image|None]]`
  • 結構
    • 除最後一行外,所有行表示上下文示例
    • 最後一行表示當前查詢(目標影像設定為 `None`)
  • 對於批次推理,傳入 `List[List[List[Image|None]]]`

解析度控制

  • 預設行為
    • 第一階段初始生成:區域為 `${pipe.resolution}^2`
    • 第二階段上取樣:3 倍因子
  • 自定義解析度:使用 `upsampling_height` 和 `upsampling_width` 引數調整

示例

有關涵蓋廣泛任務的綜合示例,請參閱線上演示GitHub 倉庫。以下是三個案例的簡單示例:掩碼到影像轉換、邊緣檢測和主題驅動生成。

掩碼到影像示例

import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image

pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
    # in-context examples
    [
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg'),
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg'),
    ],
    # query with the target image
    [
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg'),
        None, # No image needed for the target image
    ],
]

# Task and content prompt
task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding."
content_prompt = """Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. 
The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. 
Its plumage is a mix of dark brown and golden hues, with intricate feather details. 
The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. 
The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, 
soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, 
tranquil, majestic, wildlife photography."""

# Run the pipeline
image_result = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    upsampling_width=1344,
    upsampling_height=768,
    upsampling_strength=0.4,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]

# Save the resulting image
image_result.save("visualcloze.png")

邊緣檢測示例

import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image

pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
    # in-context examples
    [
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-1_image.jpg'),
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-1_edge.jpg'),
    ],
    [
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-2_image.jpg'),
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_incontext-example-2_edge.jpg'),
    ],
    # query with the target image
    [
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_edgedetection_query_image.jpg'),
        None, # No image needed for the target image
    ],
]

# Task and content prompt
task_prompt = "Each row illustrates a pathway from [IMAGE1] a sharp and beautifully composed photograph to [IMAGE2] edge map with natural well-connected outlines using a clear logical task."
content_prompt = ""

# Run the pipeline
image_result = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    upsampling_width=864,
    upsampling_height=1152,
    upsampling_strength=0.4,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]

# Save the resulting image
image_result.save("visualcloze.png")

主題驅動生成示例

import torch
from diffusers import VisualClozePipeline
from diffusers.utils import load_image

pipe = VisualClozePipeline.from_pretrained("VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16)
pipe.to("cuda")

# Load in-context images (make sure the paths are correct and accessible)
image_paths = [
    # in-context examples
    [
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_reference.jpg'),
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_depth.jpg'),
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-1_image.jpg'),
    ],
    [
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_reference.jpg'),
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_depth.jpg'),
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_incontext-example-2_image.jpg'),
    ],
    # query with the target image
    [
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_query_reference.jpg'),
        load_image('https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_subjectdriven_query_depth.jpg'),
        None, # No image needed for the target image
    ],
]

# Task and content prompt
task_prompt = """Each row describes a process that begins with [IMAGE1] an image containing the key object, 
[IMAGE2] depth map revealing gray-toned spatial layers and results in 
[IMAGE3] an image with artistic qualitya high-quality image with exceptional detail."""
content_prompt = """A vintage porcelain collector's item. Beneath a blossoming cherry tree in early spring, 
this treasure is photographed up close, with soft pink petals drifting through the air and vibrant blossoms framing the scene."""

# Run the pipeline
image_result = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    upsampling_width=1024,
    upsampling_height=1024,
    upsampling_strength=0.2,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0][0]

# Save the resulting image
image_result.save("visualcloze.png")

獨立使用每個管道

import torch
from diffusers import VisualClozeGenerationPipeline, FluxFillPipeline as VisualClozeUpsamplingPipeline
from diffusers.utils import load_image
from PIL import Image

pipe = VisualClozeGenerationPipeline.from_pretrained(
    "VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16
)
pipe.to("cuda")

image_paths = [
    # in-context examples
    [
        load_image(
            "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg"
        ),
        load_image(
            "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg"
        ),
    ],
    # query with the target image
    [
        load_image(
            "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg"
        ),
        None,  # No image needed for the target image
    ],
]
task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding."
content_prompt = "Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. Its plumage is a mix of dark brown and golden hues, with intricate feather details. The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, tranquil, majestic, wildlife photography."

# Stage 1: Generate initial image
image = pipe(
    task_prompt=task_prompt,
    content_prompt=content_prompt,
    image=image_paths,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0),
).images[0][0]

# Stage 2 (optional): Upsample the generated image
pipe_upsample = VisualClozeUpsamplingPipeline.from_pipe(pipe)
pipe_upsample.to("cuda")

mask_image = Image.new("RGB", image.size, (255, 255, 255))

image = pipe_upsample(
    image=image,
    mask_image=mask_image,
    prompt=content_prompt,
    width=1344,
    height=768,
    strength=0.4,
    guidance_scale=30,
    num_inference_steps=30,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0),
).images[0]

image.save("visualcloze.png")

< >

class diffusers.VisualClozePipeline

< >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer text_encoder_2: T5EncoderModel tokenizer_2: T5TokenizerFast transformer: FluxTransformer2DModel resolution: int = 384 )

引數

  • transformer (FluxTransformer2DModel) — 用於去噪編碼影像潛在的條件 Transformer (MMDiT) 架構。
  • scheduler (FlowMatchEulerDiscreteScheduler) — 用於與 `transformer` 結合去噪編碼影像潛在的排程器。
  • vae (AutoencoderKL) — 用於編碼和解碼影像與潛在表示之間的變分自動編碼器 (VAE) 模型。
  • text_encoder (CLIPTextModel) — CLIP,具體為 clip-vit-large-patch14 變體。
  • text_encoder_2 (T5EncoderModel) — T5 的第二個文字編碼器,具體為 google/t5-v1_1-xxl 變體。
  • tokenizer (CLIPTokenizer) — CLIPTokenizer 類的分詞器。
  • tokenizer_2 (T5TokenizerFast) — T5TokenizerFast 類的第二個分詞器。
  • resolution (int, 可選, 預設為 384) — 查詢和上下文示例中的影像拼接時的解析度。

用於視覺上下文影像生成的 VisualCloze 管道。參考:https://github.com/lzyhha/VisualCloze/tree/main。此管道旨在根據視覺上下文示例生成影像。

__call__

< >

( task_prompt: typing.Union[str, typing.List[str]] = None content_prompt: typing.Union[str, typing.List[str]] = None image: typing.Optional[torch.FloatTensor] = None upsampling_height: typing.Optional[int] = None upsampling_width: typing.Optional[int] = None num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 30.0 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 upsampling_strength: float = 1.0 ) ~pipelines.flux.FluxPipelineOutput or tuple

引數

  • task_prompt (strList[str], 可選) — 用於定義任務意圖的提示。
  • content_prompt (strList[str], 可選) — 用於定義要生成的目標影像內容或標題的提示。
  • image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], 或 List[np.ndarray]) — 用作起點的 `Image`、numpy 陣列或表示影像批次的張量。對於 numpy 陣列和 pytorch 張量,預期值範圍在 `[0, 1]` 之間。如果是張量或張量列表,預期形狀應為 `(B, C, H, W)` 或 `(C, H, W)`。如果是 numpy 陣列或陣列列表,預期形狀應為 `(B, H, W, C)` 或 `(H, W, C)`。
  • upsampling_height (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 透過 SDEdit 進行上取樣後,生成影像(即輸出影像)的畫素高度。預設情況下,影像上取樣因子為 3,並且基礎解析度由管道的解析度引數決定。當只指定 upsampling_heightupsampling_width 中的一個時,另一個將根據寬高比自動設定。
  • upsampling_width (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 透過 SDEdit 進行上取樣後,生成影像(即輸出影像)的畫素寬度。預設情況下,影像上取樣因子為 3,並且基礎解析度由管道的解析度引數決定。當只指定 upsampling_heightupsampling_width 中的一個時,另一個將根據寬高比自動設定。
  • num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像,但推理速度會變慢。
  • sigmas (List[float], 可選) — 用於去噪過程的自定義 sigmas,適用於其 set_timesteps 方法支援 sigmas 引數的排程器。如果未定義,將使用傳入 num_inference_steps 時的預設行為。
  • guidance_scale (float, 可選, 預設為 30.0) — Classifier-Free Diffusion Guidance 中定義的 Guidance scale。guidance_scale 定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 啟用 Guidance scale。較高的 Guidance scale 鼓勵生成與文字 prompt 緊密相關的影像,通常以犧牲較低影像質量為代價。
  • num_images_per_prompt (int, 可選, 預設為 1) — 每個提示要生成的影像數量。
  • generator (torch.GeneratorList[torch.Generator], 可選) — 一個或多個 torch generator(s),用於使生成具有確定性。
  • latents (torch.FloatTensor, 可選) — 預先生成的噪聲潛在變數,從高斯分佈中取樣,用作影像生成的輸入。可用於透過不同的提示調整相同的生成。如果未提供,將使用提供的隨機 generator 取樣生成一個潛在變數張量。
  • prompt_embeds (torch.FloatTensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入,例如提示權重。如果未提供,將從 prompt 輸入引數生成文字嵌入。
  • pooled_prompt_embeds (torch.FloatTensor, 可選) — 預先生成的池化文字嵌入。可用於輕鬆調整文字輸入,例如提示權重。如果未提供,池化文字嵌入將從 prompt 輸入引數生成。
  • output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL: PIL.Image.Imagenp.array
  • return_dict (bool, 可選, 預設為 True) — 是否返回 ~pipelines.flux.FluxPipelineOutput 而不是普通元組。
  • joint_attention_kwargs (dict, 可選) — 一個 kwargs 字典,如果指定,將作為 self.processor 中定義的 AttentionProcessor 的引數傳遞到 diffusers.models.attention_processor
  • callback_on_step_end (Callable, 可選) — 一個函式,在推理過程中每個去噪步驟結束時呼叫。該函式使用以下引數呼叫:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
  • callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。
  • max_sequence_length (int 預設為 512) — 與 prompt 一起使用的最大序列長度。
  • upsampling_strength (float, 可選, 預設為 1.0) — 表示上取樣結果時轉換參考 image 的程度。必須在 0 到 1 之間。生成的影像用作起點,upsampling_strength 越高,新增的噪聲越多。去噪步驟的數量取決於最初新增的噪聲量。當 upsampling_strength 為 1 時,新增的噪聲最大,去噪過程執行 num_inference_steps 中指定的完整迭代次數。值為 0 會跳過上取樣步驟,並以 self.resolution 的解析度輸出結果。

返回

~pipelines.flux.FluxPipelineOutputtuple

如果 return_dict 為 True,則為 ~pipelines.flux.FluxPipelineOutput,否則為 tuple。返回元組時,第一個元素是生成的影像列表。

呼叫 VisualCloze 管道進行生成時呼叫的函式。

示例

>>> import torch
>>> from diffusers import VisualClozePipeline
>>> from diffusers.utils import load_image

>>> image_paths = [
...     # in-context examples
...     [
...         load_image(
...             "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg"
...         ),
...         load_image(
...             "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg"
...         ),
...     ],
...     # query with the target image
...     [
...         load_image(
...             "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg"
...         ),
...         None,  # No image needed for the target image
...     ],
... ]
>>> task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding."
>>> content_prompt = "Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. Its plumage is a mix of dark brown and golden hues, with intricate feather details. The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, tranquil, majestic, wildlife photography."
>>> pipe = VisualClozePipeline.from_pretrained(
...     "VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16
... )
>>> pipe.to("cuda")

>>> image = pipe(
...     task_prompt=task_prompt,
...     content_prompt=content_prompt,
...     image=image_paths,
...     upsampling_width=1344,
...     upsampling_height=768,
...     upsampling_strength=0.4,
...     guidance_scale=30,
...     num_inference_steps=30,
...     max_sequence_length=512,
...     generator=torch.Generator("cpu").manual_seed(0),
... ).images[0][0]
>>> image.save("visualcloze.png")

VisualClozeGenerationPipeline

class diffusers.VisualClozeGenerationPipeline

< >

( scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer text_encoder_2: T5EncoderModel tokenizer_2: T5TokenizerFast transformer: FluxTransformer2DModel resolution: int = 384 )

引數

  • transformer (FluxTransformer2DModel) — 用於對編碼影像潛在變數進行去噪的條件 Transformer (MMDiT) 架構。
  • scheduler (FlowMatchEulerDiscreteScheduler) — 與 transformer 結合使用以對編碼影像潛在變數進行去噪的排程器。
  • vae (AutoencoderKL) — 用於在潛在表示之間編碼和解碼影像的變分自動編碼器 (VAE) 模型。
  • text_encoder (CLIPTextModel) — CLIP,特別是 clip-vit-large-patch14 變體。
  • text_encoder_2 (T5EncoderModel) — T5,特別是 google/t5-v1_1-xxl 變體。
  • tokenizer (CLIPTokenizer) — CLIPTokenizer 類的分詞器。
  • tokenizer_2 (T5TokenizerFast) — 第二個分詞器,屬於 T5TokenizerFast 類。
  • resolution (int, 可選, 預設為 384) — 查詢影像和情境示例影像拼接後的每張影像的解析度。

用於生成具有視覺上下文影像的 VisualCloze 管道。參考:https://github.com/lzyhha/VisualCloze/tree/main 該管道旨在根據視覺情境示例生成影像。

__call__

< >

( task_prompt: typing.Union[str, typing.List[str]] = None content_prompt: typing.Union[str, typing.List[str]] = None image: typing.Optional[torch.FloatTensor] = None num_inference_steps: int = 50 sigmas: typing.Optional[typing.List[float]] = None guidance_scale: float = 30.0 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.FloatTensor] = None prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True joint_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) ~pipelines.flux.FluxPipelineOutputtuple

引數

  • task_prompt (strList[str], 可選) — 定義任務意圖的一個或多個提示。
  • content_prompt (strList[str], 可選) — 定義要生成的影像內容或描述的一個或多個提示。
  • image (torch.Tensor, PIL.Image.Image, np.ndarray, List[torch.Tensor], List[PIL.Image.Image], 或 List[np.ndarray]) — 用作起點的 Image、Numpy 陣列或表示影像批次的張量。對於 Numpy 陣列和 PyTorch 張量,預期值範圍在 [0, 1] 之間。如果它是張量或張量列表,則預期形狀應為 (B, C, H, W)(C, H, W)。如果它是 Numpy 陣列或陣列列表,則預期形狀應為 (B, H, W, C)(H, W, C)
  • num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像,但推理速度會變慢。
  • sigmas (List[float], 可選) — 用於去噪過程的自定義 sigmas,適用於其 set_timesteps 方法支援 sigmas 引數的排程器。如果未定義,將使用傳入 num_inference_steps 時的預設行為。
  • guidance_scale (float, 可選, 預設為 30.0) — Classifier-Free Diffusion Guidance 中定義的 Guidance scale。guidance_scale 定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 啟用 Guidance scale。較高的 Guidance scale 鼓勵生成與文字 prompt 緊密相關的影像,通常以犧牲較低影像質量為代價。
  • num_images_per_prompt (int, 可選, 預設為 1) — 每個提示要生成的影像數量。
  • generator (torch.GeneratorList[torch.Generator], 可選) — 一個或多個 torch generator(s),用於使生成具有確定性。
  • latents (torch.FloatTensor, 可選) — 預先生成的噪聲潛在變數,從高斯分佈中取樣,用作影像生成的輸入。可用於透過不同的提示調整相同的生成。如果未提供,將使用提供的隨機 generator 取樣生成一個潛在變數張量。
  • prompt_embeds (torch.FloatTensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入,例如提示權重。如果未提供,將從 prompt 輸入引數生成文字嵌入。
  • pooled_prompt_embeds (torch.FloatTensor, 可選) — 預先生成的池化文字嵌入。可用於輕鬆調整文字輸入,例如提示權重。如果未提供,池化文字嵌入將從 prompt 輸入引數生成。
  • output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL: PIL.Image.Imagenp.array
  • return_dict (bool, 可選, 預設為 True) — 是否返回 ~pipelines.flux.FluxPipelineOutput 而不是普通元組。
  • joint_attention_kwargs (dict, 可選) — 一個 kwargs 字典,如果指定,將作為 self.processor 中定義的 AttentionProcessor 的引數傳遞到 diffusers.models.attention_processor
  • callback_on_step_end (Callable, 可選) — 一個函式,在推理過程中每個去噪步驟結束時呼叫。該函式使用以下引數呼叫:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
  • callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。
  • max_sequence_length (int 預設為 512) — 與 prompt 一起使用的最大序列長度。

返回

~pipelines.flux.FluxPipelineOutputtuple

如果 return_dict 為 True,則為 ~pipelines.flux.FluxPipelineOutput,否則為 tuple。返回元組時,第一個元素是生成的影像列表。

呼叫 VisualCloze 管道進行生成時呼叫的函式。

示例

>>> import torch
>>> from diffusers import VisualClozeGenerationPipeline, FluxFillPipeline as VisualClozeUpsamplingPipeline
>>> from diffusers.utils import load_image
>>> from PIL import Image

>>> image_paths = [
...     # in-context examples
...     [
...         load_image(
...             "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_mask.jpg"
...         ),
...         load_image(
...             "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_incontext-example-1_image.jpg"
...         ),
...     ],
...     # query with the target image
...     [
...         load_image(
...             "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/visualcloze/visualcloze_mask2image_query_mask.jpg"
...         ),
...         None,  # No image needed for the target image
...     ],
... ]
>>> task_prompt = "In each row, a logical task is demonstrated to achieve [IMAGE2] an aesthetically pleasing photograph based on [IMAGE1] sam 2-generated masks with rich color coding."
>>> content_prompt = "Majestic photo of a golden eagle perched on a rocky outcrop in a mountainous landscape. The eagle is positioned in the right foreground, facing left, with its sharp beak and keen eyes prominently visible. Its plumage is a mix of dark brown and golden hues, with intricate feather details. The background features a soft-focus view of snow-capped mountains under a cloudy sky, creating a serene and grandiose atmosphere. The foreground includes rugged rocks and patches of green moss. Photorealistic, medium depth of field, soft natural lighting, cool color palette, high contrast, sharp focus on the eagle, blurred background, tranquil, majestic, wildlife photography."
>>> pipe = VisualClozeGenerationPipeline.from_pretrained(
...     "VisualCloze/VisualClozePipeline-384", resolution=384, torch_dtype=torch.bfloat16
... )
>>> pipe.to("cuda")

>>> image = pipe(
...     task_prompt=task_prompt,
...     content_prompt=content_prompt,
...     image=image_paths,
...     guidance_scale=30,
...     num_inference_steps=30,
...     max_sequence_length=512,
...     generator=torch.Generator("cpu").manual_seed(0),
... ).images[0][0]

>>> # optional, upsampling the generated image
>>> pipe_upsample = VisualClozeUpsamplingPipeline.from_pipe(pipe)
>>> pipe_upsample.to("cuda")

>>> mask_image = Image.new("RGB", image.size, (255, 255, 255))

>>> image = pipe_upsample(
...     image=image,
...     mask_image=mask_image,
...     prompt=content_prompt,
...     width=1344,
...     height=768,
...     strength=0.4,
...     guidance_scale=30,
...     num_inference_steps=30,
...     max_sequence_length=512,
...     generator=torch.Generator("cpu").manual_seed(0),
... ).images[0]

>>> image.save("visualcloze.png")

disable_vae_slicing

< >

( )

停用切片 VAE 解碼。如果之前啟用了 enable_vae_slicing,此方法將返回一步計算解碼。

disable_vae_tiling

< >

( )

停用平鋪 VAE 解碼。如果之前啟用了 enable_vae_tiling,此方法將恢復一步計算解碼。

enable_vae_slicing

< >

( )

啟用切片 VAE 解碼。啟用此選項後,VAE 會將輸入張量分片,分步計算解碼。這有助於節省一些記憶體並允許更大的批次大小。

enable_vae_tiling

< >

( )

啟用平鋪 VAE 解碼。啟用此選項後,VAE 將把輸入張量分割成瓦片,分多步計算編碼和解碼。這對於節省大量記憶體和處理更大的影像非常有用。

encode_prompt

< >

( layout_prompt: typing.Union[str, typing.List[str]] task_prompt: typing.Union[str, typing.List[str]] content_prompt: typing.Union[str, typing.List[str]] device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.FloatTensor] = None pooled_prompt_embeds: typing.Optional[torch.FloatTensor] = None max_sequence_length: int = 512 lora_scale: typing.Optional[float] = None )

引數

  • layout_prompt (strList[str], 可選) — 定義情境示例數量和任務中涉及的影像數量的一個或多個提示。
  • task_prompt (strList[str], 可選) — 定義任務意圖的一個或多個提示。
  • content_prompt (strList[str], 可選) — 定義要生成的影像內容或描述的一個或多個提示。
  • device — (torch.device): torch 裝置
  • num_images_per_prompt (int) — 每個提示應生成的影像數量。
  • prompt_embeds (torch.FloatTensor, 可選) — 預先生成的文字嵌入。可用於輕鬆調整文字輸入,例如提示權重。如果未提供,則文字嵌入將從 prompt 輸入引數生成。
  • pooled_prompt_embeds (torch.FloatTensor, 可選) — 預先生成的池化文字嵌入。可用於輕鬆調整文字輸入,例如提示權重。如果未提供,則池化文字嵌入將從 prompt 輸入引數生成。
  • lora_scale (float, 可選) — 如果載入了 LoRA 層,將應用於文字編碼器所有 LoRA 層的 LoRA 縮放比例。
< > 在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.