Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

OmniGen

OmniGen：統一影像生成，作者：BAAI 的 Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, Zheng Liu。

論文摘要如下：

大型語言模型（LLM）的出現統一了語言生成任務，並徹底改變了人機互動。然而，在影像生成領域，能夠在一個框架內處理各種任務的統一模型仍有待探索。在這項工作中，我們引入了 OmniGen，這是一種用於統一影像生成的新型擴散模型。OmniGen 具有以下特點：1) 統一性：OmniGen 不僅展示了文字到影像的生成能力，還固有地支援各種下游任務，例如影像編輯、主體驅動生成和視覺條件生成。2) 簡單性：OmniGen 的架構高度簡化，無需額外外掛。此外，與現有擴散模型相比，它更易於使用，可以透過指令端到端地完成複雜任務，無需額外的中間步驟，極大地簡化了影像生成工作流程。3) 知識遷移：得益於統一格式的學習，OmniGen 有效地在不同任務之間遷移知識，管理未見過的任務和領域，並展示了新穎的能力。我們還探討了模型的推理能力和思維鏈機制的潛在應用。這項工作代表了通用影像生成模型的首次嘗試，我們將在 https://github.com/VectorSpaceLab/OmniGen 釋出我們的資源，以促進未來的發展。

請務必檢視排程器指南，瞭解如何探索排程器速度和質量之間的權衡，並參閱跨管道重用元件部分，瞭解如何高效地將相同元件載入到多個管道中。

此管道由 staoxiao 貢獻。原始程式碼庫可在此處找到。原始權重可在 hf.co/shitao 下找到。

推理

首先，載入管道

import torch
from diffusers import OmniGenPipeline

pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16)
pipe.to("cuda")

對於文字到影像，傳遞一個文字提示。預設情況下，OmniGen 生成 1024x1024 的影像。您可以嘗試設定 height 和 width 引數來生成不同大小的影像。

prompt = "Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD."
image = pipe(
    prompt=prompt,
    height=1024,
    width=1024,
    guidance_scale=3,
    generator=torch.Generator(device="cpu").manual_seed(111),
).images[0]
image.save("output.png")

OmniGen 支援多模態輸入。當輸入包含影像時，您需要在文字提示中新增佔位符 <img><|image_1|></img> 來表示影像。建議啟用 use_input_image_size_as_output 以使編輯後的影像與原始影像大小相同。

prompt="<img><|image_1|></img> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola."
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png")]
image = pipe(
    prompt=prompt, 
    input_images=input_images, 
    guidance_scale=2, 
    img_guidance_scale=1.6,
    use_input_image_size_as_output=True,
    generator=torch.Generator(device="cpu").manual_seed(222)).images[0]
image.save("output.png")

OmniGenPipeline

class diffusers.OmniGenPipeline

< 源 >

( transformer: OmniGenTransformer2DModel scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL tokenizer: LlamaTokenizer )

引數

transformer (OmniGenTransformer2DModel) — OmniGen 的自迴歸 Transformer 架構。
scheduler (FlowMatchEulerDiscreteScheduler) — 一個與 transformer 結合使用的排程器，用於對編碼影像的潛在表示進行去噪。
vae (AutoencoderKL) — 用於將影像編碼和解碼為潛在表示的變分自編碼器 (VAE) 模型。
tokenizer (LlamaTokenizer) — LlamaTokenizer 類的文字分詞器。

用於多模態到影像生成的 OmniGen 管道。

參考：https://huggingface.co/papers/2409.11340

call

< 源 >

( prompt: typing.Union[str, typing.List[str]] input_images: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 50 max_input_image_size: int = 1024 timesteps: typing.List[int] = None guidance_scale: float = 2.5 img_guidance_scale: float = 1.6 use_input_image_size_as_output: bool = False num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] )

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞。如果輸入包含影像，需要在提示詞中新增佔位符 <img><|image_i|></img> 來指示第 i 個影像的位置。
input_images (PipelineImageInput 或 List[PipelineImageInput], 可選) — 輸入影像列表。我們將用列表中第 i 個影像替換提示中的“<|image_i|>”。
height (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的高度（畫素）。預設設定為 1024 以獲得最佳效果。
width (int, 可選, 預設為 self.unet.config.sample_size * self.vae_scale_factor) — 生成影像的寬度（畫素）。預設設定為 1024 以獲得最佳效果。
num_inference_steps (int, 可選, 預設為 50) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但會以較慢的推理速度為代價。
max_input_image_size (int, 可選, 預設為 1024) — 輸入影像的最大尺寸，將用於將輸入影像裁剪到最大尺寸。
timesteps (List[int], 可選) — 用於去噪過程的自定義時間步，與支援在其 set_timesteps 方法中帶有 timesteps 引數的排程器一起使用。如果未定義，將使用傳遞 num_inference_steps 時的預設行為。必須按降序排列。
guidance_scale (float, 可選, 預設為 2.5) — 如 Classifier-Free Diffusion Guidance 中定義的引導比例。guidance_scale 定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 啟用引導比例。更高的引導比例會鼓勵生成與文字 prompt 緊密相關的影像，通常以犧牲影像質量為代價。
img_guidance_scale (float, 可選, 預設為 1.6) — 如 Instrucpix2pix 中公式 3 定義。
use_input_image_size_as_output (bool, 預設為 False) — 是否使用輸入影像尺寸作為輸出影像尺寸，可用於單影像輸入，例如影像編輯任務。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示要生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch 生成器，用於使生成過程具有確定性。
latents (torch.Tensor, 可選) — 預先生成的噪聲潛在表示，從高斯分佈中取樣，用作影像生成的輸入。可用於使用不同提示調整相同的生成。如果未提供，將使用提供的隨機 generator 取樣生成潛在張量。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL: PIL.Image.Image 或 np.array。
return_dict (bool, 可選, 預設為 True) — 是否返回 ~pipelines.flux.FluxPipelineOutput 而不是普通元組。
callback_on_step_end (Callable, 可選) — 在推理過程中，每個去噪步驟結束時呼叫的函式。該函式以以下引數呼叫：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。

呼叫管道進行生成時呼叫的函式。

示例

>>> import torch
>>> from diffusers import OmniGenPipeline

>>> pipe = OmniGenPipeline.from_pretrained("Shitao/OmniGen-v1-diffusers", torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A cat holding a sign that says hello world"
>>> # Depending on the variant being used, the pipeline call will slightly vary.
>>> # Refer to the pipeline documentation for more details.
>>> image = pipe(prompt, num_inference_steps=50, guidance_scale=2.5).images[0]
>>> image.save("output.png")