Diffusers 文件

Lumina2

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

Lumina2

Lumina Image 2.0: 一個統一高效的影像生成模型是一個20億引數的基於流的擴散Transformer，能夠從文字描述中生成多樣化的影像。

論文摘要如下：

我們推出了 Lumina-Image 2.0，這是一個先進的文字到影像模型，在多項基準測試中超越了現有最先進的方法，同時展示了其發展成為通用視覺智慧模型的潛力。Lumina-Image 2.0 具備三個關鍵特性：(1) 統一性 – 它採用統一架構，將文字和影像標記視為聯合序列，從而實現自然的跨模態互動並促進任務擴充套件。此外，由於高質量的影像描述器可以提供語義上更好對齊的文字-影像訓練對，我們引入了一個統一的影像描述系統 UniCaptioner，它為模型生成全面而精確的影像描述。這不僅加速了模型收斂，還透過提示模板增強了提示詞依從性、可變長度提示處理和任務泛化能力。(2) 效率 – 為了提高統一架構的效率，我們開發了一系列最佳化技術，在訓練期間改進語義學習和精細紋理生成，同時在不損害影像質量的情況下融入推理加速策略。(3) 透明度 – 我們開源了所有訓練細節、程式碼和模型，以確保完全可復現性，旨在彌合資源豐富的閉源研究團隊與獨立開發者之間的差距。

請務必檢視排程器指南，瞭解如何權衡排程器速度和質量，並參閱跨管道複用元件部分，瞭解如何高效地將相同元件載入到多個管道中。

Lumina Image 2.0 的單檔案載入

Lumina Image 2.0 的單檔案載入適用於 Lumina2Transformer2DModel

import torch
from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline

ckpt_path = "https://huggingface.co/Alpha-VLLM/Lumina-Image-2.0/blob/main/consolidated.00-of-01.pth"
transformer = Lumina2Transformer2DModel.from_single_file(
    ckpt_path, torch_dtype=torch.bfloat16
)

pipe = Lumina2Pipeline.from_pretrained(
    "Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
image = pipe(
    "a cat holding a sign that says hello",
    generator=torch.Generator("cpu").manual_seed(0),
).images[0]
image.save("lumina-single-file.png")

Lumina Image 2.0 使用 GGUF 量化檢查點

Lumina2Transformer2DModel 的 GGUF 量化檢查點可以透過 from_single_file 和 GGUFQuantizationConfig 載入

from diffusers import Lumina2Transformer2DModel, Lumina2Pipeline, GGUFQuantizationConfig 

ckpt_path = "https://huggingface.co/calcuis/lumina-gguf/blob/main/lumina2-q4_0.gguf"
transformer = Lumina2Transformer2DModel.from_single_file(
    ckpt_path,
    quantization_config=GGUFQuantizationConfig(compute_dtype=torch.bfloat16),
    torch_dtype=torch.bfloat16,
)

pipe = Lumina2Pipeline.from_pretrained(
    "Alpha-VLLM/Lumina-Image-2.0", transformer=transformer, torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
image = pipe(
    "a cat holding a sign that says hello",
    generator=torch.Generator("cpu").manual_seed(0),
).images[0]
image.save("lumina-gguf.png")

Lumina2Pipeline

class diffusers.Lumina2Pipeline

< 來源 >

( transformer: Lumina2Transformer2DModel scheduler: FlowMatchEulerDiscreteScheduler vae: AutoencoderKL text_encoder: Gemma2PreTrainedModel tokenizer: typing.Union[transformers.models.gemma.tokenization_gemma.GemmaTokenizer, transformers.models.gemma.tokenization_gemma_fast.GemmaTokenizerFast] )

引數

vae (AutoencoderKL) — 用於將影像編碼和解碼為潛在表示的變分自編碼器 (VAE) 模型。
text_encoder (Gemma2PreTrainedModel) — 凍結的 Gemma2 文字編碼器。
tokenizer (GemmaTokenizer 或 GemmaTokenizerFast) — Gemma 分詞器。
transformer (Transformer2DModel) — 一個文字條件 Transformer2DModel，用於對編碼的影像潛在表示進行去噪。
scheduler (SchedulerMixin) — 與 transformer 結合使用的排程器，用於對編碼的影像潛在表示進行去噪。

用於文字到影像生成的 Lumina-T2I 管道。

該模型繼承自DiffusionPipeline。有關庫為所有管道實現的通用方法（例如下載或儲存、在特定裝置上執行等），請檢視超類文件。

call

< 來源 >

( prompt: typing.Union[str, typing.List[str]] = None width: typing.Optional[int] = None height: typing.Optional[int] = None num_inference_steps: int = 30 guidance_scale: float = 4.0 negative_prompt: typing.Union[str, typing.List[str]] = None sigmas: typing.List[float] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] system_prompt: typing.Optional[str] = None cfg_trunc_ratio: float = 1.0 cfg_normalization: bool = True max_sequence_length: int = 256 ) → ImagePipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞。如果未定義，則必須傳遞 prompt_embeds。
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示詞。如果未定義，則必須傳遞 negative_prompt_embeds。如果未使用引導（即，如果 guidance_scale 小於 1），則忽略。
num_inference_steps (int, 可選, 預設為 30) — 去噪步數。更多去噪步數通常能帶來更高質量的影像，但會犧牲推理速度。
sigmas (List[float], 可選) — 用於去噪過程的自定義 sigmas，適用於其 set_timesteps 方法支援 sigmas 引數的排程器。如果未定義，將使用傳遞 num_inference_steps 時的預設行為。
guidance_scale (float, 可選, 預設為 4.0) — 如無分類器擴散引導中所定義的引導比例。guidance_scale 定義為Imagen Paper中公式2的 w。透過設定 guidance_scale > 1 來啟用引導比例。更高的引導比例會鼓勵生成與文字 prompt 緊密相關的影像，通常會以較低的影像質量為代價。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示詞要生成的影像數量。
height (int, 可選, 預設為 self.unet.config.sample_size) — 生成影像的畫素高度。
width (int, 可選, 預設為 self.unet.config.sample_size) — 生成影像的畫素寬度。
eta (float, 可選, 預設為 0.0) — 對應 DDIM 論文中的引數 eta (η)：https://huggingface.co/papers/2010.02502。僅適用於 schedulers.DDIMScheduler，對其他排程器將被忽略。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個或多個 torch 生成器，用於使生成確定性。
latents (torch.Tensor, 可選) — 預生成的噪聲潛在變數，從高斯分佈中取樣，用作影像生成的輸入。可用於透過不同提示詞微調同一生成。如果未提供，將使用提供的隨機 generator 取樣生成潛在張量。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，文字嵌入將從 prompt 輸入引數生成。
prompt_attention_mask (torch.Tensor, 可選) — 文字嵌入的預生成注意力掩碼。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負文字嵌入。對於 Lumina-T2I，此負提示應為 ""。如果未提供，將從 negative_prompt 輸入引數生成 negative_prompt_embeds。
negative_prompt_attention_mask (torch.Tensor, 可選) — 負面文字嵌入的預生成注意力掩碼。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL: PIL.Image.Image 或 np.array。
return_dict (bool, 可選, 預設為 True) — 是否返回 ~pipelines.stable_diffusion.IFPipelineOutput 而不是普通元組。
attention_kwargs — 一個 kwargs 字典，如果指定，將作為引數傳遞給 diffusers.models.attention_processor 中定義的 self.processor 下的 AttentionProcessor。
callback_on_step_end (Callable, 可選) — 在推理過程中，每個去噪步驟結束時呼叫的函式。該函式透過以下引數呼叫：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。
system_prompt (str, 可選) — 用於影像生成的系統提示。
cfg_trunc_ratio (float, 可選, 預設為 1.0) — 應用基於歸一化的指導尺度的步長間隔比率。
cfg_normalization (bool, 可選, 預設為 True) — 是否應用基於歸一化的指導尺度。
max_sequence_length (int, 預設為 256) — 與 prompt 一起使用的最大序列長度。

ImagePipelineOutput 或 tuple

如果 return_dict 為 True，則返回 ImagePipelineOutput，否則返回一個 tuple，其中第一個元素是生成的影像列表。

呼叫管道進行生成時呼叫的函式。

示例

>>> import torch
>>> from diffusers import Lumina2Pipeline

>>> pipe = Lumina2Pipeline.from_pretrained("Alpha-VLLM/Lumina-Image-2.0", torch_dtype=torch.bfloat16)
>>> # Enable memory optimizations.
>>> pipe.enable_model_cpu_offload()

>>> prompt = "Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution cityscape with smoky skies and tall, metal structures"
>>> image = pipe(prompt).images[0]

disable_vae_slicing

< source >

( )

停用切片 VAE 解碼。如果之前啟用了 enable_vae_slicing，此方法將返回一步計算解碼。

disable_vae_tiling

< source >

( )

停用平鋪 VAE 解碼。如果之前啟用了 enable_vae_tiling，此方法將恢復一步計算解碼。

enable_vae_slicing

< source >

( )

啟用切片 VAE 解碼。啟用此選項後，VAE 會將輸入張量分片，分步計算解碼。這有助於節省一些記憶體並允許更大的批次大小。

enable_vae_tiling

< source >

( )

啟用平鋪 VAE 解碼。啟用此選項後，VAE 將把輸入張量分割成瓦片，分多步計算編碼和解碼。這對於節省大量記憶體和處理更大的影像非常有用。

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] do_classifier_free_guidance: bool = True negative_prompt: typing.Union[str, typing.List[str]] = None num_images_per_prompt: int = 1 device: typing.Optional[torch.device] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None system_prompt: typing.Optional[str] = None max_sequence_length: int = 256 )

引數

prompt (str 或 List[str], 可選) — 要編碼的提示。
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示。如果未定義，則必須傳遞 negative_prompt_embeds。在使用非引導模式時（即 guidance_scale 小於 1 時）將被忽略。對於 Lumina-T2I，這應該是 ""。
do_classifier_free_guidance (bool, 可選, 預設為 True) — 是否使用分類器自由引導。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示應生成的影像數量。
device — (torch.device, 可選): 放置結果嵌入的 torch 裝置。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，則將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。對於 Lumina-T2I，它應該是 "" 字串的嵌入。
max_sequence_length (int, 預設為 256) — 用於提示的最大序列長度。

將提示編碼為文字編碼器隱藏狀態。

< > 在 GitHub 上更新

←LTXVideo Lumina-T2X→

Diffusers

Lumina2

Lumina Image 2.0 的單檔案載入

Lumina Image 2.0 使用 GGUF 量化檢查點

Lumina2Pipeline

class diffusers.Lumina2Pipeline

__call__

disable_vae_slicing

disable_vae_tiling

enable_vae_slicing

enable_vae_tiling

encode_prompt

call