Diffusers 文件

Cosmos

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

Cosmos

Cosmos 世界基礎模型平臺，助力物理 AI，由 NVIDIA 提供。

物理 AI 需要首先進行數字化訓練。它需要自身的數字孿生，即策略模型，以及世界的數字孿生，即世界模型。在本文中，我們提出了 Cosmos 世界基礎模型平臺，以幫助開發者為他們的物理 AI 設定構建定製化的世界模型。我們將世界基礎模型定位為一種通用世界模型，可以透過微調用於下游應用，從而成為定製化的世界模型。我們的平臺涵蓋了影片篩選流程、預訓練的世界基礎模型、預訓練世界基礎模型的後訓練示例以及影片tokenizer。為了幫助物理 AI 構建者解決我們社會最關鍵的問題，我們將我們的平臺開源，並將我們的模型以許可協議開放權重，可從https://github.com/NVIDIA/Cosmos獲取。

請務必查閱排程器指南，瞭解如何在排程器速度和質量之間進行權衡，並檢視跨管道重用元件部分，瞭解如何高效地將相同元件載入到多個管道中。

CosmosTextToWorldPipeline

class diffusers.CosmosTextToWorldPipeline

< 原始檔 >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLCosmos scheduler: EDMEulerScheduler safety_checker: CosmosSafetyChecker = None )

引數

text_encoder (T5EncoderModel) — 凍結的文字編碼器。Cosmos 使用 T5；具體是 t5-11b 變體。
tokenizer (T5TokenizerFast) — T5Tokenizer 類的分詞器。
transformer (CosmosTransformer3DModel) — 用於對編碼影像潛空間進行去噪的條件 Transformer。
scheduler (FlowMatchEulerDiscreteScheduler) — 與 transformer 結合使用的排程器，用於對編碼影像潛空間進行去噪。
vae (AutoencoderKLCosmos) — 變分自編碼器（VAE）模型，用於將影片編碼和解碼為潛在表示。

使用 Cosmos Predict1 進行文字到世界生成的管道。

此模型繼承自 DiffusionPipeline。有關所有管道通用的方法（下載、儲存、在特定裝置上執行等）的更多資訊，請檢視超類文件。

call

< 原始檔 >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 36 guidance_scale: float = 7.0 fps: int = 30 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~CosmosPipelineOutput 或 tuple

引數

prompt (str 或 List[str], 可選) — 用於引導影像生成的提示詞或提示詞列表。如果未定義，則必須傳入 prompt_embeds。
height (int, 預設為 720) — 生成影像的高度（畫素）。
width (int, 預設為 1280) — 生成影像的寬度（畫素）。
num_frames (int, 預設為 121) — 生成影片的幀數。
num_inference_steps (int, 預設為 36) — 去噪步數。更多的去噪步數通常會帶來更高質量的影像，但推理速度會變慢。
guidance_scale (float, 預設為 7.0) — Classifier-Free Diffusion Guidance 中定義的引導比例。guidance_scale 定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 啟用引導比例。
fps (int, 預設為 30) — 生成影片的每秒幀數。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示要生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 用於使生成具有確定性的 torch.Generator。
latents (torch.Tensor, 可選) — 從高斯分佈中取樣的預生成噪聲潛空間，用作影像生成的輸入。可用於透過不同的提示調整相同的生成。如果未提供，則使用提供的隨機 generator 進行取樣生成潛空間張量。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.FloatTensor, 可選) — 預生成的負面文字嵌入。對於 PixArt-Sigma，此負面提示應為 ""。如果未提供，負面提示嵌入將從 negative_prompt 輸入引數生成。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。在 PIL.Image 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 CosmosPipelineOutput 而不是普通元組。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可選) — 在推理過程中，每個去噪步驟結束時呼叫的函式或 PipelineCallback 或 MultiPipelineCallbacks 的子類，具有以下引數：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。

~CosmosPipelineOutput 或 tuple

如果 return_dict 為 True，則返回 CosmosPipelineOutput，否則返回一個 tuple，其中第一個元素是生成的影像列表，第二個元素是布林值列表，指示相應的生成的影像是否包含“不適合工作”（nsfw）內容。

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers import CosmosTextToWorldPipeline
>>> from diffusers.utils import export_to_video

>>> model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Text2World"
>>> pipe = CosmosTextToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A sleek, humanoid robot stands in a vast warehouse filled with neatly stacked cardboard boxes on industrial shelves. The robot's metallic body gleams under the bright, even lighting, highlighting its futuristic design and intricate joints. A glowing blue light emanates from its chest, adding a touch of advanced technology. The background is dominated by rows of boxes, suggesting a highly organized storage system. The floor is lined with wooden pallets, enhancing the industrial setting. The camera remains static, capturing the robot's poised stance amidst the orderly environment, with a shallow depth of field that keeps the focus on the robot while subtly blurring the background for a cinematic effect."

>>> output = pipe(prompt=prompt).frames[0]
>>> export_to_video(output, "output.mp4", fps=30)

encode_prompt

< 原始檔 >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_videos_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

引數

prompt (str or List[str], optional) — 要編碼的提示。
negative_prompt (str 或 List[str], 可選) — 不引導影像生成的提示。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用指導時被忽略（即，如果 guidance_scale 小於 1 則被忽略）。
do_classifier_free_guidance (bool, 可選, 預設為 True) — 是否使用分類器自由指導。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示應生成的影片數量。要放置結果嵌入的 torch 裝置
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
device — (torch.device, 可選): torch 裝置
dtype — (torch.dtype, 可選): torch 資料型別

將提示編碼為文字編碼器隱藏狀態。

CosmosVideoToWorldPipeline

class diffusers.CosmosVideoToWorldPipeline

< source >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLCosmos scheduler: EDMEulerScheduler safety_checker: CosmosSafetyChecker = None )

引數

text_encoder (T5EncoderModel) — 凍結的文字編碼器。Cosmos 使用 T5；特別是 t5-11b 變體。
tokenizer (T5TokenizerFast) — T5Tokenizer 類的分詞器。
transformer (CosmosTransformer3DModel) — 用於對編碼影像潛伏進行去噪的條件變換器。
scheduler (FlowMatchEulerDiscreteScheduler) — 與 transformer 結合使用的排程器，用於對編碼影像潛伏進行去噪。
vae (AutoencoderKLCosmos) — 變分自動編碼器 (VAE) 模型，用於將影片編碼和解碼為潛在表示。

使用 Cosmos Predict-1 進行影像到世界和影片到世界生成的流水線。

此模型繼承自 DiffusionPipeline。有關所有管道通用的方法（下載、儲存、在特定裝置上執行等）的更多資訊，請檢視超類文件。

call

< source >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None video: typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 121 num_inference_steps: int = 36 guidance_scale: float = 7.0 input_frames_guidance: bool = False augment_sigma: float = 0.001 fps: int = 30 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~CosmosPipelineOutput or tuple

引數

prompt (str 或 List[str], 可選) — 引導影像生成的提示。如果未定義，則必須傳遞 prompt_embeds。
height (int, 預設為 720) — 生成影像的高度（畫素）。
width (int, 預設為 1280) — 生成影像的寬度（畫素）。
num_frames (int, 預設為 121) — 生成影片的幀數。
num_inference_steps (int, 預設為 36) — 去噪步數。更多去噪步數通常會帶來更高質量的影像，但推理速度會變慢。
guidance_scale (float, 預設為 7.0) — Classifier-Free Diffusion Guidance 中定義的指導比例。guidance_scale 定義為 Imagen Paper 中公式 2 的 w。透過設定 guidance_scale > 1 啟用指導比例。
fps (int, 預設為 30) — 生成影片的每秒幀數。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 用於使生成具有確定性的 torch.Generator。
latents (torch.Tensor, 可選) — 從高斯分佈中取樣的預生成的噪聲潛伏，用作影像生成的輸入。可用於使用不同的提示調整相同的生成。如果未提供，潛伏張量將透過使用提供的隨機 generator 取樣生成。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.FloatTensor, 可選) — 預生成的負文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。選擇 PIL.Image 或 np.array。
return_dict (bool, 可選, 預設為 True) — 是否返回 CosmosPipelineOutput 而不是普通元組。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可選) — 在推理過程中，每個去噪步驟結束時呼叫的函式或 PipelineCallback 或 MultiPipelineCallbacks 的子類，具有以下引數：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含流水線類的 ._callback_tensor_inputs 屬性中列出的變數。

~CosmosPipelineOutput 或 tuple

用於生成的管道的呼叫函式。

示例

影像條件

>>> import torch
>>> from diffusers import CosmosVideoToWorldPipeline
>>> from diffusers.utils import export_to_video, load_image

>>> model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Video2World"
>>> pipe = CosmosVideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "The video depicts a long, straight highway stretching into the distance, flanked by metal guardrails. The road is divided into multiple lanes, with a few vehicles visible in the far distance. The surrounding landscape features dry, grassy fields on one side and rolling hills on the other. The sky is mostly clear with a few scattered clouds, suggesting a bright, sunny day."
>>> image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input.jpg"
... )

>>> video = pipe(image=image, prompt=prompt).frames[0]
>>> export_to_video(video, "output.mp4", fps=30)

影片條件

>>> import torch
>>> from diffusers import CosmosVideoToWorldPipeline
>>> from diffusers.utils import export_to_video, load_video

>>> model_id = "nvidia/Cosmos-1.0-Diffusion-7B-Video2World"
>>> pipe = CosmosVideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.transformer = torch.compile(pipe.transformer)
>>> pipe.to("cuda")

>>> prompt = "The video depicts a winding mountain road covered in snow, with a single vehicle traveling along it. The road is flanked by steep, rocky cliffs and sparse vegetation. The landscape is characterized by rugged terrain and a river visible in the distance. The scene captures the solitude and beauty of a winter drive through a mountainous region."
>>> video = load_video(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cosmos/cosmos-video2world-input-vid.mp4"
... )[
...     :21
... ]  # This example uses only the first 21 frames

>>> video = pipe(video=video, prompt=prompt).frames[0]
>>> export_to_video(video, "output.mp4", fps=30)

encode_prompt

< source >

引數

prompt (str 或 List[str], 可選) — 要編碼的提示
negative_prompt (str 或 List[str], 可選) — 不引導影像生成的提示。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用指導時被忽略（即，如果 guidance_scale 小於 1 則被忽略）。
do_classifier_free_guidance (bool, 可選, 預設為 True) — 是否使用分類器自由指導。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示應生成的影片數量。要放置結果嵌入的 torch 裝置
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，文字嵌入將從 prompt 輸入引數生成。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
device — (torch.device, 可選): torch 裝置
dtype — (torch.dtype, 可選): torch 資料型別

將提示編碼為文字編碼器隱藏狀態。

Cosmos2TextToImagePipeline

class diffusers.Cosmos2TextToImagePipeline

< source >

( text_encoder: T5EncoderModel tokenizer: T5TokenizerFast transformer: CosmosTransformer3DModel vae: AutoencoderKLWan scheduler: FlowMatchEulerDiscreteScheduler safety_checker: CosmosSafetyChecker = None )

引數

text_encoder (T5EncoderModel) — 凍結的文字編碼器。Cosmos 使用 T5；特別是 t5-11b 變體。
tokenizer (T5TokenizerFast) — T5Tokenizer 類的分詞器。
transformer (CosmosTransformer3DModel) — 用於對編碼影像潛伏進行去噪的條件變換器。
scheduler (FlowMatchEulerDiscreteScheduler) — 與 transformer 結合使用的排程器，用於對編碼影像潛伏進行去噪。
vae (AutoencoderKLWan) — 變分自動編碼器 (VAE) 模型，用於將影片編碼和解碼為潛在表示。

使用 Cosmos Predict2 進行文字到影像生成的流水線。

此模型繼承自 DiffusionPipeline。有關所有管道通用的方法（下載、儲存、在特定裝置上執行等）的更多資訊，請檢視超類文件。

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 768 width: int = 1360 num_inference_steps: int = 35 guidance_scale: float = 7.0 num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 ) → ~CosmosImagePipelineOutput or tuple

引數

prompt (str 或 List[str], 可選) — 引導影像生成的提示。如果未定義，則必須傳遞 prompt_embeds。
height (int, 預設為 768) — 生成影像的高度（畫素）。
width (int, 預設為 1360) — 生成影像的寬度（畫素）。
num_inference_steps (int, 預設為 35) — 去噪步驟的數量。去噪步驟越多，通常會生成更高質量的影像，但推理速度會變慢。
guidance_scale (float, 預設為 7.0) — 在 Classifier-Free Diffusion Guidance 中定義的指導比例。guidance_scale 在 Imagen Paper 的公式 2 中定義為 w。透過設定 guidance_scale > 1 啟用指導比例。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個 torch.Generator 用於使生成過程確定化。
latents (torch.Tensor, 可選) — 從高斯分佈取樣的預生成噪聲潛在變數，用作影像生成的輸入。可用於透過不同的提示調整相同的生成。如果未提供，則使用提供的隨機 generator 進行取樣以生成潛在張量。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，則將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.FloatTensor, 可選) — 預生成的負面文字嵌入。對於 PixArt-Sigma，此負面提示應為“”。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。在 PIL.Image 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 CosmosImagePipelineOutput 而不是普通元組。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可選) — 在推理過程中，每個去噪步驟結束時呼叫的函式或 PipelineCallback 或 MultiPipelineCallbacks 的子類。引數如下：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 中指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類 ._callback_tensor_inputs 屬性中列出的變數。

~CosmosImagePipelineOutput 或 tuple

如果 return_dict 為 True，則返回 CosmosImagePipelineOutput，否則返回一個 tuple，其中第一個元素是生成的影像列表，第二個元素是指示相應生成的影像是否包含“不適合工作”（nsfw）內容的 bool 列表。

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers import Cosmos2TextToImagePipeline

>>> # Available checkpoints: nvidia/Cosmos-Predict2-2B-Text2Image, nvidia/Cosmos-Predict2-14B-Text2Image
>>> model_id = "nvidia/Cosmos-Predict2-2B-Text2Image"
>>> pipe = Cosmos2TextToImagePipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
>>> negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."

>>> output = pipe(
...     prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
... ).images[0]
>>> output.save("output.png")

encode_prompt

< 來源 >

( prompt: typing.Union[str, typing.List[str]] negative_prompt: typing.Union[str, typing.List[str], NoneType] = None do_classifier_free_guidance: bool = True num_images_per_prompt: int = 1 prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None max_sequence_length: int = 512 device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None )

引數

prompt (str 或 List[str], 可選) — 要編碼的提示。
negative_prompt (str 或 List[str], 可選) — 不引導影像生成的提示。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用引導時（即 guidance_scale 小於 1 時），此引數將被忽略。
do_classifier_free_guidance (bool, 可選, 預設為 True) — 是否使用分類器自由引導。
num_images_per_prompt (int, 可選, 預設為 1) — 每個提示應生成的影片數量。用於放置結果嵌入的 torch 裝置。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，則將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負面文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
device — (torch.device, 可選): 用於放置結果嵌入的 torch 裝置。
dtype — (torch.dtype, 可選): torch 資料型別。

將提示編碼為文字編碼器隱藏狀態。

Cosmos2VideoToWorldPipeline

class diffusers.Cosmos2VideoToWorldPipeline

< 來源 >

引數

text_encoder (T5EncoderModel) — 凍結的文字編碼器。Cosmos 使用 T5；具體是 t5-11b 變體。
tokenizer (T5TokenizerFast) — T5Tokenizer 類的分詞器。
transformer (CosmosTransformer3DModel) — 用於去噪編碼影像潛在變數的條件 Transformer。
scheduler (FlowMatchEulerDiscreteScheduler) — 與 transformer 結合使用的排程器，用於去噪編碼影像潛在變數。
vae (AutoencoderKLWan) — 變分自編碼器（VAE）模型，用於將影片編碼和解碼為潛在表示。

使用 Cosmos Predict2 進行影片到世界生成的管道。

此模型繼承自 DiffusionPipeline。有關所有管道通用的方法（下載、儲存、在特定裝置上執行等）的更多資訊，請檢視超類文件。

call

< 來源 >

( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] = None video: typing.List[typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]]] = None prompt: typing.Union[str, typing.List[str]] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None height: int = 704 width: int = 1280 num_frames: int = 93 num_inference_steps: int = 35 guidance_scale: float = 7.0 fps: int = 16 num_videos_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] max_sequence_length: int = 512 sigma_conditioning: float = 0.0001 ) → ~CosmosPipelineOutput 或 tuple

引數

image (PIL.Image.Image, np.ndarray, torch.Tensor, 可選) — 用作影片生成條件輸入的影像。
video (List[PIL.Image.Image], np.ndarray, torch.Tensor, 可選) — 用作影片生成條件輸入的影片。
prompt (str 或 List[str], 可選) — 引導影像生成的提示。如果未定義，則必須傳遞 prompt_embeds。
height (int, 預設為 704) — 生成影像的高度（畫素）。
width (int, 預設為 1280) — 生成影像的寬度（畫素）。
num_frames (int, 預設為 93) — 生成影片的幀數。
num_inference_steps (int, 預設為 35) — 去噪步驟的數量。去噪步驟越多，通常會生成更高質量的影像，但推理速度會變慢。
guidance_scale (float, 預設為 7.0) — 在 Classifier-Free Diffusion Guidance 中定義的指導比例。guidance_scale 在 Imagen Paper 的公式 2 中定義為 w。透過設定 guidance_scale > 1 啟用指導比例。
fps (int, 預設為 16) — 生成影片的每秒幀數。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示生成的影像數量。
generator (torch.Generator 或 List[torch.Generator], 可選) — 一個 torch.Generator 用於使生成過程確定化。
latents (torch.Tensor, 可選) — 從高斯分佈取樣的預生成噪聲潛在變數，用作影像生成的輸入。可用於透過不同的提示調整相同的生成。如果未提供，則使用提供的隨機 generator 進行取樣以生成潛在張量。
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示權重。如果未提供，則將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.FloatTensor, 可選) — 預生成的負面文字嵌入。對於 PixArt-Sigma，此負面提示應為“”。如果未提供，negative_prompt_embeds 將從 negative_prompt 輸入引數生成。
output_type (str, 可選, 預設為 "pil") — 生成影像的輸出格式。在 PIL.Image 或 np.array 之間選擇。
return_dict (bool, 可選, 預設為 True) — 是否返回 CosmosPipelineOutput 而不是普通元組。
callback_on_step_end (Callable, PipelineCallback, MultiPipelineCallbacks, 可選) — 在推理過程中，每個去噪步驟結束時呼叫的函式或 PipelineCallback 或 MultiPipelineCallbacks 的子類。引數如下：callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)。callback_kwargs 將包含 callback_on_step_end_tensor_inputs 中指定的所有張量列表。
callback_on_step_end_tensor_inputs (List, 可選) — callback_on_step_end 函式的張量輸入列表。列表中指定的張量將作為 callback_kwargs 引數傳遞。您只能包含管道類的 ._callback_tensor_inputs 屬性中列出的變數。
max_sequence_length (int, 預設為 512) — 提示詞中的最大 token 數。如果提示詞超過此長度，則將被截斷。如果提示詞短於此長度，則將進行填充。
sigma_conditioning (float, 預設為 0.0001) — 用於縮放條件潛變數的 sigma 值。理想情況下，它不應更改或應設定為接近零的小值。

~CosmosPipelineOutput 或 tuple

用於生成的管道的呼叫函式。

示例

>>> import torch
>>> from diffusers import Cosmos2VideoToWorldPipeline
>>> from diffusers.utils import export_to_video, load_image

>>> # Available checkpoints: nvidia/Cosmos-Predict2-2B-Video2World, nvidia/Cosmos-Predict2-14B-Video2World
>>> model_id = "nvidia/Cosmos-Predict2-2B-Video2World"
>>> pipe = Cosmos2VideoToWorldPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
>>> pipe.to("cuda")

>>> prompt = "A close-up shot captures a vibrant yellow scrubber vigorously working on a grimy plate, its bristles moving in circular motions to lift stubborn grease and food residue. The dish, once covered in remnants of a hearty meal, gradually reveals its original glossy surface. Suds form and bubble around the scrubber, creating a satisfying visual of cleanliness in progress. The sound of scrubbing fills the air, accompanied by the gentle clinking of the dish against the sink. As the scrubber continues its task, the dish transforms, gleaming under the bright kitchen lights, symbolizing the triumph of cleanliness over mess."
>>> negative_prompt = "The video captures a series of frames showing ugly scenes, static with no motion, motion blur, over-saturation, shaky footage, low resolution, grainy texture, pixelated images, poorly lit areas, underexposed and overexposed scenes, poor color balance, washed out colors, choppy sequences, jerky movements, low frame rate, artifacting, color banding, unnatural transitions, outdated special effects, fake elements, unconvincing visuals, poorly edited content, jump cuts, visual noise, and flickering. Overall, the video is of poor quality."
>>> image = load_image(
...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/yellow-scrubber.png"
... )

>>> video = pipe(
...     image=image, prompt=prompt, negative_prompt=negative_prompt, generator=torch.Generator().manual_seed(1)
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=16)

encode_prompt

< source >

引數

prompt (str 或 List[str], 可選) — 待編碼的提示詞
negative_prompt (str 或 List[str], 可選) — 不用於引導影像生成的提示詞。如果未定義，則必須傳遞 negative_prompt_embeds。當不使用指導時（即，如果 guidance_scale 小於 1 時），此引數將被忽略。
do_classifier_free_guidance (bool, 可選, 預設為 True) — 是否使用分類器自由指導。
num_videos_per_prompt (int, 可選, 預設為 1) — 每個提示詞應生成的影片數量。用於放置結果嵌入的 torch 裝置
prompt_embeds (torch.Tensor, 可選) — 預生成的文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，將從 prompt 輸入引數生成文字嵌入。
negative_prompt_embeds (torch.Tensor, 可選) — 預生成的負向文字嵌入。可用於輕鬆調整文字輸入，例如提示詞權重。如果未提供，負向提示詞嵌入將從 negative_prompt 輸入引數生成。
device — (torch.device, 可選): torch 裝置
dtype — (torch.dtype, 可選): torch 資料型別

將提示編碼為文字編碼器隱藏狀態。

CosmosPipelineOutput

class diffusers.pipelines.cosmos.pipeline_output.CosmosPipelineOutput

< source >

( frames: Tensor )

引數

frames (torch.Tensor, np.ndarray, 或 List[List[PIL.Image.Image]]) — 影片輸出列表 - 可以是長度為 batch_size 的巢狀列表，每個子列表包含長度為 num_frames 的去噪 PIL 影像序列。也可以是形狀為 (batch_size, num_frames, channels, height, width) 的 NumPy 陣列或 Torch 張量。

Cosmos 任意到世界/影片管道的輸出類。

CosmosImagePipelineOutput

class diffusers.pipelines.cosmos.pipeline_output.CosmosImagePipelineOutput

< source >

( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )

引數

images (List[PIL.Image.Image] 或 np.ndarray) — 去噪的 PIL 影像列表，長度為 batch_size，或形狀為 (batch_size, height, width, num_channels) 的 numpy 陣列。PIL 影像或 numpy 陣列表示擴散管道的去噪影像。

Cosmos 任意到影像管道的輸出類。

< > 在 GitHub 上更新

←ControlNetUnion Dance Diffusion→

Diffusers

Cosmos

CosmosTextToWorldPipeline

class diffusers.CosmosTextToWorldPipeline

__call__

encode_prompt

CosmosVideoToWorldPipeline

class diffusers.CosmosVideoToWorldPipeline

__call__

encode_prompt

Cosmos2TextToImagePipeline

class diffusers.Cosmos2TextToImagePipeline

__call__

encode_prompt

Cosmos2VideoToWorldPipeline

class diffusers.Cosmos2VideoToWorldPipeline

__call__

encode_prompt

CosmosPipelineOutput

class diffusers.pipelines.cosmos.pipeline_output.CosmosPipelineOutput

CosmosImagePipelineOutput

class diffusers.pipelines.cosmos.pipeline_output.CosmosImagePipelineOutput

call

call

call

call