Diffusers 文件

AutoencoderDC

Diffusers

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

AutoencoderDC

用於 SANA 的 2D 自動編碼器模型，由麻省理工學院 HAN 實驗室的 Junyu Chen*、Han Cai*、Junsong Chen、Enze Xie、Shang Yang、Haotian Tang、Muyang Li、Yao Lu 和 Song Han 在 DCAE 中引入。

論文摘要如下：

我們提出了深度壓縮自動編碼器（DC-AE），一種用於加速高解析度擴散模型的新型自動編碼器模型家族。現有自動編碼器模型在中等空間壓縮比（例如 8x）下表現出色，但在高空間壓縮比（例如 64x）下無法保持令人滿意的重建精度。我們透過引入兩個關鍵技術來解決這一挑戰：（1）殘差自動編碼，我們設計模型以基於空間到通道轉換的特徵來學習殘差，以減輕高空間壓縮自動編碼器的最佳化難度；（2）解耦高解析度適應，一種高效的解耦三階段訓練策略，用於減輕高空間壓縮自動編碼器的泛化懲罰。透過這些設計，我們將自動編碼器的空間壓縮比提高到 128 倍，同時保持重建質量。將我們的 DC-AE 應用於潛在擴散模型，我們實現了顯著的加速而沒有精度下降。例如，在 ImageNet 512x512 上，與廣泛使用的 SD-VAE-f8 自動編碼器相比，我們的 DC-AE 在 H100 GPU 上為 UViT-H 提供了 19.1 倍的推理加速和 17.9 倍的訓練加速，同時實現了更好的 FID。我們的程式碼可在 this https URL 獲取。

以下 DCAE 模型在 Diffusers 中釋出並受支援。

Diffusers 格式	原始格式
`mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers`	`mit-han-lab/dc-ae-f32c32-sana-1.0`
`mit-han-lab/dc-ae-f32c32-in-1.0-diffusers`	`mit-han-lab/dc-ae-f32c32-in-1.0`
`mit-han-lab/dc-ae-f32c32-mix-1.0-diffusers`	`mit-han-lab/dc-ae-f32c32-mix-1.0`
`mit-han-lab/dc-ae-f64c128-in-1.0-diffusers`	`mit-han-lab/dc-ae-f64c128-in-1.0`
`mit-han-lab/dc-ae-f64c128-mix-1.0-diffusers`	`mit-han-lab/dc-ae-f64c128-mix-1.0`
`mit-han-lab/dc-ae-f128c512-in-1.0-diffusers`	`mit-han-lab/dc-ae-f128c512-in-1.0`
`mit-han-lab/dc-ae-f128c512-mix-1.0-diffusers`	`mit-han-lab/dc-ae-f128c512-mix-1.0`

此模型由 lawrence-cj 貢獻。

使用 from_pretrained() 載入 Diffusers 格式的模型。

from diffusers import AutoencoderDC

ae = AutoencoderDC.from_pretrained("mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers", torch_dtype=torch.float32).to("cuda")

透過 from_single_file 載入 Diffusers 中的模型

from difusers import AutoencoderDC

ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f32c32-sana-1.0/blob/main/model.safetensors"
model = AutoencoderDC.from_single_file(ckpt_path)

AutoencoderDC 模型具有 in 和 mix 單檔案檢查點變體，它們具有匹配的檢查點鍵，但使用不同的比例因子。Diffusers 無法根據檢查點自動推斷要與模型一起使用的正確配置檔案，並且將預設使用 mix 變體配置檔案配置模型。要覆蓋自動確定的配置，請在使用 in 變體檢查點進行單檔案載入時使用 config 引數。

from diffusers import AutoencoderDC

ckpt_path = "https://huggingface.co/mit-han-lab/dc-ae-f128c512-in-1.0/blob/main/model.safetensors"
model = AutoencoderDC.from_single_file(ckpt_path, config="mit-han-lab/dc-ae-f128c512-in-1.0-diffusers")

AutoencoderDC

class diffusers.AutoencoderDC

< 來源 >

( in_channels: int = 3 latent_channels: int = 32 attention_head_dim: int = 32 encoder_block_types: typing.Union[str, typing.Tuple[str]] = 'ResBlock' decoder_block_types: typing.Union[str, typing.Tuple[str]] = 'ResBlock' encoder_block_out_channels: typing.Tuple[int, ...] = (128, 256, 512, 512, 1024, 1024) decoder_block_out_channels: typing.Tuple[int, ...] = (128, 256, 512, 512, 1024, 1024) encoder_layers_per_block: typing.Tuple[int] = (2, 2, 2, 3, 3, 3) decoder_layers_per_block: typing.Tuple[int] = (3, 3, 3, 3, 3, 3) encoder_qkv_multiscales: typing.Tuple[typing.Tuple[int, ...], ...] = ((), (), (), (5,), (5,), (5,)) decoder_qkv_multiscales: typing.Tuple[typing.Tuple[int, ...], ...] = ((), (), (), (5,), (5,), (5,)) upsample_block_type: str = 'pixel_shuffle' downsample_block_type: str = 'pixel_unshuffle' decoder_norm_types: typing.Union[str, typing.Tuple[str]] = 'rms_norm' decoder_act_fns: typing.Union[str, typing.Tuple[str]] = 'silu' scaling_factor: float = 1.0 )

引數

in_channels (int, 預設為 3) — 樣本中的輸入通道數。
latent_channels (int, 預設為 32) — 潛在空間表示中的通道數。
encoder_block_types (Union[str, Tuple[str]], 預設為 "ResBlock") — 編碼器中使用的塊型別。
decoder_block_types (Union[str, Tuple[str]], 預設為 "ResBlock") — 解碼器中使用的塊型別。
encoder_block_out_channels (Tuple[int, ...], 預設為 (128, 256, 512, 512, 1024, 1024)) — 編碼器中每個塊的輸出通道數。
decoder_block_out_channels (Tuple[int, ...], 預設為 (128, 256, 512, 512, 1024, 1024)) — 解碼器中每個塊的輸出通道數。
encoder_layers_per_block (Tuple[int], 預設為 (2, 2, 2, 3, 3, 3)) — 編碼器中每個塊的層數。
decoder_layers_per_block (Tuple[int], 預設為 (3, 3, 3, 3, 3, 3)) — 解碼器中每個塊的層數。
encoder_qkv_multiscales (Tuple[Tuple[int, ...], ...], 預設為 ((), (), (), (5,), (5,), (5,))) — 編碼器 QKV（query-key-value）轉換的多尺度配置。
decoder_qkv_multiscales (Tuple[Tuple[int, ...], ...], 預設為 ((), (), (), (5,), (5,), (5,))) — 解碼器 QKV（query-key-value）轉換的多尺度配置。
upsample_block_type (str, 預設為 "pixel_shuffle") — 解碼器中用於上取樣的塊型別。
downsample_block_type (str, 預設為 "pixel_unshuffle") — 編碼器中用於下采樣的塊型別。
decoder_norm_types (Union[str, Tuple[str]], 預設為 "rms_norm") — 解碼器中使用的歸一化型別。
decoder_act_fns (Union[str, Tuple[str]], 預設為 "silu") — 解碼器中使用的啟用函式。
scaling_factor (float, 預設為 1.0) — 潛在特徵均方根的乘法逆。這用於在訓練擴散模型時將潛在空間縮放為單位方差。在傳遞給擴散模型之前，潛在特徵會透過公式 z = z * scaling_factor 進行縮放。解碼時，潛在特徵會透過公式 z = 1 / scaling_factor * z 縮放回原始比例。

一個在 DCAE 中引入並在 SANA 中使用的自動編碼器模型。

此模型繼承自 ModelMixin。有關所有模型實現的通用方法（如下載或儲存），請參閱超類文件。

包裝器

< 來源 >

( *args **kwargs )

包裝器

< 來源 >

( *args **kwargs )

停用切片

< 來源 >

( )

停用分片 AE 解碼。如果之前啟用了 enable_slicing，此方法將恢復一步計算解碼。

停用平鋪

< 來源 >

( )

停用平鋪 AE 解碼。如果之前啟用了 enable_tiling，此方法將恢復一步計算解碼。

啟用切片

< 來源 >

( )

啟用分片 AE 解碼。當此選項啟用時，AE 將把輸入張量分割成切片，分多步計算解碼。這對於節省記憶體和允許更大的批處理大小很有用。

啟用平鋪

< 來源 >

( tile_sample_min_height: typing.Optional[int] = None tile_sample_min_width: typing.Optional[int] = None tile_sample_stride_height: typing.Optional[float] = None tile_sample_stride_width: typing.Optional[float] = None )

引數

tile_sample_min_height (int, 可選) — 樣本沿高度維度被分割成瓦片所需的最小高度。
tile_sample_min_width (int, 可選) — 樣本沿寬度維度被分割成瓦片所需的最小寬度。
tile_sample_stride_height (int, 可選) — 兩個連續垂直瓦片之間的最小重疊量。這用於確保沿高度維度不會產生平鋪偽影。
tile_sample_stride_width (int, 可選) — 兩個連續水平瓦片之間的步幅。這用於確保沿寬度維度不會產生平鋪偽影。

啟用平鋪 AE 解碼。當此選項啟用時，AE 會將輸入張量分割成瓦片，分多步計算解碼和編碼。這對於節省大量記憶體和處理更大的影像非常有用。

DecoderOutput

class diffusers.models.autoencoders.vae.DecoderOutput

< 來源 >

( sample: Tensor commit_loss: typing.Optional[torch.FloatTensor] = None )

引數

sample (形狀為 (batch_size, num_channels, height, width) 的 torch.Tensor) — 模型最後一層的解碼輸出樣本。

解碼方法的輸出。

< > 在 GitHub 上更新

←AsymmetricAutoencoderKL AutoencoderKL→