DeepSpeed

DeepSpeed 實現了 ZeRO 論文中描述的所有內容。一些顯著的最佳化包括：

最佳化器狀態分割槽（ZeRO stage 1）
梯度分割槽（ZeRO stage 2）
引數分割槽（ZeRO stage 3）
自定義混合精度訓練處理
一系列基於 CUDA 擴充套件的快速最佳化器
ZeRO-Offload 到 CPU 和磁碟/NVMe
模型引數的分層分割槽 (ZeRO++)

ZeRO-Offload 有自己的專門論文：ZeRO-Offload: Democratizing Billion-Scale Model Training。而 NVMe 支援則在論文 ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning 中有所描述。

DeepSpeed ZeRO-2 主要僅用於訓練，因為它的功能對推理沒有用處。

DeepSpeed ZeRO-3 也可以用於推理，因為它允許在多個 GPU 上載入大型模型，這在單個 GPU 上是不可能的。

Accelerate 透過 2 種方式集成了 DeepSpeed：

透過在 accelerate config 中指定 deepspeed config file 來整合 DeepSpeed 功能。你只需提供你的自定義配置檔案或使用我們的模板。本文件的大部分內容都集中在這個功能上。這支援 DeepSpeed 的所有核心功能，併為使用者提供了很大的靈活性。使用者可能需要根據配置更改幾行程式碼。
透過 deepspeed_plugin 進行整合。這支援 DeepSpeed 功能的一個子集，並對其餘配置使用預設選項。使用者無需更改任何程式碼，適合那些對 DeepSpeed 大部分預設設定感到滿意的使用者。

集成了什麼？

訓練

Accelerate 集成了 DeepSpeed ZeRO 的所有功能。這包括 ZeRO 的第 1、2 和 3 階段，以及 ZeRO-Offload、ZeRO-Infinity（可解除安裝到磁碟/NVMe）和 ZeRO++。下面是使用 ZeRO - Zero Redundancy Optimizer 進行資料並行的簡要描述，以及來自此部落格文章的圖示

（來源：連結）

a. 階段 1：在資料並行工作器/GPU 之間分片最佳化器狀態

b. 階段 2：在資料並行工作器/GPU 之間分片最佳化器狀態 + 梯度

c. 階段 3：在資料並行工作器/GPU 之間分片最佳化器狀態 + 梯度 + 模型引數

d. 最佳化器解除安裝：在 ZeRO 階段 2 的基礎上，將梯度 + 最佳化器狀態解除安裝到 CPU/磁碟

e. 引數解除安裝：在 ZeRO 階段 3 的基礎上，將模型引數解除安裝到 CPU/磁碟

f. 分層分割槽：在 ZeRO 階段 3 的基礎上，透過在節點間進行資料並行訓練並在節點內進行 ZeRO-3 分片，實現高效的多節點訓練。

注意：關於磁碟解除安裝，為了獲得可觀的速度，磁碟應該是 NVME，但技術上它可以在任何磁碟上工作。

推理

DeepSpeed ZeRO 推理支援帶有 ZeRO-Infinity 的 ZeRO 階段 3。它使用與訓練相同的 ZeRO 協議，但它不使用最佳化器和學習率排程器，只有階段 3 是相關的。更多詳情請參閱：deepspeed-zero-inference。

它是如何工作的？

前提條件：安裝 DeepSpeed 版本 >=0.6.5。請參閱 DeepSpeed 安裝詳情瞭解更多資訊。

我們首先看一下透過 accelerate config 進行的易於使用的整合。然後是更靈活、功能更豐富的 deepspeed config file 整合。

Accelerate DeepSpeed 外掛

在你的機器上只需執行：

accelerate config

並回答所提出的問題。它會詢問你是否要為 DeepSpeed 使用配置檔案，你應該回答否。然後回答以下問題以生成一個基本的 DeepSpeed 配置。這將生成一個配置檔案，在執行時將自動用於正確設定預設選項：

accelerate launch my_script.py --args_to_my_script

例如，以下是如何使用 DeepSpeed 外掛執行 NLP 示例 examples/nlp_example.py（從倉庫根目錄）：

ZeRO Stage-2 DeepSpeed 外掛示例

compute_environment: LOCAL_MACHINE
deepspeed_config:
 gradient_accumulation_steps: 1
 gradient_clipping: 1.0
 offload_optimizer_device: none
 offload_param_device: none
 zero3_init_flag: true
 zero_stage: 2
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

accelerate launch examples/nlp_example.py --mixed_precision fp16

帶有 CPU Offload 的 ZeRO Stage-3 DeepSpeed 外掛示例

compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

accelerate launch examples/nlp_example.py --mixed_precision fp16

目前，Accelerate 透過 CLI 支援以下配置：

`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them.
`gradient_clipping`: Enable gradient clipping with value.
`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2.
`offload_optimizer_nvme_path`: Decides Nvme Path to offload optimizer states. If unspecified, will default to 'none'.
`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3.
`offload_param_nvme_path`: Decides Nvme Path to offload parameters. If unspecified, will default to 'none'.
`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3.
`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3.
`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training.
`deepspeed_moe_layer_cls_names`: Comma-separated list of transformer Mixture-of-Experts (MoE) layer class names (case-sensitive) to wrap ,e.g, `MixtralSparseMoeBlock`, `Qwen2MoeSparseMoeBlock`, `JetMoEAttention,JetMoEBlock` ...
`deepspeed_hostfile`: DeepSpeed hostfile for configuring multi-node compute resources.
`deepspeed_exclusion_filter`: DeepSpeed exclusion filter string when using mutli-node setup.
`deepspeed_inclusion_filter`: DeepSpeed inclusion filter string when using mutli-node setup.
`deepspeed_multinode_launcher`: DeepSpeed multi-node launcher to use, e.g. `pdsh`, `standard`, `openmpi`, `mvapich`, `mpich`, `slurm`, `nossh` (requires DeepSpeed >= 0.14.5). If unspecified, will default to `pdsh`.
`deepspeed_config_file`: path to the DeepSpeed config file in `json` format. See the next section for more details on this.

為了能夠調整更多選項，你需要使用 DeepSpeed 配置檔案。

DeepSpeed 配置檔案

在你的機器上只需執行：

accelerate config

並回答所提出的問題。它會詢問你是否要為 deepspeed 使用配置檔案，你回答是，並提供 deepspeed 配置檔案的路徑。這將生成一個配置檔案，在執行時將自動用於正確設定預設選項：

accelerate launch my_script.py --args_to_my_script

例如，以下是如何使用 DeepSpeed 配置檔案執行 NLP 示例 examples/by_feature/deepspeed_with_config_support.py（從倉庫根目錄）：

ZeRO Stage-2 DeepSpeed 配置檔案示例

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_config_file: /home/ubuntu/accelerate/examples/deepspeed_config_templates/zero_stage2_config.json
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

zero_stage2_config.json 的內容是：

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

accelerate launch examples/by_feature/deepspeed_with_config_support.py \
--config_name "gpt2-large" \
--tokenizer_name "gpt2-large" \
--dataset_name "wikitext" \
--dataset_config_name "wikitext-2-raw-v1" \
--block_size 128 \
--output_dir "./clm/clm_deepspeed_stage2_accelerate" \
--learning_rate 5e-4 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 24 \
--num_train_epochs 3 \
--with_tracking \
--report_to "wandb"\

帶有 CPU 解除安裝的 ZeRO Stage-3 DeepSpeed 配置檔案示例

compute_environment: LOCAL_MACHINE
deepspeed_config:
 deepspeed_config_file: /home/ubuntu/accelerate/examples/deepspeed_config_templates/zero_stage3_offload_config.json
 zero3_init_flag: true
distributed_type: DEEPSPEED
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
use_cpu: false

zero_stage3_offload_config.json 的內容是：

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "total_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "sub_group_size": 1e9,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": "auto"
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

accelerate launch examples/by_feature/deepspeed_with_config_support.py \
--config_name "gpt2-large" \
--tokenizer_name "gpt2-large" \
--dataset_name "wikitext" \
--dataset_config_name "wikitext-2-raw-v1" \
--block_size 128 \
--output_dir "./clm/clm_deepspeed_stage3_offload_accelerate" \
--learning_rate 5e-4 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--num_train_epochs 3 \
--with_tracking \
--report_to "wandb"\

ZeRO++ 配置示例 你可以透過使用適當的配置引數來使用 ZeRO++ 的功能。請注意，ZeRO++ 是 ZeRO Stage 3 的一個擴充套件。以下是如何修改配置檔案，來自 DeepSpeed 的 ZeRO++ 教程：

{
    "zero_optimization": {
        "stage": 3,
        "reduce_bucket_size": "auto",

        "zero_quantized_weights": true,
        "zero_hpz_partition_size": 8,
        "zero_quantized_gradients": true,

        "contiguous_gradients": true,
        "overlap_comm": true
    }
}

對於分層分割槽，分割槽大小 zero_hpz_partition_size 理想情況下應設定為每個節點的 GPU 數量。（例如，上面的配置檔案假設每個節點有 8 個 GPU）

使用 DeepSpeed 配置檔案時的重要程式碼更改

DeepSpeed 最佳化器和排程器。有關這些的更多資訊，請參閱 DeepSpeed 最佳化器和 DeepSpeed 排程器文件。我們將看看在使用這些時程式碼中需要做的更改。

a. DS Optim + DS Scheduler: DeepSpeed 配置檔案中同時存在 optimizer 和 scheduler 鍵的情況。在這種情況下，將使用它們，使用者必須使用 accelerate.utils.DummyOptim 和 accelerate.utils.DummyScheduler 來替換其程式碼中的 PyTorch/自定義最佳化器和排程器。以下是來自 examples/by_feature/deepspeed_with_config_support.py 的程式碼片段，展示了這一點：
```
 # Creates Dummy Optimizer if `optimizer` was specified in the config file else creates Adam Optimizer
 optimizer_cls = (
     torch.optim.AdamW
     if accelerator.state.deepspeed_plugin is None
     or "optimizer" not in accelerator.state.deepspeed_plugin.deepspeed_config
     else DummyOptim
 )
 optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate)

 # Creates Dummy Scheduler if `scheduler` was specified in the config file else creates `args.lr_scheduler_type` Scheduler
 if (
     accelerator.state.deepspeed_plugin is None
     or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
 ):
     lr_scheduler = get_scheduler(
         name=args.lr_scheduler_type,
         optimizer=optimizer,
         num_warmup_steps=args.num_warmup_steps,
         num_training_steps=args.max_train_steps,
     )
 else:
     lr_scheduler = DummyScheduler(
         optimizer, total_num_steps=args.max_train_steps, warmup_num_steps=args.num_warmup_steps
     )
```
b. 自定義 Optim + 自定義 Scheduler: DeepSpeed 配置檔案中既沒有 optimizer 也沒有 scheduler 鍵的情況。在這種情況下，使用者無需進行任何程式碼更改，這也是使用 DeepSpeed 外掛整合時的情況。在上面的示例中，我們可以看到，如果 DeepSpeed 配置檔案中缺少 optimizer 和 scheduler 鍵，程式碼將保持不變。

c. 自定義 Optim + DS Scheduler: DeepSpeed 配置檔案中僅存在 scheduler 鍵的情況。在這種情況下，使用者必須使用 accelerate.utils.DummyScheduler 來替換其程式碼中的 PyTorch/自定義排程器。

d. DS Optim + 自定義 Scheduler: DeepSpeed 配置檔案中僅存在 optimizer 鍵的情況。這將導致錯誤，因為你只能在使用 DS Optim 時使用 DS Scheduler。
注意上面示例 DeepSpeed 配置檔案中的 auto 值。這些值由 prepare 方法根據提供給 prepare 方法的模型、資料載入器、虛擬最佳化器和虛擬排程器自動處理。只有在上述示例中指定的 auto 欄位由 prepare 方法處理，其餘的必須由使用者明確指定。

auto 值的計算方式如下：

reduce_bucket_size: hidden_size * hidden_size
stage3_prefetch_bucket_size: int(0.9 * hidden_size * hidden_size)
stage3_param_persistence_threshold: 10 * hidden_size

要使 auto 功能對這 3 個配置項生效 - Accelerate 將使用 model.config.hidden_size 或 max(model.config.hidden_sizes) 作為 hidden_size。如果兩者都不可用，啟動將失敗，你將需要手動設定這 3 個配置項。請記住，前 2 個配置項是通訊緩衝區 - 它們越大，通訊效率越高，同時它們消耗的 GPU 記憶體也越多，因此這是一個可調的效能權衡。

使用 DeepSpeed 配置檔案時需要注意的事項

以下是在不同場景中使用 deepspeed_config_file 的示例指令碼。

程式碼 test.py

from accelerate import Accelerator
from accelerate.state import AcceleratorState


def main():
    accelerator = Accelerator()
    accelerator.print(f"{AcceleratorState()}")


if __name__ == "__main__":
    main()

場景 1：手動修改的 accelerate 配置檔案，其中包含 deepspeed_config_file 以及其他條目。

accelerate 配置檔案的內容：

command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 1
  gradient_clipping: 1.0
  offload_optimizer_device: 'cpu'
  offload_param_device: 'cpu'
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  deepspeed_config_file: 'ds_config.json'
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

ds_config.json:

{
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 3,
        "stage3_gather_16bit_weights_on_model_save": false,
        "offload_optimizer": {
            "device": "none"
        },
        "offload_param": {
            "device": "none"
        }
    },
    "gradient_clipping": 1.0,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": 10,
    "steps_per_print": 2000000
}

accelerate launch test.py 的輸出：

ValueError: When using `deepspeed_config_file`, the following accelerate config variables will be ignored:
['gradient_accumulation_steps', 'gradient_clipping', 'zero_stage', 'offload_optimizer_device', 'offload_param_device',
'zero3_save_16bit_model', 'mixed_precision'].
Please specify them appropriately in the DeepSpeed config file.
If you are using an accelerate config file, remove other config variables mentioned in the above specified list.
The easiest method is to create a new config following the questionnaire via `accelerate config`.
It will only ask for the necessary config variables when using `deepspeed_config_file`.

場景 2：使用錯誤的解決方案建立新的 accelerate 配置，並檢查現在是否沒有丟擲歧義錯誤。

執行 accelerate config

$ accelerate config
-------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: yes
Please enter the path to the json DeepSpeed config file: ds_config.json
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: yes
How many GPU(s) should be used for distributed training? [1]:4
accelerate configuration saved at ds_config_sample.yaml

accelerate 配置檔案的內容：

compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: ds_config.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

accelerate launch test.py 的輸出：

Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: bf16
ds_config: {'bf16': {'enabled': True}, 'zero_optimization': {'stage': 3, 'stage3_gather_16bit_weights_on_model_save': False, 'offload_optimizer': {'device': 'none'}, 'offload_param': {'device': 'none'}}, 'gradient_clipping': 1.0, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 10, 'steps_per_print': inf, 'fp16': {'enabled': False}}

場景 3：在 DeepSpeed 配置檔案中將與 DeepSpeed 相關的 accelerate launch 命令引數設定為 "auto"，並檢查一切是否按預期工作。

新的 ds_config.json，其中 accelerate launch DeepSpeed 命令引數為 "auto"

{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": "auto",
        "stage3_gather_16bit_weights_on_model_save": "auto",
        "offload_optimizer": {
            "device": "auto"
        },
        "offload_param": {
            "device": "auto"
        }
    },
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "steps_per_print": 2000000
}

accelerate launch --mixed_precision="fp16" --zero_stage=3 --gradient_accumulation_steps=5 --gradient_clipping=1.0 --offload_param_device="cpu" --offload_optimizer_device="nvme" --zero3_save_16bit_model="true" test.py 的輸出

Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
ds_config: {'bf16': {'enabled': False}, 'zero_optimization': {'stage': 3, 'stage3_gather_16bit_weights_on_model_save': True, 'offload_optimizer': {'device': 'nvme'}, 'offload_param': {'device': 'cpu'}}, 'gradient_clipping': 1.0, 'train_batch_size': 'auto', 'train_micro_batch_size_per_gpu': 'auto', 'gradient_accumulation_steps': 5, 'steps_per_print': inf, 'fp16': {'enabled': True, 'auto_cast': True}}

注意:

其餘的 "auto" 值在 accelerator.prepare() 呼叫中處理，如 使用 DeepSpeed 配置檔案時的重要程式碼更改 的第 2 點所述。
只有當 gradient_accumulation_steps 為 auto 時，才會使用建立 Accelerator 物件時透過 Accelerator(gradient_accumulation_steps=k) 傳遞的值。當使用 DeepSpeed 外掛時，將使用外掛中的值，並且它將覆蓋建立 Accelerator 物件時傳遞的值。

儲存和載入

對於 ZeRO Stage-1 和 Stage-2，模型的儲存和載入沒有變化。

在 ZeRO Stage-3 下，state_dict 只包含佔位符，因為模型權重被分割槽到多個 GPU 上。ZeRO Stage-3 有 2 個選項：

a. 儲存整個 16 位模型權重，以便以後使用 model.load_state_dict(torch.load(pytorch_model.bin)) 直接載入。為此，可以在 DeepSpeed 配置檔案中將 zero_optimization.stage3_gather_16bit_weights_on_model_save 設定為 True，或者在 DeepSpeed 外掛中將 zero3_save_16bit_model 設定為 True。請注意，此選項需要在單個 GPU 上整合權重，這可能會很慢並且需要大量記憶體，因此僅在需要時使用此功能。 以下是來自 examples/by_feature/deepspeed_with_config_support.py 的程式碼片段，展示了這一點：

unwrapped_model = accelerator.unwrap_model(model)

# New Code #
# Saves the whole/unpartitioned fp16 model when in ZeRO Stage-3 to the output directory if
# `stage3_gather_16bit_weights_on_model_save` is True in DeepSpeed Config file or
# `zero3_save_16bit_model` is True in DeepSpeed Plugin.
# For Zero Stages 1 and 2, models are saved as usual in the output directory.
# The model name saved is `pytorch_model.bin`
unwrapped_model.save_pretrained(
    args.output_dir,
    is_main_process=accelerator.is_main_process,
    save_function=accelerator.save,
    state_dict=accelerator.get_state_dict(model),
)

b. 要獲取 32 位權重，首先使用 model.save_checkpoint() 儲存模型。以下是來自 examples/by_feature/deepspeed_with_config_support.py 的程式碼片段，展示了這一點：

success = model.save_checkpoint(PATH, ckpt_id, checkpoint_state_dict)
status_msg = f"checkpointing: PATH={PATH}, ckpt_id={ckpt_id}"
if success:
    logging.info(f"Success {status_msg}")
else:
    logging.warning(f"Failure {status_msg}")

這將在檢查點目錄中建立 ZeRO 模型和最佳化器分割槽以及 zero_to_fp32.py 指令碼。你可以使用此指令碼進行離線整合。它不需要配置檔案或 GPU。以下是其用法示例：

$ cd /path/to/checkpoint_dir
$ ./zero_to_fp32.py . pytorch_model.bin
Processing zero checkpoint at global_step1
Detected checkpoint of type zero stage 3, world_size: 2
Saving fp32 state dict to pytorch_model.bin (total_numel=60506624)

要獲取用於儲存/推理的 32 位模型，你可以執行：

from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint

unwrapped_model = accelerator.unwrap_model(model)
fp32_model = load_state_dict_from_zero_checkpoint(unwrapped_model, checkpoint_dir)

如果你只對 state_dict 感興趣，你可以執行以下操作：

from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)

請注意，所有這些函式都需要大約最終檢查點大小 2 倍的記憶體（通用 RAM）。

ZeRO 推理

DeepSpeed ZeRO 推理支援帶有 ZeRO-Infinity 的 ZeRO 階段 3。它使用與訓練相同的 ZeRO 協議，但它不使用最佳化器和學習率排程器，只有階段 3 是相關的。透過 accelerate 整合，你只需要如下所示準備模型和資料載入器：

model, eval_dataloader = accelerator.prepare(model, eval_dataloader)

需要注意的幾個問題

當前的整合不支援 DeepSpeed 的流水線並行。
當前的整合不支援 mpu，這限制了 Megatron-LM 中支援的張量並行。
當前的整合不支援多個模型。

多節點 DeepSpeed

DeepSpeed 支援在各種不同的啟動器上進行多節點推理和訓練。你可以透過在 CLI 或 DeepSpeed 配置檔案中設定 deepspeed_multinode_launcher 配置來指定不同的啟動器。

目前，accelerate 支援為以下 DeepSpeed 多節點啟動器傳遞配置：pdsh（預設）、standard、openmpi、mvapich、mpich、slurm、nossh（需要 DeepSpeed >= 0.14.5）。

請閱讀 DeepSpeed 文件以獲取有關不同啟動器的更多資訊。預設情況下，DeepSpeed 會嘗試從主節點到其他節點使用無密碼 SSH 來執行啟動器命令。在這種配置下，accelerate launch 命令只需要在主節點上執行。如果使用 nossh 啟動器，你需要在每個節點上使用複製的配置執行 accelerate launch 命令。

DeepSpeed 資源

有關 deepspeed 內部的文件可以在這裡找到。

論文

最後，請記住，Accelerate 只是集成了 DeepSpeed，因此，如果你對 DeepSpeed 的使用有任何問題或疑問，請向 DeepSpeed GitHub 提交問題。

對於那些對 FSDP 和 DeepSpeed 的異同感興趣的人，請檢視這裡的概念指南！

< > 在 GitHub 上更新