在 DeepSpeed 中使用多個模型

本指南假設您已閱讀並理解 DeepSpeed 使用指南。

將 Accelerate 和 DeepSpeed 結合使用多個模型在以下場景中非常有用：

知識蒸餾
RLHF 等訓練後技術（更多示例請參見 TRL 庫）
一次性訓練多個模型

目前，Accelerate 提供了一個**非常實驗性**的 API 來幫助您使用多個模型。

本教程將重點介紹兩種常見用例：

知識蒸餾：訓練一個較小的學生模型來模仿一個更大、效能更好的教師模型。如果學生模型能放入單個 GPU，我們可以使用 ZeRO-2 進行訓練，並使用 ZeRO-3 對教師模型進行分片以進行推理。這比對兩個模型都使用 ZeRO-3 要快得多。
一次性訓練多個*不相關*的模型。

知識蒸餾

知識蒸餾是使用多個模型但只訓練其中一個模型的好例子。

通常情況下，您會對兩個模型使用單個 utils.DeepSpeedPlugin。然而，在這種情況下，有兩個獨立的配置。Accelerate 允許您建立和使用多個外掛，**當且僅當**它們位於一個 `dict` 中，以便您可以在需要時引用並啟用相應的外掛。

from accelerate.utils import DeepSpeedPlugin

zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
zero3_plugin = DeepSpeedPlugin(hf_ds_config="zero3_config.json")

deepspeed_plugins = {"student": zero2_plugin, "teacher": zero3_plugin}

`zero2_config.json` 應配置為完整訓練（因此，如果您不使用自己的 `scheduler` 和 `optimizer`，請指定它們），而 `zero3_config.json` 只應為推理模型配置，如下例所示。

{
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": "auto",
        "stage3_max_reuse_distance": "auto",
    },
    "train_micro_batch_size_per_gpu": 1
}

下面展示了一個 `zero2_config.json` 配置示例。

{
    "bf16": {
        "enabled": "auto"
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
}

即使某個特定模型不被訓練，如果未指定 `train_micro_batch_size_per_gpu`，DeepSpeed 也會引發錯誤。

接下來，建立一個單獨的 Accelerator 並傳入這兩個配置。

from accelerate import Accelerator

accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)

現在讓我們看看如何使用它們。

學生模型

預設情況下，Accelerate 將 `dict` 中的第一項設定為預設或啟用的外掛（即 `"student"` 外掛）。您可以使用 utils.deepspeed.get_active_deepspeed_plugin() 函式來驗證這一點，以檢視哪個外掛已啟用。

active_plugin = get_active_deepspeed_plugin(accelerator.state)
assert active_plugin is deepspeed_plugins["student"]

`AcceleratorState` 也會將活動的 DeepSpeed 外掛儲存在 `state.deepspeed_plugin` 中。

assert active_plugin is accelerator.deepspeed_plugin

由於 `student` 是當前活動的外掛，讓我們繼續準備模型、最佳化器和排程器。

student_model, optimizer, scheduler = ...
student_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)

現在是時候處理教師模型了。

教師模型

首先，您需要在 Accelerator 中指定應使用 `zero3_config.json` 配置。

accelerator.state.select_deepspeed_plugin("teacher")

這將停用 `"student"` 外掛並啟用 `"teacher"` 外掛。Transformers 內部的有狀態 DeepSpeed 配置會被更新，這會改變在使用 `deepspeed.initialize()` 時呼叫哪個外掛配置。這使您可以使用 Transformers 提供的自動 `deepspeed.zero.Init` 上下文管理器整合。

teacher_model = AutoModel.from_pretrained(...)
teacher_model = accelerator.prepare(teacher_model)

否則，您應該使用 `deepspeed.zero.Init` 手動初始化模型。

with deepspeed.zero.Init(accelerator.deepspeed_plugin.config):
    model = MyModel(...)

訓練

從這裡開始，只要 `teacher_model` 從不被訓練，您的訓練迴圈可以是任何您喜歡的形式。

teacher_model.eval()
student_model.train()
for batch in train_dataloader:
    with torch.no_grad():
        output_teacher = teacher_model(**batch)
    output_student = student_model(**batch)
    # Combine the losses or modify it in some way
    loss = output_teacher.loss + output_student.loss
    accelerator.backward(loss)
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

訓練多個不相關的模型

訓練多個模型是一個更復雜的場景。在當前狀態下，我們假設每個模型在訓練期間都與另一個模型**完全不相關**。

這種情況仍然需要建立兩個 utils.DeepSpeedPlugin。但是，您還需要第二個 Accelerator，因為不同的 `deepspeed` 引擎在不同時間被呼叫。一個 Accelerator 一次只能攜帶一個例項。

然而，由於 state.AcceleratorState 是一個有狀態的物件，它已經知道可用的兩個 utils.DeepSpeedPlugin。您只需例項化第二個 Accelerator，無需額外引數。

first_accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)
second_accelerator = Accelerator()

您可以呼叫任一 `first_accelerator.state.select_deepspeed_plugin()` 來啟用或停用特定的外掛，然後呼叫 `prepare`。

# can be `accelerator_0`, `accelerator_1`, or by calling `AcceleratorState().select_deepspeed_plugin(...)`
first_accelerator.state.select_deepspeed_plugin("first_model")
first_model = AutoModel.from_pretrained(...)
# For this example, `get_training_items` is a nonexistent function that gets the setup we need for training
first_optimizer, first_scheduler, train_dl, eval_dl = get_training_items(model1)
first_model, first_optimizer, first_scheduler, train_dl, eval_dl = accelerator.prepare(
    first_model, first_optimizer, first_scheduler, train_dl, eval_dl
)

second_accelerator.state.select_deepspeed_plugin("second_model")
second_model = AutoModel.from_pretrained(...)
# For this example, `get_training_items` is a nonexistent function that gets the setup we need for training
second_optimizer, second_scheduler, _, _ = get_training_items(model2)
second_model, second_optimizer, second_scheduler = accelerator.prepare(
    second_model, second_optimizer, second_scheduler
)

現在您可以開始訓練了。

for batch in dl:
    outputs1 = first_model(**batch)
    first_accelerator.backward(outputs1.loss)
    first_optimizer.step()
    first_scheduler.step()
    first_optimizer.zero_grad()
    
    outputs2 = model2(**batch)
    second_accelerator.backward(outputs2.loss)
    second_optimizer.step()
    second_scheduler.step()
    second_optimizer.zero_grad()

資源

要檢視更多示例，請查閱 [Accelerate] 中當前的相關測試。

< > 在 GitHub 上更新