最佳化器

Transformers 提供了兩種原生最佳化器：AdamW 和 AdaFactor。它還提供了更多專用最佳化器的整合。安裝提供最佳化器的庫，並將其放置在 TrainingArguments 的 `optim` 引數中。

本指南將向您展示如何使用 Trainer 和 TrainingArguments（如下所示）使用這些最佳化器。

import torch
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM, Trainer

args = TrainingArguments(
    output_dir="./test-optimizer",
    max_steps=1000,
    per_device_train_batch_size=4,
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-5,
    save_strategy="no",
    run_name="optimizer-name",
)

APOLLO

pip install apollo-torch

用於高效 LLM 最佳化的近似梯度縮放 (APOLLO) 是一種記憶體高效的最佳化器，允許對預訓練和微調進行全引數學習。它在保持 AdamW 級別的效能的同時，具有類似 SGD 的記憶體效率。為了實現極致的記憶體效率，您可以使用 APOLLO-Mini，這是 APOLLO 的秩 1 變體。APOLLO 最佳化器支援：

超低秩效率。您可以比 GaLoRE 使用更低的秩，秩 1 就足夠了。
避免昂貴的 SVD 計算。APOLLO 利用隨機投影來避免訓練停滯。

使用 `optim_target_modules` 引數指定要訓練的層。

import torch
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./test-apollo",
    max_steps=100,
    per_device_train_batch_size=2,
+   optim="apollo_adamw",
+   optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-5,
    save_strategy="no",
    run_name="apollo_adamw",
)

對於其他訓練選項，使用 `optim_args` 定義超引數，如 `rank`、`scale` 等。請參閱下表，獲取可用超引數的完整列表。

`scale` 引數可以設定為 `n/r`，其中 `n` 是原始空間維度，`r` 是低秩空間維度。您可以透過調整學習率同時保持 `scale` 為預設值來達到類似的效果。

引數	描述	APOLLO	APOLLO-Mini
秩	用於梯度縮放的輔助子空間的秩	256	1
縮放型別	縮放因子的應用方式	`channel` (每通道縮放)	`tensor` (每張量縮放)
縮放	調整梯度更新以穩定訓練	1.0	128
更新投影間隔	更新投影矩陣之前的步數	200	200
投影	投影型別	`隨機`	`隨機`

以下示例啟用 APOLLO-Mini 最佳化器。

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./test-apollo_mini",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="apollo_adamw",
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    optim_args="proj=random,rank=1,scale=128.0,scale_type=tensor,update_proj_gap=200",
)

GrokAdamW

pip install grokadamw

GrokAdamW 是一種最佳化器，旨在幫助那些受益於 *grokking* 的模型，該術語用於描述由於梯度變化緩慢而導致的延遲泛化。它對於需要更高階最佳化技術以實現更好效能和穩定性的模型特別有用。

import torch
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./test-grokadamw",
    max_steps=1000,
    per_device_train_batch_size=4,
+   optim="grokadamw",
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-5,
    save_strategy="no",
    run_name="grokadamw",
)

LOMO

pip install lomo-optim

低記憶體最佳化 (LOMO) 是一系列最佳化器，包括 LOMO 和 AdaLomo，旨在用於 LLM 的低記憶體全引數微調。LOMO 最佳化器將梯度計算和引數更新融合在一個步驟中以減少記憶體使用。AdaLomo 在 LOMO 的基礎上，透過為每個引數引入自適應學習率，類似於 Adam 最佳化器。

建議使用 AdaLomo，且不帶 `grad_norm`，以獲得更好的效能和更高的吞吐量。

args = TrainingArguments(
    output_dir="./test-lomo",
    max_steps=1000,
    per_device_train_batch_size=4,
+   optim="adalomo",
    gradient_checkpointing=True,
    gradient_checkpointing=True,
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-6,
    save_strategy="no",
    run_name="adalomo",
)

無排程最佳化器

pip install schedulefree

無排程最佳化器 (SFO) 將基本最佳化器的動量替換為平均和插值的組合。與傳統排程器不同，SFO 完全消除了退火學習率的需要。

SFO 支援 RAdam (`schedule_free_radam`)、AdamW (`schedule_free_adamw`) 和 SGD (`schedule_free_sgd`) 最佳化器。RAdam 排程器不需要 `warmup_steps` 或 `warmup_ratio`。

預設情況下，建議將 `lr_scheduler_type` 設定為 `"constant"`。其他 `lr_scheduler_type` 值也可能有效，但將 SFO 最佳化器與其他學習率排程器結合使用可能會影響 SFO 的預期行為和效能。

args = TrainingArguments(
    output_dir="./test-schedulefree",
    max_steps=1000,
    per_device_train_batch_size=4,
+   optim="schedule_free_radamw,
+   lr_scheduler_type="constant",
    gradient_checkpointing=True,
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-6,
    save_strategy="no",
    run_name="sfo",
)

< > 在 GitHub 上更新