TRL 文件
PPO 訓練器
並獲得增強的文件體驗
開始使用
PPO 訓練器
TRL 支援使用近端策略最佳化 (PPO) 訓練大語言模型 (LLM)。
參考文獻
開始使用
要執行 PPO 指令碼以確保訓練器可以執行,你可以執行以下命令,使用一個虛擬的獎勵模型來訓練 PPO 模型。
python examples/scripts/ppo/ppo.py \ --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \ --dataset_train_split descriptiveness \ --learning_rate 3e-6 \ --num_ppo_epochs 1 \ --num_mini_batches 1 \ --output_dir models/minimal/ppo \ --per_device_train_batch_size 64 \ --gradient_accumulation_steps 1 \ --total_episodes 10000 \ --model_name_or_path EleutherAI/pythia-1b-deduped \ --sft_model_path EleutherAI/pythia-1b-deduped \ --reward_model_path EleutherAI/pythia-1b-deduped \ --missing_eos_penalty 1.0
日誌指標解釋
記錄的指標如下。這是一個在 Weights and Biases 上跟蹤的執行示例
eps
: 跟蹤每秒的片段(episode)數。objective/kl
: 當前策略和參考策略之間的平均庫爾貝克-萊布勒(KL)散度。objective/entropy
: 策略的平均熵,表示策略選擇動作的隨機性。objective/non_score_reward
: 來自非分數相關來源的平均獎勵,基本上是beta * kl.sum(1)
,其中beta
是 KL 懲罰係數,kl
是每個詞元的 KL 散度。objective/rlhf_reward
: 平均 RLHF 獎勵,即score - non_score_reward
。objective/scores
: 由獎勵模型/環境返回的平均分數。policy/approxkl_avg
: 連續 PPO 策略之間的平均近似 KL 散度。請注意,這與objective/kl
不同。policy/clipfrac_avg
: 被裁剪的策略更新的平均比例,表示策略更新被限制以防止大幅度變化的頻率。loss/policy_avg
: 平均策略損失,表示策略的表現如何。loss/value_avg
: 平均價值損失,表示預測值與實際獎勵之間的差異。val/clipfrac_avg
: 被裁剪的價值函式更新的平均比例,與 policy/clipfrac_avg 類似,但針對價值函式。policy/entropy_avg
: 訓練期間策略的平均熵,表示策略動作的多樣性。val/ratio
: 當前策略機率與舊策略機率的平均比率,提供了策略變化程度的度量。val/ratio_var
:val/ratio
的方差,表示策略變化的可變性。val/num_eos_tokens
: 生成的序列結束(EOS)詞元的數量,可以表示完整響應的數量。lr
: 學習率:最佳化器當前使用的學習率。episode
: 輪次:訓練過程中的當前輪次計數。
實用技巧
- 除錯提示:
objective/rlhf_reward
:這是 RLHF 訓練的最終目標。如果訓練按預期進行,這個指標應該持續上升。 - 除錯提示:
val/ratio
:這個數值應該在 1.0 左右浮動,並被 PPO 的代理損失函式中的--cliprange 0.2
裁剪。所以如果這個ratio
過高,比如 2.0 或 1000.0,或者過低,比如 0.1,意味著連續策略之間的更新過於劇烈。你應該嘗試理解為什麼會發生這種情況並嘗試修復它。 - 記憶體提示:如果記憶體不足,可以嘗試減少
--per_device_train_batch_size
或增加--gradient_accumulation_steps
來減少記憶體佔用。 - 記憶體提示:如果你有多個 GPU,也可以使用 DeepSpeed stage 3 進行訓練來減少記憶體佔用
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml
。 - 使用提示:我們推薦透過
--missing_eos_penalty
使用“EOS 技巧”,它會對未以 EOS 標記結束的補全得分減去一個固定的標量懲罰。這可以幫助模型學習生成更連貫的補全。
我的模型到底在做什麼?
為了幫助你理解模型正在做什麼,我們會定期記錄一些模型的樣本補全。這是一個補全的例子。在一個在 Weights and Biases 上跟蹤的執行示例中,它看起來如下,讓你能看到模型在不同訓練階段的響應。預設情況下,我們在訓練期間生成 --num_sample_generations 10
個樣本,但你可以自定義生成的數量。
在日誌中,取樣的生成結果如下:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ query ┃ model response ┃ score ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│ SUBREDDIT: r/AskReddit │ I'm in love with a friend, and │ 3.921875 │
│ │ I don't know how to get rid of │ │
│ TITLE: How do you get someone │ those feelings. I'm │ │
│ out of your head? │ desperate.<|endoftext|>[PAD][P… │ │
│ │ │ │
│ POST: Hi, │ │ │
│ I'm 22, and I have been with my │ │ │
│ girlfriend for 5 years now. We │ │ │
│ recently moved together. We've │ │ │
│ always loved each other │ │ │
│ intensely. │ │ │
│ │ │ │
│ Problem, I recently started to │ │ │
│ have feelings for an other │ │ │
│ person (a friend). This person │ │ │
│ has had a boyfriend for now 3 │ │ │
│ years, and has absolutely no │ │ │
│ ideas. Those feelings were so │ │ │
│ strong, it was hard to hide │ │ │
│ them. After 2 months of me │ │ │
│ being distant and really sad, │ │ │
│ my girlfriend forced me to say │ │ │
│ what was bothering me. I'm not │ │ │
│ a good liar, and now she knows. │ │ │
│ │ │ │
│ We decided to give us a week │ │ │
│ alone, I went to my parents. │ │ │
│ │ │ │
│ Now, I'm completely lost. I │ │ │
│ keep on thinking about this │ │ │
│ person, and I hate that. I │ │ │
│ would like for those feelings │ │ │
│ to go away, to leave me alone. │ │ │
│ But I can't. │ │ │
│ │ │ │
│ What do I do? It's been 3 │ │ │
│ months now, and I'm just │ │ │
│ desperate. │ │ │
│ │ │ │
│ TL;DR: │ │ │
├─────────────────────────────────┼─────────────────────────────────┼──────────┤
│ SUBREDDIT: r/pettyrevenge │ My mom woke me up with a loud │ 6.84375 │
│ │ TV. I blasted Gangnam Style on │ │
│ TITLE: So, my mom woke me up │ repeat, with the bass cranked │ │
│ with a loud TV. │ up as high as it could │ │
│ │ go.<|endoftext|>[PAD][PAD][PAD… │ │
│ POST: She was in her living │ │ │
│ room, watching TV. This was at │ │ │
│ about 8:30 in the morning, and │ │ │
│ she was exercising. She turned │ │ │
│ the TV up extra loud to hear it │ │ │
│ over her excercycle, and woke │ │ │
│ me up. I went in there asking │ │ │
│ for her to turn it down. She │ │ │
│ said she didn't have to; I │ │ │
│ explained that I always used │ │ │
│ headphones so she didn't have │ │ │
│ to deal with my noise and that │ │ │
│ she should give me a little │ │ │
│ more respect, given that I paid │ │ │
│ rent at the time. │ │ │
│ │ │ │
│ She disagreed. I went back to │ │ │
│ my room, rather pissed off at │ │ │
│ the lack of equality. I had no │ │ │
│ lock on my door; but I had a │ │ │
│ dresser right next to it, so I │ │ │
│ pulled one of the drawers out │ │ │
│ enough so that it caused the │ │ │
│ door to not be openable. Then, │ │ │
│ I turned my speakers up really │ │ │
│ loud and blasted Gangnam Style │ │ │
│ on repeat, with the bass │ │ │
│ cranked up as high as it could │ │ │
│ go. │ │ │
│ │ │ │
│ If you hate Gangnam Style for │ │ │
│ being overplayed, you will see │ │ │
│ why I chose that particular │ │ │
│ song. I personally don't mind │ │ │
│ it. But here's the thing about │ │ │
│ my bass; it vibrates the walls, │ │ │
│ making one hell of a lot of │ │ │
│ noise. Needless to say, my mom │ │ │
│ was not pleased and shut off │ │ │
│ the internet. But it was oh so │ │ │
│ worth it. │ │ │
│ │ │ │
│ TL;DR: │ │ │
└─────────────────────────────────┴─────────────────────────────────┴──────────┘
實現細節
這個 PPO 實現基於使用 PPO 進行 RLHF 的 N+ 個實現細節:TL;DR 摘要案例研究。
基準實驗
為了驗證 PPO 實現的有效性,我們在 1B 引數模型上進行了實驗。以下是我們用來執行實驗的命令。我們直接從使用 PPO 進行 RLHF 的 N+ 個實現細節:TL;DR 摘要案例研究中獲取 SFT / RM 模型。
accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
examples/scripts/ppo/ppo_tldr.py \
--output_dir models/minimal/ppo_tldr \
--learning_rate 3e-6 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 4 \
--total_episodes 1000000 \
--model_name_or_path EleutherAI/pythia-1b-deduped \
--sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
--reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
--local_rollout_forward_batch_size 16 \
--missing_eos_penalty 1.0 \
--stop_token eos
檢查點和實驗跟蹤可在以下連結檢視:
為了評估,我們使用 vLLM 載入檢查點,並使用 GPT-4o mini 作為評判模型,評估生成的 TL;DR 與參考 TL;DR。有關如何使用評判器的更多資訊,請參見評判器。
$ python examples/scripts/evals/judge_tldr.py --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --judge_model gpt-4o-mini --num_examples 1000 Model win rate: 33.00% $ python examples/scripts/evals/judge_tldr.py --model_name_or_path vwxyzjn/ppo_tldr --judge_model gpt-4o-mini --num_examples 1000 Model win rate: 64.70%
PPO 檢查點獲得了 64.7% 的偏好率,而 SFT 檢查點的偏好率為 33.0%。這是一個好跡象,表明 PPO 訓練按預期工作。
指標
# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
# to use it, change `?we=huggingface&wpn=trl` to your own project and `?tag=pr-1540` to your own tag
python -m openrlbenchmark.rlops_multi_metrics \
--filters '?we=huggingface&wpn=trl&xaxis=train/episode&ceik=output_dir&cen=sft_model_path&metrics=train/objective/rlhf_reward&metrics=train/objective/scores&metrics=train/objective/kl&metrics=train/objective/non_score_reward&metrics=train/objective/entropy&metrics=train/policy/approxkl_avg&metrics=train/policy/clipfrac_avg&metrics=train/loss/policy_avg&metrics=train/loss/value_avg&metrics=train/val/clipfrac_avg&metrics=train/policy/entropy_avg&metrics=train/val/ratio&metrics=train/val/ratio_var&metrics=train/val/num_eos_tokens&metrics=train/lr&metrics=train/eps' \
"cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr?tag=pr-1540" \
--env-ids models/minimal/ppo_tldr \
--pc.ncols 4 \
--pc.ncols-legend 1 \
--pc.xlabel "Episode" \
--output-filename benchmark/trl/pr-1540/ppo \
--scan-history
PPOTrainer
class trl.PPOTrainer
< 原始碼 >( args: PPOConfig processing_class: typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.image_processing_utils.BaseImageProcessor, transformers.feature_extraction_utils.FeatureExtractionMixin, transformers.processing_utils.ProcessorMixin, NoneType] model: Module ref_model: typing.Optional[torch.nn.modules.module.Module] reward_model: Module train_dataset: Dataset value_model: Module data_collator: typing.Optional[transformers.data.data_collator.DataCollatorWithPadding] = None eval_dataset: typing.Union[datasets.arrow_dataset.Dataset, dict[str, datasets.arrow_dataset.Dataset], NoneType] = None optimizers: tuple = (None, None) callbacks: typing.Optional[list[transformers.trainer_callback.TrainerCallback]] = None peft_config: typing.Optional[ForwardRef('PeftConfig')] = None )
push_to_hub
< 原始碼 >( commit_message: typing.Optional[str] = 'End of training' blocking: bool = True token: typing.Optional[str] = None revision: typing.Optional[str] = None **kwargs )
引數
將 `self.model` 和 `self.processing_class` 上傳到 🤗 模型中心的 `self.args.hub_model_id` 儲存庫。
PPOConfig
class trl.PPOConfig
< 原始碼 >( output_dir: typing.Optional[str] = None overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: typing.Optional[int] = None per_gpu_eval_batch_size: typing.Optional[int] = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: typing.Optional[int] = None eval_delay: typing.Optional[float] = 0 torch_empty_cache_steps: typing.Optional[int] = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear' lr_scheduler_kwargs: typing.Union[dict[str, typing.Any], str, NoneType] = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: str = 'passive' log_level_replica: str = 'warning' log_on_each_node: bool = True logging_dir: typing.Optional[str] = None logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' logging_first_step: bool = False logging_steps: float = 10 logging_nan_inf_filter: bool = True save_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps' save_steps: float = 500 save_total_limit: typing.Optional[int] = None save_safetensors: typing.Optional[bool] = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: typing.Optional[int] = None jit_mode_eval: bool = False use_ipex: bool = False bf16: typing.Optional[bool] = None fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: typing.Optional[bool] = None local_rank: int = -1 ddp_backend: typing.Optional[str] = None tpu_num_cores: typing.Optional[int] = None tpu_metrics_debug: bool = False debug: typing.Union[str, list[transformers.debug_utils.DebugOption]] = '' dataloader_drop_last: bool = False eval_steps: typing.Optional[float] = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: typing.Optional[int] = None past_index: int = -1 run_name: typing.Optional[str] = None disable_tqdm: typing.Optional[bool] = None remove_unused_columns: typing.Optional[bool] = True label_names: typing.Optional[list[str]] = None load_best_model_at_end: typing.Optional[bool] = False metric_for_best_model: typing.Optional[str] = None greater_is_better: typing.Optional[bool] = None ignore_data_skip: bool = False fsdp: typing.Union[list[transformers.trainer_utils.FSDPOption], str, NoneType] = '' fsdp_min_num_params: int = 0 fsdp_config: typing.Union[dict[str, typing.Any], str, NoneType] = None fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = None accelerator_config: typing.Union[dict, str, NoneType] = None deepspeed: typing.Union[dict, str, NoneType] = None label_smoothing_factor: float = 0.0 optim: typing.Union[transformers.training_args.OptimizerNames, str] = 'adamw_torch' optim_args: typing.Optional[str] = None adafactor: bool = False group_by_length: bool = False length_column_name: typing.Optional[str] = 'length' report_to: typing.Union[NoneType, str, list[str]] = None ddp_find_unused_parameters: typing.Optional[bool] = None ddp_bucket_cap_mb: typing.Optional[int] = None ddp_broadcast_buffers: typing.Optional[bool] = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: typing.Optional[str] = None hub_model_id: typing.Optional[str] = None hub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save' hub_token: typing.Optional[str] = None hub_private_repo: typing.Optional[bool] = None hub_always_push: bool = False hub_revision: typing.Optional[str] = None gradient_checkpointing: bool = False gradient_checkpointing_kwargs: typing.Union[dict[str, typing.Any], str, NoneType] = None include_inputs_for_metrics: bool = False include_for_metrics: list = <factory> eval_do_concat_batches: bool = True fp16_backend: str = 'auto' push_to_hub_model_id: typing.Optional[str] = None push_to_hub_organization: typing.Optional[str] = None push_to_hub_token: typing.Optional[str] = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: typing.Optional[str] = None ray_scope: typing.Optional[str] = 'last' ddp_timeout: int = 1800 torch_compile: bool = False torch_compile_backend: typing.Optional[str] = None torch_compile_mode: typing.Optional[str] = None include_tokens_per_second: typing.Optional[bool] = False include_num_input_tokens_seen: typing.Optional[bool] = False neftune_noise_alpha: typing.Optional[float] = None optim_target_modules: typing.Union[NoneType, str, list[str]] = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: typing.Optional[bool] = False liger_kernel_config: typing.Optional[dict[str, bool]] = None eval_use_gather_object: typing.Optional[bool] = False average_tokens_across_devices: typing.Optional[bool] = True dataset_num_proc: typing.Optional[int] = None num_mini_batches: int = 1 total_episodes: typing.Optional[int] = None local_rollout_forward_batch_size: int = 64 num_sample_generations: int = 10 response_length: int = 53 stop_token: typing.Optional[typing.Literal['eos']] = None stop_token_id: typing.Optional[int] = None temperature: float = 0.7 missing_eos_penalty: typing.Optional[float] = None sft_model_path: str = 'EleutherAI/pythia-160m' world_size: typing.Optional[int] = None num_total_batches: typing.Optional[int] = None micro_batch_size: typing.Optional[int] = None local_batch_size: typing.Optional[int] = None batch_size: typing.Optional[int] = None local_mini_batch_size: typing.Optional[int] = None mini_batch_size: typing.Optional[int] = None exp_name: str = 'ppo_config' reward_model_path: str = 'EleutherAI/pythia-160m' model_adapter_name: typing.Optional[str] = None ref_adapter_name: typing.Optional[str] = None num_ppo_epochs: int = 4 whiten_rewards: bool = False kl_coef: float = 0.05 kl_estimator: typing.Literal['k1', 'k3'] = 'k1' cliprange: float = 0.2 vf_coef: float = 0.1 cliprange_value: float = 0.2 gamma: float = 1.0 lam: float = 0.95 ds3_gather_for_generation: bool = True )
引數
- exp_name (
str
, 可選, 預設為os.path.basename(__file__)[ ---3]
):此實驗的名稱。 - reward_model_path (
str
, 可選, 預設為"EleutherAI/pythia-160m"
) — 獎勵模型的路徑。 - model_adapter_name (
str
或None
, 可選, 預設為None
) — 當使用帶有多個介面卡的LoRA時,訓練目標PEFT介面卡的名稱。 - ref_adapter_name (
str
或None
, 可選, 預設為None
) — 當使用帶有多個介面卡的LoRA時,參考PEFT介面卡的名稱。 - num_ppo_epochs (
int
, 可選, 預設為4
) — 訓練的輪數。 - whiten_rewards (
bool
, 可選, 預設為False
) — 是否白化獎勵。 - kl_coef (
float
, 可選, 預設為0.05
) — KL係數。 - kl_estimator (
Literal["k1", "k3"]
, 可選, 預設為"k1"
) — 從近似KL散度中選擇哪個KL散度估計器。預設為“k1”,這是一個直接的、無偏的估計器。可以設定為“k3”,這是一個方差更小的無偏估計器,似乎是“一個嚴格更好的估計器”。不能設定為“k2”,因為它用於日誌記錄。 - cliprange (
float
, 可選, 預設為0.2
) — 裁剪範圍。 - vf_coef (
float
, 可選, 預設為0.1
) — 值函式係數。 - cliprange_value (
float
, 可選, 預設為0.2
) — 值函式的裁剪範圍。 - gamma (
float
, 可選, 預設為1.0
) — 折扣因子。 - lam (
float
, 可選, 預設為0.95
) — GAE 的 Lambda 值。 - ds3_gather_for_generation (
bool
, 可選, 預設為True
) — 此設定適用於 DeepSpeed ZeRO-3。如果啟用,策略模型權重將被收集用於生成,從而提高生成速度。但是,停用此選項可以訓練超過單個 GPU VRAM 容量的模型,儘管代價是生成速度較慢。
PPOTrainer 的配置類。
此類僅包含特定於 PPO 訓練的引數。有關訓練引數的完整列表,請參閱 TrainingArguments
和 OnPolicyConfig
文件。請注意,此類中的預設值可能與 TrainingArguments
中的預設值不同。
使用 HfArgumentParser
,我們可以將此類別轉換為可在命令列上指定的 argparse 引數。