RLOO 訓練器

TRL 支援使用 REINFORCE 留一法（RLOO）來訓練大語言模型（LLM）。其核心思想是，RLOO 不使用價值函式，而是為每個提示詞生成 K 個補全。對於每個補全，RLOO 使用其他 K-1 個補全的平均得分作為基線來計算優勢。RLOO 還將整個補全建模為單個動作，而 PPO 則將每個詞元（token）建模為一個動作。請注意，REINFORCE / A2C 是 PPO 的一個特例，即 PPO 週期數為 1 且小批次（mini-batch）數為 1 的情況，這就是我們在 TRL 中實現 RLOO 的方式。

參考文獻

開始使用

要執行一個 RLOO 指令碼以確保訓練器可以正常工作，你可以執行以下命令，使用一個虛擬的獎勵模型來訓練 RLOO 模型。

python examples/scripts/rloo/rloo.py \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --dataset_train_split descriptiveness \
    --learning_rate 3e-6 \
    --output_dir models/minimal/rloo \
    --per_device_train_batch_size 64 \
    --gradient_accumulation_steps 1 \
    --total_episodes 10000 \
    --model_name_or_path EleutherAI/pythia-14m \
    --reward_model_path EleutherAI/pythia-14m \
    --missing_eos_penalty 1.0

日誌指標解釋

記錄的指標如下。這是一個在 Weights and Biases 上跟蹤的執行示例

eps: 跟蹤每秒的片段（episode）數。
objective/kl: 當前策略和參考策略之間的平均庫爾貝克-萊布勒（KL）散度。
objective/entropy: 策略的平均熵，表示策略選擇動作的隨機性。
objective/non_score_reward: 來自非分數相關來源的平均獎勵，基本上是 beta * kl.sum(1)，其中 beta 是 KL 懲罰係數，kl 是每個詞元的 KL 散度。
objective/rlhf_reward: 平均 RLHF 獎勵，即 score - non_score_reward。
objective/scores: 由獎勵模型/環境返回的平均分數。
policy/approxkl_avg: 連續 PPO 策略之間的平均近似 KL 散度。請注意，這與 objective/kl 不同。
policy/clipfrac_avg: 被裁剪的策略更新的平均比例，表示策略更新被限制以防止大幅度變化的頻率。
loss/policy_avg: 平均策略損失，表示策略的表現如何。
val/clipfrac_avg: 被裁剪的價值函式更新的平均比例，類似於 policy/clipfrac_avg，但用於價值函式。
policy/entropy_avg: 訓練期間策略的平均熵，表示策略動作的多樣性。
val/ratio: 當前策略機率與舊策略機率的平均比率，提供了策略變化程度的度量。
val/ratio_var: val/ratio 的方差，表示策略變化的可變性。
val/num_eos_tokens: 生成的序列結束（EOS）詞元的數量，可以表示完整響應的數量。
lr: lr: 最佳化器當前使用的學習率。
episode: episode: 訓練過程中的當前全域性步驟或片段計數。

秘籍

除錯提示：objective/rlhf_reward：這是 RLHF 訓練的最終目標。如果訓練按預期進行，該指標應持續上升。
除錯提示：val/ratio：這個數值應該在 1.0 附近浮動，並且會被 PPO 的替代損失函式的 --cliprange 0.2 引數裁剪。因此，如果這個 ratio 過高（如 2.0 或 1000.0）或過低（如 0.1），則意味著連續策略之間的更新過於劇烈。你應該嘗試理解為什麼會發生這種情況並嘗試修復它。
記憶體提示：如果記憶體不足，可以嘗試減少 --per_device_train_batch_size 或增加 --gradient_accumulation_steps 來減少記憶體佔用。
記憶體提示：如果你有多個 GPU，也可以使用 DeepSpeed stage 3 執行訓練以減少記憶體佔用 accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml。
使用提示：我們建議透過 --missing_eos_penalty 使用“EOS 技巧”，它會從未以 EOS 詞元結束的補全分數中減去一個固定的標量懲罰。這可以幫助模型學習生成更連貫的補全。

我的模型到底在做什麼？

為了幫助您瞭解模型的行為，我們會定期記錄一些模型的樣本補全。這是一個補全的示例。在一個在 Weights and Biases 上跟蹤的執行示例中，它看起來如下，讓您可以看到模型在不同訓練階段的響應。預設情況下，我們在訓練期間生成 --num_sample_generations 10 個樣本，但您可以自定義生成的數量。

在日誌中，取樣的生成結果如下所示：

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ query                           ┃ model response                  ┃ score    ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━┩
│  SUBREDDIT: r/AskReddit         │  I'm in love with a friend, and │ 3.921875 │
│                                 │ I don't know how to get rid of  │          │
│ TITLE: How do you get someone   │ those feelings. I'm             │          │
│ out of your head?               │ desperate.<|endoftext|>[PAD][P… │          │
│                                 │                                 │          │
│ POST: Hi,                       │                                 │          │
│ I'm 22, and I have been with my │                                 │          │
│ girlfriend for 5 years now. We  │                                 │          │
│ recently moved together. We've  │                                 │          │
│ always loved each other         │                                 │          │
│ intensely.                      │                                 │          │
│                                 │                                 │          │
│ Problem, I recently started to  │                                 │          │
│ have feelings for an other      │                                 │          │
│ person (a friend). This person  │                                 │          │
│ has had a boyfriend for now 3   │                                 │          │
│ years, and has absolutely no    │                                 │          │
│ ideas. Those feelings were so   │                                 │          │
│ strong, it was hard to hide     │                                 │          │
│ them. After 2 months of me      │                                 │          │
│ being distant and really sad,   │                                 │          │
│ my girlfriend forced me to say  │                                 │          │
│ what was bothering me. I'm not  │                                 │          │
│ a good liar, and now she knows. │                                 │          │
│                                 │                                 │          │
│ We decided to give us a week    │                                 │          │
│ alone, I went to my parents.    │                                 │          │
│                                 │                                 │          │
│ Now, I'm completely lost. I     │                                 │          │
│ keep on thinking about this     │                                 │          │
│ person, and I hate that. I      │                                 │          │
│ would like for those feelings   │                                 │          │
│ to go away, to leave me alone.  │                                 │          │
│ But I can't.                    │                                 │          │
│                                 │                                 │          │
│ What do I do? It's been 3       │                                 │          │
│ months now, and I'm just        │                                 │          │
│ desperate.                      │                                 │          │
│                                 │                                 │          │
│ TL;DR:                          │                                 │          │
├─────────────────────────────────┼─────────────────────────────────┼──────────┤
│  SUBREDDIT: r/pettyrevenge      │  My mom woke me up with a loud  │ 6.84375  │
│                                 │ TV. I blasted Gangnam Style on  │          │
│ TITLE: So, my mom woke me up    │ repeat, with the bass cranked   │          │
│ with a loud TV.                 │ up as high as it could          │          │
│                                 │ go.<|endoftext|>[PAD][PAD][PAD… │          │
│ POST: She was in her living     │                                 │          │
│ room, watching TV. This was at  │                                 │          │
│ about 8:30 in the morning, and  │                                 │          │
│ she was exercising. She turned  │                                 │          │
│ the TV up extra loud to hear it │                                 │          │
│ over her excercycle, and woke   │                                 │          │
│ me up. I went in there asking   │                                 │          │
│ for her to turn it down. She    │                                 │          │
│ said she didn't have to; I      │                                 │          │
│ explained that I always used    │                                 │          │
│ headphones so she didn't have   │                                 │          │
│ to deal with my noise and that  │                                 │          │
│ she should give me a little     │                                 │          │
│ more respect, given that I paid │                                 │          │
│ rent at the time.               │                                 │          │
│                                 │                                 │          │
│ She disagreed. I went back to   │                                 │          │
│ my room, rather pissed off at   │                                 │          │
│ the lack of equality. I had no  │                                 │          │
│ lock on my door; but I had a    │                                 │          │
│ dresser right next to it, so I  │                                 │          │
│ pulled one of the drawers out   │                                 │          │
│ enough so that it caused the    │                                 │          │
│ door to not be openable. Then,  │                                 │          │
│ I turned my speakers up really  │                                 │          │
│ loud and blasted Gangnam Style  │                                 │          │
│ on repeat, with the bass        │                                 │          │
│ cranked up as high as it could  │                                 │          │
│ go.                             │                                 │          │
│                                 │                                 │          │
│ If you hate Gangnam Style for   │                                 │          │
│ being overplayed, you will see  │                                 │          │
│ why I chose that particular     │                                 │          │
│ song. I personally don't mind   │                                 │          │
│ it. But here's the thing about  │                                 │          │
│ my bass; it vibrates the walls, │                                 │          │
│ making one hell of a lot of     │                                 │          │
│ noise. Needless to say, my mom  │                                 │          │
│ was not pleased and shut off    │                                 │          │
│ the internet. But it was oh so  │                                 │          │
│ worth it.                       │                                 │          │
│                                 │                                 │          │
│ TL;DR:                          │                                 │          │
└─────────────────────────────────┴─────────────────────────────────┴──────────┘

實現細節

RLOOTrainer 的大部分內容基於 PPO 實現，而該實現又基於論文使用 PPO 進行 RLHF 的 N+ 個實現細節：TL;DR 摘要案例研究。

以下是 RLOO 的向量化優勢計算方法

def test_rloo_reward():
    local_batch_size = 3
    rloo_k = 4
    rlhf_reward = torch.tensor([
        1, 2, 3, # first rlhf reward for three prompts
        2, 3, 4, # second rlhf reward for three prompts
        5, 6, 7, # third rlhf reward for three prompts
        8, 9, 10, # fourth rlhf reward for three prompts
    ]).float() # here we have 3 prompts which have 4 completions each

    baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
    advantages = torch.zeros_like(rlhf_reward)
    for i in range(0, len(advantages), local_batch_size):
        other_response_rlhf_rewards = []
        for j in range(0, len(advantages), local_batch_size):
            if i != j:
                other_response_rlhf_rewards.append(rlhf_reward[j : j + local_batch_size])
        advantages[i : i + local_batch_size] = rlhf_reward[i : i + local_batch_size] - torch.stack(other_response_rlhf_rewards).mean(0)
    
    assert (1 - (2 + 5 + 8) / 3 - advantages[0].item()) < 1e-6  # First rlhf reward for the first prompt
    assert (6 - (3 + 2 + 9) / 3 - advantages[7].item()) < 1e-6  # Third rlhf reward for the second prompt

    # Vectorized implementation
    rlhf_reward = rlhf_reward.reshape(rloo_k, local_batch_size)
    baseline = (rlhf_reward.sum(0) - rlhf_reward) / (rloo_k - 1)
    vec_advantages = rlhf_reward - baseline
    torch.testing.assert_close(vec_advantages.flatten(), advantages)

基準實驗

為了驗證 RLOO 實現的有效性，我們對 1B 引數模型進行了實驗。以下是我們用來執行實驗的命令。我們直接使用了論文使用 PPO 進行 RLHF 的 N+ 個實現細節：TL;DR 摘要案例研究中的 SFT / RM 模型。

accelerate launch --config_file examples/accelerate_configs/deepspeed_zero2.yaml \
    --output_dir models/minimal/rloo_tldr \
    --dataset_name trl-internal-testing/tldr-preference-sft-trl-style \
    --dataset_test_split validation \
    --num_ppo_epochs 2 \
    --num_mini_batches 2 \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 16 \
    --gradient_accumulation_steps 16 \
    --total_episodes 1000000 \
    --model_name_or_path EleutherAI/pythia-1b-deduped \
    --sft_model_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
    --reward_model_path cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
    --local_rollout_forward_batch_size 16 \
    --missing_eos_penalty 1.0 \
    --stop_token eos \
    --kl_coef 0.03

檢查點和實驗跟蹤可在以下連結檢視：

為了進行評估，我們使用 vLLM 載入檢查點，並使用 GPT-4o mini 作為評判模型來評估生成的 TL;DR 與參考 TL;DR 的優劣。有關如何使用評判模型的更多資訊，請參見評判模型。

$ python examples/scripts/evals/judge_tldr.py --model_name_or_path cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 33.00%
$ python examples/scripts/evals/judge_tldr.py --model_name_or_path vwxyzjn/rloo_tldr --judge_model gpt-4o-mini --num_examples 1000
Model win rate: 51.20%

RLOO 檢查點獲得了 51.2% 的偏好率，而 SFT 檢查點的偏好率為 33.0%。這是一個很好的跡象，表明 RLOO 訓練按預期工作。

指標

# pip install openrlbenchmark==0.2.1a5
# see https://github.com/openrlbenchmark/openrlbenchmark#get-started for documentation
# to use it, change `?we=huggingface&wpn=trl` to your own project and `?tag=pr-1540` to your own tag
python -m openrlbenchmark.rlops_multi_metrics \
    --filters '?we=huggingface&wpn=trl&xaxis=train/episode&ceik=output_dir&cen=sft_model_path&metrics=train/objective/rlhf_reward&metrics=train/objective/scores&metrics=train/objective/kl&metrics=train/objective/non_score_reward&metrics=train/objective/entropy&metrics=train/policy/approxkl_avg&metrics=train/policy/clipfrac_avg&metrics=train/loss/policy_avg&metrics=train/policy/entropy_avg&metrics=train/val/ratio&metrics=train/val/ratio_var&metrics=train/val/num_eos_tokens&metrics=train/lr&metrics=train/eps' \
        "cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr?tag=pr-1540" \
    --env-ids models/minimal/rloo_tldr \
    --pc.ncols 4 \
    --pc.ncols-legend 1 \
    --pc.xlabel "Episode" \
    --output-filename benchmark/trl/pr-1540/rloo \
    --scan-history

Reinforce++

Jian Hu 的 Reinforce++ 報告提出了一些最佳化技巧，以提高 RLHF 的效能和穩定性。這些技巧包括：

裁剪獎勵：將獎勵值限制在特定範圍內，以減輕極端獎勵對模型更新的影響，從而防止梯度爆炸。
歸一化獎勵：將獎勵縮放至均值為 0、標準差為 1，這有助於穩定訓練過程。
歸一化優勢：將優勢縮放至均值為 0、標準差為 1，這有助於穩定訓練過程。
使用報告中公式 (1) 定義的詞元級 KL 懲罰，而非序列級 KL 懲罰（預設）。

這些選項可透過 RLOOConfig 類中的相應引數進行設定。

RLOOTrainer

class trl.RLOOTrainer

< 原始碼 >

( config: RLOOConfig processing_class: typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.image_processing_utils.BaseImageProcessor, transformers.feature_extraction_utils.FeatureExtractionMixin, transformers.processing_utils.ProcessorMixin, NoneType] policy: Module ref_policy: Module reward_model: typing.Union[torch.nn.modules.module.Module, typing.Callable[[list[str]], list[float]]] train_dataset: Dataset data_collator: typing.Optional[transformers.data.data_collator.DataCollatorWithPadding] = None eval_dataset: typing.Union[datasets.arrow_dataset.Dataset, dict[str, datasets.arrow_dataset.Dataset], NoneType] = None optimizers: tuple = (None, None) callbacks: typing.Optional[list[transformers.trainer_callback.TrainerCallback]] = None )

train

< 原始碼 >

( )

save_model

< 原始碼 >

( output_dir: typing.Optional[str] = None _internal_call: bool = False )

將儲存模型，以便您可以使用 `from_pretrained()` 重新載入它。

僅從主程序儲存。

push_to_hub

< 原始碼 >

( commit_message: typing.Optional[str] = 'End of training' blocking: bool = True token: typing.Optional[str] = None revision: typing.Optional[str] = None **kwargs )

引數

commit_message (str, 可選, 預設為 "End of training") — 推送時使用的提交資訊。
blocking (bool, 可選, 預設為 True) — 函式是否應在 `git push` 完成後才返回。
token (str, 可選, 預設為 None) — 具有寫入許可權的令牌，用於覆蓋 Trainer 的原始引數。
revision (str, 可選) — 要提交的 git 修訂版本。預設為“main”分支的頭部。
kwargs (dict[str, Any], 可選) — 傳遞給 `~Trainer.create_model_card` 的額外關鍵字引數。

將 `self.model` 和 `self.processing_class` 上傳到 🤗 模型中心的 `self.args.hub_model_id` 儲存庫。

RLOOConfig

class trl.RLOOConfig

< 原始碼 >

( output_dir: typing.Optional[str] = None overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: typing.Optional[int] = None per_gpu_eval_batch_size: typing.Optional[int] = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: typing.Optional[int] = None eval_delay: typing.Optional[float] = 0 torch_empty_cache_steps: typing.Optional[int] = None learning_rate: float = 5e-05 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear' lr_scheduler_kwargs: typing.Union[dict[str, typing.Any], str, NoneType] = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: str = 'passive' log_level_replica: str = 'warning' log_on_each_node: bool = True logging_dir: typing.Optional[str] = None logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' logging_first_step: bool = False logging_steps: float = 10 logging_nan_inf_filter: bool = True save_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps' save_steps: float = 500 save_total_limit: typing.Optional[int] = None save_safetensors: typing.Optional[bool] = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: typing.Optional[int] = None jit_mode_eval: bool = False use_ipex: bool = False bf16: typing.Optional[bool] = None fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: typing.Optional[bool] = None local_rank: int = -1 ddp_backend: typing.Optional[str] = None tpu_num_cores: typing.Optional[int] = None tpu_metrics_debug: bool = False debug: typing.Union[str, list[transformers.debug_utils.DebugOption]] = '' dataloader_drop_last: bool = False eval_steps: typing.Optional[float] = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: typing.Optional[int] = None past_index: int = -1 run_name: typing.Optional[str] = None disable_tqdm: typing.Optional[bool] = None remove_unused_columns: typing.Optional[bool] = True label_names: typing.Optional[list[str]] = None load_best_model_at_end: typing.Optional[bool] = False metric_for_best_model: typing.Optional[str] = None greater_is_better: typing.Optional[bool] = None ignore_data_skip: bool = False fsdp: typing.Union[list[transformers.trainer_utils.FSDPOption], str, NoneType] = '' fsdp_min_num_params: int = 0 fsdp_config: typing.Union[dict[str, typing.Any], str, NoneType] = None fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = None accelerator_config: typing.Union[dict, str, NoneType] = None deepspeed: typing.Union[dict, str, NoneType] = None label_smoothing_factor: float = 0.0 optim: typing.Union[transformers.training_args.OptimizerNames, str] = 'adamw_torch' optim_args: typing.Optional[str] = None adafactor: bool = False group_by_length: bool = False length_column_name: typing.Optional[str] = 'length' report_to: typing.Union[NoneType, str, list[str]] = None ddp_find_unused_parameters: typing.Optional[bool] = None ddp_bucket_cap_mb: typing.Optional[int] = None ddp_broadcast_buffers: typing.Optional[bool] = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: typing.Optional[str] = None hub_model_id: typing.Optional[str] = None hub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save' hub_token: typing.Optional[str] = None hub_private_repo: typing.Optional[bool] = None hub_always_push: bool = False hub_revision: typing.Optional[str] = None gradient_checkpointing: bool = False gradient_checkpointing_kwargs: typing.Union[dict[str, typing.Any], str, NoneType] = None include_inputs_for_metrics: bool = False include_for_metrics: list = <factory> eval_do_concat_batches: bool = True fp16_backend: str = 'auto' push_to_hub_model_id: typing.Optional[str] = None push_to_hub_organization: typing.Optional[str] = None push_to_hub_token: typing.Optional[str] = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: typing.Optional[str] = None ray_scope: typing.Optional[str] = 'last' ddp_timeout: int = 1800 torch_compile: bool = False torch_compile_backend: typing.Optional[str] = None torch_compile_mode: typing.Optional[str] = None include_tokens_per_second: typing.Optional[bool] = False include_num_input_tokens_seen: typing.Optional[bool] = False neftune_noise_alpha: typing.Optional[float] = None optim_target_modules: typing.Union[NoneType, str, list[str]] = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: typing.Optional[bool] = False liger_kernel_config: typing.Optional[dict[str, bool]] = None eval_use_gather_object: typing.Optional[bool] = False average_tokens_across_devices: typing.Optional[bool] = True dataset_num_proc: typing.Optional[int] = None num_mini_batches: int = 1 total_episodes: typing.Optional[int] = None local_rollout_forward_batch_size: int = 64 num_sample_generations: int = 10 response_length: int = 53 stop_token: typing.Optional[typing.Literal['eos']] = None stop_token_id: typing.Optional[int] = None temperature: float = 0.7 missing_eos_penalty: typing.Optional[float] = None sft_model_path: str = 'EleutherAI/pythia-160m' world_size: typing.Optional[int] = None num_total_batches: typing.Optional[int] = None micro_batch_size: typing.Optional[int] = None local_batch_size: typing.Optional[int] = None batch_size: typing.Optional[int] = None local_mini_batch_size: typing.Optional[int] = None mini_batch_size: typing.Optional[int] = None exp_name: str = 'rloo_config' reward_model_path: str = 'EleutherAI/pythia-160m' num_ppo_epochs: int = 4 whiten_rewards: bool = False kl_coef: float = 0.05 cliprange: float = 0.2 rloo_k: int = 2 normalize_reward: bool = False reward_clip_range: float = 10.0 normalize_advantage: bool = False token_level_kl: bool = False ds3_gather_for_generation: bool = True )

引數

exp_name (str, 可選, 預設為 os.path.basename(__file__)[ -- -len(".py")]): 本次實驗的名稱。
reward_model_path (str, 可選, 預設為 "EleutherAI/pythia-160m") — 獎勵模型的路徑。
num_ppo_epochs (int, 可選, 預設為 4) — 訓練的週期數。
whiten_rewards (bool, 可選, 預設為 False) — 是否白化獎勵。
kl_coef (float, 可選, 預設為 0.05) — KL 係數。
cliprange (float, 可選, 預設為 0.2) — 裁剪範圍。
rloo_k (int, 可選, 預設為 2) — REINFORCE 留一法 (RLOO) 中每個提示詞的線上樣本數。
normalize_reward (bool, optional, defaults to False) — 是否對獎勵進行歸一化。
reward_clip_range (float, optional, defaults to 10.0) — 獎勵的裁剪範圍。
normalize_advantage (bool, optional, defaults to False) — 是否對優勢函式（advantages）進行歸一化。
token_level_kl (bool, optional, defaults to True) — 是使用詞元級別（token-level）的 KL 懲罰還是序列級別（sequence-level）的 KL 懲罰。
ds3_gather_for_generation (bool, optional, defaults to True) — 此設定適用於 DeepSpeed ZeRO-3。如果啟用，策略模型的權重將在生成時被收集，從而提高生成速度。然而，停用此選項可以訓練超過單個 GPU VRAM 容量的模型，但代價是生成速度會變慢。

RLOOTrainer 的配置類。

此類僅包含特定於 RLOO 訓練的引數。有關訓練引數的完整列表，請參閱 TrainingArguments 和 OnPolicyConfig 文件。請注意，此類中的預設值可能與 TrainingArguments 中的預設值不同。

使用 HfArgumentParser，我們可以將此類別轉換為可在命令列上指定的 argparse 引數。

< > 在 GitHub 上更新