獎勵函式

此模組包含一些有用的獎勵函式，主要用於 GRPOTrainer。

格式化獎勵

think_format_reward

trl.rewards.think_format_reward

< 源 >

( completions: list **kwargs ) → list[float]

引數

completions (list[list[dict[str, str]]]) — 待評估的補全列表。每個補全必須是包含一條訊息的列表，即一個包含鍵 "content" 且其值為補全文字的字典。
**kwargs — 額外的關鍵字引數。此函式不使用它們，但在函式簽名中需要它們以確保與像 GRPOTrainer 這樣的訓練器相容。

list[float]

一個獎勵列表，其中每個獎勵如果補全符合預期格式則為 1.0，否則為 0.0。

該獎勵函式檢查推理過程是否被包裹在 "<think>" 和 "</think>" 標籤內。如果格式正確，函式返回 1.0 的獎勵，否則返回 0.0。

示例

>>> from trl.rewards import think_format_reward

>>> completions = [
...     [{"content": "<think>\nThis is my reasoning.\n</think>\nThis is my answer."}],
...     [{"content": "<think>\nThis is my reasoning.\nThis is my answer."}],
... ]
>>> think_format_reward(completions)
[1.0, 0.0]

< > 在 GitHub 上更新

TRL

獎勵函式

格式化獎勵

think_format_reward

trl.rewards.think_format_reward