Judges

TRL Judges 是一個實驗性 API，隨時可能發生變化。

TRL 提供了 judges 功能，可以輕鬆比較兩個補全結果。

請確保透過執行以下命令安裝了所需的依賴項

pip install trl[judges]

使用提供的 judges

TRL 開箱即用地提供了多個 judges。例如，您可以使用 HfPairwiseJudge，透過 Hugging Face 模型中心的一個預訓練模型來比較兩個補全結果

from trl import HfPairwiseJudge

judge = HfPairwiseJudge()
judge.judge(
    prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
    completions=[["Paris", "Lyon"], ["Saturn", "Jupiter"]],
)  # Outputs: [0, 1]

定義您自己的 judge

為了定義您自己的 judge，我們提供了幾個基類，您可以對其進行子類化。對於基於排序的 judges，您需要子類化 BaseRankJudge 並實現 BaseRankJudge.judge() 方法。對於成對比較的 judges，您需要子類化 BasePairJudge 並實現 BasePairJudge.judge 方法。如果您想定義不屬於這些類別的 judge，您需要子類化 BaseJudge 並實現 BaseJudge.judge() 方法。

例如，讓我們定義一個偏好較短補全結果的成對 judge

from trl import BasePairwiseJudge

class PrefersShorterJudge(BasePairwiseJudge):
    def judge(self, prompts, completions, shuffle_order=False):
        return [0 if len(completion[0]) > len(completion[1]) else 1 for completion in completions]

然後您可以如下使用這個 judge

judge = PrefersShorterJudge()
judge.judge(
    prompts=["What is the capital of France?", "What is the biggest planet in the solar system?"],
    completions=[["Paris", "The capital of France is Paris."], ["Jupiter is the biggest planet in the solar system.", "Jupiter"]],
)  # Outputs: [0, 1]

提供的 judges

PairRMJudge

class trl.PairRMJudge

< 原始碼 >

( )

基於 AllenAI 的 PairRM 模型的 LLM judge。

此 judge 使用 PairRM 模型對給定提示下的成對補全結果進行排序。它專為語言模型輸出的成對比較而設計。PairRM 模型使用 llm-blender 庫載入，並在預設的 Accelerator 裝置上執行。

屬性:

blender (llm_blender.Blender)：llm-blender 中 Blender 類的例項。

示例:

>>> pairrm_judge = PairRMJudge()
>>> prompts = ["Translate 'hello' to French", "What's the capital of Japan?"]
>>> completions = [["Bonjour", "Salut"], ["Kyoto", "Tokyo"]]
>>> results = pairrm_judge.judge(prompts, completions)
>>> print(results)  # [0, 1] (indicating the first completion is preferred for the first prompt and the second)

此類需要安裝 llm-blender 庫。使用以下命令安裝：pip install llm-blender。

judge

< 原始碼 >

( prompts: list completions: list shuffle_order: bool = True return_scores: bool = False temperature: float = 1.0 ) → Union[list[int, float]]

引數

prompts (list[str]) — 要評判的提示列表。
completions (list[list[str]]) — 每個提示的成對補全結果列表。
shuffle_order (bool, 可選, 預設為 True) — 是否打亂補全結果的順序以避免位置偏差。
return_scores (bool, 可選, 預設為 False) — 如果為 True，則返回第一個補全結果的機率分數，而不是排名（即 *soft-judge*）。
temperature (float, 可選, 預設為 1.0) — 如果 return_scores 為 True，用於縮放 logits 的溫度。

Union[list[int, float]]

如果 `return_scores` 為 `False`，則為每個提示返回一個排名列表（`0` 或 `1`），表示哪個補全結果更優。如果 `return_scores` 為 `True`，則返回第一個補全結果的 softmax 機率。

引發

ValueError

ValueError — 如果每個提示的補全數量不恰好是 2。

使用 PairRM 模型對給定提示的補全結果對進行評判。

注意：與 llm-blender 不同，排名是從 0 開始的（`0` 表示第一個補全結果更優）。

HfPairwiseJudge

class trl.HfPairwiseJudge

< 原始碼 >

( model = 'meta-llama/Meta-Llama-3-70B-Instruct' token: typing.Optional[str] = None system_prompt: typing.Optional[str] = None )

引數

model (str, 可選, 預設為 "meta-llama/Meta-Llama-3-70B-Instruct") — 用於 judge 的模型。
token (str, 可選) — 用於 huggingface_hub.InferenceClient 的 Hugging Face API 令牌。
system_prompt (str 或 None, 可選, 預設為 None) — 用於 judge 的系統提示。如果未提供，則使用預設提示。請注意，系統提示應包含以下佔位符：{prompt}、{response0} 和 {response1}。此外，推理時會使用 max_tokens=1，因此係統提示應要求單 token 響應。

基於 Hugging Face API 和聊天補全的成對 judge。

此 judge 適用於評估聊天模型的質量，其中補全結果是對給定提示的響應。

OpenAIPairwiseJudge

class trl.OpenAIPairwiseJudge

< 原始碼 >

( model = 'gpt-4-turbo-preview' system_prompt: typing.Optional[str] = None max_requests: typing.Optional[int] = 1000 )

引數

model (str, 可選, 預設為 "gpt-4-turbo-preview") — 用於 judge 的模型。
system_prompt (str 或 None, 可選, 預設為 None) — 用於 judge 的系統提示。如果未提供，則使用預設提示。請注意，系統提示應包含以下佔位符：{prompt}、{response0} 和 {response1}。此外，推理時會使用 max_tokens=1，因此係統提示應要求單 token 響應。
max_requests (int 或 None, 可選, 預設為 1000) — 向 OpenAI API 發出的最大請求數。如果設定為 None，則沒有限制。

基於 OpenAI API 的 judge。

此 judge 適用於評估聊天模型的質量，其中補全結果是對給定提示的響應。

AllTrueJudge

class trl.AllTrueJudge

< 原始碼 >

( judges: list )

引數

judges (list[BaseBinaryJudge]) — 一個 BaseBinaryJudge 例項列表，其決策將被統一。

統一多個 BaseBinaryJudge 例項的決策。

僅當所有內部二元 judges 返回 `1` 時才返回 `1`。如果任何 judge 返回 `0`，它將返回 `0`。如果任何 judge 返回 `-1`，表示其處理失敗，此 judge 也將返回 `-1`。

實現了 CGPO 論文中描述的“法官混合體”(Mixture of Judges)。

基類

BaseJudge

class trl.BaseJudge

< 原始碼 >

( )

judges 的基類。此類的子類應實現 judge 方法。

BaseBinaryJudge

class trl.BaseBinaryJudge

< 原始碼 >

( )

二元 judges 的基類。

judge

< 原始碼 >

( prompts: list completions: list gold_completions: typing.Optional[list[str]] = None shuffle_order: bool = True ) → list[int]

引數

prompts (list[str]) — 提示列表。
completions (list[str]) — 補全結果列表。
gold_completions (list[str], 可選) — 黃金補全結果列表（如果存在）。
shuffle_order (bool) — 是否打亂補全結果的順序以避免位置偏差。

list[int]

一個二元標籤列表

1 表示補全結果滿足評估的約束條件。
0 表示補全結果不滿足評估的約束條件。

對給定提示的補全結果進行評判。用於評估補全結果是否滿足某個約束條件。

此基類應用於實現 CGPO 論文第 4.1.4 節中所述的二元評估。它適用於評估提示-補全對是否滿足特定約束。

注意：如果 judge 對任何提示返回 -1，則表示用於計算偏好的內部過程失敗。例如，如果底層的語言模型或基於規則的約束返回了無效答案，就可能發生這種情況。在這種情況下，呼叫者應適當處理這些無效索引，可能透過實現回退邏輯或錯誤處理。

BaseRankJudge

class trl.BaseRankJudge

< 原始碼 >

( )

LLM 排名 judges 的基類。

示例:

class MyRankJudge(BaseRankJudge):
    def judge(self, prompts, completions, shuffle_order=True):
        return ...  # Your ranking logic here


judge = MyRankJudge()
judge.judge(
    prompts=["The capital of France is", "The capital of Germany is"],
    completions=[[" Paris", " Marseille", "Lyon"], [" Munich", " Berlin"]],
)  # [[0, 1, 2], [1, 0]]