TRL 文件

資料集格式和型別

TRL

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

資料集格式和型別

本指南概述了 TRL 中每個訓練器支援的資料集格式和型別。

資料集格式和型別概覽

資料集的*格式*指的是資料的結構方式，通常分為*標準*或*對話式*。
*型別*與資料集設計的特定任務相關聯，例如*僅提示*或*偏好*。每種型別都由其列來表徵，這些列會根據任務的不同而變化，如下表所示。

型別 \ 格式	標準	對話式
語言建模	`{"text": "The sky is blue."}`	`{"messages": [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}]}`
僅提示	`{"prompt": "The sky is"}`	`{"prompt": [{"role": "user", "content": "What color is the sky?"}]}`
提示-補全	`{"prompt": "The sky is", "completion": " blue."}`	`{"prompt": [{"role": "user", "content": "What color is the sky?"}], "completion": [{"role": "assistant", "content": "It is blue."}]}`
偏好	`{"prompt": "The sky is", "chosen": " blue.", "rejected": " green."}` 或者，使用隱式提示 `{"chosen": "The sky is blue.", "rejected": "The sky is green."}`	`{"prompt": [{"role": "user", "content": "What color is the sky?"}], "chosen": [{"role": "assistant", "content": "It is blue."}], "rejected": [{"role": "assistant", "content": "It is green."}]}` 或者，使用隱式提示 `{"chosen": [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}], "rejected": [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}]}`
無配對偏好	`{"prompt": "The sky is", "completion": " blue.", "label": True}`	`{"prompt": [{"role": "user", "content": "What color is the sky?"}], "completion": [{"role": "assistant", "content": "It is green."}], "label": False}`
逐步監督	`{"prompt": "Which number is larger, 9.8 or 9.11?", "completions": ["The fractional part of 9.8 is 0.8.", "The fractional part of 9.11 is 0.11.", "0.11 is greater than 0.8.", "Hence, 9.11 > 9.8."], "labels": [True, True, False, False]}`

格式

標準

標準資料集格式通常由純文字字串組成。資料集中的列根據任務的不同而變化。這是 TRL 訓練器所期望的格式。以下是不同任務的標準資料集格式示例：

# Language modeling
language_modeling_example = {"text": "The sky is blue."}
# Preference
preference_example = {"prompt": "The sky is", "chosen": " blue.", "rejected": " green."}
# Unpaired preference
unpaired_preference_example = {"prompt": "The sky is", "completion": " blue.", "label": True}

對話式

對話式資料集用於涉及使用者和助手之間對話或聊天互動的任務。與標準資料集格式不同，這些資料集包含一系列訊息，其中每條訊息都有一個 `role`（例如 `“user”` 或 `“assistant”`）和 `content`（訊息文字）。

messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "I'd like to show off how chat templating works!"},
]

與標準資料集一樣，對話式資料集中的列根據任務的不同而變化。以下是不同任務的對話式資料集格式示例：

# Prompt-completion
prompt_completion_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}],
                             "completion": [{"role": "assistant", "content": "It is blue."}]}
# Preference
preference_example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "chosen": [{"role": "assistant", "content": "It is blue."}],
    "rejected": [{"role": "assistant", "content": "It is green."}],
}

對話式資料集對於訓練聊天模型很有用，但在與 TRL 訓練器一起使用之前，必須將其轉換為標準格式。這通常使用特定於所用模型的聊天模板來完成。更多資訊，請參閱在 TRL 中使用對話式資料集部分。

工具呼叫

一些聊天模板支援*工具呼叫*，它允許模型在生成過程中與外部函式（稱為**工具**）進行互動。這擴充套件了模型的對話能力，使其能夠在決定呼叫工具時輸出一個 `“tool_calls”` 欄位，而不是標準的 `“content”` 訊息。

在助手發起工具呼叫後，工具會執行並返回其輸出。然後，助手可以處理此輸出並相應地繼續對話。

這是一個工具呼叫互動的簡單示例：

messages = [
    {"role": "user", "content": "Turn on the living room lights."},
    {"role": "assistant", "tool_calls": [
        {"type": "function", "function": {
            "name": "control_light",
            "arguments": {"room": "living room", "state": "on"}
        }}]
    },
    {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
    {"role": "assistant", "content": "Done!"}
]

在準備用於監督微調 (SFT) 的工具呼叫資料集時，重要的是您的資料集要包含一個名為 `tools` 的附加列。該列包含模型可用的工具列表，聊天模板通常使用該列表來構建系統提示。

工具必須以編碼的 JSON 模式格式指定。您可以使用 `get_json_schema` 工具從 Python 函式簽名中自動生成此模式：

from transformers.utils import get_json_schema

def control_light(room: str, state: str) -> str:
    """
    Controls the lights in a room.

    Args:
        room: The name of the room.
        state: The desired state of the light ("on" or "off").

    Returns:
        str: A message indicating the new state of the lights.
    """
    return f"The lights in {room} are now {state}."

# Generate JSON schema
json_schema = get_json_schema(control_light)

生成的模式如下所示：

{
    "type": "function",
    "function": {
        "name": "control_light",
        "description": "Controls the lights in a room.",
        "parameters": {
            "type": "object",
            "properties": {
                "room": {"type": "string", "description": "The name of the room."},
                "state": {"type": "string", "description": 'The desired state of the light ("on" or "off").'},
            },
            "required": ["room", "state"],
        },
        "return": {"type": "string", "description": "str: A message indicating the new state of the lights."},
    },
}

一個完整的 SFT 資料集條目可能如下所示：

{"messages": messages, "tools": [json_schema]}

有關工具呼叫的更多詳細資訊，請參閱 `transformers` 文件中的工具呼叫部分和部落格文章統一工具使用。

Harmony

Harmony 響應格式是隨 OpenAI GPT OSS 模型一起引入的。它透過為推理、函式呼叫和有關模型行為的元資料新增更豐富的結構來擴充套件對話格式。主要特點包括：

**開發者角色** – 提供高階指令（類似於系統提示）並列出可用工具。
**通道** – 將不同型別的助手輸出分離到不同的流中：
- `analysis` – 用於內部推理，來自鍵 `“thinking”`
- `final` – 用於面向使用者的答案，來自鍵 `“content”`
- `commentary` – 用於工具呼叫或元註釋
**推理努力程度** – 指示模型應顯示多少思考過程（例如 `“low”`、`“medium”`、`“high”`）。
**模型身份** – 明確定義助手的角色。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")

messages = [
    {"role": "developer", "content": "Use a friendly tone."},
    {"role": "user", "content": "What is the meaning of life?"},
    {"role": "assistant", "thinking": "Deep reflection...", "content": "The final answer is..."},
]

print(
    tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        reasoning_effort="low",
        model_identity="You are HuggingGPT, a large language model trained by Hugging Face."
    )
)

這將產生：

<|start|>system<|message|>You are HuggingGPT, a large language model trained by Hugging Face.
Knowledge cutoff: 2024-06
Current date: 2025-08-03

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>developer<|message|># Instructions

Use a friendly tone.<|end|><|start|>user<|message|>What is the meaning of life?<|end|><|start|>assistant<|channel|>analysis<|message|>Deep reflection...<|end|><|start|>assistant<|channel|>final<|message|>The final answer is...<|return|>

有關訊息結構、支援欄位和高階用法的完整詳細資訊，請參閱 Harmony 文件。

型別

語言建模

語言建模資料集由一列 `“text”`（對於對話式資料集則為 `“messages”`）組成，其中包含完整的文字序列。

# Standard format
language_modeling_example = {"text": "The sky is blue."}
# Conversational format
language_modeling_example = {"messages": [
    {"role": "user", "content": "What color is the sky?"},
    {"role": "assistant", "content": "It is blue."}
]}

僅提示

在僅提示資料集中，僅提供初始提示（問題或部分句子），鍵為 `“prompt”`。訓練通常涉及基於此提示生成補全，模型學習繼續或完成給定的輸入。

# Standard format
prompt_only_example = {"prompt": "The sky is"}
# Conversational format
prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}

有關僅提示資料集的示例，請參閱僅提示資料集集合。

雖然僅提示和語言建模型別相似，但它們在處理輸入的方式上有所不同。在僅提示型別中，提示表示期望模型補全或繼續的部分輸入，而在語言建模型別中，輸入被視為一個完整的句子或序列。TRL 對這兩種型別的處理方式不同。以下是顯示每種型別在 `apply_chat_template` 函式輸出中差異的示例：

from transformers import AutoTokenizer
from trl import apply_chat_template

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

# Example for prompt-only type
prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
apply_chat_template(prompt_only_example, tokenizer)
# Output: {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n'}

# Example for language modeling type
lm_example = {"messages": [{"role": "user", "content": "What color is the sky?"}]}
apply_chat_template(lm_example, tokenizer)
# Output: {'text': '<|user|>\nWhat color is the sky?<|end|>\n<|endoftext|>'}

僅提示輸出包含一個 `'<|assistant|>\n'`，表示助手回合的開始，並期望模型生成一個補全。
相比之下，語言建模輸出將輸入視為一個完整的序列，並用 `'<|endoftext|>'` 終止它，表示文字的結束，不期望任何額外內容。

提示-補全

提示-補全資料集包括一個 `“prompt”` 和一個 `“completion”`。

# Standard format
prompt_completion_example = {"prompt": "The sky is", "completion": " blue."}
# Conversational format
prompt_completion_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}],
                             "completion": [{"role": "assistant", "content": "It is blue."}]}

有關提示-補全資料集的示例，請參閱提示-補全資料集集合。

偏好

偏好資料集用於訓練模型在兩個或多個可能的補全中選擇一個。此資料集包括一個 `“prompt”`、一個 `“chosen”` 補全和一個 `“rejected”` 補全。模型被訓練成選擇 `“chosen”` 響應而不是 `“rejected”` 響應。一些資料集可能不包括 `“prompt”` 列，在這種情況下，提示是隱式的，直接包含在 `“chosen”` 和 `“rejected”` 補全中。我們建議儘可能使用顯式提示。

# Standard format
## Explicit prompt (recommended)
preference_example = {"prompt": "The sky is", "chosen": " blue.", "rejected": " green."}
# Implicit prompt
preference_example = {"chosen": "The sky is blue.", "rejected": "The sky is green."}

# Conversational format
## Explicit prompt (recommended)
preference_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}],
                      "chosen": [{"role": "assistant", "content": "It is blue."}],
                      "rejected": [{"role": "assistant", "content": "It is green."}]}
## Implicit prompt
preference_example = {"chosen": [{"role": "user", "content": "What color is the sky?"},
                                 {"role": "assistant", "content": "It is blue."}],
                      "rejected": [{"role": "user", "content": "What color is the sky?"},
                                   {"role": "assistant", "content": "It is green."}]}

有關偏好資料集的示例，請參閱偏好資料集集合。

一些偏好資料集可以在 Hugging Face Hub 上找到，標籤為 `dpo`。您還可以瀏覽 librarian-bots 的 DPO 集合來識別偏好資料集。

無配對偏好

無配對偏好資料集類似於偏好資料集，但它不是為同一提示提供 `“chosen”` 和 `“rejected”` 補全，而是包含單個 `“completion”` 和一個 `“label”`，該標籤指示補全是否被偏好。

# Standard format
unpaired_preference_example = {"prompt": "The sky is", "completion": " blue.", "label": True}
# Conversational format
unpaired_preference_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}],
                               "completion": [{"role": "assistant", "content": "It is blue."}],
                               "label": True}

有關無配對偏好資料集的示例，請參閱無配對偏好資料集集合。

逐步監督

逐步（或過程）監督資料集類似於無配對偏好資料集，但它包括多個補全步驟，每個步驟都有自己的標籤。這種結構對於需要詳細、分步標註的任務（例如推理任務）非常有用。透過分別評估每個步驟並提供有針對性的標籤，這種方法有助於精確識別推理正確和錯誤的地方，從而對推理過程的每個部分進行有針對性的反饋。

stepwise_example = {
    "prompt": "Which number is larger, 9.8 or 9.11?",
    "completions": ["The fractional part of 9.8 is 0.8, while the fractional part of 9.11 is 0.11.", "Since 0.11 is greater than 0.8, the number 9.11 is larger than 9.8."],
    "labels": [True, False]
}

有關逐步監督資料集的示例，請參閱逐步監督資料集集合。

使用哪種資料集型別？

選擇正確的資料集型別取決於您正在處理的任務以及您正在使用的 TRL 訓練器的具體要求。以下是每個 TRL 訓練器支援的資料集型別的簡要概述。

訓練器	期望的資料集型別
BCOTrainer	無配對偏好
CPOTrainer	偏好（建議使用顯式提示）
DPOTrainer	偏好（建議使用顯式提示）
GKDTrainer	提示-補全
GRPOTrainer	僅提示
IterativeSFTTrainer	無配對偏好
KTOTrainer	無配對偏好或偏好（建議使用顯式提示）
NashMDTrainer	僅提示
OnlineDPOTrainer	僅提示
ORPOTrainer	偏好（建議使用顯式提示）
PPOTrainer	已分詞的語言建模
PRMTrainer	逐步監督
RewardTrainer	偏好（建議使用隱式提示）
SFTTrainer	語言建模或提示-補全
XPOTrainer	僅提示

TRL 訓練器僅支援標準資料集格式，目前如此。如果您有對話式資料集，必須先將其轉換為標準格式。有關如何處理對話式資料集的更多資訊，請參閱在 TRL 中使用對話式資料集部分。

在 TRL 中使用對話式資料集

對話式資料集越來越普遍，尤其是在訓練聊天模型方面。然而，一些 TRL 訓練器不支援其原始格式的對話式資料集。（更多資訊，請參閱問題 #2071。）這些資料集必須首先轉換為標準格式。幸運的是，TRL 提供了簡化此轉換的工具，下文將詳細介紹。

將對話式資料集轉換為標準資料集

要將對話式資料集轉換為標準資料集，您需要對資料集*應用聊天模板*。聊天模板是一個預定義的結構，通常包括使用者和助手訊息的佔位符。此模板由您使用的模型的分詞器提供。

有關使用聊天模板的詳細說明，請參閱 `transformers` 文件中的聊天模板部分。

在 TRL 中，您用於轉換資料集的方法將根據任務而有所不同。幸運的是，TRL 提供了一個名為 apply_chat_template() 的輔助函式來簡化此過程。以下是如何使用它的示例：

from transformers import AutoTokenizer
from trl import apply_chat_template

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

example = {
    "prompt": [{"role": "user", "content": "What color is the sky?"}],
    "completion": [{"role": "assistant", "content": "It is blue."}]
}

apply_chat_template(example, tokenizer)
# Output:
# {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n', 'completion': 'It is blue.<|end|>\n<|endoftext|>'}

或者，您可以使用 map 方法在整個資料集上應用模板：

from datasets import Dataset
from trl import apply_chat_template

dataset_dict = {
    "prompt": [[{"role": "user", "content": "What color is the sky?"}],
               [{"role": "user", "content": "Where is the sun?"}]],
    "completion": [[{"role": "assistant", "content": "It is blue."}],
                   [{"role": "assistant", "content": "In the sky."}]]
}

dataset = Dataset.from_dict(dataset_dict)
dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
# Output:
# {'prompt': ['<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n',
#             '<|user|>\nWhere is the sun?<|end|>\n<|assistant|>\n'],
#  'completion': ['It is blue.<|end|>\n<|endoftext|>', 'In the sky.<|end|>\n<|endoftext|>']}

我們建議使用 apply_chat_template() 函式，而不是直接呼叫 `tokenizer.apply_chat_template`。處理非語言建模資料集的聊天模板可能很棘手，並可能導致錯誤，例如在對話中間錯誤地放置系統提示。有關其他示例，請參閱 #1930 (comment)。apply_chat_template() 函式旨在處理這些複雜問題，並確保為各種任務正確應用聊天模板。

請注意，聊天模板是模型特定的。例如，如果您在上面的示例中使用來自 meta-llama/Meta-Llama-3.1-8B-Instruct 的聊天模板，您會得到不同的輸出：

apply_chat_template(example, AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct"))
# Output:
# {'prompt': '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat color is the sky?<|im_end|>\n<|im_start|>assistant\n',
#  'completion': 'It is blue.<|im_end|>\n'}

請始終使用與您正在使用的模型相關聯的聊天模板。使用錯誤的模板可能會導致不準確或意外的結果。

在 TRL 中使用任何資料集：預處理和轉換

許多資料集的格式是為特定任務量身定製的，可能與 TRL 不直接相容。要將此類資料集與 TRL 一起使用，您可能需要對其進行預處理並將其轉換為所需格式。

為了簡化此過程，我們提供了一組示例指令碼，涵蓋了常見的資料集轉換。

示例：UltraFeedback 資料集

讓我們以 UltraFeedback 資料集為例。以下是該資料集的預覽：

如上所示，資料集格式與預期結構不符。它不是對話格式，列名不同，並且結果涉及不同的模型（例如 Bard、GPT-4）和方面（例如“有用性”、“誠實性”）。

透過使用提供的轉換指令碼 `examples/datasets/ultrafeedback.py`，您可以將此資料集轉換為無配對偏好型別，並將其推送到 Hub：

python examples/datasets/ultrafeedback.py --push_to_hub --repo_id trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness

轉換後，資料集將如下所示：

現在，您可以將此資料集與 TRL 一起使用了！

透過調整提供的指令碼或建立自己的指令碼，您可以將任何資料集轉換為與 TRL 相容的格式。

用於轉換資料集型別的實用工具

本節提供示例程式碼，幫助您在不同資料集型別之間進行轉換。雖然某些轉換可以在應用聊天模板後進行（即在標準格式中），但我們建議在應用聊天模板之前執行轉換，以確保其一致工作。

為簡單起見，以下一些示例未遵循此建議，並使用了標準格式。然而，這些轉換可以直接應用於對話格式而無需修改。

從 \ 到	語言建模	提示-補全	僅提示	帶隱式提示的偏好	偏好	無配對偏好	逐步監督
語言建模	不適用	不適用	不適用	不適用	不適用	不適用	不適用
提示-補全	🔗	不適用	🔗	不適用	不適用	不適用	不適用
僅提示	不適用	不適用	不適用	不適用	不適用	不適用	不適用
帶隱式提示的偏好	🔗	🔗	🔗	不適用	🔗	🔗	不適用
偏好	🔗	🔗	🔗	🔗	不適用	🔗	不適用
無配對偏好	🔗	🔗	🔗	不適用	不適用	不適用	不適用
逐步監督	🔗	🔗	🔗	不適用	不適用	🔗	不適用

從提示-補全到語言建模資料集

要將提示-補全資料集轉換為語言建模資料集，請將提示和補全連線起來。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["The sky is", "The sun is"],
    "completion": [" blue.", " in the sky."],
})

def concat_prompt_completion(example):
    return {"text": example["prompt"] + example["completion"]}

dataset = dataset.map(concat_prompt_completion, remove_columns=["prompt", "completion"])

>>> dataset[0]
{'text': 'The sky is blue.'}

從提示-補全到僅提示資料集

要將提示-補全資料集轉換為僅提示資料集，請移除補全。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["The sky is", "The sun is"],
    "completion": [" blue.", " in the sky."],
})

dataset = dataset.remove_columns("completion")

>>> dataset[0]
{'prompt': 'The sky is'}

從帶隱式提示的偏好資料集到語言建模資料集

要將帶隱式提示的偏好資料集轉換為語言建模資料集，請移除被拒絕的項，並將列 `“chosen”` 重新命名為 `“text”`。

from datasets import Dataset

dataset = Dataset.from_dict({
    "chosen": ["The sky is blue.", "The sun is in the sky."],
    "rejected": ["The sky is green.", "The sun is in the sea."],
})

dataset = dataset.rename_column("chosen", "text").remove_columns("rejected")

>>> dataset[0]
{'text': 'The sky is blue.'}

從帶隱式提示的偏好資料集到提示-補全資料集

要將帶隱式提示的偏好資料集轉換為提示-補全資料集，請使用 extract_prompt() 提取提示，移除被拒絕的項，並將列 `“chosen”` 重新命名為 `“completion”`。

from datasets import Dataset
from trl import extract_prompt

dataset = Dataset.from_dict({
    "chosen": [
        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}],
        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sky."}],
    ],
    "rejected": [
        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}],
        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sea."}],
    ],
})
dataset = dataset.map(extract_prompt).remove_columns("rejected").rename_column("chosen", "completion")

>>> dataset[0]
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}], 'completion': [{'role': 'assistant', 'content': 'It is blue.'}]}

從帶隱式提示的偏好資料集到僅提示資料集

要將帶隱式提示的偏好資料集轉換為僅提示資料集，請使用 extract_prompt() 提取提示，並移除被拒絕的和選擇的項。

from datasets import Dataset
from trl import extract_prompt

dataset = Dataset.from_dict({
    "chosen": [
        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}],
        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sky."}],
    ],
    "rejected": [
        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}],
        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sea."}],
    ],
})
dataset = dataset.map(extract_prompt).remove_columns(["chosen", "rejected"])

>>> dataset[0]
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}]}

從隱式提示到顯式提示的偏好資料集

要將帶隱式提示的偏好資料集轉換為帶顯式提示的偏好資料集，請使用 extract_prompt() 提取提示。

from datasets import Dataset
from trl import extract_prompt

dataset = Dataset.from_dict({
    "chosen": [
        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}],
        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sky."}],
    ],
    "rejected": [
        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}],
        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sea."}],
    ],
})

dataset = dataset.map(extract_prompt)

>>> dataset[0]
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}],
 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}

從帶隱式提示的偏好資料集到無配對偏好資料集

要將帶隱式提示的偏好資料集轉換為無配對偏好資料集，請使用 extract_prompt() 提取提示，並使用 unpair_preference_dataset() 將資料集解對。

from datasets import Dataset
from trl import extract_prompt, unpair_preference_dataset

dataset = Dataset.from_dict({
    "chosen": [
        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is blue."}],
        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sky."}],
    ],
    "rejected": [
        [{"role": "user", "content": "What color is the sky?"}, {"role": "assistant", "content": "It is green."}],
        [{"role": "user", "content": "Where is the sun?"}, {"role": "assistant", "content": "In the sea."}],
    ],
})

dataset = dataset.map(extract_prompt)
dataset = unpair_preference_dataset(dataset)

>>> dataset[0]
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
 'completion': [{'role': 'assistant', 'content': 'It is blue.'}],
 'label': True}

請記住，偏好資料集中的 `“chosen”` 和 `“rejected”` 補全可能都好也可能都壞。在應用 unpair_preference_dataset() 之前，請確保所有 `“chosen”` 補全都可以標記為好，所有 `“rejected”` 補全都標記為壞。這可以透過檢查每個補全的絕對評級來保證，例如來自獎勵模型。

從偏好資料集到語言建模資料集

要將偏好資料集轉換為語言建模資料集，請移除被拒絕的項，將提示和選擇的項連線到 `“text”` 列中。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["The sky is", "The sun is"],
    "chosen": [" blue.", " in the sky."],
    "rejected": [" green.", " in the sea."],
})

def concat_prompt_chosen(example):
    return {"text": example["prompt"] + example["chosen"]}

dataset = dataset.map(concat_prompt_chosen, remove_columns=["prompt", "chosen", "rejected"])

>>> dataset[0]
{'text': 'The sky is blue.'}

從偏好資料集到提示-補全資料集

要將偏好資料集轉換為提示-補全資料集，請移除被拒絕的項，並將列 `“chosen”` 重新命名為 `“completion”`。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["The sky is", "The sun is"],
    "chosen": [" blue.", " in the sky."],
    "rejected": [" green.", " in the sea."],
})

dataset = dataset.remove_columns("rejected").rename_column("chosen", "completion")

>>> dataset[0]
{'prompt': 'The sky is', 'completion': ' blue.'}

從偏好資料集到僅提示資料集

要將偏好資料集轉換為僅提示資料集，請移除被拒絕的和選擇的項。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["The sky is", "The sun is"],
    "chosen": [" blue.", " in the sky."],
    "rejected": [" green.", " in the sea."],
})

dataset = dataset.remove_columns(["chosen", "rejected"])

>>> dataset[0]
{'prompt': 'The sky is'}

從顯式提示到隱式提示的偏好資料集

要將帶顯式提示的偏好資料集轉換為帶隱式提示的偏好資料集，請將提示連線到選擇的和被拒絕的項，然後移除提示。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": [
        [{"role": "user", "content": "What color is the sky?"}],
        [{"role": "user", "content": "Where is the sun?"}],
    ],
    "chosen": [
        [{"role": "assistant", "content": "It is blue."}],
        [{"role": "assistant", "content": "In the sky."}],
    ],
    "rejected": [
        [{"role": "assistant", "content": "It is green."}],
        [{"role": "assistant", "content": "In the sea."}],
    ],
})

def concat_prompt_to_completions(example):
    return {"chosen": example["prompt"] + example["chosen"], "rejected": example["prompt"] + example["rejected"]}

dataset = dataset.map(concat_prompt_to_completions, remove_columns="prompt")

>>> dataset[0]
{'chosen': [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is blue.'}],
 'rejected': [{'role': 'user', 'content': 'What color is the sky?'}, {'role': 'assistant', 'content': 'It is green.'}]}

從偏好資料集到無配對偏好資料集

要將資料集轉換為無配對偏好資料集，請使用 unpair_preference_dataset() 將資料集解對。

from datasets import Dataset
from trl import unpair_preference_dataset

dataset = Dataset.from_dict({
    "prompt": [
        [{"role": "user", "content": "What color is the sky?"}],
        [{"role": "user", "content": "Where is the sun?"}],
    ],
    "chosen": [
        [{"role": "assistant", "content": "It is blue."}],
        [{"role": "assistant", "content": "In the sky."}],
    ],
    "rejected": [
        [{"role": "assistant", "content": "It is green."}],
        [{"role": "assistant", "content": "In the sea."}],
    ],
})

dataset = unpair_preference_dataset(dataset)

>>> dataset[0]
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
 'completion': [{'role': 'assistant', 'content': 'It is blue.'}],
 'label': True}

從未配對偏好資料集到語言建模資料集

要將無配對偏好資料集轉換為語言建模資料集，請將帶有好補全的提示連線到 `“text”` 列中，並移除提示、補全和標籤列。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["The sky is", "The sun is", "The sky is", "The sun is"],
    "completion": [" blue.", " in the sky.", " green.", " in the sea."],
    "label": [True, True, False, False],
})

def concatenate_prompt_completion(example):
    return {"text": example["prompt"] + example["completion"]}

dataset = dataset.filter(lambda x: x["label"]).map(concatenate_prompt_completion).remove_columns(["prompt", "completion", "label"])

>>> dataset[0]
{'text': 'The sky is blue.'}

從未配對偏好資料集到提示-補全資料集

要將無配對偏好資料集轉換為提示-補全資料集，請篩選出好的標籤，然後移除標籤列。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["The sky is", "The sun is", "The sky is", "The sun is"],
    "completion": [" blue.", " in the sky.", " green.", " in the sea."],
    "label": [True, True, False, False],
})

dataset = dataset.filter(lambda x: x["label"]).remove_columns(["label"])

>>> dataset[0]
{'prompt': 'The sky is', 'completion': ' blue.'}

從未配對偏好資料集到僅提示資料集

要將無配對偏好資料集轉換為僅提示資料集，請移除補全和標籤列。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["The sky is", "The sun is", "The sky is", "The sun is"],
    "completion": [" blue.", " in the sky.", " green.", " in the sea."],
    "label": [True, True, False, False],
})

dataset = dataset.remove_columns(["completion", "label"])

>>> dataset[0]
{'prompt': 'The sky is'}

從逐步監督資料集到語言建模資料集

要將逐步監督資料集轉換為語言建模資料集，請將帶有好補全的提示連線到 `“text”` 列中。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["Blue light", "Water"],
    "completions": [[" scatters more in the atmosphere,", " so the sky is green."],
                   [" forms a less dense structure in ice,", " which causes it to expand when it freezes."]],
    "labels": [[True, False], [True, True]],
})

def concatenate_prompt_completions(example):
    completion = "".join(example["completions"])
    return {"text": example["prompt"] + completion}

dataset = dataset.filter(lambda x: all(x["labels"])).map(concatenate_prompt_completions, remove_columns=["prompt", "completions", "labels"])

>>> dataset[0]
{'text': 'Blue light scatters more in the atmosphere, so the sky is green.'}

從逐步監督資料集到提示-補全資料集

要將逐步監督資料集轉換為提示-補全資料集，請連線好的補全並移除標籤。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["Blue light", "Water"],
    "completions": [[" scatters more in the atmosphere,", " so the sky is green."],
                   [" forms a less dense structure in ice,", " which causes it to expand when it freezes."]],
    "labels": [[True, False], [True, True]],
})

def join_completions(example):
    completion = "".join(example["completions"])
    return {"completion": completion}

dataset = dataset.filter(lambda x: all(x["labels"])).map(join_completions, remove_columns=["completions", "labels"])

>>> dataset[0]
{'prompt': 'Blue light', 'completion': ' scatters more in the atmosphere, so the sky is green.'}

從逐步監督資料集到僅提示資料集

要將逐步監督資料集轉換為僅提示資料集，請移除補全和標籤。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["Blue light", "Water"],
    "completions": [[" scatters more in the atmosphere,", " so the sky is green."],
                   [" forms a less dense structure in ice,", " which causes it to expand when it freezes."]],
    "labels": [[True, False], [True, True]],
})

dataset = dataset.remove_columns(["completions", "labels"])

>>> dataset[0]
{'prompt': 'Blue light'}

從分步監督到非成對偏好資料集

要將分步監督資料集轉換為非成對偏好資料集，請連線補全內容併合並標籤。

合併標籤的方法取決於具體任務。在此示例中，我們使用邏輯與（AND）操作。這意味著，如果步驟標籤指示了各個步驟的正確性，那麼最終生成的標籤將反映整個序列的正確性。

from datasets import Dataset

dataset = Dataset.from_dict({
    "prompt": ["Blue light", "Water"],
    "completions": [[" scatters more in the atmosphere,", " so the sky is green."],
                   [" forms a less dense structure in ice,", " which causes it to expand when it freezes."]],
    "labels": [[True, False], [True, True]],
})

def merge_completions_and_labels(example):
    return {"prompt": example["prompt"], "completion": "".join(example["completions"]), "label": all(example["labels"])}

dataset = dataset.map(merge_completions_and_labels, remove_columns=["completions", "labels"])

>>> dataset[0]
{'prompt': 'Blue light', 'completion': ' scatters more in the atmosphere, so the sky is green.', 'label': False}

視覺資料集

一些訓練器還支援使用圖文對來微調視覺語言模型（VLM）。在這種情況下，建議使用對話格式，因為每個模型處理文字中影像佔位符的方式都不同。

對話式視覺資料集與標準對話式資料集在兩個關鍵方面有所不同：

資料集中必須包含帶有影像資料的鍵 images。
訊息中的 "content" 欄位必須是一個字典列表，其中每個字典指定資料型別："image" 或 "text"。

示例

# Textual dataset:
"content": "What color is the sky?"

# Vision dataset:
"content": [
    {"type": "image"}, 
    {"type": "text", "text": "What color is the sky in the image?"}
]

對話式視覺資料集的一個例子是 openbmb/RLAIF-V-Dataset。下面是該資料集訓練資料的嵌入式檢視，您可以直接瀏覽：

< > 在 GitHub 上更新

←快速入門論文索引→