資料工具

is_conversational

trl.is_conversational

< 原始碼 >

( example: dict ) → bool

引數

example (dict[str, Any]) — 資料集的單個數據條目。根據資料集型別的不同，該示例可以有不同的鍵。

布林值

如果資料是對話格式，則為 True，否則為 False。

檢查示例是否為對話格式。

示例

>>> example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
>>> is_conversational(example)
True

>>> example = {"prompt": "The sky is"}
>>> is_conversational(example)
False

apply_chat_template

trl.apply_chat_template

< 原始碼 >

( example: dict tokenizer: PreTrainedTokenizerBase tools: typing.Optional[list[typing.Union[dict, typing.Callable]]] = None **template_kwargs )

將聊天模板應用於對話示例，同時應用 tools 中函式列表的模式。

更多詳情，請參見 maybe_apply_chat_template()。

maybe_apply_chat_template

trl.maybe_apply_chat_template

< 原始碼 >

( example: dict tokenizer: PreTrainedTokenizerBase tools: typing.Optional[list[typing.Union[dict, typing.Callable]]] = None **template_kwargs: typing.Any ) → dict[str, str]

引數

example (dict[str, list[dict[str, str]]) — 表示對話資料集單個數據條目的字典。每個資料條目可以有不同的鍵，具體取決於資料集型別。支援的資料集型別有：
- 語言建模資料集："messages"。
- 僅提示資料集："prompt"。
- 提示-完成資料集："prompt" 和 "completion"。
- 偏好資料集："prompt"、"chosen" 和 "rejected"。
- 帶隱式提示的偏好資料集："chosen" 和 "rejected"。
- 未配對偏好資料集："prompt"、"completion" 和 "label"。
對於鍵 "messages"、"prompt"、"chosen"、"rejected" 和 "completion"，其值是訊息列表，其中每個訊息是一個包含鍵 "role" 和 "content" 的字典。
tokenizer (PreTrainedTokenizerBase) — 用於應用聊天模板的分詞器。
tools (list[Union[dict, Callable]] 或 None，可選，預設為 None) — 模型可訪問的工具列表（可呼叫函式）。如果模板不支援函式呼叫，此引數將無效。
**template_kwargs (Any，可選) — 傳遞給模板渲染器的額外關鍵字引數。聊天模板將可以訪問這些引數。

dict[str, str]

應用了聊天模板的格式化示例。

如果示例是對話格式，則對其應用聊天模板。

備註

此函式不改變鍵，但語言建模資料集除外，其中 "messages" 會被替換為 "text"。
對於僅提示的資料，如果最後一個角色是 "user"，則將生成提示新增到提示中。否則，如果最後一個角色是 "assistant"，則繼續最後一條訊息。

示例

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
>>> example = {
...     "prompt": [{"role": "user", "content": "What color is the sky?"}],
...     "completion": [{"role": "assistant", "content": "It is blue."}],
... }
>>> apply_chat_template(example, tokenizer)
{'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n', 'completion': 'It is blue.<|end|>\n<|endoftext|>'}

maybe_convert_to_chatml

trl.maybe_convert_to_chatml

< 原始碼 >

( example: dict ) → dict[str, list]

引數

example (dict[str, list]) — 包含訊息列表的單個數據條目。

dict[str, list]

重新格式化為 ChatML 風格的示例。

將包含欄位 from 和 value 的對話資料集轉換為 ChatML 格式。

此函式修改對話資料以符合 OpenAI 的 ChatML 格式

將訊息字典中的鍵 "from" 替換為 "role"。
將訊息字典中的鍵 "value" 替換為 "content"。
為了與 ChatML 保持一致，將 "conversations" 重新命名為 "messages"。

示例

>>> from trl import maybe_convert_to_chatml

>>> example = {
...     "conversations": [
...         {"from": "user", "value": "What color is the sky?"},
...         {"from": "assistant", "value": "It is blue."},
...     ]
... }
>>> maybe_convert_to_chatml(example)
{'messages': [{'role': 'user', 'content': 'What color is the sky?'},
              {'role': 'assistant', 'content': 'It is blue.'}]}

extract_prompt

trl.extract_prompt

< 原始碼 >

( example: dict )

從偏好資料示例中提取共享提示，其中提示隱含在“選擇的”和“拒絕的”兩個完成中。

更多詳情，請參見 maybe_extract_prompt()。

maybe_extract_prompt

trl.maybe_extract_prompt

< 原始碼 >

( example: dict ) → dict[str, list]

引數

example (dict[str, list]) — 代表偏好資料集中單個數據條目的字典。它必須包含鍵 "chosen" 和 "rejected"，其中每個值既可以是對話式，也可以是標準 (str) 格式。

dict[str, list]

一個包含以下內容的字典：

"prompt"：“選擇的”和“拒絕的”兩個完成之間最長的公共字首。
"chosen"：“選擇的”完成中去除提示後的剩餘部分。
"rejected"：“拒絕的”完成中去除提示後的剩餘部分。

從偏好資料示例中提取共享提示，其中提示隱含在“選擇的”和“拒絕的”兩個完成中。

如果示例已經包含 "prompt" 鍵，函式將按原樣返回該示例。否則，函式會識別“選擇的”和“拒絕的”兩個完成之間最長的公共對話輪次序列（字首），並將其提取為提示。然後，它會從各自的“選擇的”和“拒絕的”完成中移除這個提示。

示例

>>> example = {
...     "chosen": [
...         {"role": "user", "content": "What color is the sky?"},
...         {"role": "assistant", "content": "It is blue."},
...     ],
...     "rejected": [
...         {"role": "user", "content": "What color is the sky?"},
...         {"role": "assistant", "content": "It is green."},
...     ],
... }
>>> extract_prompt(example)
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}],
 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}

或者，使用 datasets.Dataset 的 map 方法

>>> from trl import extract_prompt
>>> from datasets import Dataset

>>> dataset_dict = {
...     "chosen": [
...         [
...             {"role": "user", "content": "What color is the sky?"},
...             {"role": "assistant", "content": "It is blue."},
...         ],
...         [
...             {"role": "user", "content": "Where is the sun?"},
...             {"role": "assistant", "content": "In the sky."},
...         ],
...     ],
...     "rejected": [
...         [
...             {"role": "user", "content": "What color is the sky?"},
...             {"role": "assistant", "content": "It is green."},
...         ],
...         [
...             {"role": "user", "content": "Where is the sun?"},
...             {"role": "assistant", "content": "In the sea."},
...         ],
...     ],
... }
>>> dataset = Dataset.from_dict(dataset_dict)
>>> dataset = dataset.map(extract_prompt)
>>> dataset[0]
{'prompt': [{'role': 'user', 'content': 'What color is the sky?'}],
 'chosen': [{'role': 'assistant', 'content': 'It is blue.'}],
 'rejected': [{'role': 'assistant', 'content': 'It is green.'}]}

unpair_preference_dataset

trl.unpair_preference_dataset

< 原始碼 >

( dataset: ~DatasetType num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None ) → Dataset

引數

dataset (Dataset 或 DatasetDict) — 要取消配對的偏好資料集。資料集必須包含 "chosen"、"rejected" 列，以及可選的 "prompt" 列。
num_proc (int 或 None，可選，預設為 None) — 用於處理資料集的程序數。
desc (str 或 None，可選，預設為 None) — 在對映示例時，與進度條一起顯示的描述性文字。

資料集

未配對的偏好資料集。

取消偏好資料集的配對。

示例

>>> from datasets import Dataset

>>> dataset_dict = {
...     "prompt": ["The sky is", "The sun is"],
...     "chosen": [" blue.", "in the sky."],
...     "rejected": [" green.", " in the sea."],
... }
>>> dataset = Dataset.from_dict(dataset_dict)
>>> dataset = unpair_preference_dataset(dataset)
>>> dataset
Dataset({
    features: ['prompt', 'completion', 'label'],
    num_rows: 4
})

>>> dataset[0]
{'prompt': 'The sky is', 'completion': ' blue.', 'label': True}

maybe_unpair_preference_dataset

trl.maybe_unpair_preference_dataset

< 原始碼 >

( dataset: ~DatasetType num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None ) → Dataset 或 DatasetDict

引數

dataset (Dataset 或 DatasetDict) — 要取消配對的偏好資料集。資料集必須包含 "chosen"、"rejected" 列，以及可選的 "prompt" 列。
num_proc (int 或 None，可選，預設為 None) — 用於處理資料集的程序數。
desc (str 或 None，可選，預設為 None) — 在對映示例時，與進度條一起顯示的描述性文字。

Dataset 或 DatasetDict

如果偏好資料集已配對，則返回未配對的資料集，否則返回原始資料集。

如果偏好資料集已配對，則取消配對。

示例

>>> from datasets import Dataset

>>> dataset_dict = {
...     "prompt": ["The sky is", "The sun is"],
...     "chosen": [" blue.", "in the sky."],
...     "rejected": [" green.", " in the sea."],
... }
>>> dataset = Dataset.from_dict(dataset_dict)
>>> dataset = unpair_preference_dataset(dataset)
>>> dataset
Dataset({
    features: ['prompt', 'completion', 'label'],
    num_rows: 4
})

>>> dataset[0]
{'prompt': 'The sky is', 'completion': ' blue.', 'label': True}

pack_dataset

trl.pack_dataset

< 原始碼 >

( dataset: ~DatasetType seq_length: int strategy: str = 'bfd' map_kwargs: typing.Optional[dict[str, typing.Any]] = None ) → Dataset 或 DatasetDict

引數

dataset (Dataset 或 DatasetDict) — 要打包的資料集
seq_length (int) — 目標打包序列長度。
strategy (str，可選，預設為 "bfd") — 使用的打包策略。可以是：
- "bfd" (最佳擬合遞減法)：較慢但保留序列邊界。序列永遠不會在中間被切斷。
- "wrapped"：更快但更具侵略性。忽略序列邊界，會為了完全填充每個打包序列而切斷序列。
map_kwargs (dict 或 None，可選，預設為 None) — 在打包示例時傳遞給資料集的 map 方法的額外關鍵字引數。

Dataset 或 DatasetDict

打包了序列的資料集。由於序列被合併，示例數量可能會減少。

將資料集中的序列打包成大小為 seq_length 的塊。

示例

>>> from datasets import Dataset
>>> from trl import pack_dataset

>>> examples = {
...     "input_ids": [[1, 2, 3], [4, 5], [6, 7, 8], [9]],
...     "attention_mask": [[1, 1, 0], [1, 0], [1, 0, 0], [1]],
... }
>>> dataset = Dataset.from_dict(examples)
>>> packed_dataset = pack_dataset(dataset, seq_length=4, strategy="bfd")
>>> packed_dataset[:]
{'input_ids': [[1, 2, 3, 9], [6, 7, 8, 4, 5]],
 'attention_mask': [[1, 1, 0, 1], [1, 0, 0, 1, 0]]}

truncate_dataset

trl.truncate_dataset

< 原始碼 >

( dataset: ~DatasetType max_length: int map_kwargs: typing.Optional[dict[str, typing.Any]] = None ) → Dataset 或 DatasetDict

引數

dataset (Dataset 或 DatasetDict) — 要截斷的資料集。
seq_length (int) — 要截斷到的最大序列長度。
map_kwargs (dict 或 None，可選，預設為 None) — 在截斷示例時傳遞給資料集的 map 方法的額外關鍵字引數。

Dataset 或 DatasetDict

序列被截斷的資料集。

將資料集中的序列截斷到指定的 max_length。

示例

>>> from datasets import Dataset

>>> examples = {
...     "input_ids": [[1, 2, 3], [4, 5, 6, 7], [8]],
...     "attention_mask": [[0, 1, 1], [0, 0, 1, 1], [1]],
... }
>>> dataset = Dataset.from_dict(examples)
>>> truncated_dataset = truncate_dataset(dataset, max_length=2)
>>> truncated_dataset[:]
{'input_ids': [[1, 2], [4, 5], [8]],
 'attention_mask': [[0, 1], [0, 0], [1]]}

< > 在 GitHub 上更新

TRL

資料工具

is_conversational

trl.is_conversational

apply_chat_template

trl.apply_chat_template

maybe_apply_chat_template

trl.maybe_apply_chat_template

maybe_convert_to_chatml

trl.maybe_convert_to_chatml

extract_prompt

trl.extract_prompt

maybe_extract_prompt

trl.maybe_extract_prompt

unpair_preference_dataset

trl.unpair_preference_dataset

maybe_unpair_preference_dataset

trl.maybe_unpair_preference_dataset

pack_dataset

trl.pack_dataset

truncate_dataset

trl.truncate_dataset