開源 AI 食譜文件
使用 distilabel 生成偏好資料集
並獲得增強的文件體驗
開始使用
使用 distilabel 生成偏好資料集
作者:David Berenstein 和 Sara Han Díaz
- 庫:argilla, hf-inference-endpoints
- 元件:LoadDataFromHub、TextGeneration、UltraFeedback、GroupColumns、FormatTextGenerationDPO、PreferenceToArgilla、InferenceEndpointsLLM
在本教程中,我們將使用 distilabel 為 DPO、ORPO 或 RLHF 生成一個合成偏好資料集。distilabel 是一個合成數據和 AI 反饋框架,專為需要基於已驗證研究論文的快速、可靠且可擴充套件流水線的工程師設計。請在此處檢視文件。
為了生成響應並對其進行評估,我們將使用與 distilabel 整合的無伺服器 HF 推理 API。這項服務是免費但有速率限制的,允許您透過簡單的 HTTP 請求測試和評估超過 15 萬個公共模型或您自己的私有模型,並在 Hugging Face 的共享基礎設施上進行快速推理。如果您需要更多計算能力,可以使用 Hugging Face 推理端點部署您自己的推理端點。
最後,為了進一步整理資料,我們將使用 Argilla,它使我們能夠就資料質量提供人工反饋。Argilla 是一個為 AI 工程師和領域專家設計的協作工具,他們需要為自己的專案構建高質量的資料集。請在此處檢視文件。
開始
安裝依賴
要完成本教程,您需要透過 pip 安裝 distilabel SDK 和一些第三方庫。我們將使用免費但有速率限制的 Hugging Face 無伺服器推理 API,因此需要將其作為 distilabel 的額外依賴項進行安裝。您可以透過執行以下命令來安裝它們:
!pip install "distilabel[hf-inference-endpoints]"
!pip install "transformers~=4.0" "torch~=2.0"
讓我們進行必要的匯入:
from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
LoadDataFromHub,
GroupColumns,
FormatTextGenerationDPO,
PreferenceToArgilla,
)
from distilabel.steps.tasks import TextGeneration, UltraFeedback
您需要一個 HF_TOKEN
來使用 HF 推理端點。請登入以便在本筆記本中直接使用它。
import os
from huggingface_hub import login
login(token=os.getenv("HF_TOKEN"), add_to_git_credential=True)
(可選)部署 Argilla
您可以跳過此步驟,或用任何其他資料評估工具替代,但您的模型質量會因資料質量不佳而受損,因此我們確實建議您檢查您的資料。如果您已經部署了 Argilla,可以跳過此步驟。否則,您可以按照本指南快速部署 Argilla。
此外,您還需要將 Argilla 安裝為 distilabel 的額外依賴項。
!pip install "distilabel[argilla, hf-inference-endpoints]"
定義流水線
為了生成我們的偏好資料集,我們需要定義一個包含所有必要步驟的 Pipeline
。下面,我們將詳細介紹每個步驟。
載入資料集
我們將使用來自 Hugging Face Hub 的 argilla/10Kprompts-mini
資料集作為源資料。
- 元件:
LoadDataFromHub
- 輸入列:
instruction
和topic
,與載入的資料集中的列相同 - 輸出列:
instruction
和topic
load_dataset = LoadDataFromHub(
repo_id="argilla/10Kprompts-mini",
num_examples=1,
pipeline=Pipeline(name="showcase-pipeline"),
)
load_dataset.load()
next(load_dataset.process())
生成響應
我們需要為給定的指令生成響應。我們將使用兩個透過無伺服器推理 API 在 Hugging Face Hub 上可用的不同模型:meta-llama/Meta-Llama-3-8B-Instruct
和 mistralai/Mixtral-8x7B-Instruct-v0.1
。我們還將為每個模型指定生成引數。
- 元件:使用
InferenceEndpointsLLM
的TextGeneration
任務 - 輸入列:
instruction
- 輸出列:每個模型的
generation
、distilabel_metadata
、model_name
為了滿足您的用例並改善結果,您可以使用任何其他您選擇的 LLM。
>>> generate_responses = [
... TextGeneration(
... llm=InferenceEndpointsLLM(
... model_id="meta-llama/Meta-Llama-3-8B-Instruct",
... tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
... generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
... ),
... pipeline=Pipeline(name="showcase-pipeline"),
... ),
... TextGeneration(
... llm=InferenceEndpointsLLM(
... model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
... tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
... generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
... ),
... pipeline=Pipeline(name="showcase-pipeline"),
... ),
... ]
>>> for task in generate_responses:
... task.load()
... print(next(task.process([{"instruction": "Which are the top cities in Spain?"}])))
[{'instruction': 'Which are the top cities in Spain?', 'generation': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.', 'distilabel_metadata': {'raw_output_text_generation_0': 'Spain is a country with a rich culture, history, and architecture, and it has many great cities to visit. Here are some of the top cities in Spain:\n\n1. **Madrid**: The capital city of Spain, known for its vibrant nightlife, museums, and historic landmarks like the Royal Palace and Prado Museum.\n2. **Barcelona**: The second-largest city in Spain, famous for its modernist architecture, beaches, and iconic landmarks like La Sagrada Família and Park Güell, designed by Antoni Gaudí.\n3. **Valencia**: Located on the Mediterranean coast, Valencia is known for its beautiful beaches, City of Arts and Sciences, and delicious local cuisine, such as paella.\n4. **Seville**: The capital of Andalusia, Seville is famous for its stunning cathedral, Royal Alcázar Palace, and lively flamenco music scene.\n5. **Málaga**: A coastal city in southern Spain, Málaga is known for its rich history, beautiful beaches, and being the birthplace of Pablo Picasso.\n6. **Zaragoza**: Located in the northeastern region of Aragon, Zaragoza is a city with a rich history, known for its Roman ruins, Gothic cathedral, and beautiful parks.\n7. **Granada**: A city in the Andalusian region, Granada is famous for its stunning Alhambra palace and generalife gardens, a UNESCO World Heritage Site.\n8. **Bilbao**: A city in the Basque Country, Bilbao is known for its modern architecture, including the Guggenheim Museum, and its rich cultural heritage.\n9. **Alicante**: A coastal city in the Valencia region, Alicante is famous for its beautiful beaches, historic castle, and lively nightlife.\n10. **San Sebastián**: A city in the Basque Country, San Sebastián is known for its stunning beaches, gastronomic scene, and cultural events like the San Sebastián International Film Festival.\n\nThese are just a few of the many great cities in Spain, each with its own unique character and attractions.'}, 'model_name': 'meta-llama/Meta-Llama-3-8B-Instruct'}] [{'instruction': 'Which are the top cities in Spain?', 'generation': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.', 'distilabel_metadata': {'raw_output_text_generation_0': ' Here are some of the top cities in Spain based on various factors such as tourism, culture, history, and quality of life:\n\n1. Madrid: The capital and largest city in Spain, Madrid is known for its vibrant nightlife, world-class museums (such as the Prado Museum and Reina Sofia Museum), stunning parks (such as the Retiro Park), and delicious food.\n\n2. Barcelona: Famous for its unique architecture, Barcelona is home to several UNESCO World Heritage sites designed by Antoni Gaudí, including the Sagrada Familia and Park Güell. The city also boasts beautiful beaches, a lively arts scene, and delicious Catalan cuisine.\n\n3. Valencia: A coastal city located in the east of Spain, Valencia is known for its City of Arts and Sciences, a modern architectural complex that includes a planetarium, opera house, and museum of interactive science. The city is also famous for its paella, a traditional Spanish dish made with rice, vegetables, and seafood.\n\n4. Seville: The capital of Andalusia, Seville is famous for its flamenco dancing, stunning cathedral (the largest Gothic cathedral in the world), and the Alcázar, a beautiful palace made up of a series of rooms and courtyards.\n\n5. Granada: Located in the foothills of the Sierra Nevada mountains, Granada is known for its stunning Alhambra palace, a Moorish fortress that dates back to the 9th century. The city is also famous for its tapas, a traditional Spanish dish that is often served for free with drinks.\n\n6. Bilbao: A city in the Basque Country, Bilbao is famous for its modern architecture, including the Guggenheim Museum, a contemporary art museum designed by Frank Gehry. The city is also known for its pintxos, a type of Basque tapas that are served in bars and restaurants.\n\n7. Málaga: A coastal city in Andalusia, Málaga is known for its beautiful beaches, historic sites (including the Alcazaba and Gibralfaro castles), and the Picasso Museum, which is dedicated to the famous Spanish artist who was born in the city.\n\nThese are just a few of the many wonderful cities in Spain.'}, 'model_name': 'mistralai/Mixtral-8x7B-Instruct-v0.1'}]
分組響應
評估響應的任務需要一個生成列表作為輸入。然而,每個模型的響應都儲存在子集 text_generation_0
和 text_generation_1
的 generation 列中。我們將把這兩個列合併到一個單獨的列和 default
子集中。
- 元件:
GroupColumns
- 輸入列:來自
text_generation_0
和text_generation_1
的generation
和model_name
- 輸出列:
generations
和model_names
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
pipeline=Pipeline(name="showcase-pipeline"),
)
next(
group_responses.process(
[
{
"generation": "Madrid",
"model_name": "meta-llama/Meta-Llama-3-8B-Instruct",
},
],
[
{
"generation": "Barcelona",
"model_name": "mistralai/Mixtral-8x7B-Instruct-v0.1",
}
],
)
)
評估響應
為了構建我們的偏好資料集,我們需要評估模型生成的響應。我們將使用 meta-llama/Meta-Llama-3-70B-Instruct
來完成此任務,應用 UltraFeedback
任務,該任務會根據不同維度(幫助性、誠實性、遵循指令、真實性)來評判響應。
- 元件:使用
InferenceEndpointsLLM
的UltraFeedback
任務 - 輸入列:
instruction
,generations
- 輸出列:
ratings
,rationales
,distilabel_metadata
,model_name
為了滿足您的用例並改善結果,您可以使用任何其他您選擇的 LLM。
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
pipeline=Pipeline(name="showcase-pipeline"),
)
evaluate_responses.load()
next(
evaluate_responses.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
}
]
)
)
轉換為偏好資料集
- 您可以自動將其轉換為包含
chosen
和rejected
列的偏好資料集。- 元件:
FormatTextGenerationDPO
步驟 - 輸入列:
instruction
、generations
、generation_models
、ratings
- 輸出列:
prompt
、prompt_id
、chosen
、chosen_model
、chosen_rating
、rejected
、rejected_model
、rejected_rating
- 元件:
format_dpo = FormatTextGenerationDPO(pipeline=Pipeline(name="showcase-pipeline"))
format_dpo.load()
next(
format_dpo.process(
[
{
"instruction": "What's the capital of Spain?",
"generations": ["Madrid", "Barcelona"],
"generation_models": [
"Meta-Llama-3-8B-Instruct",
"Mixtral-8x7B-Instruct-v0.1",
],
"ratings": [5, 1],
}
]
)
)
- 或者您可以使用 Argilla 手動標註資料,並將其轉換為偏好資料集。
- 元件:
PreferenceToArgilla
步驟 - 輸入列:
instruction
、generations
、generation_models
、ratings
- 輸出列:
instruction
、generations
、generation_models
、ratings
- 元件:
to_argilla = PreferenceToArgilla(
dataset_name="preference-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2,
)
執行流水線
下面,您可以看到完整的流水線定義:
with Pipeline(name="generate-dataset") as pipeline:
load_dataset = LoadDataFromHub(repo_id="argilla/10Kprompts-mini")
generate_responses = [
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-8B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
TextGeneration(
llm=InferenceEndpointsLLM(
model_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
tokenizer_id="mistralai/Mixtral-8x7B-Instruct-v0.1",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
)
),
]
group_responses = GroupColumns(
columns=["generation", "model_name"],
output_columns=["generations", "model_names"],
)
evaluate_responses = UltraFeedback(
aspect="overall-rating",
llm=InferenceEndpointsLLM(
model_id="meta-llama/Meta-Llama-3-70B-Instruct",
tokenizer_id="meta-llama/Meta-Llama-3-70B-Instruct",
generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
),
)
format_dpo = FormatTextGenerationDPO()
to_argilla = PreferenceToArgilla(
dataset_name="preference-dataset",
dataset_workspace="argilla",
api_url="https://[your-owner-name]-[your-space-name].hf.space",
api_key="[your-api-key]",
num_generations=2,
)
for task in generate_responses:
load_dataset.connect(task)
task.connect(group_responses)
group_responses.connect(evaluate_responses)
evaluate_responses.connect(format_dpo, to_argilla)
現在讓我們執行流水線並生成偏好資料集。
distiset = pipeline.run()
讓我們檢查一下偏好資料集!如果您已將資料載入到 Argilla,您可以在 Argilla UI 中開始標註。
您可以將資料集推送到 Hub 以與社群共享,並嵌入它以瀏覽資料。
distiset.push_to_hub("[your-owner-name]/example-preference-dataset")
結論
在本教程中,我們展示了使用 distilabel 構建生成偏好資料集流水線的詳細步驟。您可以為自己的用例定製此流水線,並透過 Hugging Face Hub 與社群共享您的資料集,或使用它們來訓練 DPO 或 ORPO 模型。
我們使用一個包含提示的資料集,透過無伺服器 Hugging Face 推理 API 使用兩個不同的模型生成響應。接下來,我們使用第三個模型,遵循 UltraFeedback 標準來評估這些響應。最後,我們將資料轉換為偏好資料集,並使用 Argilla 進行進一步的整理。
< > 在 GitHub 上更新