Distilabel

Distilabel 是一個合成數據和 AI 反饋框架，適用於需要基於經過驗證的研究論文構建快速、可靠且可擴充套件的管道的工程師。

Distilabel 可用於為各種專案生成合成資料和 AI 反饋，包括傳統的預測性 NLP（分類、提取等）以及生成式和大型語言模型場景（指令遵循、對話生成、判斷等）。Distilabel 的程式設計方法允許您構建可擴充套件的 LLM 資料生成和 AI 反饋管道。distilabel 的目標是透過基於經過驗證的研究方法快速生成高質量、多樣化的資料集，並透過 AI 反饋進行判斷，從而加速您的 AI 開發。

人們用 distilabel 構建什麼？

Argilla 社群使用 distilabel 建立了令人驚歎的資料集和模型。

1M OpenHermesPreference 是一個包含約 100 萬條 AI 偏好設定的資料集，這些設定是使用 teknium/OpenHermes-2.5 LLM 生成的。這是一個很好的例子，展示瞭如何使用 distilabel 擴充套件和加速資料集開發。
distilabeled Intel Orca DPO 資料集用於微調改進的 OpenHermes 模型。該資料集是透過將 Argilla 中的人工標註與 distilabel 中的 AI 反饋相結合而構建的，從而產生了改進版的 Intel Orca 資料集，並且效能優於在原始資料集上微調的模型。
haiku DPO 資料是一個示例，展示了任何人如何為特定任務建立合成數據集，該資料集在經過整理和評估後可用於微調自定義 LLM。

先決條件

首先使用您的 Hugging Face 帳戶登入

hf auth login

請確保已安裝 `distilabel`

pip install -U distilabel[vllm]

Distilabel 管道

Distilabel 管道可以由任意數量的相互連線的步驟或任務構建。一個步驟或任務的輸出作為輸入提供給另一個。一系列步驟可以串聯起來，以構建複雜的 LLM 資料處理和生成管道。每個步驟的輸入是一批資料，其中包含一個字典列表，每個字典代表資料集的一行，鍵是列名。為了從 Hugging Face Hub 饋送資料和向其饋送資料，我們定義了一個 `Distiset` 類，作為 `datasets.DatasetDict` 的抽象。

Distiset 作為資料集物件

distilabel 中的管道返回一種特殊型別的 Hugging Face `datasets.DatasetDict`，稱為 `Distiset`。

管道可以在 Distiset 中輸出多個子集，Distiset 是一個類似字典的物件，每個子集有一個條目。然後，Distiset 可以無縫推送到 Hugging Face Hub，所有子集都位於同一個倉庫中。

從 Hub 載入資料到 Distiset

為了展示從 Hub 載入資料的示例，我們將復現Prometheus 2 論文，並使用 distilabel 中實現的 PrometheusEval 任務。Prometheus 2 和 PrometheusEval 任務分別是直接評估和成對排序任務，即評估給定指令的單個獨立響應的質量（有或無參考答案），以及評估給定指令的一個響應與另一個響應的質量（有或無參考答案）。我們將在從 Hub 載入的資料集上使用這些任務，該資料集由 Hugging Face H4 團隊建立，名為HuggingFaceH4/instruction-dataset。

from distilabel.llms import vLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import KeepColumns, LoadDataFromHub
from distilabel.steps.tasks import PrometheusEval

if __name__ == "__main__":
    with Pipeline(name="prometheus") as pipeline:
        load_dataset = LoadDataFromHub(
            name="load_dataset",
            repo_id="HuggingFaceH4/instruction-dataset",
            split="test",
            output_mappings={"prompt": "instruction", "completion": "generation"},
        )

        task = PrometheusEval(
            name="task",
            llm=vLLM(
                model="prometheus-eval/prometheus-7b-v2.0",
                chat_template="[INST] {{ messages[0]['content'] }}\n{{ messages[1]['content'] }}[/INST]",
            ),
            mode="absolute",
            rubric="factual-validity",
            reference=False,
            num_generations=1,
            group_generations=False,
        )

        keep_columns = KeepColumns(
            name="keep_columns",
            columns=["instruction", "generation", "feedback", "result", "model_name"],
        )

        load_dataset >> task >> keep_columns

然後，我們需要使用執行時引數呼叫 `pipeline.run`，以便啟動管道並將資料儲存在 `Distiset` 物件中。

distiset = pipeline.run(
    parameters={
        task.name: {
            "llm": {
                "generation_kwargs": {
                    "max_new_tokens": 1024,
                    "temperature": 0.7,
                },
            },
        },
    },
)

將 distilabel Distiset 推送到 Hub

將 `Distiset` 推送到 Hugging Face 倉庫，其中每個子集將對應一個不同的配置。

distiset.push_to_hub(
    "my-org/my-dataset",
    commit_message="Initial commit",
    private=False,
    token=os.getenv("HF_TOKEN"),
)

📚 資源

< > 在 GitHub 上更新