使用 Argilla Spaces 進行資料標註

本筆記演示了系統評估 LLM 輸出和建立 LLM 訓練資料的工作流程。你可以首先使用本筆記評估你最喜歡的 LLM 在你的任務上的零樣本效能，而無需進行任何微調。如果你想提高效能，你可以輕鬆地重複使用此工作流程來建立訓練資料。

示例用例：程式碼生成。在本教程中，我們演示瞭如何為程式碼生成任務建立高質量的測試和訓練資料。然而，相同的工作流程可以適用於任何其他與你的特定用例相關的任務。

在本筆記中，我們：

下載示例任務的資料。
提示兩個 LLM 回答這些任務。這將生成“合成數據”以加速手動資料建立。
在 HF Spaces 上建立 Argilla 標註介面，以比較和評估兩個 LLM 的輸出。
將示例資料和零樣本 LLM 響應上傳到 Argilla 標註介面。
下載已標註的資料。

你可以根據自己的需求調整本筆記，例如，在步驟 (2) 中使用不同的 LLM 和 API 提供商，或在步驟 (3) 中調整標註任務。

安裝所需軟體包並連線到 HF Hub

!pip install argilla~=2.0.0
!pip install transformers~=4.40.0
!pip install datasets~=2.19.0
!pip install huggingface_hub~=0.23.2

# Login to the HF Hub. We recommend using this login method 
# to avoid the need to explicitly store your HF token in variables 
import huggingface_hub
!git config --global credential.helper store
huggingface_hub.login(add_to_git_credential=True)

下載示例任務資料

首先，我們下載一個包含 LLM 程式碼生成任務的示例資料集。我們希望評估兩個不同的 LLM 在這些程式碼生成任務上的表現。我們使用來自 bigcode/self-oss-instruct-sc2-exec-filter-50k 資料集的指令，該資料集曾用於訓練 StarCoder2-Instruct 模型。

>>> from datasets import load_dataset

>>> # Small sample for faster testing
>>> dataset_codetask = load_dataset("bigcode/self-oss-instruct-sc2-exec-filter-50k", split="train[:3]")
>>> print("Dataset structure:\n", dataset_codetask, "\n")

>>> # We are only interested in the instructions/prompts provided in the dataset
>>> instructions_lst = dataset_codetask["instruction"]
>>> print("Example instructions:\n", instructions_lst[:2])

Dataset structure:
 Dataset(&#123;
    features: ['fingerprint', 'sha1', 'seed', 'response', 'concepts', 'prompt', 'instruction', 'id'],
    num_rows: 3
}) 

Example instructions:
 ['Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.', 'Write a Python function `check_collision` that takes a list of `rectangles` as input and checks if there are any collisions between any two rectangles. A rectangle is represented as a tuple (x, y, w, h) where (x, y) is the top-left corner of the rectangle, `w` is the width, and `h` is the height.\n\nThe function should return True if any pair of rectangles collide, and False otherwise. Use an iterative approach and check for collisions based on the bounding box collision detection algorithm. If a collision is found, return True immediately without checking for more collisions.']

對示例任務提示兩個 LLM

使用 chat_template 格式化指令

在將指令傳送到 LLM API 之前，我們需要使用正確的 `chat_template` 格式化指令，以便我們想要評估的每個模型都能正確處理。這實際上涉及到在指令周圍新增一些特殊標記。有關聊天模板的詳細資訊，請參閱文件。

>>> # Apply correct chat formatting to instructions from the dataset
>>> from transformers import AutoTokenizer

>>> models_to_compare = ["mistralai/Mixtral-8x7B-Instruct-v0.1", "meta-llama/Meta-Llama-3-70B-Instruct"]


>>> def format_prompt(prompt, tokenizer):
...     messages = [{"role": "user", "content": prompt}]
...     messages_tokenized = tokenizer.apply_chat_template(
...         messages, tokenize=False, add_generation_prompt=True, return_tensors="pt"
...     )
...     return messages_tokenized


>>> prompts_formatted_dic = {}
>>> for model in models_to_compare:
...     tokenizer = AutoTokenizer.from_pretrained(model)

...     prompt_formatted = []
...     for instruction in instructions_lst:
...         prompt_formatted.append(format_prompt(instruction, tokenizer))

...     prompts_formatted_dic.update({model: prompt_formatted})


>>> print(
...     f"\nFirst prompt formatted for {models_to_compare[0]}:\n\n",
...     prompts_formatted_dic[models_to_compare[0]][0],
...     "\n\n",
... )
>>> print(
...     f"First prompt formatted for {models_to_compare[1]}:\n\n",
...     prompts_formatted_dic[models_to_compare[1]][0],
...     "\n\n",
... )

First prompt formatted for mistralai/Mixtral-8x7B-Instruct-v0.1:

 [INST] Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None. [/INST] 


First prompt formatted for meta-llama/Meta-Llama-3-70B-Instruct:

 <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

將指令傳送到 HF 推理 API

現在，我們可以將指令傳送到兩個 LLM 的 API 以獲取可評估的輸出。我們首先定義一些引數以正確生成響應。Hugging Face 的 LLM API 由 Text Generation Inference (TGI) 容器提供支援。有關 TGI OpenAPI 規範，請參見此處；有關 Transformers 生成引數的不同引數解釋，請參見文件。

generation_params = dict(
    # we use low temperature and top_p to reduce creativity and increase likelihood of highly probable tokens
    temperature=0.2,
    top_p=0.60,
    top_k=None,
    repetition_penalty=1.0,
    do_sample=True,
    max_new_tokens=512 * 2,
    return_full_text=False,
    seed=42,
    # details=True,
    # stop=["<|END_OF_TURN_TOKEN|>"],
    # grammar={"type": "json"}
    max_time=None,
    stream=False,
    use_cache=False,
    wait_for_model=False,
)

現在，我們可以向無伺服器推理 API (文件) 發出標準 API 請求。請注意，無伺服器推理 API 主要用於測試，並且受到速率限制。對於不受速率限制的測試，你可以透過 HF 專用端點 (文件) 建立自己的 API。另請參閱我們開源 AI 食譜中相應的教程。

推理 API 食譜完成後，以下程式碼將更新。

>>> import requests
>>> from tqdm.auto import tqdm


>>> # Hint: use asynchronous API calls (and dedicated endpoints) to increase speed
>>> def query(payload=None, api_url=None):
...     response = requests.post(api_url, headers=headers, json=payload)
...     return response.json()


>>> headers = {"Authorization": f"Bearer {huggingface_hub.get_token()}"}

>>> output_dic = {}
>>> for model in models_to_compare:
...     # Create API urls for each model
...     # When using dedicated endpoints, you can reuse the same code and simply replace this URL
...     api_url = "https://api-inference.huggingface.co/models/" + model

...     # send requests to API
...     output_lst = []
...     for prompt in tqdm(prompt_formatted):
...         output = query(payload={"inputs": prompt, "parameters": {**generation_params}}, api_url=api_url)
...         output_lst.append(output[0]["generated_text"])

...     output_dic.update({model: output_lst})

>>> print(f"---First generation of {models_to_compare[0]}:\n{output_dic[models_to_compare[0]][0]}\n\n")
>>> print(f"---First generation of {models_to_compare[1]}:\n{output_dic[models_to_compare[1]][0]}")

---First generation of mistralai/Mixtral-8x7B-Instruct-v0.1:
Here's a Python function that meets your requirements:

```python
def get_value(matrix, indices):
    try:
        return matrix[indices[0]][indices[1]]
    except IndexError:
        return None
```

This function takes a matrix (represented by a list of lists) and a tuple of indices as input. It first tries to access the value at the given indices in the matrix. If the indices are out of range, it catches the `IndexError` exception and returns `None`.


---First generation of meta-llama/Meta-Llama-3-70B-Instruct:
Here is a Python function that does what you described:
```
def get_value(matrix, indices):
    try:
        row, col = indices
        return matrix[row][col]
    except IndexError:
        return None
```
Here's an explanation of how the function works:

1. The function takes two arguments: `matrix` (a list of lists) and `indices` (a tuple of two integers, representing the row and column indices).
2. The function tries to access the value at the specified indices using `matrix[row][col]`.
3. If the indices are out of range (i.e., `row` or `col` is greater than the length of the corresponding dimension of the matrix), an `IndexError` exception is raised.
4. The `except` block catches the `IndexError` exception and returns `None` instead of raising an error.

Here's an example usage of the function:
```
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

print(get_value(matrix, (0, 0)))  # prints 1
print(get_value(matrix, (1, 1)))  # prints 5
print(get_value(matrix, (3, 0)))  # prints None (out of range)
print(get_value(matrix, (0, 3)))  # prints None (out of range)
```
I hope this helps! Let me know if you have any questions.

將 LLM 輸出儲存到資料集中

現在我們可以將 LLM 輸出連同原始指令一起儲存到資料集中。

# create a HF dataset with the instructions and model outputs
from datasets import Dataset

dataset = Dataset.from_dict(
    {
        "instructions": instructions_lst,
        "response_model_1": output_dic[models_to_compare[0]],
        "response_model_2": output_dic[models_to_compare[1]],
    }
)

dataset

建立和配置你的 Argilla 資料集

我們使用 Argilla，這是一個面向 AI 工程師和領域專家的協作工具，他們需要為其專案構建高質量的資料集。

我們透過 HF Spaces 執行 Argilla，你只需點選幾下即可設定，無需任何本地設定。你可以按照這些說明建立 HF Argilla Space。有關 HF Argilla Spaces 的進一步配置，另請參閱詳細的文件。如果你願意，你還可以透過 Argilla 的 Docker 容器在本地執行 Argilla（請參閱Argilla 文件）。

Argilla login screen

以程式設計方式與 Argilla 互動

在我們能夠根據特定任務定製資料集並上傳將在使用者介面中顯示的資料之前，我們首先需要設定一些事項。

將此筆記本連線到 Argilla：我們現在可以將此筆記本連線到 Argilla，以便以程式設計方式配置你的資料集並上傳/下載資料。

# After starting the Argilla Space (or local docker container) you can connect to the Space with the code below.
import argilla as rg

client = rg.Argilla(
    api_url="https://username-spacename.hf.space",  # Locally: "https://:6900"
    api_key="your-apikey",  # You'll find it in the UI "My Settings > API key"
    # To use a private HF Argilla Space, also pass your HF token
    headers={"Authorization": f"Bearer {huggingface_hub.get_token()}"},
)

user = client.me
user

撰寫良好的標註指南

為人工標註員撰寫良好的指南與撰寫良好的訓練程式碼同樣重要（且困難）。良好的指令應滿足以下標準：

簡單明瞭：指南應該簡單明瞭，以便對你的任務一無所知的人也能理解。請務必請至少一位同事重新閱讀指南，以確保沒有歧義。
可復現和明確：完成標註任務所需的所有資訊都應包含在指南中。一個常見的錯誤是在與選定的標註員對話期間建立對指南的非正式解釋。未來的標註員將沒有這些資訊，如果指南中沒有明確說明，他們可能會以與預期不同的方式完成任務。
簡短而全面：指南應儘可能簡短，同時包含所有必要資訊。標註員往往不會仔細閱讀冗長的指南，因此請儘量保持其簡短，同時保持全面。

請注意，建立標註指南是一個迭代過程。在將任務分配給其他人之前，最好自己先進行幾十個標註，並根據從資料中獲得的經驗教訓完善指南。隨著任務的演變，對指南進行版本控制也有助於此。更多提示請參閱此部落格文章。

annotator_guidelines = """\
Your task is to evaluate the responses of two LLMs to code generation tasks. 

First, you need to score each response on a scale from 0 to 7. You add points to your final score based on the following criteria:
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code is overall correct, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
Your resulting final score can be any value between 0 to 7. 

If both responses have a final score of <= 4, select one response and correct it manually in the text field. 
The corrected response must fulfill all criteria from above. 
"""

rating_tooltip = """\
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code works mostly correctly, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
"""

累積評分與李克特量表：請注意，以上指南要求標註員透過為明確的標準加分來完成累積評分。另一種方法是“李克特量表”，其中要求標註員在連續量表上對響應進行評分，例如從 1（非常差）到 3（一般）到 5（非常好）。我們通常推薦累積評分，因為它強制你和標註員明確質量標準，而僅僅將響應評為“4”（好）則模糊不清，不同的標註員會對其有不同的解釋。

根據你的特定任務定製 Argilla 資料集

現在我們可以建立自己的 `code-llm` 任務，幷包含標註所需的欄位、問題和元資料。有關配置 Argilla 資料集的更多資訊，請參閱 Argilla 文件。

dataset_argilla_name = "code-llm"
workspace_name = "argilla"
reuse_existing_dataset = False  # for easier iterative testing

# Configure your dataset settings
settings = rg.Settings(
    # The overall annotation guidelines, which human annotators can refer back to inside of the interface
    guidelines="my guidelines",
    fields=[
        rg.TextField(name="instruction", title="Instruction:", use_markdown=True, required=True),
        rg.TextField(
            name="generation_1",
            title="Response model 1:",
            use_markdown=True,
            required=True,
        ),
        rg.TextField(
            name="generation_2",
            title="Response model 2:",
            use_markdown=True,
            required=True,
        ),
    ],
    # These are the questions we ask annotators about the fields in the dataset
    questions=[
        rg.RatingQuestion(
            name="score_response_1",
            title="Your score for the response of model 1:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.RatingQuestion(
            name="score_response_2",
            title="Your score for the response of model 2:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.LabelQuestion(
            name="which_response_corrected",
            title="If both responses score below 4, select a response to correct:",
            description="Select the response you will correct in the text field below.",
            labels=["Response 1", "Response 2", "Combination of both", "Neither"],
            required=False,
        ),
        rg.TextQuestion(
            name="correction",
            title="Paste the selected response below and correct it manually:",
            description="Your corrected response must fulfill all criteria from the annotation guidelines.",
            use_markdown=True,
            required=False,
        ),
        rg.TextQuestion(
            name="comments",
            title="Annotator Comments",
            description="Add any additional comments here. E.g.: edge cases, issues with the interface etc.",
            use_markdown=True,
            required=False,
        ),
    ],
    metadata=[
        rg.TermsMetadataProperty(
            name="source-dataset",
            title="Original dataset source",
        ),
    ],
    allow_extra_metadata=False,
)

if reuse_existing_dataset:
    dataset_argilla = client.datasets(dataset_argilla_name, workspace=workspace_name)
else:
    dataset_argilla = rg.Dataset(
        name=dataset_argilla_name,
        settings=settings,
        workspace=workspace_name,
    )
    if client.datasets(dataset_argilla_name, workspace=workspace_name) is not None:
        client.datasets(dataset_argilla_name, workspace=workspace_name).delete()
    dataset_argilla = dataset_argilla.create()

dataset_argilla

執行上述程式碼後，你將在 Argilla 中看到新的自定義 `code-llm` 資料集（以及你之前可能建立的任何其他資料集）。

將資料載入到 Argilla

目前，資料集仍為空。讓我們用下面的程式碼載入一些資料。

# Iterate over the samples in the dataset
records = [
    rg.Record(
        fields={
            "instruction": example["instructions"],
            "generation_1": example["response_model_1"],
            "generation_2": example["response_model_2"],
        },
        metadata={
            "source-dataset": "bigcode/self-oss-instruct-sc2-exec-filter-50k",
        },
        # Optional: add suggestions from an LLM-as-a-judge system
        # They will be indicated with a sparkle icon and shown as pre-filled responses
        # It will speed up manual annotation
        # suggestions=[
        #     rg.Suggestion(
        #         question_name="score_response_1",
        #         value=example["llm_judge_rating"],
        #         agent="llama-3-70b-instruct",
        #     ),
        # ],
    )
    for example in dataset
]

try:
    dataset_argilla.records.log(records)
except Exception as e:
    print("Exception:", e)

Argilla 標註使用者介面將類似於：

Argilla UI

標註

就是這樣，我們已經建立了 Argilla 資料集，現在我們可以在 UI 中開始標註了！預設情況下，記錄將在獲得 1 個標註後完成。請檢視這些指南，瞭解如何自動分配標註任務和在 Argilla 中進行標註。

重要提示：如果你在 HF Space 中使用 Argilla，你需要啟用持久儲存，以便你的資料安全儲存且不會在一段時間後自動刪除。對於生產設定，請務必在進行任何標註之前啟用持久儲存，以避免資料丟失。

下載已標註的資料

標註完成後，你可以從 Argilla 中拉取資料，並簡單地將其儲存和處理為任何表格格式（請參閱此處的文件）。你還可以下載資料集的過濾版本（文件）。

annotated_dataset = client.datasets(dataset_argilla_name, workspace=workspace_name)

hf_dataset = annotated_dataset.records.to_datasets()

# This HF dataset can then be formatted, stored and processed into any tabular data format
hf_dataset.to_pandas()

# Store the dataset locally
hf_dataset.to_csv("argilla-dataset-local.csv")  # Save as CSV
# hf_dataset.to_json("argilla-dataset-local.json")  # Save as JSON
# hf_dataset.save_to_disk("argilla-dataset-local")  # Save as a `datasets.Dataset` in the local filesystem
# hf_dataset.to_parquet()  # Save as Parquet

下一步

就這樣！你已經使用 HF 推理 API 建立了合成 LLM 資料，在 Argilla 中建立了一個數據集，將 LLM 資料上傳到 Argilla，評估/修正了資料，並在標註後以簡單的表格格式下載了資料以供後續使用。

我們專門為兩個主要用例設計了管道和介面：

評估：你現在可以簡單地使用 `score_response_1` 和 `score_response_2` 列中的數值分數來計算哪個模型總體上更好。你還可以檢查評分非常低或高的響應，進行詳細的錯誤分析。當你測試或訓練不同的模型時，你可以重複使用此管道並跟蹤不同模型的改進。
訓練：標註足夠資料後，你可以從資料中建立訓練集和測試集，並微調你自己的模型。你可以使用高評分的響應文字透過 TRL SFTTrainer 進行監督微調，或者你可以直接使用評分透過 TRL DPOTrainer 進行偏好微調技術，如 DPO。有關不同 LLM 微調技術的優缺點，請參閱 TRL 文件。

調整和改進：為了使此管道適應你的特定用例，可以改進許多方面。例如，你可以提示 LLM 評估兩個 LLM 的輸出，其指令與人類標註員的指南非常相似（“LLM-as-a-judge”方法）。這有助於進一步加快你的評估管道。有關 LLM-as-a-judge 的示例實現，請參閱我們的 LLM-as-a-judge 食譜，有關其他許多想法，請參閱我們的整體開源 AI 食譜。

< > 在 GitHub 上更新