開源 AI 食譜文件

使用 LLM 作為評判員 🧑‍⚖️ 進行自動化和多功能評估

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

使用 LLM 作為評判員 🧑‍⚖️ 進行自動化和多功能評估

作者：Aymeric Roucher

評估大型語言模型 (LLM) 通常是一項艱鉅的任務：鑑於它們廣泛的能力，分配給它們的任務通常需要根據非常寬泛且定義鬆散的要求進行評判。例如，助手對問題的回答可能是：

沒有基於上下文
重複，重複，重複
語法錯誤
篇幅過長，用詞過多，導致論述或書面內容過於詳細和冗長
不連貫
...

評判標準不勝列舉。即使我們有一個有限的列表，每個標準也很難衡量：“設計一個基於規則的程式來評估輸出是極其具有挑戰性的。基於輸出與參考答案之間相似性的傳統評估指標（例如 ROUGE、BLEU）對於這些問題也無效。”

✅ 一種強大解決方案是以人類的方式評估輸出，而無需耗費昂貴的人力時間，這就是 LLM 作為評判員 (LLM-as-a-judge)。這種方法在使用 MT-Bench 和 Chatbot Arena 對作為評判員的 LLM 進行評判中被引入——我鼓勵您閱讀。

💡 這個想法很簡單：讓 LLM 為你打分。🤖✓

但我們會發現，它並非開箱即用：你需要仔細設定才能獲得好的結果。

!pip install huggingface_hub datasets pandas tqdm -q

import re
import pandas as pd
from tqdm.auto import tqdm
from datasets import load_dataset
from huggingface_hub import InferenceClient, notebook_login

tqdm.pandas()  # load tqdm's pandas support
pd.set_option("display.max_colwidth", None)

notebook_login()

repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
)

# Test your LLM client
llm_client.text_generation(prompt="How are you today?", max_new_tokens=20)

1. 準備建立和評估我們的 LLM 評判員

假設您想給 LLM 一個特定的任務，比如回答開放式問題。

困難在於，正如我們上面討論的，衡量答案的質量很困難，例如，精確的字串匹配會將太多正確但措辭不同的答案標記為錯誤。

您可以讓人類標註員來評判輸出，但這非常耗時，而且如果您想更新模型或問題，就必須全部重新來過。

✅ 在這種情況下，您可以設定一個 LLM 作為評判員。

但是要使用 LLM 作為評判員，您首先需要評估它對您的模型輸出的評分有多可靠。

➡️ 所以第一步將是... 建立一個人工評估資料集。但您只需為少數幾個示例獲取人工標註即可——大約 30 個就足以對效能有一個很好的瞭解。而且每次您想測試您的 LLM 評判員時，都可以重複使用這個資料集。

在我們的案例中，我們將使用 feedbackQA，其中包含每個問題/答案對的 2 個人工評估和分數：使用 30 個示例的樣本將代表您的小型評估資料集可能的樣子。

ratings = load_dataset("McGill-NLP/feedbackQA")["train"]
ratings = pd.DataFrame(ratings)

ratings["review_1"] = ratings["feedback"].apply(lambda x: x["rating"][0])
ratings["explanation_1"] = ratings["feedback"].apply(lambda x: x["explanation"][0])
ratings["review_2"] = ratings["feedback"].apply(lambda x: x["rating"][1])
ratings["explanation_2"] = ratings["feedback"].apply(lambda x: x["explanation"][1])
ratings = ratings.drop(columns=["feedback"])

# Map scores to numeric values
conversion_dict = {"Excellent": 4, "Acceptable": 3, "Could be Improved": 2, "Bad": 1}
ratings["score_1"] = ratings["review_1"].map(conversion_dict)
ratings["score_2"] = ratings["review_2"].map(conversion_dict)

計算效能基線總是一個好主意：在這裡，例如，可以是兩個人類評分者之間的一致性，透過他們給出的分數的皮爾遜相關係數來衡量。

>>> print("Correlation between 2 human raters:")
>>> print(f"{ratings['score_1'].corr(ratings['score_2'], method='pearson'):.3f}")

Correlation between 2 human raters:
0.563

兩個人類評分者之間的這種相關性不是很好。如果您的評分確實很差，這可能意味著評分標準不夠清晰。

這意味著我們的“真實標籤”包含噪聲：因此我們不能期望任何演算法評估能夠非常接近它。

然而，我們可以減少這種噪聲

透過取平均分作為我們的真實標籤而不是任何單一分數，我們應該可以消除一些不規則性。
透過只選擇人類評審員意見一致的樣本。

在這裡，我們將選擇最後一個選項，並只保留 2 個人類評審員意見一致的示例。

# Sample examples
ratings_where_raters_agree = ratings.loc[ratings["score_1"] == ratings["score_2"]]
examples = ratings_where_raters_agree.groupby("score_1").sample(7, random_state=1214)
examples["human_score"] = examples["score_1"]

# Visualize 1 sample for each score
display(examples.groupby("human_score").first())

2. 建立我們的 LLM 評判員

我們用一個基本的提示詞來構建我們的 LLM 評判員，其中包含這些元素

任務描述
量表描述：最小值、最大值、數值型別（這裡是 float）
輸出格式說明
一個答案的開頭，儘可能地引導 LLM

JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer as a float on a scale of 0 to 10, where 0 means that the system_answer is not helpful at all, and 10 means that the answer completely and helpfully addresses the question.

Provide your feedback as follows:

Feedback:::
Total rating: (your rating, as a float between 0 and 10)

Now here are the question and answer.

Question: {question}
Answer: {answer}

Feedback:::
Total rating: """

examples["llm_judge"] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
        max_new_tokens=1000,
    ),
    axis=1,
)

def extract_judge_score(answer: str, split_str: str = "Total rating:") -> int:
    try:
        if split_str in answer:
            rating = answer.split(split_str)[1]
        else:
            rating = answer
        digit_groups = [el.strip() for el in re.findall(r"\d+(?:\.\d+)?", rating)]
        return float(digit_groups[0])
    except Exception as e:
        print(e)
        return None


examples["llm_judge_score"] = examples["llm_judge"].apply(extract_judge_score)
# Rescale the score given by the LLM on the same scale as the human score
examples["llm_judge_score"] = (examples["llm_judge_score"] / 10) + 1

>>> print("Correlation between LLM-as-a-judge and the human raters:")
>>> print(f"{examples['llm_judge_score'].corr(examples['human_score'], method='pearson'):.3f}")

Correlation between LLM-as-a-judge and the human raters:
0.567

這已經不錯了，考慮到兩個隨機、獨立的變數之間的皮爾遜相關係數將為 0！

但我們很容易就能做得更好。🔝

3. 改進 LLM 評判員

正如 Aparna Dhinakaran 所展示的，LLM 在評估連續範圍內的輸出方面表現不佳。這篇文章為我們提供了一些構建更好提示詞的最佳實踐：

⏳ 留出更多思考時間，在最終答案前增加一個 評估 欄位。
🔢 使用小的整數範圍，例如 1-4 或 1-5，而不是像我們之前那樣使用大的浮點數範圍。
👩‍🏫 提供指導性量表。
我們甚至加入了一點激勵來鼓勵 LLM！

IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.

Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and answer.

Question: {question}
Answer: {answer}

Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """

examples["llm_judge_improved"] = examples.progress_apply(
    lambda x: llm_client.text_generation(
        prompt=IMPROVED_JUDGE_PROMPT.format(question=x["question"], answer=x["answer"]),
        max_new_tokens=500,
    ),
    axis=1,
)
examples["llm_judge_improved_score"] = examples["llm_judge_improved"].apply(extract_judge_score)

>>> print("Correlation between LLM-as-a-judge and the human raters:")
>>> print(f"{examples['llm_judge_improved_score'].corr(examples['human_score'], method='pearson'):.3f}")

Correlation between LLM-as-a-judge and the human raters:
0.843

僅透過對提示詞進行幾次調整（其中幾個百分點的提升歸功於我給 LLM 的無恥小費，我在此宣告該小費不具有法律約束力），相關性就提高了近 30%。

相當令人印象深刻！👏

讓我們展示一些我們的 LLM 評判員的錯誤來分析它們

errors = pd.concat(
    [
        examples.loc[examples["llm_judge_improved_score"] > examples["human_score"]].head(1),
        examples.loc[examples["llm_judge_improved_score"] < examples["human_score"]].head(2),
    ]
)

display(
    errors[
        [
            "question",
            "answer",
            "human_score",
            "explanation_1",
            "llm_judge_improved_score",
            "llm_judge_improved",
        ]
    ]
)

分歧很小：總的來說，我們的系統似乎已經達到了一個良好的效能水平！

4. 我們如何進一步提升我們的 LLM 評判員？

🎯 你永遠無法達到 100%： 首先要注意的是，我們的人工真實標籤肯定存在一些噪音，所以即使有一個完美的 LLM 評判員，一致性/相關性也永遠不會達到 100%。

🧭 提供參考： 如果您能為每個問題提供參考答案，絕對應該將其提供給評判員 LLM 的提示詞中，以獲得更好的結果！

▶️ 提供少樣本示例： 在提示詞中新增一些問題和真實標籤評估的少樣本示例可以改善結果。（我在這裡試過，但在這種情況下沒有改善結果，所以我跳過了，但它可能對您的資料集有效！）

➕ 累加量表： 當評判可以分解為原子標準時，使用累加量表可以進一步改善結果：見下文 👇

ADDITIVE_PROMPT = """
(...)
- Award 1 point if the answer is related to the question.
- Give 1 additional point if the answer is clear and precise.
- Provide 1 further point if the answer is true.
- One final point should be awarded if the answer provides additional resources to support the user.
...
"""

用結構化生成實現

使用結構化生成，您可以配置 LLM 評判員直接提供 JSON 格式的輸出，包含 Evaluation 和 Total rating 欄位，這使得解析更容易：請參閱我們的結構化生成指南以瞭解更多！

結論

今天就到這裡，恭喜你堅持下來！🥳

我得走了，有些奇怪的人在我門口敲門，聲稱他們代表 Mixtral 來收取 H100。🤔

< > 在 GitHub 上更新

←RAG 評估使用 `judges` 評估 AI 搜尋引擎 - 用於 LLM 作為評判員的開源庫→