開源 AI 食譜文件

RAG 評估

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

RAG 評估

作者：Aymeric Roucher

本筆記本演示瞭如何透過構建一個合成評估資料集並使用“LLM 作為評判者”來計算系統準確性，從而評估您的 RAG（檢索增強生成）系統。

有關 RAG 的介紹，您可以查閱另一篇指南！

RAG 系統很複雜：這是一張 RAG 圖表，我們用藍色標註了所有可以增強系統的可能性。

實施這些改進中的任何一項都可以帶來巨大的效能提升；但是，如果您無法監控更改對系統性能的影響，那麼任何更改都是徒勞的！所以，讓我們來看看如何評估我們的 RAG 系統。

評估 RAG 效能

由於有如此多的可調動部分對效能有巨大影響，因此對 RAG 系統進行基準測試至關重要。

對於我們的評估流程，我們將需要：

一個包含問答對（QA 對）的評估資料集
一個評估器，用於計算我們的系統在上述評估資料集上的準確性。

➡️ 事實證明，我們可以使用 LLM 在整個過程中為我們提供幫助！

評估資料集將由一個 LLM 🤖 合成生成，問題將由其他 LLM 🤖 過濾。
一個將 LLM 作為評判者的代理 🤖 將在這個合成數據集上執行評估。

讓我們深入研究並開始構建我們的評估流程！ 首先，我們安裝所需的模型依賴項。

!pip install -q torch transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets langchain-community ragatouille

%reload_ext autoreload
%autoreload 2

from tqdm.auto import tqdm
import pandas as pd
from typing import Optional, List, Tuple
import json
import datasets

pd.set_option("display.max_colwidth", None)

from huggingface_hub import notebook_login

notebook_login()

載入您的知識庫

ds = datasets.load_dataset("m-ric/huggingface_doc", split="train")

1. 構建一個用於評估的合成數據集

我們首先構建一個包含問題和相關上下文的合成數據集。方法是從我們的知識庫中獲取元素，並要求 LLM 根據這些文件生成問題。

然後我們設定其他 LLM 代理作為生成的 QA 對的質量過濾器：每個代理都將針對特定缺陷進行過濾。

1.1. 準備源文件

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument

langchain_docs = [LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]}) for doc in tqdm(ds)]


text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    add_start_index=True,
    separators=["\n\n", "\n", ".", " ", ""],
)

docs_processed = []
for doc in langchain_docs:
    docs_processed += text_splitter.split_documents([doc])

1.2. 設定用於問題生成的代理

我們使用 Mixtral 來生成 QA 對，因為它在諸如 Chatbot Arena 等排行榜中表現出色。

from huggingface_hub import InferenceClient


repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(
    model=repo_id,
    timeout=120,
)


def call_llm(inference_client: InferenceClient, prompt: str):
    response = inference_client.post(
        json={
            "inputs": prompt,
            "parameters": {"max_new_tokens": 1000},
            "task": "text-generation",
        },
    )
    return json.loads(response.decode())[0]["generated_text"]


call_llm(llm_client, "This is a test context")

QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""

現在讓我們生成我們的 QA 對。在這個例子中，我們只生成 10 個 QA 對，剩下的將從 Hub 載入。

但是對於您特定的知識庫，考慮到您希望至少獲得約 100 個測試樣本，並且考慮到我們稍後會用我們的評判代理過濾掉其中大約一半，您應該生成更多，數量應大於 200 個樣本。

import random

N_GENERATIONS = 10  # We intentionally generate only 10 QA couples here for cost and time considerations

print(f"Generating {N_GENERATIONS} QA couples...")

outputs = []
for sampled_context in tqdm(random.sample(docs_processed, N_GENERATIONS)):
    # Generate QA couple
    output_QA_couple = call_llm(llm_client, QA_generation_prompt.format(context=sampled_context.page_content))
    try:
        question = output_QA_couple.split("Factoid question: ")[-1].split("Answer: ")[0]
        answer = output_QA_couple.split("Answer: ")[-1]
        assert len(answer) < 300, "Answer is too long"
        outputs.append(
            {
                "context": sampled_context.page_content,
                "question": question,
                "answer": answer,
                "source_doc": sampled_context.metadata["source"],
            }
        )
    except:
        continue

display(pd.DataFrame(outputs).head(1))

1.3. 設定評判代理

之前代理生成的問題可能存在許多缺陷：在驗證這些問題之前，我們應該進行質量檢查。

因此，我們構建評判代理，它們將根據這篇論文中給出的幾個標準對每個問題進行評分。

依據性： 問題能否從給定的上下文中得到回答？
相關性： 問題是否與使用者相關？例如，“transformers 4.29.1 版本釋出的日期是什麼？”對機器學習從業者來說並不相關。

我們注意到的最後一個失敗案例是，當一個函式是為生成問題的特定環境量身定製的，但本身無法理解，比如“本指南中使用的函式名稱是什麼？”。我們還為這個標準構建了一個評判代理。

獨立性：對於具備領域知識/網際網路訪問許可權的人來說，這個問題在沒有任何上下文的情況下是否可以理解？與此相反的例子是，從一篇特定部落格文章中生成的問題：“本文中使用的函式是什麼？”。

我們系統地用所有這些代理對函式進行評分，只要任何一個代理的得分過低，我們就會從評估資料集中剔除該問題。

💡 在要求代理輸出分數時，我們首先要求它們給出其理由。這將幫助我們驗證分數，但更重要的是，要求它首先輸出理由可以給模型更多的詞元來思考和闡述答案，然後再將其總結為單個分數詞元。

我們現在構建並執行這些評判代理。

question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """

question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """

print("Generating critique for each QA couple...")
for output in tqdm(outputs):
    evaluations = {
        "groundedness": call_llm(
            llm_client,
            question_groundedness_critique_prompt.format(context=output["context"], question=output["question"]),
        ),
        "relevance": call_llm(
            llm_client,
            question_relevance_critique_prompt.format(question=output["question"]),
        ),
        "standalone": call_llm(
            llm_client,
            question_standalone_critique_prompt.format(question=output["question"]),
        ),
    }
    try:
        for criterion, evaluation in evaluations.items():
            score, eval = (
                int(evaluation.split("Total rating: ")[-1].strip()),
                evaluation.split("Total rating: ")[-2].split("Evaluation: ")[1],
            )
            output.update(
                {
                    f"{criterion}_score": score,
                    f"{criterion}_eval": eval,
                }
            )
    except Exception as e:
        continue

現在，讓我們根據評判代理的分數過濾掉不好的問題。

>>> import pandas as pd

>>> pd.set_option("display.max_colwidth", None)

>>> generated_questions = pd.DataFrame.from_dict(outputs)

>>> print("Evaluation dataset before filtering:")
>>> display(
...     generated_questions[
...         [
...             "question",
...             "answer",
...             "groundedness_score",
...             "relevance_score",
...             "standalone_score",
...         ]
...     ]
... )
>>> generated_questions = generated_questions.loc[
...     (generated_questions["groundedness_score"] >= 4)
...     & (generated_questions["relevance_score"] >= 4)
...     & (generated_questions["standalone_score"] >= 4)
... ]
>>> print("============================================")
>>> print("Final evaluation dataset:")
>>> display(
...     generated_questions[
...         [
...             "question",
...             "answer",
...             "groundedness_score",
...             "relevance_score",
...             "standalone_score",
...         ]
...     ]
... )

>>> eval_dataset = datasets.Dataset.from_pandas(generated_questions, split="train", preserve_index=False)

Evaluation dataset before filtering:

現在我們的合成評估資料集已經完成了！我們可以在這個評估資料集上評估不同的 RAG 系統。

我們在這裡只生成了幾個 QA 對以減少時間和成本。但讓我們透過載入一個預先生成的資料集來開始下一部分。

eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

2. 構建我們的 RAG 系統

2.1. 預處理文件以構建我們的向量資料庫

在這一部分，我們將知識庫中的文件分割成更小的塊：這些塊將是被檢索器選中的片段，然後被閱讀器 LLM 作為其答案的支援元素。
目標是構建語義相關的片段：既不能太小以至於不足以支援答案，也不能太大以免稀釋單個觀點。

文字分割有多種選擇：

每 n 個單詞/字元分割一次，但這有切斷段落甚至句子的風險。
在 n 個單詞/字元後分割，但僅在句子邊界處分割。
遞迴分割 試圖透過樹狀處理方式，首先按最大的單位（章節）分割，然後遞迴地按更小的單位（段落、句子）分割，從而更多地保留文件結構。

要了解有關分塊的更多資訊，我建議您閱讀 Greg Kamradt 撰寫的這本很棒的筆記本。

這個空間可以讓你視覺化不同分割選項對你得到的塊的影響。

在下文中，我們使用 Langchain 的 RecursiveCharacterTextSplitter。

💡 為了在我們的文字分割器中測量塊長度，我們的長度函式將不是字元數，而是分詞後文本中的詞元數：確實，對於處理詞元的後續嵌入器，以詞元為單位測量長度更相關，並且經驗上表現更好。

from langchain.docstore.document import Document as LangchainDocument

RAW_KNOWLEDGE_BASE = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]}) for doc in tqdm(ds)
]

from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer


def split_documents(
    chunk_size: int,
    knowledge_base: List[LangchainDocument],
    tokenizer_name: str,
) -> List[LangchainDocument]:
    """
    Split documents into chunks of size `chunk_size` characters and return a list of documents.
    """
    text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained(tokenizer_name),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
    )

    docs_processed = []
    for doc in knowledge_base:
        docs_processed += text_splitter.split_documents([doc])

    # Remove duplicates
    unique_texts = {}
    docs_processed_unique = []
    for doc in docs_processed:
        if doc.page_content not in unique_texts:
            unique_texts[doc.page_content] = True
            docs_processed_unique.append(doc)

    return docs_processed_unique

2.2. 檢索器 - 嵌入 🗂️

檢索器就像一個內部搜尋引擎：給定使用者查詢，它會從您的知識庫中返回最相關的文件。

對於知識庫，我們使用 Langchain 向量資料庫，因為它提供了一個方便的 FAISS 索引，並允許我們在整個處理過程中保留文件元資料。

🛠️ 包含的選項：

調整分塊方法
- 塊的大小
- 方法：按不同的分隔符分割，使用語義分塊…
更改嵌入模型

from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.utils import DistanceStrategy
import os


def load_embeddings(
    langchain_docs: List[LangchainDocument],
    chunk_size: int,
    embedding_model_name: Optional[str] = "thenlper/gte-small",
) -> FAISS:
    """
    Creates a FAISS index from the given embedding model and documents. Loads the index directly if it already exists.

    Args:
        langchain_docs: list of documents
        chunk_size: size of the chunks to split the documents into
        embedding_model_name: name of the embedding model to use

    Returns:
        FAISS index
    """
    # load embedding_model
    embedding_model = HuggingFaceEmbeddings(
        model_name=embedding_model_name,
        multi_process=True,
        model_kwargs={"device": "cuda"},
        encode_kwargs={"normalize_embeddings": True},  # set True to compute cosine similarity
    )

    # Check if embeddings already exist on disk
    index_name = f"index_chunk:{chunk_size}_embeddings:{embedding_model_name.replace('/', '~')}"
    index_folder_path = f"./data/indexes/{index_name}/"
    if os.path.isdir(index_folder_path):
        return FAISS.load_local(
            index_folder_path,
            embedding_model,
            distance_strategy=DistanceStrategy.COSINE,
        )

    else:
        print("Index not found, generating it...")
        docs_processed = split_documents(
            chunk_size,
            langchain_docs,
            embedding_model_name,
        )
        knowledge_index = FAISS.from_documents(
            docs_processed, embedding_model, distance_strategy=DistanceStrategy.COSINE
        )
        knowledge_index.save_local(index_folder_path)
        return knowledge_index

2.3. 閱讀器 - LLM 💬

在這一部分，LLM 閱讀器會閱讀檢索到的文件來組織其答案。

🛠️ 這裡我們嘗試了以下選項來改進結果

開啟/關閉重排
更改閱讀器模型

RAG_PROMPT_TEMPLATE = """
<|system|>
Using the information contained in the context,
give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
{context}
---
Now here is the question you need to answer.

Question: {question}
</s>
<|assistant|>
"""

from langchain_community.llms import HuggingFaceHub

repo_id = "HuggingFaceH4/zephyr-7b-beta"
READER_MODEL_NAME = "zephyr-7b-beta"
HF_API_TOKEN = ""

READER_LLM = HuggingFaceHub(
    repo_id=repo_id,
    task="text-generation",
    huggingfacehub_api_token=HF_API_TOKEN,
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
)

from ragatouille import RAGPretrainedModel
from langchain_core.vectorstores import VectorStore
from langchain_core.language_models.llms import LLM


def answer_with_rag(
    question: str,
    llm: LLM,
    knowledge_index: VectorStore,
    reranker: Optional[RAGPretrainedModel] = None,
    num_retrieved_docs: int = 30,
    num_docs_final: int = 7,
) -> Tuple[str, List[LangchainDocument]]:
    """Answer a question using RAG with the given knowledge index."""
    # Gather documents with retriever
    relevant_docs = knowledge_index.similarity_search(query=question, k=num_retrieved_docs)
    relevant_docs = [doc.page_content for doc in relevant_docs]  # keep only the text

    # Optionally rerank results
    if reranker:
        relevant_docs = reranker.rerank(question, relevant_docs, k=num_docs_final)
        relevant_docs = [doc["content"] for doc in relevant_docs]

    relevant_docs = relevant_docs[:num_docs_final]

    # Build the final prompt
    context = "\nExtracted documents:\n"
    context += "".join([f"Document {str(i)}:::\n" + doc for i, doc in enumerate(relevant_docs)])

    final_prompt = RAG_PROMPT_TEMPLATE.format(question=question, context=context)

    # Redact an answer
    answer = llm(final_prompt)

    return answer, relevant_docs

3. RAG 系統基準測試

RAG 系統和評估資料集現已準備就緒。最後一步是在此評估資料集上評判 RAG 系統的輸出。

為此，我們設定了一個評判代理。⚖️🤖

在不同的 RAG 評估指標中，我們選擇只關注答案正確性，因為它是我們系統性能的最佳端到端指標。

我們使用 GPT4 作為評判者，因為它在經驗上表現良好，但你也可以嘗試其他模型，例如 kaist-ai/prometheus-13b-v1.0 或 BAAI/JudgeLM-33B-v1.0。

💡 在評估提示中，我們對 1-5 分的每個指標都給出了詳細描述，就像在 Prometheus 的提示模板中所做的那樣：這有助於模型精確地確定其指標。相反，如果你給評判者 LLM 一個模糊的評分標準，不同示例之間的輸出將不夠一致。

💡 再次強調，在給出最終評分之前提示 LLM 輸出理由，可以為其提供更多詞元來幫助其形成和闡述判斷。

from langchain_core.language_models import BaseChatModel


def run_rag_tests(
    eval_dataset: datasets.Dataset,
    llm,
    knowledge_index: VectorStore,
    output_file: str,
    reranker: Optional[RAGPretrainedModel] = None,
    verbose: Optional[bool] = True,
    test_settings: Optional[str] = None,  # To document the test settings used
):
    """Runs RAG tests on the given dataset and saves the results to the given output file."""
    try:  # load previous generations if they exist
        with open(output_file, "r") as f:
            outputs = json.load(f)
    except:
        outputs = []

    for example in tqdm(eval_dataset):
        question = example["question"]
        if question in [output["question"] for output in outputs]:
            continue

        answer, relevant_docs = answer_with_rag(question, llm, knowledge_index, reranker=reranker)
        if verbose:
            print("=======================================================")
            print(f"Question: {question}")
            print(f"Answer: {answer}")
            print(f'True answer: {example["answer"]}')
        result = {
            "question": question,
            "true_answer": example["answer"],
            "source_doc": example["source_doc"],
            "generated_answer": answer,
            "retrieved_docs": [doc for doc in relevant_docs],
        }
        if test_settings:
            result["test_settings"] = test_settings
        outputs.append(result)

        with open(output_file, "w") as f:
            json.dump(outputs, f)

EVALUATION_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 5}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
[Is the response correct, accurate, and factual based on the reference answer?]
Score 1: The response is completely incorrect, inaccurate, and/or not factual.
Score 2: The response is mostly incorrect, inaccurate, and/or not factual.
Score 3: The response is somewhat correct, accurate, and/or factual.
Score 4: The response is mostly correct, accurate, and factual.
Score 5: The response is completely correct, accurate, and factual.

###Feedback:"""

from langchain.prompts.chat import (
    ChatPromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import SystemMessage


evaluation_prompt_template = ChatPromptTemplate.from_messages(
    [
        SystemMessage(content="You are a fair evaluator language model."),
        HumanMessagePromptTemplate.from_template(EVALUATION_PROMPT),
    ]
)

from langchain.chat_models import ChatOpenAI

OPENAI_API_KEY = ""

eval_chat_model = ChatOpenAI(model="gpt-4-1106-preview", temperature=0, openai_api_key=OPENAI_API_KEY)
evaluator_name = "GPT4"


def evaluate_answers(
    answer_path: str,
    eval_chat_model,
    evaluator_name: str,
    evaluation_prompt_template: ChatPromptTemplate,
) -> None:
    """Evaluates generated answers. Modifies the given answer file in place for better checkpointing."""
    answers = []
    if os.path.isfile(answer_path):  # load previous generations if they exist
        answers = json.load(open(answer_path, "r"))

    for experiment in tqdm(answers):
        if f"eval_score_{evaluator_name}" in experiment:
            continue

        eval_prompt = evaluation_prompt_template.format_messages(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        eval_result = eval_chat_model.invoke(eval_prompt)
        feedback, score = [item.strip() for item in eval_result.content.split("[RESULT]")]
        experiment[f"eval_score_{evaluator_name}"] = score
        experiment[f"eval_feedback_{evaluator_name}"] = feedback

        with open(answer_path, "w") as f:
            json.dump(answers, f)

🚀 讓我們執行測試並評估答案吧！👇

if not os.path.exists("./output"):
    os.mkdir("./output")

for chunk_size in [200]:  # Add other chunk sizes (in tokens) as needed
    for embeddings in ["thenlper/gte-small"]:  # Add other embeddings as needed
        for rerank in [True, False]:
            settings_name = f"chunk:{chunk_size}_embeddings:{embeddings.replace('/', '~')}_rerank:{rerank}_reader-model:{READER_MODEL_NAME}"
            output_file_name = f"./output/rag_{settings_name}.json"

            print(f"Running evaluation for {settings_name}:")

            print("Loading knowledge base embeddings...")
            knowledge_index = load_embeddings(
                RAW_KNOWLEDGE_BASE,
                chunk_size=chunk_size,
                embedding_model_name=embeddings,
            )

            print("Running RAG...")
            reranker = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0") if rerank else None
            run_rag_tests(
                eval_dataset=eval_dataset,
                llm=READER_LLM,
                knowledge_index=knowledge_index,
                output_file=output_file_name,
                reranker=reranker,
                verbose=False,
                test_settings=settings_name,
            )

            print("Running evaluation...")
            evaluate_answers(
                output_file_name,
                eval_chat_model,
                evaluator_name,
                evaluation_prompt_template,
            )

檢查結果

import glob

outputs = []
for file in glob.glob("./output/*.json"):
    output = pd.DataFrame(json.load(open(file, "r")))
    output["settings"] = file
    outputs.append(output)
result = pd.concat(outputs)

result["eval_score_GPT4"] = result["eval_score_GPT4"].apply(lambda x: int(x) if isinstance(x, str) else 1)
result["eval_score_GPT4"] = (result["eval_score_GPT4"] - 1) / 4

average_scores = result.groupby("settings")["eval_score_GPT4"].mean()
average_scores.sort_values()

示例結果

讓我們載入我透過調整此筆記本中可用的不同選項獲得的結果。有關這些選項為何可能有效或無效的更多詳細資訊，請參閱關於高階 RAG 的筆記本。

正如下圖所示，有些調整並沒有帶來任何改進，而有些則帶來了巨大的效能提升。

➡️ 沒有單一的良方：在調整 RAG 系統時，您應該嘗試幾個不同的方向。

import plotly.express as px

scores = datasets.load_dataset("m-ric/rag_scores_cookbook", split="train")
scores = pd.Series(scores["score"], index=scores["settings"])

fig = px.bar(
    scores,
    color=scores,
    labels={
        "value": "Accuracy",
        "settings": "Configuration",
    },
    color_continuous_scale="bluered",
)
fig.update_layout(
    width=1000,
    height=600,
    barmode="group",
    yaxis_range=[0, 100],
    title="<b>Accuracy of different RAG configurations</b>",
    xaxis_title="RAG settings",
    font=dict(size=15),
)
fig.layout.yaxis.ticksuffix = "%"
fig.update_coloraxes(showscale=False)
fig.update_traces(texttemplate="%{y:.1f}", textposition="outside")
fig.show()

如您所見，這些調整對效能的影響各不相同。特別是，調整塊大小既簡單又影響巨大。

但這是我們的情況：您的結果可能會大不相同：既然您有了一個穩健的評估流程，您就可以開始探索其他選項了！🗺️

< > 在 GitHub 上更新

←使用 Hugging Face 和 Milvus 實現 RAG 使用 LLM 作為評判者進行自動化和通用的評估→