Agentic RAG：透過查詢重構和自查詢為你的 RAG 加速！🚀

本教程是高階教程。你應該首先掌握這本烹飪書中的概念！

提醒：檢索增強生成（RAG）是“使用 LLM 回答使用者查詢，但答案基於從知識庫中檢索到的資訊”。它比使用普通或微調的 LLM 具有許多優勢：例如，它可以將答案基於真實事實並減少虛構，它允許為 LLM 提供特定領域的知識，並且它允許對知識庫中的資訊進行細粒度控制。

但是普通的 RAG 有其侷限性，最重要的是這兩個：

它只執行一個檢索步驟：如果結果不好，那麼生成也會很差。
語義相似性以使用者查詢為參考進行計算，這可能是次優的：例如，使用者查詢通常是一個問題，而包含真實答案的文件將採用肯定句式，因此其相似度得分會低於其他疑問句式的源文件，從而可能錯過相關資訊。

但是我們可以透過構建一個RAG 智慧體來緩解這些問題：非常簡單，一個配備檢索工具的智慧體！

這個智慧體將：✅ 自行構建查詢並 ✅ 在需要時進行批判性重檢索。

所以它應該能夠天真地恢復一些高階 RAG 技術！

智慧體不是直接使用使用者查詢作為語義搜尋的參考，而是像HyDE中那樣，自行構建一個可能更接近目標文件的參考句子。
智慧體可以像自查詢中那樣，生成片段並在需要時重新檢索。

讓我們構建這個系統。🛠️

執行以下行以安裝所需的依賴項

!pip install pandas langchain langchain-community sentence-transformers faiss-cpu smolagents --upgrade -q

讓我們登入以便呼叫 HF 推理 API

from huggingface_hub import notebook_login

notebook_login()

我們首先載入一個我們想要執行 RAG 的知識庫：這個資料集是許多huggingface包的文件頁面的編譯，以 Markdown 格式儲存。

import datasets

knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

現在我們透過處理資料集並將其儲存到向量資料庫中，來準備知識庫以供檢索器使用。

我們使用LangChain，因為它提供了出色的向量資料庫實用程式。對於嵌入模型，我們使用thenlper/gte-small，因為它在我們的RAG_evaluation烹飪書中表現良好。

>>> from tqdm import tqdm
>>> from transformers import AutoTokenizer
>>> from langchain.docstore.document import Document
>>> from langchain.text_splitter import RecursiveCharacterTextSplitter
>>> from langchain.vectorstores import FAISS
>>> from langchain_community.embeddings import HuggingFaceEmbeddings
>>> from langchain_community.vectorstores.utils import DistanceStrategy

>>> source_docs = [
...     Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]}) for doc in knowledge_base
... ]

>>> text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
...     AutoTokenizer.from_pretrained("thenlper/gte-small"),
...     chunk_size=200,
...     chunk_overlap=20,
...     add_start_index=True,
...     strip_whitespace=True,
...     separators=["\n\n", "\n", ".", " ", ""],
... )

>>> # Split docs and keep only unique ones
>>> print("Splitting documents...")
>>> docs_processed = []
>>> unique_texts = {}
>>> for doc in tqdm(source_docs):
...     new_docs = text_splitter.split_documents([doc])
...     for new_doc in new_docs:
...         if new_doc.page_content not in unique_texts:
...             unique_texts[new_doc.page_content] = True
...             docs_processed.append(new_doc)

>>> print("Embedding documents... This should take a few minutes (5 minutes on MacBook with M1 Pro)")
>>> embedding_model = HuggingFaceEmbeddings(model_name="thenlper/gte-small")
>>> vectordb = FAISS.from_documents(
...     documents=docs_processed,
...     embedding=embedding_model,
...     distance_strategy=DistanceStrategy.COSINE,
... )

Splitting documents...

現在資料庫已準備就緒：讓我們構建我們的 Agentic RAG 系統！

👉 我們只需要一個RetrieverTool，我們的智慧體就可以利用它從知識庫中檢索資訊。

由於我們需要將 vectordb 新增為工具的一個屬性，我們不能簡單地使用帶有@tool裝飾器的簡單工具建構函式：所以我們將遵循高階智慧體文件中強調的高階設定。

from smolagents import Tool
from langchain_core.vectorstores import VectorStore


class RetrieverTool(Tool):
    name = "retriever"
    description = "Using semantic similarity, retrieves some documents from the knowledge base that have the closest embeddings to the input query."
    inputs = {
        "query": {
            "type": "string",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        }
    }
    output_type = "string"

    def __init__(self, vectordb: VectorStore, **kwargs):
        super().__init__(**kwargs)
        self.vectordb = vectordb

    def forward(self, query: str) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        docs = self.vectordb.similarity_search(
            query,
            k=7,
        )

        return "\nRetrieved documents:\n" + "".join(
            [f"===== Document {str(i)} =====\n" + doc.page_content for i, doc in enumerate(docs)]
        )

現在，建立一個利用此工具的智慧體就變得簡單了！

智慧體在初始化時需要這些引數

tools：智慧體可以呼叫的工具列表。
model：為智慧體提供支援的 LLM。

我們的model必須是一個可呼叫物件，它以訊息列表為輸入並返回文字。它還需要接受一個stop_sequences引數，該引數指示何時停止其生成。為了方便起見，我們直接使用包中提供的InferenceClientModel類來獲取一個呼叫我們推理 API的 LLM 引擎。

我們使用meta-llama/Llama-3.1-70B-Instruct，並在 Hugging Face 的推理 API 上免費提供！

注意： 推理 API 根據各種標準託管模型，部署的模型可能會在沒有事先通知的情況下更新或替換。在此處瞭解更多資訊。

from smolagents import InferenceClientModel, ToolCallingAgent

model = InferenceClientModel("meta-llama/Llama-3.1-70B-Instruct")

retriever_tool = RetrieverTool(vectordb)
agent = ToolCallingAgent(tools=[retriever_tool], model=model)

由於我們將智慧體初始化為ReactJsonAgent，它已自動獲得一個預設系統提示，告訴 LLM 引擎逐步處理並生成 JSON Blob 形式的工具呼叫（您可以根據需要用自己的提示模板替換此提示模板）。

然後，當其.run()方法啟動時，智慧體負責呼叫 LLM 引擎、解析工具呼叫 JSON Blob 並執行這些工具呼叫，所有這些都在一個迴圈中進行，直到提供最終答案為止。

>>> agent_output = agent.run("How can I push a model to the Hub?")

>>> print("Final output:")
>>> print(agent_output)

Final output:
To push a model to the Hub, you can use the push_to_hub() method after training. You can also use the PushToHubCallback to upload checkpoints regularly during a longer training run. Additionally, you can push the model up to the hub using the api.upload_folder() method.

Agentic RAG 與標準 RAG

智慧體設定能讓 RAG 系統變得更好嗎？嗯，讓我們用 LLM Judge 將其與標準 RAG 系統進行比較！

我們將使用meta-llama/Meta-Llama-3-70B-Instruct進行評估，因為它是我們測試 LLM 判別用例時最強大的 OS 模型之一。

eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

在執行測試之前，讓我們讓代理不那麼冗長。

import logging

agent.logger.setLevel(logging.WARNING)  # Let's reduce the agent's verbosity level

eval_dataset = datasets.load_dataset("m-ric/huggingface_doc_qa_eval", split="train")

outputs_agentic_rag = []

for example in tqdm(eval_dataset):
    question = example["question"]

    enhanced_question = f"""Using the information contained in your knowledge base, which you can access with the 'retriever' tool,
give a comprehensive answer to the question below.
Respond only to the question asked, response should be concise and relevant to the question.
If you cannot find information, do not give up and try calling your retriever again with different arguments!
Make sure to have covered the question completely by calling the retriever tool several times with semantically different queries.
Your queries should not be questions but affirmative form sentences: e.g. rather than "How do I load a model from the Hub in bf16?", query should be "load a model from the Hub bf16 weights".

Question:
{question}"""
    answer = agent.run(enhanced_question)
    print("=======================================================")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f'True answer: {example["answer"]}')

    results_agentic = {
        "question": question,
        "true_answer": example["answer"],
        "source_doc": example["source_doc"],
        "generated_answer": answer,
    }
    outputs_agentic_rag.append(results_agentic)

from huggingface_hub import InferenceClient

reader_llm = InferenceClient("Qwen/Qwen2.5-72B-Instruct")

outputs_standard_rag = []

for example in tqdm(eval_dataset):
    question = example["question"]
    context = retriever_tool(question)

    prompt = f"""Given the question and supporting documents below, give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
Provide the number of the source document when relevant.
If you cannot find information, do not give up and try calling your retriever again with different arguments!

Question:
{question}

{context}
"""
    messages = [{"role": "user", "content": prompt}]
    answer = reader_llm.chat_completion(messages).choices[0].message.content

    print("=======================================================")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f'True answer: {example["answer"]}')

    results_agentic = {
        "question": question,
        "true_answer": example["answer"],
        "source_doc": example["source_doc"],
        "generated_answer": answer,
    }
    outputs_standard_rag.append(results_agentic)

評估提示遵循了我們的 llm_judge 烹飪書中展示的一些最佳原則：它遵循一個小的整數李克特量表，具有清晰的標準，以及每個分數的描述。

EVALUATION_PROMPT = """You are a fair evaluator language model.

You will be given an instruction, a response to evaluate, a reference answer that gets a score of 3, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 3. You should refer to the score rubric.
3. The output format should look as follows: \"Feedback: {{write a feedback for criteria}} [RESULT] {{an integer number between 1 and 3}}\"
4. Please do not generate any other opening, closing, and explanations. Be sure to include [RESULT] in your output.
5. Do not score conciseness: a correct answer that covers the question should receive max score, even if it contains additional useless information.

The instruction to evaluate:
{instruction}

Response to evaluate:
{response}

Reference Answer (Score 3):
{reference_answer}

Score Rubrics:
[Is the response complete, accurate, and factual based on the reference answer?]
Score 1: The response is completely incomplete, inaccurate, and/or not factual.
Score 2: The response is somewhat complete, accurate, and/or factual.
Score 3: The response is completely complete, accurate, and/or factual.

Feedback:"""

from huggingface_hub import InferenceClient

evaluation_client = InferenceClient("meta-llama/Llama-3.1-70B-Instruct")

import pandas as pd

results = {}
for system_type, outputs in [
    ("agentic", outputs_agentic_rag),
    ("standard", outputs_standard_rag),
]:
    for experiment in tqdm(outputs):
        eval_prompt = EVALUATION_PROMPT.format(
            instruction=experiment["question"],
            response=experiment["generated_answer"],
            reference_answer=experiment["true_answer"],
        )
        messages = [
            {"role": "system", "content": "You are a fair evaluator language model."},
            {"role": "user", "content": eval_prompt},
        ]

        eval_result = evaluation_client.text_generation(eval_prompt, max_new_tokens=1000)
        try:
            feedback, score = [item.strip() for item in eval_result.split("[RESULT]")]
            experiment["eval_score_LLM_judge"] = score
            experiment["eval_feedback_LLM_judge"] = feedback
        except:
            print(f"Parsing failed - output was: {eval_result}")

    results[system_type] = pd.DataFrame.from_dict(outputs)
    results[system_type] = results[system_type].loc[~results[system_type]["generated_answer"].str.contains("Error")]

>>> DEFAULT_SCORE = 2  # Give average score whenever scoring fails


>>> def fill_score(x):
...     try:
...         return int(x)
...     except:
...         return DEFAULT_SCORE


>>> for system_type, outputs in [
...     ("agentic", outputs_agentic_rag),
...     ("standard", outputs_standard_rag),
... ]:

...     results[system_type]["eval_score_LLM_judge_int"] = (
...         results[system_type]["eval_score_LLM_judge"].fillna(DEFAULT_SCORE).apply(fill_score)
...     )
...     results[system_type]["eval_score_LLM_judge_int"] = (results[system_type]["eval_score_LLM_judge_int"] - 1) / 2

...     print(
...         f"Average score for {system_type} RAG: {results[system_type]['eval_score_LLM_judge_int'].mean()*100:.1f}%"
...     )

Average score for agentic RAG: 86.9%
Average score for standard RAG: 73.1%

總結：與標準 RAG 相比，代理設定將分數提高了 14%！（從 73.1% 到 86.9%）

這是一個巨大的進步，而且設定非常簡單 🚀

（作為基線，不使用知識庫的 Llama-3-70B 獲得了 36%）

< > 在 GitHub 上更新

開源 AI 食譜

Agentic RAG：透過查詢重構和自查詢為你的 RAG 加速！🚀

Agentic RAG 與標準 RAG