開源 LLM 作為 LangChain Agent

釋出於 2024 年 1 月 24 日

在 GitHub 上更新

贊

摘要

引言

目錄

什麼是 Agent？
ReAct Agent 內部工作原理的玩具示例

Agent 系統的挑戰

使用 LangChain 執行 Agent

Agent 對決：開源 LLM 作為通用推理 Agent 的表現如何？
評估

模型

結果

TL;DR

開源 LLM 已經達到了一個性能水平，使其成為為 Agent 工作流提供動力的合適推理引擎：Mixtral 甚至在我們的基準測試中超越了 GPT-3.5，並且其效能可以透過微調輕鬆進一步提高。

我們釋出了最簡單的 Agent 庫：smolagents！請在此處檢視 smolagents 介紹部落格。

引言

經過因果語言建模訓練的大型語言模型 (LLM) 可以處理各種任務，但它們在邏輯、計算和搜尋等基本任務上常常表現不佳。最糟糕的情況是，當它們在某個領域（例如數學）表現不佳時，卻仍然嘗試自行處理所有計算。

為了克服這一弱點，除了其他方法之外，還可以將 LLM 整合到一個可以呼叫工具的系統中：這種系統稱為 LLM Agent。

在這篇文章中，我們將解釋 ReAct Agent 的內部工作原理，然後展示如何使用 LangChain 最近整合的 ChatHuggingFace 類構建它們。最後，我們將幾個開源 LLM 與 GPT-3.5 和 GPT-4 進行基準測試。

什麼是 Agent？

LLM Agent 的定義相當寬泛：LLM Agent 是所有使用 LLM 作為其引擎並能夠根據觀察對其環境執行操作的系統。它們可以使用“感知 => 反思 => 行動”迴圈的多次迭代來完成任務，並且通常會透過規劃或知識管理系統進行增強以提高其效能。您可以在Xi 等人，2023中找到關於 Agent 領域的良好綜述。

今天，我們重點關注 ReAct Agent。ReAct 是一種構建 Agent 的方法，其名稱由“推理 (Reasoning)”和“行動 (Acting)”兩個詞組合而成。在提示中，我們描述了模型、它可以使用的工具，並要求它“一步步”思考（也稱為思維鏈行為），以規劃和執行其下一步行動以達到最終答案。

drawing

ReAct Agent 內部工作原理的玩具示例

上面的圖表看起來非常高層次，但其底層原理卻相當簡單。

請檢視此筆記本：我們使用 Transformers 庫實現了一個最簡單的工具呼叫示例。

LLM 在迴圈中被呼叫，其提示本質上包含

Here is a question: "{question}" 
You have access to these tools: {tools_descriptions}. 
You should first reflect with ‘Thought: {your_thoughts}’, then you either:
- call a tool with the proper JSON formatting,
- or your print your final answer starting with the prefix ‘Final Answer:’

然後解析 LLM 的輸出

如果它包含字串 'Final Answer:'，則迴圈結束並列印答案，
否則，LLM 應該已經輸出了一個工具呼叫：您可以解析此輸出來獲取工具名稱和引數，然後使用所述引數呼叫所述工具。然後將此工具呼叫的輸出附加到提示中，然後使用此擴充套件資訊再次呼叫 LLM，直到它獲得足夠的資訊最終提供問題的最終答案。

例如，當回答問題時，LLM 的輸出可能看起來像這樣：1:23:45 中有多少秒？

Thought: I need to convert the time string into seconds.

Action:
{
    "action": "convert_time",
    "action_input": {
    "time": "1:23:45"
    }
}

由於此輸出不包含字串 'Final Answer:'，因此它正在呼叫一個工具：因此我們解析此輸出並獲取工具呼叫引數：呼叫工具 convert_time，引數為 {"time": "1:23:45"}。執行此工具呼叫返回 {'seconds': '5025'}。

所以我們將這整個資訊塊附加到提示中。

現在的新提示是（一個稍詳細的版本）

Here is a question: "How many seconds are in 1:23:45?"
You have access to these tools:
    - convert_time: converts a time given in hours:minutes:seconds into seconds.

You should first reflect with ‘Thought: {your_thoughts}’, then you either:
- call a tool with the proper JSON formatting,
- or your print your final answer starting with the prefix ‘Final Answer:’

Thought: I need to convert the time string into seconds.

Action:
{
    "action": "convert_time",
    "action_input": {
    "time": "1:23:45"
    }
}
Observation: {'seconds': '5025'}

➡️ 我們再次呼叫 LLM，使用這個新提示。鑑於它在 Observation 中可以訪問工具呼叫的結果，LLM 現在極有可能輸出

Thought: I now have the information needed to answer the question.
Final Answer: There are 5025 seconds in 1:23:45.

任務已解決！

Agent 系統的挑戰

通常，執行 Agent 系統對於 LLM 引擎來說困難的部分在於

從提供的工具中，選擇一個有助於實現預期目標的工具：例如，當被問及“大於 30,000 的最小素數是多少？”時，Agent 可以呼叫 Search 工具，引數為 "K2 的高度是多少"，但這無濟於事。
使用嚴格的引數格式呼叫工具：例如，當嘗試計算一輛汽車在 10 分鐘內行駛 3 公里的速度時，您必須呼叫 Calculator 工具來將 distance 除以 time：即使您的 Calculator 工具接受 JSON 格式的呼叫：{”tool”: “Calculator”, “args”: “3km/10min”}，也有許多陷阱，例如
- 拼寫錯工具名稱：“calculator” 或 “Compute” 將不起作用
- 給出引數名稱而不是它們的值：“args”: “distance/time”
- 非標準化格式：“args": "3km in 10minutes”
有效攝取和使用過去觀察中收集的資訊，無論是初始上下文還是使用工具後返回的觀察結果。

那麼，一個完整的 Agent 設定會是什麼樣子呢？

使用 LangChain 執行 Agent

我們剛剛集成了 ChatHuggingFace 封裝器，它允許您在 🦜🔗LangChain 中建立基於開源模型的 Agent。

建立 ChatModel 併為其提供工具的程式碼非常簡單，您可以在 Langchain 文件中檢視所有內容。

from langchain_community.llms import HuggingFaceEndpoint
from langchain_community.chat_models.huggingface import ChatHuggingFace

llm = HuggingFaceEndpoint(repo_id="HuggingFaceH4/zephyr-7b-beta")

chat_model = ChatHuggingFace(llm=llm)

您可以透過為其提供 ReAct 風格的提示和工具，將 chat_model 變為 Agent

from langchain import hub
from langchain.agents import AgentExecutor, load_tools
from langchain.agents.format_scratchpad import format_log_to_str
from langchain.agents.output_parsers import (
    ReActJsonSingleInputOutputParser,
)
from langchain.tools.render import render_text_description
from langchain_community.utilities import SerpAPIWrapper

# setup tools
tools = load_tools(["serpapi", "llm-math"], llm=llm)

# setup ReAct style prompt
prompt = hub.pull("hwchase17/react-json")
prompt = prompt.partial(
    tools=render_text_description(tools),
    tool_names=", ".join([t.name for t in tools]),
)

# define the agent
chat_model_with_stop = chat_model.bind(stop=["\nObservation"])
agent = (
    {
        "input": lambda x: x["input"],
        "agent_scratchpad": lambda x: format_log_to_str(x["intermediate_steps"]),
    }
    | prompt
    | chat_model_with_stop
    | ReActJsonSingleInputOutputParser()
)

# instantiate AgentExecutor
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

agent_executor.invoke(
    {
        "input": "Who is the current holder of the speed skating world record on 500 meters? What is her current age raised to the 0.43 power?"
    }
)

然後 Agent 將處理輸入

Thought: To answer this question, I need to find age of the current speedskating world record holder.  I will use the search tool to find this information.
Action:
{
    "action": "search",
    "action_input": "speed skating world record holder 500m age"
}
Observation: ...

Agent 對決：開源 LLM 作為通用推理 Agent 的表現如何？

您可以在此處找到此基準測試的程式碼。

評估

我們希望衡量開源 LLM 作為通用推理 Agent 的效能。因此，我們選擇需要使用邏輯和基本工具的問題：計算器和網際網路搜尋。最終的資料集是來自其他 3 個數據集的樣本組合

為了測試網際網路搜尋能力：我們選擇了來自 HotpotQA 的問題：這原本是一個檢索資料集，但它可以用於一般問題回答，並可訪問網際網路。一些問題最初需要結合來自各種來源的資訊：在我們的設定中，這意味著執行多個網際網路搜尋步驟以組合結果。
對於計算器使用，我們添加了來自 GSM8K 的問題：此資料集測試小學數學能力，並且可以透過正確利用 4 種運算子（加、減、乘、除）完全解決。
我們還從 GAIA 中選擇了問題，GAIA 是一個非常困難的通用 AI 助手基準。原始資料集中的問題可能需要許多其他不同的工具，例如程式碼直譯器或 PDF 閱讀器：我們手工挑選了不需要除搜尋和計算器之外的工具的問題。

評估使用 GPT-4 作為評判，基於 Prometheus 提示格式的提示，以 5 點李克特量表給出結果：請參閱此處的確切提示。

模型

我們評估了一些強大的開源模型

這些模型在 LangChain 的 ReAct 實現中進行評估。這意味著我們提示它們以這種格式輸出其函式呼叫

{
  "action": $TOOL_NAME,
  "action_input": $INPUT
}

為了比較，我們還在相同的示例上使用 LangChain 的 OpenAI 專用 Agent 評估了 GPT-3.5 和 GPT-4。由於這使用了它們自己的函式呼叫模板，它們對此進行了微調，這意味著 OpenAI 模型應該處於其最佳效能。

結果

由於開源模型沒有專門針對給定輸出格式中的函式呼叫進行微調，因此與 OpenAI Agent 相比，它們略顯劣勢。

儘管如此，一些模型表現出色！💪

以下是 Mixtral-8x7B 回答問題的示例：“貴陽和塔城哪個城市人口更多？”

Thought: To answer this question, I need to find the current populations of both Guiyang and Tacheng. I will use the search tool to find this information.
Action:
{
    "action": "search",
    "action_input": "current population of Guiyang"
}
4.697 million (2016)

Thought: I have obtained the population of Guiyang, which is 4.697 million (as of 2016). Now, I will use the search tool again to find the population of Tacheng.
Action:
{
    "action": "search",
    "action_input": "current population of Tacheng"
}
Population: Xinjiang: Tacheng data was reported at 1,110.000 Persons in 2021. This records a decrease from the previous number of 1,138.638 Persons for 2020.
Population: Xinjiang: Tacheng data is updated yearly, averaging 1,023.000 Persons from Dec 2005 to 2021, with 17 observations.

I have obtained the population of Tacheng, which is approximately 1.11 million (as of 2021). Comparing the two populations, Guiyang has a larger population than Tacheng.

Thought: I now know the final answer
Final Answer: Guiyang has a larger population, which is approximately 4.697 million (as of 2016), compared to Tacheng's population of approximately 1.11 million (as of 2021).

以下是模型在我們評估資料集上的基準測試結果（原始 1-5 分的平均分已轉換為 0-100% 的比例，以便於閱讀）

benchmark of agents performance

如您所見，一些開源模型在 Agent 工作流方面表現不佳：雖然對於小型 Zephyr-7b 來說這是預料之中的，但 Llama2-70b 的表現卻出奇地差。

👉 但 Mixtral-8x7B 表現非常出色：它甚至擊敗了 GPT-3.5！ 🏆

這還是開箱即用的效能：與 GPT-3.5 不同，Mixtral 未針對 Agent 工作流進行微調（據我們所知），這在一定程度上阻礙了其效能。例如，在 GAIA 上，10% 的問題失敗是因為 Mixtral 嘗試使用格式不正確的引數呼叫工具。透過對函式呼叫和任務規劃技能進行適當的微調，Mixtral 的分數可能會更高。

➡️ 我們強烈建議開源構建者開始為 Agent 微調 Mixtral，以超越下一個挑戰者：GPT-4！🚀

結語

GAIA 基準測試，儘管此處僅對一小部分問題和少量工具進行測試，但似乎是衡量 Agent 工作流整體模型效能的非常可靠的指標，因為它通常涉及多個推理步驟和嚴格的邏輯。
Agent 工作流允許 LLM 提高效能：例如，在 GSM8K 上，GPT-4 的技術報告指出 5-shot CoT 提示的準確率為 92%：為其提供計算器可使我們在零樣本中達到 95%。對於 Mixtral-8x7B，LLM 排行榜顯示 5-shot 的準確率為 57.6%，我們在零樣本中達到 73%。（請記住，我們只測試了 GSM8K 的 20 個問題）

更多部落格文章