呼叫許可：Transformers Agents 2.0 介紹

釋出於 2024 年 5 月 13 日

在 GitHub 上更新

贊

134

摘要

目錄

什麼是智慧體？

Transformers Agents 的方法
主要元素

示例用例
自校正檢索增強生成

使用簡單的多智慧體設定🤝實現高效網頁瀏覽

測試我們的智慧體
LLM 引擎基準測試

使用多模態智慧體攀登 GAIA 排行榜

結論

TL;DR

我們正在釋出 Transformers Agents 2.0！

⇒ 🎁 在我們現有的智慧體型別之上，我們引入了兩種新智慧體，它們**可以根據過去的觀察結果進行迭代以解決複雜任務**。

⇒ 💡 我們的目標是讓程式碼**清晰、模組化，並讓最終的提示和工具等通用屬性透明可見**。

⇒ 🤝 我們增加了**共享選項**，以促進社群智慧體的發展。

⇒ 💪 **極其高效能的新智慧體框架**，讓 Llama-3-70B-Instruct 智慧體在 GAIA 排行榜上超越了基於 GPT-4 的智慧體！

🚀 趕快嘗試一下，在 GAIA 排行榜上更上一層樓吧！

transformers.agents 現已升級為獨立庫 smolagents！這兩個庫的 API 非常相似，因此切換很容易。請檢視 smolagents 介紹部落格。

什麼是智慧體？

大型語言模型 (LLM) 可以處理各種任務，但它們在邏輯、計算和搜尋等特定任務上常常遇到困難。當在這些它們表現不佳的領域被提示時，它們經常無法生成正確的答案。

克服這一弱點的一種方法是建立一個**智慧體**，它只是一個由 LLM 驅動的程式。智慧體透過**工具**獲得能力，以幫助其執行操作。當智慧體需要特定技能來解決特定問題時，它會依賴其工具箱中合適的工具。

因此，當智慧體在解決問題時需要特定技能時，它只需依賴其工具箱中合適的工具即可。

實驗表明，智慧體框架通常表現非常好，在多項基準測試中取得了最先進的效能。例如，請檢視HumanEval 的頂級提交：它們都是智慧體系統。

Transformers Agents 的方法

構建智慧體工作流是複雜的，我們認為這些系統需要很高的清晰度和模組化。一年前我們釋出了 Transformers Agents，現在我們正在加倍努力實現我們的核心設計目標。

我們的框架力求

透過簡潔實現清晰：我們儘可能減少抽象。簡單的錯誤日誌和可訪問的屬性讓您可以輕鬆檢查正在發生的事情，並提供更高的清晰度。
模組化：我們傾向於提供構建塊，而不是完整、複雜的特徵集。您可以自由選擇最適合您專案的構建塊。
- 例如，由於任何智慧體系統都只是由 LLM 引擎驅動的載體，我們決定在概念上將兩者分離，這使您可以用任何底層 LLM 建立任何智慧體型別。

最重要的是，我們有**共享功能**，讓您可以站在巨人的肩膀上！

主要元素

Tool：這是讓您使用工具或實現新工具的類。它主要由一個可呼叫前向方法組成，該方法執行工具操作，以及一組幾個基本屬性：name、descriptions、inputs和output_type。這些屬性用於為工具動態生成使用手冊並將其插入到 LLM 的提示中。
Toolbox：它是一組提供給智慧體的工具，作為解決特定任務的資源。出於效能考慮，工具箱中的工具已經例項化並準備就緒。這是因為某些工具需要時間進行初始化，因此通常最好重用現有工具箱並只交換一個工具，而不是在每次智慧體初始化時從頭開始重新構建一組工具。
CodeAgent：一個非常簡單的智慧體，將其操作生成為單個 Python 程式碼塊。它無法根據先前的觀察進行迭代。
ReactAgent：ReAct 智慧體遵循思考 ⇒ 行動 ⇒ 觀察的迴圈，直到它們解決了任務。我們提供了兩類 ReactAgent
- ReactCodeAgent 將其動作生成為 python 程式碼塊。
- ReactJsonAgent 將其動作生成為 JSON 塊。

檢視文件以瞭解如何使用每個元件！

智慧體在底層是如何工作的？

本質上，智慧體的作用是“允許 LLM 使用工具”。智慧體有一個關鍵的agent.run()方法，它會：

以**特定提示**的形式向 LLM 提供工具使用資訊。這樣，LLM 就可以選擇要執行的工具來解決任務。
**解析** LLM 輸出中的工具呼叫（可以是程式碼、JSON 格式或任何其他格式）。
**執行**呼叫。
如果智慧體被設計為在先前輸出上進行迭代，它會**保留一個帶有先前工具呼叫和觀察的記憶體**。這個記憶體的粒度可以根據你希望它的長期性而或多或少地精細。

graph of agent workflows

有關智慧體的更多一般背景資訊，您可以閱讀 Lilian Weng 的這篇優秀部落格文章，或我們早期關於使用 LangChain 構建智慧體的部落格文章。

要更深入地瞭解我們的包，請檢視智慧體文件。

示例用例

為了能夠提前體驗此功能，請首先從其main分支安裝transformers

pip install "git+https://github.com/huggingface/transformers.git#egg=transformers[agents]"

Agents 2.0 將在 5 月中旬釋出的 v4.41.0 版本中釋出。

自校正檢索增強生成

快速定義：檢索增強生成 (RAG) 是指“使用 LLM 回答使用者查詢，但答案基於從知識庫中檢索到的資訊”。它比使用普通或微調的 LLM 有許多優點：舉幾個例子，它允許將答案基於真實事實並減少胡編亂造，它允許為 LLM 提供特定領域的知識，並且它允許對知識庫資訊的訪問進行細粒度控制。

假設我們要執行 RAG，並且一些引數必須動態生成。例如，根據使用者查詢，我們可能希望將搜尋限制在知識庫的特定子集，或者我們可能希望調整檢索文件的數量。困難在於：如何根據使用者查詢動態調整這些引數？

好吧，我們可以透過讓我們的智慧體訪問這些引數來做到這一點！

讓我們設定這個系統。

執行下面這行命令來安裝所需的依賴項

pip install langchain sentence-transformers faiss-cpu

我們首先載入一個知識庫，我們希望在該知識庫上執行 RAG：該資料集是許多huggingface包的文件頁面的編譯，以 markdown 格式儲存。

import datasets
knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")

現在我們透過處理資料集並將其儲存到向量資料庫中來準備知識庫，以供檢索器使用。我們將使用 LangChain，因為它具有出色的向量資料庫實用程式

from langchain.docstore.document import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

source_docs = [
    Document(
        page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]}
    ) for doc in knowledge_base
]

docs_processed = RecursiveCharacterTextSplitter(chunk_size=500).split_documents(source_docs)[:1000]

embedding_model = HuggingFaceEmbeddings("thenlper/gte-small")
vectordb = FAISS.from_documents(
    documents=docs_processed,
    embedding=embedding_model
)

現在資料庫已準備就緒，讓我們構建一個基於它回答使用者查詢的 RAG 系統！

我們希望我們的系統根據查詢只從最相關的資訊源中進行選擇。

我們的文件頁面來自以下來源

>>> all_sources = list(set([doc.metadata["source"] for doc in docs_processed]))
>>> print(all_sources)

['blog', 'optimum', 'datasets-server', 'datasets', 'transformers', 'course',
'gradio', 'diffusers', 'evaluate', 'deep-rl-class', 'peft',
'hf-endpoints-documentation', 'pytorch-image-models', 'hub-docs']

我們如何根據使用者查詢選擇相關來源？

👉 讓我們將 RAG 系統構建為一個智慧體，它可以自由選擇其來源！

我們建立一個檢索器工具，智慧體可以使用它呼叫其選擇的引數

import json
from transformers.agents import Tool
from langchain_core.vectorstores import VectorStore

class RetrieverTool(Tool):
    name = "retriever"
    description = "Retrieves some documents from the knowledge base that have the closest embeddings to the input query."
    inputs = {
        "query": {
            "type": "text",
            "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
        },
        "source": {
            "type": "text", 
            "description": ""
        },
    }
    output_type = "text"
    
    def __init__(self, vectordb: VectorStore, all_sources: str, **kwargs):
        super().__init__(**kwargs)
        self.vectordb = vectordb
        self.inputs["source"]["description"] = (
            f"The source of the documents to search, as a str representation of a list. Possible values in the list are: {all_sources}. If this argument is not provided, all sources will be searched."
          )

    def forward(self, query: str, source: str = None) -> str:
        assert isinstance(query, str), "Your search query must be a string"

        if source:
            if isinstance(source, str) and "[" not in str(source): # if the source is not representing a list
                source = [source]
            source = json.loads(str(source).replace("'", '"'))

        docs = self.vectordb.similarity_search(query, filter=({"source": source} if source else None), k=3)

        if len(docs) == 0:
            return "No documents found with this filtering. Try removing the source filter."
        return "Retrieved documents:\n\n" + "\n===Document===\n".join(
            [doc.page_content for doc in docs]
        )

現在，建立一個利用此工具的智慧體就很容易了！

智慧體在初始化時需要以下引數

tools：智慧體可以呼叫的工具列表。
llm_engine：為智慧體提供動力的 LLM。

我們的 llm_engine 必須是一個可呼叫物件，它以訊息列表為輸入並返回文字。它還需要接受一個 stop_sequences 引數，指示何時停止生成。為了方便起見，我們直接使用包中提供的 HfEngine 類來獲取一個呼叫我們 Inference API 的 LLM 引擎。

from transformers.agents import HfEngine, ReactJsonAgent

llm_engine = HfEngine("meta-llama/Meta-Llama-3-70B-Instruct")

agent = ReactJsonAgent(
    tools=[RetrieverTool(vectordb, all_sources)],
    llm_engine=llm_engine
)

agent_output = agent.run("Please show me a LORA finetuning script")

print("Final output:")
print(agent_output)

由於我們將智慧體初始化為ReactJsonAgent，它已自動獲得一個預設系統提示，該提示告訴LLM引擎逐步處理並將工具呼叫生成為JSON blob（您可以根據需要用您自己的提示模板替換此提示）。

然後，當其 .run() 方法啟動時，智慧體會負責呼叫 LLM 引擎、解析工具呼叫 JSON blob 並執行這些工具呼叫，所有這些都在一個迴圈中，只有當最終答案提供時才結束。

我們得到以下輸出

Calling tool: retriever with arguments: {'query': 'LORA finetuning script', 'source': "['transformers', 'datasets-server', 'datasets']"}
Calling tool: retriever with arguments: {'query': 'LORA finetuning script'}
Calling tool: retriever with arguments: {'query': 'LORA finetuning script example', 'source': "['transformers', 'datasets-server', 'datasets']"}
Calling tool: retriever with arguments: {'query': 'LORA finetuning script example'}
Calling tool: final_answer with arguments: {'answer': 'Here is an example of a LORA finetuning script: https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L371'}

Final output:
Here is an example of a LORA finetuning script: https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L371

我們可以看到自我糾正在起作用：智慧體首先嚐試限制來源，但由於缺乏相應的文件，它最終根本沒有限制來源。

我們可以透過檢查步驟 2 的日誌中的 LLM 輸出進行驗證：print(agent.logs[2]['llm_output'])

Thought: I'll try to retrieve some documents related to LORA finetuning scripts from the entire knowledge base, without any source filtering.

Action:
{
  "action": "retriever",
  "action_input": {"query": "LORA finetuning script"}
}

使用簡單的多智慧體設定🤝實現高效網頁瀏覽

在此示例中，我們希望構建一個智慧體並在 GAIA 基準測試（Mialon et al. 2023）上對其進行測試。GAIA 是一個極其困難的基準測試，大多數問題需要使用不同的工具進行多步推理。一個特別困難的要求是擁有一個強大的網頁瀏覽器，能夠導航到具有特定限制的頁面：使用網站的內部導航發現頁面，及時選擇特定文章……

網頁瀏覽需要深入子頁面並滾動瀏覽大量文字標記，這些標記對於更高層次的任務解決是不必要的。我們將網頁瀏覽子任務分配給一個專門的網頁瀏覽智慧體。我們為其提供了一些用於瀏覽網頁的工具和一個特定的提示（請檢視倉庫以查詢具體實現）。

定義這些工具超出了本篇帖子的範圍：但您可以檢視倉庫以查詢具體實現。

from transformers.agents import ReactJsonAgent, HfEngine

WEB_TOOLS = [
    SearchInformationTool(),
    NavigationalSearchTool(),
    VisitTool(),
    DownloadTool(),
    PageUpTool(),
    PageDownTool(),
    FinderTool(),
    FindNextTool(),
]

websurfer_llm_engine = HfEngine(
    model="CohereForAI/c4ai-command-r-plus"
)  # We choose Command-R+ for its high context length

websurfer_agent = ReactJsonAgent(
    tools=WEB_TOOLS,
    llm_engine=websurfer_llm_engine,
)

為了讓這個智慧體能夠被更高層次的任務解決智慧體呼叫，我們可以簡單地將其封裝在另一個工具中

class SearchTool(Tool):
    name = "ask_search_agent"
    description = "A search agent that will browse the internet to answer a question. Use it to gather informations, not for problem-solving."

    inputs = {
        "question": {
            "description": "Your question, as a natural language sentence. You are talking to an agent, so provide them with as much context as possible.",
            "type": "text",
        }
    }
    output_type = "text"

    def forward(self, question: str) -> str:
        return websurfer_agent.run(question)

然後我們用這個搜尋工具初始化任務解決智慧體

from transformers.agents import ReactCodeAgent

llm_engine = HfEngine(model="meta-llama/Meta-Llama-3-70B-Instruct")
react_agent_hf = ReactCodeAgent(
    tools=[SearchTool()],
    llm_engine=llm_engine,
)

讓我們用以下任務執行智慧體

使用由 Marisa Alviar-Agnew 和 Henry Agnew 根據 CK-12 許可證在 LibreText 的《入門化學材料》中編譯的（2023 年 8 月 21 日）密度測量資料。我有一加侖蜂蜜和一加侖蛋黃醬，溫度均為 25C。我每次從一加侖蜂蜜中取出“一杯”蜂蜜。我需要取出多少次才能讓蜂蜜的重量小於蛋黃醬？假設容器本身的重量相同。

Thought: I will use the 'ask_search_agent' tool to find the density of honey and mayonnaise at 25C.
==== Agent is executing the code below:
density_honey = ask_search_agent(question="What is the density of honey at 25C?")
print("Density of honey:", density_honey)
density_mayo = ask_search_agent(question="What is the density of mayonnaise at 25C?")
print("Density of mayo:", density_mayo)
===
Observation:
Density of honey: The density of honey is around 1.38-1.45kg/L at 20C. Although I couldn't find information specific to 25C, minor temperature differences are unlikely to affect the density that much, so it's likely to remain within this range.
Density of mayo: The density of mayonnaise at 25°C is 0.910 g/cm³.

===== New step =====
Thought: I will convert the density of mayonnaise from g/cm³ to kg/L and then calculate the initial weights of the honey and mayonnaise in a gallon. After that, I will calculate the weight of honey after removing one cup at a time until it weighs less than the mayonnaise.
==== Agent is executing the code below:
density_honey = 1.42 # taking the average of the range
density_mayo = 0.910 # converting g/cm³ to kg/L
density_mayo = density_mayo * 1000 / 1000 # conversion

gallon_to_liters = 3.785 # conversion factor
initial_honey_weight = density_honey * gallon_to_liters
initial_mayo_weight = density_mayo * gallon_to_liters

cup_to_liters = 0.236 # conversion factor
removed_honey_weight = cup_to_liters * density_honey
===
Observation:

===== New step =====
Thought: Now that I have the initial weights of honey and mayonnaise, I'll try to calculate the number of cups to remove from the honey to make it weigh less than the mayonnaise using a simple arithmetic operation.
==== Agent is executing the code below:
cups_removed = int((initial_honey_weight - initial_mayo_weight) / removed_honey_weight) + 1
print("Cups removed:", cups_removed)
final_answer(cups_removed)
===
>>> Final answer: 6

✅ 答案是**正確**的！

測試我們的智慧體

讓我們試用一下我們的智慧體框架，並用它來對不同的模型進行基準測試！

以下所有實驗程式碼都可以在這裡找到。

LLM 引擎基準測試

agents_reasoning_benchmark 是一個小型但強大的推理測試，用於評估智慧體的效能。此基準測試已在我們之前的部落格文章中更詳細地使用和解釋過。

核心思想是，您與智慧體一起使用的工具選擇會極大地改變某些任務的效能。因此，此基準測試將使用的工具集限制為計算器和基本搜尋工具。我們從幾個可以使用這兩種工具解決的資料集中挑選了問題

來自HotpotQA的 30 個問題（Yang et al., 2018），用於測試搜尋工具的使用。
來自GSM8K的 40 個問題（Cobbe et al., 2021），用於測試計算器工具的使用。
來自GAIA的 20 個問題（Mialon et al., 2023），用於測試兩種工具在解決難題時的使用。

這裡我們嘗試了 3 種不同的引擎：Mixtral-8x7B、Llama-3-70B-Instruct 和 GPT-4 Turbo。

benchmark of agent performances

結果如上所示——為了更精確，取兩次完整執行的平均值。我們還測試了Command-R+和Mixtral-8x22B，但為了清晰起見，未顯示它們。

⇒ Llama-3-70B-Instruct 在開源模型中處於領先地位：它與 GPT-4 不相上下，而且由於 Llama 3 強大的編碼效能，它在 ReactCodeAgent 中表現尤其出色！

💡 比較基於 JSON 和基於程式碼的 React 智慧體很有趣：對於 Mixtral-8x7B 等效能較低的 LLM 引擎，基於程式碼的智慧體表現不如 JSON，因為 LLM 引擎經常無法生成高質量的程式碼。但基於程式碼的版本在與更強大的模型作為引擎配合時表現出色：根據我們的經驗，基於程式碼的版本甚至在 Llama-3-70B-Instruct 上優於 JSON。因此，我們在下一個挑戰中使用基於程式碼的版本：在完整的 GAIA 基準測試中進行測試。

使用多模態智慧體攀登 GAIA 排行榜

GAIA（Mialon et al., 2023）是一個極其困難的基準測試：您可以在上面的agent_reasoning_benchmark中看到，即使我們挑選了可以用 2 個基本工具解決的任務，模型也無法達到 50% 以上的效能。

現在我們希望在完整資料集上獲得分數，我們不再挑選問題。因此，我們必須涵蓋所有模態，這促使我們使用這些特定工具

SearchTool：上面定義的網路瀏覽器。
TextInspectorTool：將文件作為文字檔案開啟並返回其內容。
SpeechToTextTool：將音訊檔案轉錄為文字。我們使用基於distil-whisper的預設工具。
VisualQATool：視覺分析影像。為此，我們使用了閃亮的新Idefics2-8b-chatty！

我們首先初始化這些工具（更多詳細資訊，請檢查倉庫中的程式碼）。

然後我們初始化我們的智慧體

from transformers.agents import ReactCodeAgent, HfEngine

TASK_SOLVING_TOOLBOX = [
    SearchTool(),
    VisualQATool(),
    SpeechToTextTool(),
    TextInspectorTool(),
]

react_agent_hf = ReactCodeAgent(
    tools=TASK_SOLVING_TOOLBOX,
    llm_engine=HfEngine(model="meta-llama/Meta-Llama-3-70B-Instruct"),
    memory_verbose=True,
)

在完成 165 個問題所需的時間之後，我們將結果提交到 GAIA 排行榜，然後……🥁🥁🥁

GAIA leaderboard

⇒ 我們的智慧體排名第四：它擊敗了許多基於 GPT-4 的智慧體，現在是開源類別中的衛冕者！

結論

我們將在未來幾個月內繼續改進此軟體包。我們已經確定了開發路線圖中幾個令人興奮的方向

更多智慧體共享選項：目前您可以從 Hub 推送或載入工具，我們也將實現推送/載入智慧體。
更好的工具，特別是影像處理工具。
長期記憶管理。
多智慧體協作。

👉 去試試 Transformers Agents 吧！我們期待收到您的反饋和想法。

讓我們用更多的開源模型填滿排行榜頂部！🚀

transformers.agents 現已升級為獨立庫 smolagents！這兩個庫的 API 非常相似，因此切換很容易。請檢視 smolagents 介紹部落格。

更多部落格文章

作為LangChain Agents的開源LLM

作者： 2024年1月24日 • 69

CodeAgents + Structure: 一種更好的執行操作的方式

作者： 2025年5月28日 • 71

社群

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入發表評論

贊

134