我們剛剛賦予了smolagents視覺能力

釋出於 2025年1月24日

在 GitHub 上更新

贊

109

Aymeric Roucher

m-ric

merve

Albert Villanova del Moral

albertvillanova

摘要

目錄

概覽

我們如何賦予smolagents視覺能力
如何建立具有視覺功能的網頁瀏覽智慧體

執行智慧體

下一步

你這假冒為善的人，先把你眼中的梁木取出來，然後你才能看得清楚，才能取出你弟兄眼中的刺。馬太福音 7, 3-5

TL;DR

我們已經為smolagents添加了視覺支援，這使得在智慧體管道中原生使用視覺語言模型成為可能。

概述

在智慧體世界中，許多能力都被視覺之牆所阻礙。一個常見的例子是網頁瀏覽：網頁具有豐富的視覺內容，僅僅提取文字永遠無法完全恢復，無論是物件的相對位置、透過顏色傳遞的資訊還是特定的圖示……在這種情況下，視覺是智慧體真正的超能力。所以我們剛剛將這項能力新增到了我們的smolagents中！

這帶來了什麼？一個能完全自主瀏覽網頁的智慧體瀏覽器！

以下是它看起來的樣子

我們如何賦予smolagents視覺能力

🤔 我們如何將影像傳遞給智慧體？傳遞影像有兩種方式

你可以在智慧體啟動時直接獲取影像。文件AI通常就是這種情況。
有時，影像需要動態新增。一個很好的例子是當網頁瀏覽器剛剛執行了一個動作，需要檢視其視口上的影響時。

1. 在智慧體啟動時一次性傳遞影像

對於需要一次性傳遞影像的情況，我們添加了在 `run` 方法中向智慧體傳遞影像列表的功能：`agent.run("描述這些影像：", images=[image_1, image_2])`。

這些影像輸入隨後與您希望完成的任務提示一起儲存在 `TaskStep` 的 `task_images` 屬性中。

當執行智慧體時，它們將被傳遞給模型。這在根據包含視覺元素的冗長 PDF 執行操作等情況下非常方便。

2. 在每個步驟中傳遞影像 ⇒ 使用回撥

如何將影像動態新增到智慧體的記憶體中？

為了弄清楚這一點，我們首先需要了解我們的智慧體是如何工作的。

`smolagents` 中的所有智慧體都基於單一的 `MultiStepAgent` 類，它是 ReAct 框架的抽象。在基本層面上，該類按照以下步驟迴圈執行操作，其中現有變數和知識被整合到智慧體日誌中：

初始化：系統提示儲存在 `SystemPromptStep` 中，使用者查詢記錄在 `TaskStep` 中。
ReAct 迴圈（迴圈）
1. 使用 `agent.write_inner_memory_from_logs()` 將代理日誌寫入一個可供 LLM 閱讀的聊天訊息列表。
2. 將這些訊息傳送到 `Model` 物件以獲取其完成。解析完成以獲取操作（`ToolCallingAgent` 的 JSON blob，`CodeAgent` 的程式碼片段）。
3. 執行動作並將結果記錄到記憶體中（一個 `ActionStep`）。
4. 在每個步驟結束時，執行 `agent.step_callbacks` 中定義的所有回撥函式。⇒ 這就是我們新增影像支援的地方：建立一個將影像記錄到記憶體中的回撥！

下圖詳細說明了這一過程

如您所見，對於動態檢索影像的用例（例如，網路瀏覽器代理），我們支援將影像新增到模型的 `ActionStep` 中，位於 `step_log.observation_images` 屬性中。

這可以透過回撥函式實現，該函式將在每個步驟結束時執行。

讓我們演示如何製作這樣一個回撥，並用它來構建一個網頁瀏覽器智慧體。👇👇

如何建立具有視覺功能的網頁瀏覽智慧體

我們將使用 helium。它提供了基於 `selenium` 的瀏覽器自動化功能：這將使我們的智慧體更容易操作網頁。

pip install "smolagents[all]" helium selenium python-dotenv

智慧體本身可以直接使用helium，因此不需要特定的工具：它可以直接使用helium執行操作，例如點選頁面上可見的名為“top 10”的按鈕。我們仍然需要製作一些工具來幫助智慧體瀏覽網頁：一個返回上一頁的工具，另一個關閉彈出視窗的工具，因為這些彈出視窗由於其關閉按鈕上沒有文字，所以對於`helium`來說很難抓取。

from io import BytesIO
from time import sleep

import helium
from dotenv import load_dotenv
from PIL import Image
from selenium import webdriver
from selenium.common.exceptions import ElementNotInteractableException, TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

from smolagents import CodeAgent, LiteLLMModel, OpenAIServerModel, TransformersModel, tool
from smolagents.agents import ActionStep


load_dotenv()
import os

@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
    """
    Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
    Args:
        text: The text to search for
        nth_result: Which occurrence to jump to (default: 1)
    """
    elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
    if nth_result > len(elements):
        raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
    result = f"Found {len(elements)} matches for '{text}'."
    elem = elements[nth_result - 1]
    driver.execute_script("arguments[0].scrollIntoView(true);", elem)
    result += f"Focused on element {nth_result} of {len(elements)}"
    return result

@tool
def go_back() -> None:
    """Goes back to previous page."""
    driver.back()

@tool
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners.
    """
    # Common selectors for modal close buttons and overlay elements
    modal_selectors = [
        "button[class*='close']",
        "[class*='modal']",
        "[class*='modal'] button",
        "[class*='CloseButton']",
        "[aria-label*='close']",
        ".modal-close",
        ".close-modal",
        ".modal .close",
        ".modal-backdrop",
        ".modal-overlay",
        "[class*='overlay']"
    ]

    wait = WebDriverWait(driver, timeout=0.5)

    for selector in modal_selectors:
        try:
            elements = wait.until(
                EC.presence_of_all_elements_located((By.CSS_SELECTOR, selector))
            )

            for element in elements:
                if element.is_displayed():
                    try:
                        # Try clicking with JavaScript as it's more reliable
                        driver.execute_script("arguments[0].click();", element)
                    except ElementNotInteractableException:
                        # If JavaScript click fails, try regular click
                        element.click()

        except TimeoutException:
            continue
        except Exception as e:
            print(f"Error handling selector {selector}: {str(e)}")
            continue
    return "Modals closed"

目前，智慧體沒有視覺輸入。因此，讓我們演示如何透過使用回撥函式，將其影像動態地饋送到其步驟日誌中。我們建立一個回撥函式 `save_screenshot`，它將在每個步驟結束時執行。

def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
    sleep(1.0)  # Let JavaScript animations happen before taking the screenshot
    driver = helium.get_driver()
    current_step = step_log.step_number
    if driver is not None:
        for step_logs in agent.logs:  # Remove previous screenshots from logs for lean processing
            if isinstance(step_log, ActionStep) and step_log.step_number <= current_step - 2:
                step_logs.observations_images = None
        png_bytes = driver.get_screenshot_as_png()
        image = Image.open(BytesIO(png_bytes))
        print(f"Captured a browser screenshot: {image.size} pixels")
        step_log.observations_images = [image.copy()]  # Create a copy to ensure it persists, important!

    # Update observations with current URL
    url_info = f"Current url: {driver.current_url}"
    step_log.observations = url_info if step_logs.observations is None else step_log.observations + "\n" + url_info
    return

這裡最重要的一行是我們新增影像到觀測影像中：`step_log.observations_images = [image.copy()]`。

此回撥接受 `step_log` 和 `agent` 本身作為引數。將 `agent` 作為輸入允許執行比僅僅修改最後日誌更深層的操作。

我們來建立一個模型。我們已經在所有模型中添加了對影像的支援。需要精確的一點是：在使用帶有 VLM 的 TransformersModel 時，為了使其正常工作，您需要在初始化時將 `flatten_messages_as_text` 設定為 `False`，例如

model = TransformersModel(model_id="HuggingFaceTB/SmolVLM-Instruct", device_map="auto", flatten_messages_as_text=False)

對於這個演示，我們使用Fireworks API中更大的Qwen2VL。

model = OpenAIServerModel(
    api_key=os.getenv("FIREWORKS_API_KEY"),
    api_base="https://api.fireworks.ai/inference/v1",
    model_id="accounts/fireworks/models/qwen2-vl-72b-instruct",
)

現在，讓我們繼續定義我們的代理。我們將 `verbosity_level` 設定為最高，以顯示 LLM 的完整輸出訊息，從而檢視其思考過程；並將 `max_steps` 增加到 20，以給代理更多步驟來探索網路。我們還為它提供了上面定義的 `save_screenshot` 回撥。

agent = CodeAgent(
    tools=[go_back, close_popups, search_item_ctrl_f],
    model=model,
    additional_authorized_imports=["helium"],
    step_callbacks = [save_screenshot],
    max_steps=20,
    verbosity_level=2
)

最後，我們向代理提供了一些關於使用 helium 的指導。

helium_instructions = """
You can use helium to access websites. Don't bother about the helium driver, it's already managed.
First you need to import everything from helium, then you can do other actions!
Code:
```py
from helium import *
go_to('github.com/trending')
```<end_code>

You can directly click clickable elements by inputting the text that appears on them.
Code:
```py
click("Top products")
```<end_code>

If it's a link:
Code:
```py
click(Link("Top products"))
```<end_code>

If you try to interact with an element and it's not found, you'll get a LookupError.
In general stop your action after each button click to see what happens on your screenshot.
Never try to login in a page.

To scroll up or down, use scroll_down or scroll_up with as an argument the number of pixels to scroll from.
Code:
```py
scroll_down(num_pixels=1200) # This will scroll one viewport down
```<end_code>

When you have pop-ups with a cross icon to close, don't try to click the close icon by finding its element or targeting an 'X' element (this most often fails).
Just use your built-in tool `close_popups` to close them:
Code:
```py
close_popups()
```<end_code>

You can use .exists() to check for the existence of an element. For example:
Code:
```py
if Text('Accept cookies?').exists():
    click('I accept')
```<end_code>

Proceed in several steps rather than trying to solve the task in one shot.
And at the end, only when you have your answer, return your final answer.
Code:
```py
final_answer("YOUR_ANSWER_HERE")
```<end_code>

If pages seem stuck on loading, you might have to wait, for instance `import time` and run `time.sleep(5.0)`. But don't overuse this!
To list elements on page, DO NOT try code-based element searches like 'contributors = find_all(S("ol > li"))': just look at the latest screenshot you have and read it visually, or use your tool search_item_ctrl_f.
Of course, you can act on buttons like a user would do when navigating.
After each code blob you write, you will be automatically provided with an updated screenshot of the browser and the current browser url.
But beware that the screenshot will only be taken at the end of the whole action, it won't see intermediate states.
Don't kill the browser.
"""

執行智慧體

現在一切就緒：讓我們執行我們的智慧體！

github_request = """
I'm trying to find how hard I have to work to get a repo in github.com/trending.
Can you navigate to the profile for the top author of the top trending repo, and give me their total number of commits over the last year?
"""

agent.run(github_request + helium_instructions)

然而，請注意，這項任務非常困難：根據您使用的 VLM，這可能不總是奏效。像 Qwen2VL-72B 或 GPT-4o 這樣強大的 VLM 通常更容易成功。

後續步驟

這將讓您瞥見支援視覺功能的 `CodeAgent` 的強大能力，但還有更多工作要做！

您可以在此處開始使用智慧體網頁瀏覽器。
在我們的公告部落格文章中閱讀更多關於 smolagents 的資訊。
閱讀 smolagents 文件。

我們期待看到您使用視覺語言模型和 smolagents 構建出什麼樣的作品！

更多部落格文章

Gemma 3n 在開源生態系統中完全可用！

由 2025年6月26日 • 114

ScreenSuite - 最全面的 GUI 智慧體評估套件！

由 2025年6月6日 • 53

社群

jkorstad

1月24日

注意：關於模態關閉選擇器，請注意模態框也稱為對話方塊元素，如果構建得健壯，它們應該具有 role="dialog" 屬性，這可以在識別這些彈出視窗時進行搜尋。

此外，任何對話方塊/模態視窗都應該可以透過 Esc 鍵關閉！

希望這有助於更廣泛地識別模態框/對話方塊和/或幫助更輕鬆地關閉它們！

上面相關程式碼塊以供參考 :)



@tool
	
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners.
    """
    # Common selectors for modal close buttons and overlay elements
    modal_selectors = [
        "button[class*='close']",
        "[class*='modal']",
        "[class*='modal'] button",
        "[class*='CloseButton']",
        "[aria-label*='close']",
        ".modal-close",
        ".close-modal",
        ".modal .close",
        ".modal-backdrop",
        ".modal-overlay",
        "[class*='overlay']"
    ]

m-ric

文章作者 1月28日

哦，這是一個很好的觀點，非常感謝！這很有用！

earnliners

1月25日

如果你有能力做到，那麼
救命！救命！救命！ https://arxiv.org/auth/endorse?x=VMBF4S

nandughatge

1月27日

接受 cookie 彈窗。

@工具
def accept_cookie_popup() -> str
"""
接受任何可見的 cookie 同意橫幅。
"""
wait = WebDriverWait(driver, timeout=0.5)
elements = wait.until(EC.presence_of_all_elements_located((By.ID, "onetrust-accept-btn-handler")))
elements[0].click()

代理 = CodeAgent(
tools=[go_back, close_popups, search_item_ctrl_f, close_cookie_popup],
model=model,
additional_authorized_imports=["helium"],
step_callbacks=[save_screenshot],
max_steps=20,
verbosity_level=2,
)