使用 smolagents 的視覺代理

本節中的示例需要訪問功能強大的 VLM 模型。我們使用 GPT-4o API 對其進行了測試。但是，為什麼要使用 smolagents 討論了 smolagents 和 Hugging Face 支援的替代解決方案。如果您想探索其他選項，請務必檢視該部分。

賦予代理視覺能力對於解決超出文字處理範圍的任務至關重要。許多現實世界的挑戰，例如網頁瀏覽或文件理解，都需要分析豐富的視覺內容。幸運的是，smolagents 提供了對視覺語言模型 (VLM) 的內建支援，使代理能夠有效地處理和解釋影像。

在這個例子中，想象一下韋恩莊園的管家阿爾弗雷德的任務是核實參加派對的客人的身份。你可以想象，阿爾弗雷德可能不熟悉所有來賓。為了幫助他，我們可以使用一個代理，透過使用 VLM 搜尋他們的視覺資訊來驗證他們的身份。這將使阿爾弗雷德能夠就誰可以進入做出明智的決定。讓我們構建這個例子！

在代理執行開始時提供影像

您可以按照這個 notebook 中的程式碼進行操作，您可以使用 Google Colab 執行它。

在這種方法中，影像在代理執行開始時傳遞給代理，並與任務提示一起儲存為 task_images。然後，代理在整個執行過程中處理這些影像。

考慮阿爾弗雷德想要核實參加派對的超級英雄身份的情況。他已經有了一個包含過去派對的客人姓名影像資料集。給定一個新的訪客影像，代理可以將其與現有資料集進行比較，並決定是否讓他們進入。

在這種情況下，一位客人正試圖進入，阿爾弗雷德懷疑這位訪客可能是小丑冒充的神奇女俠。阿爾弗雷德需要驗證他們的身份，以防止任何不受歡迎的人進入。

讓我們構建這個例子。首先，載入影像。在這種情況下，我們使用維基百科的影像以使例子最小化，但想象一下可能的用例！

from PIL import Image
import requests
from io import BytesIO

image_urls = [
    "https://upload.wikimedia.org/wikipedia/commons/e/e8/The_Joker_at_Wax_Museum_Plus.jpg", # Joker image
    "https://upload.wikimedia.org/wikipedia/en/9/98/Joker_%28DC_Comics_character%29.jpg" # Joker image
]

images = []
for url in image_urls:
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36" 
    }
    response = requests.get(url,headers=headers)
    image = Image.open(BytesIO(response.content)).convert("RGB")
    images.append(image)

現在我們有了影像，代理會告訴我們一位客人是超級英雄（神奇女俠）還是反派（小丑）。

from smolagents import CodeAgent, OpenAIServerModel

model = OpenAIServerModel(model_id="gpt-4o")

# Instantiate the agent
agent = CodeAgent(
    tools=[],
    model=model,
    max_steps=20,
    verbosity_level=2
)

response = agent.run(
    """
    Describe the costume and makeup that the comic character in these photos is wearing and return the description.
    Tell me if the guest is The Joker or Wonder Woman.
    """,
    images=images
)

就我的執行而言，輸出如下，儘管在您的案例中可能會有所不同，正如我們已經討論過的

    {
        'Costume and Makeup - First Image': (
            'Purple coat and a purple silk-like cravat or tie over a mustard-yellow shirt.',
            'White face paint with exaggerated features, dark eyebrows, blue eye makeup, red lips forming a wide smile.'
        ),
        'Costume and Makeup - Second Image': (
            'Dark suit with a flower on the lapel, holding a playing card.',
            'Pale skin, green hair, very red lips with an exaggerated grin.'
        ),
        'Character Identity': 'This character resembles known depictions of The Joker from comic book media.'
    }

在這種情況下，輸出顯示此人正在冒充他人，因此我們可以阻止小丑進入派對！

透過動態檢索提供影像

您可以按照這個 Python 檔案中的程式碼進行操作

之前的方法很有價值，並且有許多潛在的用例。然而，在客人不在資料庫中的情況下，我們需要探索其他識別他們的方法。一種可能的解決方案是動態地從外部源檢索影像和資訊，例如瀏覽網頁以獲取詳細資訊。

在這種方法中，影像在執行期間動態新增到代理的記憶體中。我們知道，smolagents 中的代理基於 MultiStepAgent 類，它是 ReAct 框架的抽象。此類以結構化迴圈執行，其中各種變數和知識在不同階段被記錄

SystemPromptStep：儲存系統提示。
TaskStep：記錄使用者查詢和任何提供的輸入。
ActionStep：捕獲代理操作和結果的日誌。

這種結構化方法允許代理動態地整合視覺資訊並適應不斷發展的任務。下面是我們已經看過的圖表，它說明了動態工作流過程以及不同步驟如何融入代理生命週期。當瀏覽時，代理可以截圖並將其儲存為 ActionStep 中的 observation_images。

Dynamic image retrieval

現在我們瞭解了需求，讓我們構建完整的示例。在這種情況下，阿爾弗雷德希望完全控制客人驗證過程，因此瀏覽詳細資訊成為一個可行的解決方案。為了完成此示例，我們需要一組新的代理工具。此外，我們將使用 Selenium 和 Helium，它們是瀏覽器自動化工具。這將允許我們構建一個代理，該代理可以探索網頁，搜尋潛在客人的詳細資訊並檢索驗證資訊。讓我們安裝所需的工具

pip install "smolagents[all]" helium selenium python-dotenv

我們將需要一套專為瀏覽設計的代理工具，例如 search_item_ctrl_f、go_back 和 close_popups。這些工具允許代理像一個人一樣瀏覽網頁。

@tool
def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
    """
    Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
    Args:
        text: The text to search for
        nth_result: Which occurrence to jump to (default: 1)
    """
    elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
    if nth_result > len(elements):
        raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
    result = f"Found {len(elements)} matches for '{text}'."
    elem = elements[nth_result - 1]
    driver.execute_script("arguments[0].scrollIntoView(true);", elem)
    result += f"Focused on element {nth_result} of {len(elements)}"
    return result


@tool
def go_back() -> None:
    """Goes back to previous page."""
    driver.back()


@tool
def close_popups() -> str:
    """
    Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows! This does not work on cookie consent banners.
    """
    webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()

我們還需要儲存螢幕截圖的功能，因為這將是我們的 VLM 代理完成任務的重要組成部分。此功能捕獲螢幕截圖並將其儲存到 step_log.observations_images = [image.copy()] 中，允許代理在導航時動態儲存和處理影像。

def save_screenshot(step_log: ActionStep, agent: CodeAgent) -> None:
    sleep(1.0)  # Let JavaScript animations happen before taking the screenshot
    driver = helium.get_driver()
    current_step = step_log.step_number
    if driver is not None:
        for step_logs in agent.logs:  # Remove previous screenshots from logs for lean processing
            if isinstance(step_log, ActionStep) and step_log.step_number <= current_step - 2:
                step_logs.observations_images = None
        png_bytes = driver.get_screenshot_as_png()
        image = Image.open(BytesIO(png_bytes))
        print(f"Captured a browser screenshot: {image.size} pixels")
        step_log.observations_images = [image.copy()]  # Create a copy to ensure it persists, important!

    # Update observations with current URL
    url_info = f"Current url: {driver.current_url}"
    step_log.observations = url_info if step_logs.observations is None else step_log.observations + "\n" + url_info
    return

此函式作為 step_callback 傳遞給代理，因為它在代理執行期間的每個步驟結束時觸發。這允許代理在整個過程中動態捕獲和儲存螢幕截圖。

現在，我們可以生成用於瀏覽網頁的視覺代理，為其提供我們建立的工具以及 DuckDuckGoSearchTool 來探索網頁。此工具將幫助代理檢索驗證客人身份所需的資訊，這些資訊基於視覺線索。

from smolagents import CodeAgent, OpenAIServerModel, DuckDuckGoSearchTool
model = OpenAIServerModel(model_id="gpt-4o")

agent = CodeAgent(
    tools=[DuckDuckGoSearchTool(), go_back, close_popups, search_item_ctrl_f],
    model=model,
    additional_authorized_imports=["helium"],
    step_callbacks=[save_screenshot],
    max_steps=20,
    verbosity_level=2,
)

有了這些，阿爾弗雷德就可以檢查客人的身份，並就是否讓他們進入派對做出明智的決定了

agent.run("""
I am Alfred, the butler of Wayne Manor, responsible for verifying the identity of guests at party. A superhero has arrived at the entrance claiming to be Wonder Woman, but I need to confirm if she is who she says she is.

Please search for images of Wonder Woman and generate a detailed visual description based on those images. Additionally, navigate to Wikipedia to gather key details about her appearance. With this information, I can determine whether to grant her access to the event.
""" + helium_instructions)

您可以看到我們將 helium_instructions 作為任務的一部分。此特殊提示旨在控制代理的導航，確保其在瀏覽網頁時遵循正確的步驟。

讓我們看看下面的影片中它是如何工作的

這是最終輸出

Final answer: Wonder Woman is typically depicted wearing a red and gold bustier, blue shorts or skirt with white stars, a golden tiara, silver bracelets, and a golden Lasso of Truth. She is Princess Diana of Themyscira, known as Diana Prince in the world of men.

有了所有這些，我們成功地為派對建立了身份驗證器！阿爾弗雷德現在擁有確保只有正確的客人才能進門的必要工具。一切都已準備就緒，可以在韋恩莊園度過愉快的時光！

Agents 課程

使用 smolagents 的視覺代理

在代理執行開始時提供影像

透過動態檢索提供影像

延伸閱讀