ScreenEnv:部署您的全棧桌面智慧體
摘要 (TL;DR):ScreenEnv 是一個強大的 Python 庫,可讓您在 Docker 容器中建立隔離的 Ubuntu 桌面環境,用於測試和部署 GUI 智慧體(又稱計算機使用智慧體)。透過內建對模型上下文協議 (MCP) 的支援,部署能夠檢視、點選和與真實應用程式互動的桌面智慧體從未如此簡單。
什麼是 ScreenEnv?
想象一下,您需要自動化桌面任務、測試 GUI 應用程式或構建一個能與軟體互動的 AI 智慧體。過去,這需要複雜的虛擬機器設定和脆弱的自動化框架。
ScreenEnv 透過提供一個在 Docker 容器中執行的**沙盒化桌面環境**來改變這一點。您可以將其視為一個完整的虛擬桌面會話,您的程式碼可以完全控制它——不僅僅是點選按鈕和輸入文字,而是管理整個桌面體驗,包括啟動應用程式、組織視窗、處理檔案、執行終端命令以及記錄整個會話。
為何選擇 ScreenEnv?
- 🖥️ 完全的桌面控制:完整的滑鼠和鍵盤自動化、視窗管理、應用程式啟動、檔案操作、終端訪問和螢幕錄製
- 🤖 雙重整合模式:同時支援用於 AI 系統的模型上下文協議 (MCP) 和直接的沙盒 API——適應任何智慧體或後端邏輯
- 🐳 Docker 原生:無需複雜的虛擬機器設定——只需 Docker。環境是隔離的、可復現的,並且可以在不到 10 秒的時間內輕鬆部署到任何地方。支援 AMD64 和 ARM64 架構。
🎯 一鍵安裝
from screenenv import Sandbox
sandbox = Sandbox() # That's it!
兩種整合方法
ScreenEnv 提供了**兩種互補的方式**來與您的智慧體和後端系統整合,讓您可以靈活地選擇最適合您架構的方法
選項 1:直接使用沙盒 API
非常適合自定義智慧體框架、現有後端,或當您需要細粒度控制時
from screenenv import Sandbox
# Direct programmatic control
sandbox = Sandbox(headless=False)
sandbox.launch("xfce4-terminal")
sandbox.write("echo 'Custom agent logic'")
screenshot = sandbox.screenshot()
image = Image.open(BytesIO(screenshot_bytes))
...
sandbox.close()
# If close() isn’t called, you might need to shut down the container yourself.
選項 2:MCP 伺服器整合
非常適合支援模型上下文協議的 AI 系統
from screenenv import MCPRemoteServer
from mcp import ClientSession
from mcp.client.streamable_http import streamablehttp_client
# Start MCP server for AI integration
server = MCPRemoteServer(headless=False)
print(f"MCP Server URL: {server.server_url}")
# AI agents can now connect and control the desktop
async def mcp_session():
async with streamablehttp_client(server.server_url) as streams:
async with ClientSession(*streams) as session:
await session.initialize()
print(await session.list_tools())
response = await session.call_tool("screenshot", {})
image_bytes = base64.b64decode(response.content[0].data)
image = Image.open(BytesIO(image_bytes))
server.close()
# If close() isn’t called, you might need to shut down the container yourself.
這種雙重方法意味著 ScreenEnv 能夠適應您現有的基礎設施,而不是強迫您改變您的智慧體架構。
✨ 使用 screenenv 和 smolagents 建立桌面智慧體
screenenv
原生支援 smolagents
,讓您可以輕鬆構建自己的自定義桌面智慧體以實現自動化。以下是如何僅用幾個步驟建立您自己的 AI 驅動的桌面智慧體
1. 選擇您的模型
選擇您想用來驅動智慧體的後端 VLM。
import os
from smolagents import OpenAIServerModel
model = OpenAIServerModel(
model_id="gpt-4.1",
api_key=os.getenv("OPENAI_API_KEY"),
)
# Inference Endpoints
from smolagents import HfApiModel
model = HfApiModel(
model_id="Qwen/Qwen2.5-VL-7B-Instruct",
token=os.getenv("HF_TOKEN"),
provider="nebius",
)
# Transformer models
from smolagents import TransformersModel
model = TransformersModel(
model_id="Qwen/Qwen2.5-VL-7B-Instruct",
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
# Other providers
from smolagents import LiteLLMModel
model = LiteLLMModel(model_id="anthropic/claude-sonnet-4-20250514")
# see smolagents to get the list of available model connectors
2. 定義您的自定義桌面智慧體
繼承 DesktopAgentBase
並實現 _setup_desktop_tools
方法來構建您自己的動作空間!
from screenenv import DesktopAgentBase, Sandbox
from smolagents import Model, Tool, tool
from smolagents.monitoring import LogLevel
from typing import List
class CustomDesktopAgent(DesktopAgentBase):
"""Agent for desktop automation"""
def __init__(
self,
model: Model,
data_dir: str,
desktop: Sandbox,
tools: List[Tool] | None = None,
max_steps: int = 200,
verbosity_level: LogLevel = LogLevel.INFO,
planning_interval: int | None = None,
use_v1_prompt: bool = False,
**kwargs,
):
super().__init__(
model=model,
data_dir=data_dir,
desktop=desktop,
tools=tools,
max_steps=max_steps,
verbosity_level=verbosity_level,
planning_interval=planning_interval,
use_v1_prompt=use_v1_prompt,
**kwargs,
)
# OPTIONAL: Add a custom prompt template - see src/screenenv/desktop_agent/desktop_agent_base.py for more details about the default prompt template
# self.prompt_templates["system_prompt"] = CUSTOM_PROMPT_TEMPLATE.replace(
# "<<resolution_x>>", str(self.width)
# ).replace("<<resolution_y>>", str(self.height))
# Important: Adjust the prompt based on your action space to improve results.
def _setup_desktop_tools(self) -> None:
"""Define your custom tools here."""
@tool
def click(x: int, y: int) -> str:
"""
Clicks at the specified coordinates.
Args:
x: The x-coordinate of the click
y: The y-coordinate of the click
"""
self.desktop.left_click(x, y)
# self.click_coordinates = (x, y) to add the click coordinate to the observation screenshot
return f"Clicked at ({x}, {y})"
self.tools["click"] = click
@tool
def write(text: str) -> str:
"""
Types the specified text at the current cursor position.
Args:
text: The text to type
"""
self.desktop.write(text, delay_in_ms=10)
return f"Typed text: '{text}'"
self.tools["write"] = write
@tool
def press(key: str) -> str:
"""
Presses a keyboard key or combination of keys
Args:
key: The key to press (e.g. "enter", "space", "backspace", etc.) or a multiple keys string to press, for example "ctrl+a" or "ctrl+shift+a".
"""
self.desktop.press(key)
return f"Pressed key: {key}"
self.tools["press"] = press
@tool
def open(file_or_url: str) -> str:
"""
Directly opens a browser with the specified url or opens a file with the default application.
Args:
file_or_url: The URL or file to open
"""
self.desktop.open(file_or_url)
# Give it time to load
self.logger.log(f"Opening: {file_or_url}")
return f"Opened: {file_or_url}"
@tool
def launch_app(app_name: str) -> str:
"""
Launches the specified application.
Args:
app_name: The name of the application to launch
"""
self.desktop.launch(app_name)
return f"Launched application: {app_name}"
self.tools["launch_app"] = launch_app
... # Continue implementing your own action space.
3. 在桌面任務上執行智慧體
from screenenv import Sandbox
# Define your sandbox environment
sandbox = Sandbox(headless=False, resolution=(1920, 1080))
# Create your agent
agent = CustomDesktopAgent(
model=model,
data_dir="data",
desktop=sandbox,
)
# Run a task
task = "Open LibreOffice, write a report of approximately 300 words on the topic ‘AI Agent Workflow in 2025’, and save the document."
result = agent.run(task)
print(f"📄 Result: {result}")
sandbox.close()
如果您遇到 docker 訪問被拒絕的錯誤,您可以嘗試使用
sudo -E python -m test.py
執行智慧體,或將您的使用者新增到docker
組。
💡 有關完整的實現,請參閱 GitHub 上的 CustomDesktopAgent 原始碼。
立即開始
# Install ScreenEnv
pip install screenenv
# Try the examples
git clone git@github.com:huggingface/screenenv.git
cd screenenv
python -m examples.desktop_agent
# use 'sudo -E python -m examples.desktop_agent` if you're not in 'docker' group
下一步是什麼?
ScreenEnv 的目標是超越 Linux,支援 **Android、macOS 和 Windows**,從而實現真正的跨平臺 GUI 自動化。這將使開發人員和研究人員能夠構建以最少設定即可在不同環境中泛化的智慧體。
這些進步為建立**可復現的沙盒環境**鋪平了道路,這些環境非常適合基準測試和評估。