開源 AI 食譜文件

資料分析師代理:眨眼間獲取您的資料洞察 ✨

Hugging Face's logo
加入 Hugging Face 社群

並獲得增強的文件體驗

開始使用

Open In Colab

資料分析師代理:眨眼間獲取您的資料洞察 ✨

作者:Aymeric Roucher

本教程是高階教程。您應該首先具備此食譜中的知識!

在本筆記本中,我們將建立一個資料分析師代理:一個配備資料分析庫的程式碼代理,可以載入和轉換資料框,從您的資料中提取洞察,甚至繪製結果!

假設我想分析來自 Kaggle 泰坦尼克號挑戰賽的資料,以預測單個乘客的生存。但在我親自深入研究之前,我希望一個自主代理為我準備分析,透過提取趨勢並繪製一些圖表來尋找洞察。

讓我們來設定這個系統。

執行以下行安裝所需的依賴項

!pip install seaborn smolagents transformers -q -U

我們首先建立代理。我們使用了 CodeAgent(閱讀文件瞭解更多關於代理型別的資訊),所以我們甚至不需要給它任何工具:它可以直接執行其程式碼。

我們只需確保透過在 additional_authorized_imports 中傳入這些與資料科學相關的庫來讓它使用它們:["numpy", "pandas", "matplotlib.pyplot", "seaborn"]

通常,在 additional_authorized_imports 中傳入庫時,請確保它們已安裝在您的本地環境中,因為 Python 直譯器只能使用您環境中安裝的庫。

⚙ 我們的代理將由 meta-llama/Llama-3.1-70B-Instruct 提供支援,使用 HfApiModel 類,該類使用 HF 的推理 API:推理 API 允許免費快速輕鬆地執行任何開源模型!

from smolagents import InferenceClientModel, CodeAgent
from huggingface_hub import login
import os

login(os.getenv("HUGGINGFACEHUB_API_TOKEN"))

model = InferenceClientModel("meta-llama/Llama-3.1-70B-Instruct")

agent = CodeAgent(
    tools=[],
    model=model,
    additional_authorized_imports=["numpy", "pandas", "matplotlib.pyplot", "seaborn"],
    max_iterations=10,
)

資料分析 📊🤔

執行代理後,我們向其提供直接取自競賽的附加說明,並將其作為 kwarg 傳遞給 run 方法

import os

os.mkdir("./figures")
>>> additional_notes = """
... ### Variable Notes
... pclass: A proxy for socio-economic status (SES)
... 1st = Upper
... 2nd = Middle
... 3rd = Lower
... age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
... sibsp: The dataset defines family relations in this way...
... Sibling = brother, sister, stepbrother, stepsister
... Spouse = husband, wife (mistresses and fiancés were ignored)
... parch: The dataset defines family relations in this way...
... Parent = mother, father
... Child = daughter, son, stepdaughter, stepson
... Some children travelled only with a nanny, therefore parch=0 for them.
... """

>>> analysis = agent.run(
...     """You are an expert data analyst.
... Please load the source file and analyze its content.
... According to the variables you have, begin by listing 3 interesting questions that could be asked on this data, for instance about specific correlations with survival rate.
... Then answer these questions one by one, by finding the relevant numbers.
... Meanwhile, plot some figures using matplotlib/seaborn and save them to the (already existing) folder './figures/': take care to clear each figure with plt.clf() before doing another plot.

... In your final answer: summarize these correlations and trends
... After each number derive real worlds insights, for instance: "Correlation between is_december and boredness is 1.3453, which suggest people are more bored in winter".
... Your final answer should have at least 3 numbered and detailed parts.
... """,
...     additional_args=dict(additional_notes=additional_notes, source_file="titanic/train.csv"),
... )
>>> print(analysis)
The analysis of the Titanic data reveals that socio-economic status and sex are significant factors in determining survival rates. Passengers with lower socio-economic status and males are less likely to survive. The age of a passenger has a minimal impact on their survival rate.

令人印象深刻,不是嗎?您還可以為您的代理提供視覺化工具,讓它能夠反思自己的圖表!

資料科學家代理:執行預測 🛠️

👉 現在讓我們更深入地研究:我們將讓我們的模型對資料進行預測。

為此,我們還允許它在 additional_authorized_imports 中使用 sklearn

agent = CodeAgent(
    tools=[],
    model=model,
    additional_authorized_imports=[
        "numpy",
        "pandas",
        "matplotlib.pyplot",
        "seaborn",
        "sklearn",
    ],
    max_iterations=12,
)

output = agent.run(
    """You are an expert machine learning engineer.
Please train a ML model on "titanic/train.csv" to predict the survival for rows of "titanic/test.csv".
Output the results under './output.csv'.
Take care to import functions and modules before using them!
""",
    additional_args=dict(additional_notes=additional_notes + "\n" + analysis),
)

儘管代理出現了一些錯誤,但最終還是成功地解決了問題!

代理輸出的測試預測,一旦提交給 Kaggle,得分是 0.78229,在 17,360 名參賽者中排名第 2824,比我幾年前首次嘗試這項挑戰時費力取得的成績要好。

您的結果會因人而異,但無論如何,我覺得在幾秒鐘內用一個代理實現這一點非常令人印象深刻。

🚀 以上只是資料分析師代理的初步嘗試:它肯定可以進行大量改進,以更好地適應您的用例!

< > 在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.