開源 AI 食譜文件

使用 Haystack 和 NuExtract 進行資訊提取

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

使用 Haystack 和 NuExtract 進行資訊提取

作者：Stefano Fiorucci

在本 notebook 中，我們將瞭解如何使用語言模型自動從文字資料中提取資訊。

🎯 目標：建立一個應用程式，根據使用者定義的結構從給定文字或 URL 中提取特定資訊。

🧰 技術棧

Haystack 🏗️：一個可定製的編排框架，用於構建 LLM 應用程式。我們將使用 Haystack 構建資訊提取流水線。
NuExtract：一個小型的語言模型，專門為結構化資料提取進行了微調。

安裝依賴項

! pip install haystack-ai trafilatura transformers pyvis

元件

Haystack 有兩個主要概念：元件和流水線 (Components and Pipelines)。

🧩 元件 (Components) 是執行單個任務的構建塊：檔案轉換、文字生成、嵌入建立等。

➿ 流水線 (Pipelines) 允許您透過將元件組合成有向 (迴圈) 圖來定義 LLM 應用程式中的資料流。

我們現在將介紹我們的資訊提取應用程式的各個元件。之後，我們會將它們整合到一個流水線中。

LinkContentFetcher 和 HTMLToDocument：從網頁中提取文字

在我們的實驗中，我們將從網上找到的初創公司融資公告中提取資料。

為了下載網頁並提取文字，我們使用兩個元件

LinkContentFetcher：獲取一些 URL 的內容，並返回一個內容流列表 (作為 ByteStream 物件)。
HTMLToDocument：將 HTML 源轉換為文字 Documents。

>>> from haystack.components.fetchers import LinkContentFetcher
>>> from haystack.components.converters import HTMLToDocument


>>> fetcher = LinkContentFetcher()

>>> streams = fetcher.run(urls=["https://example.com/"])["streams"]

>>> converter = HTMLToDocument()
>>> docs = converter.run(sources=streams)

>>> print(docs)

&#123;'documents': [Document(id=65bb1ce4b6db2f154d3acfa145fa03363ef93f751fb8599dcec3aaf75aa325b9, content: 'This domain is for use in illustrative examples in documents. You may use this domain in literature ...', meta: &#123;'content_type': 'text/html', 'url': 'https://example.com/'})]}

HuggingFaceLocalGenerator：載入並試用模型

我們使用 HuggingFaceLocalGenerator，這是一個文字生成元件，允許使用 Transformers 庫載入託管在 Hugging Face 上的模型。

Haystack 支援許多其他的生成器，包括 HuggingFaceAPIGenerator (與 Hugging Face API 和 TGI 相容)。

我們載入 NuExtract，這是一個從 `microsoft/Phi-3-mini-4k-instruct` 微調而來的模型，用於從文字中執行結構化資料提取。模型大小為 3.8B 引數。其他變體也可用：`NuExtract-tiny` (0.5B) 和 `NuExtract-large` (7B)。

模型以 `bfloat16` 精度載入，以便在 Colab 中執行，與 FP32 相比效能損失可忽略不計，正如模型卡中所建議的那樣。

關於 Flash Attention 的說明

在推理時，您可能會看到一個警告：“您沒有執行 flash-attention 實現”。

Colab 或 Kaggle 等免費環境中的 GPU 不支援它，所以我們決定在本 notebook 中不使用它。

如果您的 GPU 架構支援它 (詳細資訊)，您可以安裝它並按如下方式獲得加速

pip install flash-attn --no-build-isolation

然後在 `model_kwargs` 中新增 `"attn_implementation": "flash_attention_2"`。

from haystack.components.generators import HuggingFaceLocalGenerator
import torch

generator = HuggingFaceLocalGenerator(
    model="numind/NuExtract", huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype": torch.bfloat16}}
)

# effectively load the model (warm_up is automatically invoked when the generator is part of a Pipeline)
generator.warm_up()

該模型支援特定的提示詞結構，可以從模型卡中推斷出來。

讓我們手動建立一個提示詞來試用模型。稍後，我們將瞭解如何根據不同輸入動態建立提示詞。

>>> prompt = """<|input|>\n### Template:
... {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }
... ### Text:
... The Fiat Panda is a city car manufactured and marketed by Fiat since 1980, currently in its third generation. The first generation Panda, introduced in 1980, was a two-box, three-door hatchback designed by Giorgetto Giugiaro and Aldo Mantovani of Italdesign and was manufactured through 2003 — receiving an all-wheel drive variant in 1983. SEAT of Spain marketed a variation of the first generation Panda under license to Fiat, initially as the Panda and subsequently as the Marbella (1986–1998).

... The second-generation Panda, launched in 2003 as a 5-door hatchback, was designed by Giuliano Biasio of Bertone, and won the European Car of the Year in 2004. The third-generation Panda debuted at the Frankfurt Motor Show in September 2011, was designed at Fiat Centro Stilo under the direction of Roberto Giolito and remains in production in Italy at Pomigliano d'Arco.[1] The fourth-generation Panda is marketed as Grande Panda, to differentiate it with the third-generation that is sold alongside it. Developed under Stellantis, the Grande Panda is produced in Serbia.

... In 40 years, Panda production has reached over 7.8 million,[2] of those, approximately 4.5 million were the first generation.[3] In early 2020, its 23-year production was counted as the twenty-ninth most long-lived single generation car in history by Autocar.[4] During its initial design phase, Italdesign referred to the car as il Zero. Fiat later proposed the name Rustica. Ultimately, the Panda was named after Empanda, the Roman goddess and patroness of travelers.
... <|output|>
... """

>>> result = generator.run(prompt=prompt)
>>> print(result)

&#123;'replies': ['&#123;\n    "Car": &#123;\n        "Name": "Fiat Panda",\n        "Manufacturer": "Fiat",\n        "Designers": [\n            "Giorgetto Giugiaro",\n            "Aldo Mantovani",\n            "Giuliano Biasio",\n            "Roberto Giolito"\n        ],\n        "Number of units produced": "over 7.8 million"\n    }\n}\n']}

不錯 ✅

PromptBuilder：動態建立提示詞

PromptBuilder 使用 Jinja2 提示詞模板進行初始化，並透過填充透過關鍵字引數傳遞的引數來呈現它。

我們的提示詞模板再現了模型卡中顯示的結構。

在我們的實驗中，我們發現縮排模式對於確保好的結果尤為重要。這可能源於模型的訓練方式。

from haystack.components.builders import PromptBuilder
from haystack import Document

prompt_template = """<|input|>
### Template:
{{ schema | tojson(indent=4) }}
{% for example in examples %}
### Example:
{{ example | tojson(indent=4) }}\n
{% endfor %}
### Text
{{documents[0].content}}
<|output|>
"""

prompt_builder = PromptBuilder(template=prompt_template)

>>> example_document = Document(content="The Fiat Panda is a city car...")

>>> example_schema = {
...     "Car": {
...         "Name": "",
...         "Manufacturer": "",
...         "Designers": [],
...         "Number of units produced": "",
...     }
... }

>>> prompt = prompt_builder.run(documents=[example_document], schema=example_schema)["prompt"]

>>> print(prompt)

<|input|>
### Template:
&#123;
    "Car": &#123;
        "Designers": [],
        "Manufacturer": "",
        "Name": "",
        "Number of units produced": ""
    }
}

### Text
The Fiat Panda is a city car...
<|output|>

效果很好 ✅

OutputAdapter

您可能已經注意到，提取的結果是 `replies` 列表的第一個元素，並且是一個 JSON 字串。

我們希望為每個源文件生成一個字典。要在流水線中執行此轉換，我們可以使用 OutputAdapter。

>>> import json
>>> from haystack.components.converters import OutputAdapter


>>> adapter = OutputAdapter(
...     template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
...     output_type=dict,
...     custom_filters={"json_loads": json.loads},
... )

... print(adapter.run(**result))

&#123;'output': &#123;'Car': &#123;'Name': 'Fiat Panda', 'Manufacturer': 'Fiat', 'Designers': ['Giorgetto Giugiaro', 'Aldo Mantovani', 'Giuliano Biasio', 'Roberto Giolito'], 'Number of units produced': 'over 7.8 million'}}}

資訊提取流水線

構建流水線

我們現在可以建立我們的流水線，透過新增和連線各個元件。

from haystack import Pipeline

ie_pipe = Pipeline()
ie_pipe.add_component("fetcher", fetcher)
ie_pipe.add_component("converter", converter)
ie_pipe.add_component("prompt_builder", prompt_builder)
ie_pipe.add_component("generator", generator)
ie_pipe.add_component("adapter", adapter)

ie_pipe.connect("fetcher", "converter")
ie_pipe.connect("converter", "prompt_builder")
ie_pipe.connect("prompt_builder", "generator")
ie_pipe.connect("generator", "adapter")

# IN CASE YOU NEED TO RECREATE THE PIPELINE FROM SCRATCH, YOU CAN UNCOMMENT THIS CELL

# ie_pipe = Pipeline()
# ie_pipe.add_component("fetcher", LinkContentFetcher())
# ie_pipe.add_component("converter", HTMLToDocument())
# ie_pipe.add_component("prompt_builder", PromptBuilder(template=prompt_template))
# ie_pipe.add_component("generator", HuggingFaceLocalGenerator(model="numind/NuExtract",
#                                       huggingface_pipeline_kwargs={"model_kwargs": {"torch_dtype":torch.bfloat16}})
# )
# ie_pipe.add_component("adapter", OutputAdapter(template="""{{ replies[0]| replace("'",'"') | json_loads}}""",
#                                          output_type=dict,
#                                          custom_filters={"json_loads": json.loads}))

# ie_pipe.connect("fetcher", "converter")
# ie_pipe.connect("converter", "prompt_builder")
# ie_pipe.connect("prompt_builder", "generator")
# ie_pipe.connect("generator", "adapter")

讓我們回顧一下我們的流水線設定

>>> ie_pipe.show()

定義來源和提取模式

我們選擇了一系列與近期初創公司融資公告相關的 URL。

此外，我們為我們旨在提取的結構化資訊定義了一個模式。

urls = [
    "https://techcrunch.com/2023/04/27/pinecone-drops-100m-investment-on-750m-valuation-as-vector-database-demand-grows/",
    "https://techcrunch.com/2023/04/27/replit-funding-100m-generative-ai/",
    "https://www.cnbc.com/2024/06/12/mistral-ai-raises-645-million-at-a-6-billion-valuation.html",
    "https://techcrunch.com/2024/01/23/qdrant-open-source-vector-database/",
    "https://www.intelcapital.com/anyscale-secures-100m-series-c-at-1b-valuation-to-radically-simplify-scaling-and-productionizing-ai-applications/",
    "https://techcrunch.com/2023/04/28/openai-funding-valuation-chatgpt/",
    "https://techcrunch.com/2024/03/27/amazon-doubles-down-on-anthropic-completing-its-planned-4b-investment/",
    "https://techcrunch.com/2024/01/22/voice-cloning-startup-elevenlabs-lands-80m-achieves-unicorn-status/",
    "https://techcrunch.com/2023/08/24/hugging-face-raises-235m-from-investors-including-salesforce-and-nvidia",
    "https://www.prnewswire.com/news-releases/ai21-completes-208-million-oversubscribed-series-c-round-301994393.html",
    "https://techcrunch.com/2023/03/15/adept-a-startup-training-ai-to-use-existing-software-and-apis-raises-350m/",
    "https://www.cnbc.com/2023/03/23/characterai-valued-at-1-billion-after-150-million-round-from-a16z.html",
]


schema = {
    "Funding": {
        "New funding": "",
        "Investors": [],
    },
    "Company": {"Name": "", "Activity": "", "Country": "", "Total valuation": "", "Total funding": ""},
}

執行流水線！

我們將所需資料傳遞給每個元件。

請注意，它們中的大多數從先前執行的元件接收資料。

from tqdm import tqdm

extracted_data = []

for url in tqdm(urls):
    result = ie_pipe.run({"fetcher": {"urls": [url]}, "prompt_builder": {"schema": schema}})

    extracted_data.append(result["adapter"]["output"])

讓我們檢查一些提取的資料

extracted_data[:2]

資料探索與視覺化

讓我們探索提取的資料，以評估其正確性並獲得見解。

資料幀

我們首先建立一個 Pandas 資料幀。為簡單起見，我們將提取的資料扁平化。

def flatten_dict(d, parent_key=""):
    items = []
    for k, v in d.items():
        new_key = f"{parent_key} - {k}" if parent_key else k
        if isinstance(v, dict):
            items.extend(flatten_dict(v, new_key).items())
        elif isinstance(v, list):
            items.append((new_key, ", ".join(v)))
        else:
            items.append((new_key, v))
    return dict(items)

import pandas as pd

df = pd.DataFrame([flatten_dict(el) for el in extracted_data])
df = df.sort_values(by="Company - Name")

df

dataframe

除了一些“公司 - 國家”中的錯誤外，提取的資料看起來不錯。

構建一個簡單圖

為了理解公司和投資者之間的關係，我們構建一個圖並將其視覺化。

首先，我們使用 NetworkX 構建一個圖。

NetworkX 是一個 Python 包，可以簡單地建立和操作網路/圖。

我們的簡單圖將以公司和投資者為節點。如果投資者和公司在同一文件中被提及，我們將它們連線起來。

import networkx as nx

# Create a new graph
G = nx.Graph()

# Add nodes and edges
for el in extracted_data:
    company_name = el["Company"]["Name"]
    G.add_node(company_name, label=company_name, title="Company")

    investors = el["Funding"]["Investors"]
    for investor in investors:
        if not G.has_node(investor):
            G.add_node(investor, label=investor, title="Investor", color="red")
        G.add_edge(company_name, investor)

接下來，我們使用 Pyvis 來視覺化圖。

Pyvis 是一個用於網路/圖互動式視覺化的 Python 包。它與 NetworkX 很好地整合。

from pyvis.network import Network
from IPython.display import display, HTML


net = Network(notebook=True, cdn_resources="in_line")
net.from_nx(G)

net.show("simple_graph.html")
display(HTML("simple_graph.html"))

graph visualization

看起來 Andreessen Horowitz 在選定的融資公告中出現得很頻繁 😊

結論與想法

在本 notebook 中，我們演示瞭如何使用一個小語言模型 (NuExtract) 和一個可定製的 LLM 應用編排框架 Haystack 來建立一個資訊提取系統。

我們如何使用提取的資料？

一些想法

可以將提取的資料新增到儲存在文件儲存中的原始文件中。這允許使用元資料過濾實現高階搜尋功能。
在前一個想法的基礎上，您可以進行 RAG (檢索增強提取)，並從查詢中提取元資料，如這篇博文中所解釋的那樣。
將文件和提取的資料儲存在知識圖譜中，並執行圖 RAG (Neo4j-Haystack 整合)。

< > 在 GitHub 上更新

←用於 PII 檢測的 LLM 閘道器使用 Qdrant 透過向量嵌入進行程式碼搜尋→