開源 AI 食譜文件

使用 Hugging Face 和 Milvus 構建 RAG

Hugging Face's logo
加入 Hugging Face 社群

並獲得增強的文件體驗

開始使用

Open In Colab

使用 Hugging Face 和 Milvus 構建 RAG

作者: Chen Zhang

Milvus 是一款流行的開源向量資料庫,透過高效能和可擴充套件的向量相似性搜尋為 AI 應用提供支援。在本教程中,我們將向您展示如何使用 Hugging Face 和 Milvus 構建一個 RAG(檢索增強生成)管道。

RAG 系統將檢索系統與大語言模型(LLM)相結合。該系統首先使用 Milvus 向量資料庫從語料庫中檢索相關文件,然後使用託管在 Hugging Face 上的 LLM 根據檢索到的文件生成答案。

準備工作

依賴和環境

! pip install --upgrade pymilvus sentence-transformers huggingface-hub langchain_community langchain-text-splitters pypdf tqdm

如果您正在使用 Google Colab,為了啟用依賴項,您可能需要 重啟執行時 (點選螢幕頂部的“執行時”選單,然後從下拉選單中選擇“重啟會話”)。

此外,我們建議您配置您的 Hugging Face 使用者訪問令牌,並將其設定在您的環境變數中,因為我們將使用 Hugging Face Hub 上的一個 LLM。如果您不設定令牌環境變數,您可能會遇到較低的請求限制。

import os

os.environ["HF_TOKEN"] = "hf_..."

準備資料

我們使用 AI 法案 PDF 作為我們 RAG 中的私有知識,這是一個針對 AI 的監管框架,不同的風險等級對應不同程度的監管。

%%bash

if [ ! -f "The-AI-Act.pdf" ]; then
    wget -q https://artificialintelligenceact.eu/wp-content/uploads/2021/08/The-AI-Act.pdf
fi

我們使用 LangChain 的 PyPDFLoader 從 PDF 中提取文字,然後將文字分割成更小的塊。預設情況下,我們將塊大小設定為 1000,重疊設定為 200,這意味著每個塊大約有 1000 個字元,兩個塊之間的重疊為 200 個字元。

>>> from langchain_community.document_loaders import PyPDFLoader

>>> loader = PyPDFLoader("The-AI-Act.pdf")
>>> docs = loader.load()
>>> print(len(docs))
108
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)
text_lines = [chunk.page_content for chunk in chunks]

準備嵌入模型

定義一個函式來生成文字嵌入。我們以 BGE 嵌入模型 為例,但您可以使用任何嵌入模型,例如在 MTEB 排行榜上找到的模型。

from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("BAAI/bge-small-en-v1.5")


def emb_text(text):
    return embedding_model.encode([text], normalize_embeddings=True).tolist()[0]

生成一個測試嵌入並列印其維度和前幾個元素。

>>> test_embedding = emb_text("This is a test")
>>> embedding_dim = len(test_embedding)
>>> print(embedding_dim)
>>> print(test_embedding[:10])
384
[-0.07660683244466782, 0.025316666811704636, 0.012505513615906239, 0.004595153499394655, 0.025780051946640015, 0.03816710412502289, 0.08050819486379623, 0.003035430097952485, 0.02439221926033497, 0.0048803347162902355]

將資料載入到 Milvus

建立集合

from pymilvus import MilvusClient

milvus_client = MilvusClient(uri="./hf_milvus_demo.db")

collection_name = "rag_collection"

關於 MilvusClient 的引數:

  • uri 設定為本地檔案,例如 ./hf_milvus_demo.db,是最方便的方法,因為它會自動利用 Milvus Lite 將所有資料儲存在該檔案中。
  • 如果您有大量資料,比如超過一百萬個向量,您可以在 Docker 或 Kubernetes 上設定一個性能更強的 Milvus 伺服器。在這種設定中,請使用伺服器的 uri,例如 https://:19530,作為您的 uri
  • 如果您想使用 Zilliz Cloud(Milvus 的全託管雲服務),請調整 uritoken,它們對應 Zilliz Cloud 中的 公共端點和 Api 金鑰

檢查集合是否已存在,如果存在則刪除它。

if milvus_client.has_collection(collection_name):
    milvus_client.drop_collection(collection_name)

使用指定的引數建立一個新的集合。

如果我們不指定任何欄位資訊,Milvus 會自動建立一個預設的 id 欄位作為主鍵,以及一個 vector 欄位來儲存向量資料。一個保留的 JSON 欄位用於儲存未在 schema 中定義的欄位及其值。

milvus_client.create_collection(
    collection_name=collection_name,
    dimension=embedding_dim,
    metric_type="IP",  # Inner product distance
    consistency_level="Strong",  # Strong consistency level
)

插入資料

遍歷文字行,建立嵌入,然後將資料插入到 Milvus 中。

這裡有一個新欄位 text,它是在集合 schema 中未定義的欄位。它將被自動新增到保留的 JSON 動態欄位中,從高層來看可以像普通欄位一樣對待。

from tqdm import tqdm

data = []

for i, line in enumerate(tqdm(text_lines, desc="Creating embeddings")):
    data.append({"id": i, "vector": emb_text(line), "text": line})

insert_res = milvus_client.insert(collection_name=collection_name, data=data)
insert_res["insert_count"]

構建 RAG

為查詢檢索資料

讓我們指定一個關於語料庫的問題。

question = "What is the legal basis for the proposal?"

在集合中搜索問題,並檢索前 3 個語義匹配項。

search_res = milvus_client.search(
    collection_name=collection_name,
    data=[emb_text(question)],  # Use the `emb_text` function to convert the question to an embedding vector
    limit=3,  # Return top 3 results
    search_params={"metric_type": "IP", "params": {}},  # Inner product distance
    output_fields=["text"],  # Return the text field
)

讓我們看一下查詢的搜尋結果

>>> import json

>>> retrieved_lines_with_distances = [(res["entity"]["text"], res["distance"]) for res in search_res[0]]
>>> print(json.dumps(retrieved_lines_with_distances, indent=4))
[
    [
        "EN 6  EN 2. LEGAL  BASIS,  SUBSIDIARITY  AND  PROPORTIONALITY  \n2.1. Legal  basis  \nThe legal basis for the proposal is in the first place Article 114 of the Treaty on the \nFunctioning of the European Union (TFEU), which provides for the adoption of measures to \nensure the establishment and f unctioning of the internal market.  \nThis proposal constitutes a core part of the EU digital single market strategy. The primary \nobjective of this proposal is to ensure the proper functioning of the internal market by setting \nharmonised rules in particular on the development, placing on the Union market and the use \nof products and services making use of AI technologies or provided as stand -alone AI \nsystems. Some Member States are already considering national rules to ensure that AI is safe \nand is developed a nd used in compliance with fundamental rights obligations. This will likely \nlead to two main problems: i) a fragmentation of the internal market on essential elements",
        0.7412998080253601
    ],
    [
        "applications and prevent market fragmentation.  \nTo achieve those objectives, this proposal presents a balanced and proportionate horizontal \nregulatory approach to AI that is limited to the minimum necessary requirements to address \nthe risks and problems linked to AI, withou t unduly constraining or hindering technological \ndevelopment or otherwise disproportionately increasing the cost of placing AI solutions on \nthe market.  The proposal sets a robust and flexible legal framework. On the one hand, it is \ncomprehensive and future -proof in its fundamental regulatory choices, including the \nprinciple -based requirements that AI systems should comply with. On the other hand, it puts \nin place a proportionate regulatory system centred on a well -defined risk -based regulatory \napproach that  does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such",
        0.696428656578064
    ],
    [
        "approach that  does not create unnecessary restrictions to trade, whereby legal intervention is \ntailored to those concrete situations where there is a justified cause for concern or where such \nconcern can reasonably be anticipated in the near future. At the same time, t he legal \nframework includes flexible mechanisms that enable it to be dynamically adapted as the \ntechnology evolves and new concerning situations emerge.  \nThe proposal sets harmonised rules for the development, placement on the market and use of \nAI systems i n the Union following a proportionate risk -based approach. It proposes a single \nfuture -proof definition of AI. Certain particularly harmful AI practices are prohibited as \ncontravening Union values, while specific restrictions and safeguards are proposed in  relation \nto certain uses of remote biometric identification systems for the purpose of law enforcement. \nThe proposal lays down a solid risk methodology to define \u201chigh -risk\u201d AI systems that pose",
        0.6891457438468933
    ]
]

使用 LLM 獲取 RAG 響應

在為 LLM 組合提示詞之前,我們先將檢索到的文件列表扁平化為一個純字串。

context = "\n".join([line_with_distance[0] for line_with_distance in retrieved_lines_with_distances])

為語言模型定義提示詞。這個提示詞是根據從 Milvus 檢索到的文件組裝而成的。

PROMPT = """
Use the following pieces of information enclosed in <context> tags to provide an answer to the question enclosed in <question> tags.
<context>
{context}
</context>
<question>
{question}
</question>
"""

我們使用託管在 Hugging Face 推理伺服器上的 Mixtral-8x7B-Instruct-v0.1 來根據提示詞生成響應。

from huggingface_hub import InferenceClient

repo_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

llm_client = InferenceClient(model=repo_id, timeout=120)

最後,我們可以格式化提示詞並生成答案。

prompt = PROMPT.format(context=context, question=question)
>>> answer = llm_client.text_generation(
...     prompt,
...     max_new_tokens=1000,
... ).strip()
>>> print(answer)
The legal basis for the proposal is Article 114 of the Treaty on the Functioning of the European Union (TFEU), which provides for the adoption of measures to ensure the establishment and functioning of the internal market. The proposal aims to establish harmonized rules for the development, placing on the market, and use of AI systems in the Union following a proportionate risk-based approach.

恭喜!您已經成功使用 Hugging Face 和 Milvus 構建了一個 RAG 管道。

< > 在 GitHub 上更新

© . This site is unofficial and not affiliated with Hugging Face, Inc.