開源 AI 食譜文件

使用 Gemma、MongoDB 和開源模型構建 RAG 系統

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

使用 Gemma、MongoDB 和開源模型構建 RAG 系統

作者: Richmond Alake

步驟 1: 安裝庫

下面的 shell 命令序列安裝了用於利用開源大型語言模型 (LLM)、嵌入模型和資料庫互動功能的庫。這些庫簡化了 RAG 系統的開發，將複雜性降低到少量程式碼即可完成。

PyMongo: 一個用於與 MongoDB 互動的 Python 庫，它提供了連線到叢集、查詢集合和文件中儲存的資料的功能。
Pandas: 提供了一種資料結構，可使用 Python 進行高效的資料處理和分析。
Hugging Face datasets: 包含音訊、視覺和文字資料集。
Hugging Face Accelerate: 抽象了編寫利用 GPU 等硬體加速器程式碼的複雜性。在實現中利用 Accelerate 在 GPU 資源上使用 Gemma 模型。
Hugging Face Transformers: 提供對大量預訓練模型的訪問。
Hugging Face Sentence Transformers: 提供對句子、文字和影像嵌入的訪問。

!pip install datasets pandas pymongo sentence_transformers
!pip install -U transformers
# Install below if using GPU
!pip install accelerate

步驟 2: 資料來源與準備

本教程中使用的資料來自 Hugging Face datasets，特別是 AIatMongoDB/embedded_movies 資料集。

# Load Dataset
from datasets import load_dataset
import pandas as pd

# https://huggingface.co/datasets/AIatMongoDB/embedded_movies
dataset = load_dataset("AIatMongoDB/embedded_movies")

# Convert the dataset to a pandas dataframe
dataset_df = pd.DataFrame(dataset["train"])

dataset_df.head(5)

以下程式碼片段中的操作側重於強制執行資料完整性和質量。

第一個過程確保每個資料點的 fullplot 屬性不為空，因為這是我們在嵌入過程中使用的主要資料。
此步驟還確保我們從所有資料點中刪除 plot_embedding 屬性，因為它將被使用不同嵌入模型 gte-large 建立的新嵌入所替換。

>>> # Data Preparation

>>> # Remove data point where plot coloumn is missing
>>> dataset_df = dataset_df.dropna(subset=["fullplot"])
>>> print("\nNumber of missing values in each column after removal:")
>>> print(dataset_df.isnull().sum())

>>> # Remove the plot_embedding from each data point in the dataset as we are going to create new embeddings with an open source embedding model from Hugging Face
>>> dataset_df = dataset_df.drop(columns=["plot_embedding"])
>>> dataset_df.head(5)

Number of missing values in each column after removal:
num_mflix_comments      0
genres                  0
countries               0
directors              12
fullplot                0
writers                13
awards                  0
runtime                14
type                    0
rated                 279
metacritic            893
poster                 78
languages               1
imdb                    0
plot                    0
cast                    1
plot_embedding          1
title                   0
dtype: int64

步驟 3: 生成嵌入

程式碼片段中的步驟如下：

匯入 SentenceTransformer 類以訪問嵌入模型。
使用 SentenceTransformer 建構函式載入嵌入模型，以例項化 gte-large 嵌入模型。
定義 get_embedding 函式，該函式以文字字串為輸入，並返回一個表示嵌入的浮點數列表。該函式首先檢查輸入文字是否不為空（去除空白後）。如果文字為空，則返回一個空列表。否則，它將使用載入的模型生成嵌入。
透過將 get_embedding 函式應用於 dataset_df DataFrame 的 “fullplot” 列來生成嵌入，為每部電影的情節生成嵌入。生成的嵌入列表被分配到一個名為 embedding 的新列中。

注意: 沒有必要對完整情節中的文字進行分塊，因為我們可以確保文字長度保持在可管理的範圍內。

from sentence_transformers import SentenceTransformer

# https://huggingface.co/thenlper/gte-large
embedding_model = SentenceTransformer("thenlper/gte-large")


def get_embedding(text: str) -> list[float]:
    if not text.strip():
        print("Attempted to get embedding for empty text.")
        return []

    embedding = embedding_model.encode(text)

    return embedding.tolist()


dataset_df["embedding"] = dataset_df["fullplot"].apply(get_embedding)

dataset_df.head()

步驟 4: 資料庫設定和連線

MongoDB 同時充當操作型資料庫和向量資料庫。它提供了一種能高效儲存、查詢和檢索向量嵌入的資料庫解決方案——其優勢在於資料庫維護、管理和成本的簡化。

要建立新的 MongoDB 資料庫，請設定一個數據庫叢集。

前往 MongoDB 官方網站註冊一個免費的 MongoDB Atlas 賬戶，或者對於現有使用者，登入 MongoDB Atlas。
在左側面板中選擇 ‘Database’ 選項，這將導航到資料庫部署頁面，那裡有任何現有叢集的部署規範。透過點選 “+Create” 按鈕建立一個新的資料庫叢集。
為資料庫叢集選擇所有適用的配置。一旦選擇了所有配置選項，點選 “Create Cluster” 按鈕來部署新建立的叢集。MongoDB 還支援在 “Shared Tab” 中建立免費叢集。

注意: 在建立概念驗證時，不要忘記為 Python 主機設定 IP 白名單，或為任何 IP 設定 0.0.0.0/0。
成功建立並部署集群后，該叢集將在 ‘Database Deployment’ 頁面上變為可訪問狀態。
點選叢集的 “Connect” 按鈕，檢視透過各種語言驅動程式設定叢集連線的選項。
本教程僅需要叢集的 URI（唯一資源識別符號）。獲取該 URI 並將其複製到名為 MONGO_URI 的 Google Colabs Secrets 環境中，或將其放在 .env 檔案或等效檔案中。

4.1 資料庫和集合設定

在繼續之前，請確保滿足以下先決條件

在 MongoDB Atlas 上設定了資料庫叢集
獲取了您的叢集 URI

有關資料庫叢集設定和獲取 URI 的幫助，請參閱我們的設定 MongoDB 叢集和獲取連線字串指南。

建立集群后，透過在叢集概覽頁面點選 + Create Database，在 MongoDB Atlas 叢集內建立資料庫和集合。

這裡有一份建立資料庫和集合的指南。

資料庫將命名為 movies。

集合將命名為 movie_collection_2。

步驟 5: 建立向量搜尋索引

此時，請確保已透過 MongoDB Atlas 建立了您的向量索引。

下一步是強制性的，用於對儲存在 movie_collection_2 集合文件中的向量嵌入進行高效準確的基於向量的搜尋。

建立向量搜尋索引可以高效地遍歷文件，以根據向量相似性檢索與查詢嵌入匹配的文件。

點選此處閱讀更多關於 MongoDB 向量搜尋索引的資訊。

{
 "fields": [{
     "numDimensions": 1024,
     "path": "embedding",
     "similarity": "cosine",
     "type": "vector"
   }]
}

numDimension 欄位的 1024 值對應於 gte-large 嵌入模型生成的向量維度。如果您使用 gte-base 或 gte-small 嵌入模型，向量搜尋索引中的 numDimension 值必須分別設定為 768 和 384。

步驟 6: 建立資料連線

下面的程式碼片段還利用 PyMongo 建立了一個 MongoDB 客戶端物件，該物件代表與叢集的連線，並允許訪問其資料庫和集合。

>>> import pymongo
>>> from google.colab import userdata


>>> def get_mongo_client(mongo_uri):
...     """Establish connection to the MongoDB."""
...     try:
...         client = pymongo.MongoClient(mongo_uri)
...         print("Connection to MongoDB successful")
...         return client
...     except pymongo.errors.ConnectionFailure as e:
...         print(f"Connection failed: {e}")
...         return None


... mongo_uri = userdata.get("MONGO_URI")
... if not mongo_uri:
...     print("MONGO_URI not set in environment variables")

... mongo_client = get_mongo_client(mongo_uri)

... # Ingest data into MongoDB
... db = mongo_client["movies"]
... collection = db["movie_collection_2"]

Connection to MongoDB successful

# Delete any existing records in the collection
collection.delete_many({})

將 pandas DataFrame 中的資料攝取到 MongoDB 集合中是一個簡單的過程，可以透過將 DataFrame 轉換為字典，然後利用集合上的 insert_many 方法傳遞轉換後的資料集記錄來高效完成。

>>> documents = dataset_df.to_dict("records")
>>> collection.insert_many(documents)

>>> print("Data ingestion into MongoDB completed")

Data ingestion into MongoDB completed

步驟 7: 對使用者查詢執行向量搜尋

下一步實現一個函式，該函式透過生成查詢嵌入並定義 MongoDB 聚合管道來返回向量搜尋結果。

該管道由 $vectorSearch 和 $project 階段組成，使用生成的向量執行查詢，並格式化結果以僅包含所需資訊，例如情節、標題和型別，同時為每個結果包含一個搜尋分數。

def vector_search(user_query, collection):
    """
    Perform a vector search in the MongoDB collection based on the user query.

    Args:
    user_query (str): The user's query string.
    collection (MongoCollection): The MongoDB collection to search.

    Returns:
    list: A list of matching documents.
    """

    # Generate embedding for the user query
    query_embedding = get_embedding(user_query)

    if query_embedding is None:
        return "Invalid query or embedding generation failed."

    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "vector_index",
                "queryVector": query_embedding,
                "path": "embedding",
                "numCandidates": 150,  # Number of candidate matches to consider
                "limit": 4,  # Return top 4 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "fullplot": 1,  # Include the plot field
                "title": 1,  # Include the title field
                "genres": 1,  # Include the genres field
                "score": {"$meta": "vectorSearchScore"},  # Include the search score
            }
        },
    ]

    # Execute the search
    results = collection.aggregate(pipeline)
    return list(results)

步驟 8: 處理使用者查詢並載入 Gemma

def get_search_result(query, collection):

    get_knowledge = vector_search(query, collection)

    search_result = ""
    for result in get_knowledge:
        search_result += f"Title: {result.get('title', 'N/A')}, Plot: {result.get('fullplot', 'N/A')}\n"

    return search_result

>>> # Conduct query with retrival of sources
>>> query = "What is the best romantic movie to watch and why?"
>>> source_information = get_search_result(query, collection)
>>> combined_information = (
...     f"Query: {query}\nContinue to answer the query by using the Search Results:\n{source_information}."
... )

>>> print(combined_information)

Query: What is the best romantic movie to watch and why?
Continue to answer the query by using the Search Results:
Title: Shut Up and Kiss Me!, Plot: Ryan and Pete are 27-year old best friends in Miami, born on the same day and each searching for the perfect woman. Ryan is a rookie stockbroker living with his psychic Mom. Pete is a slick surfer dude yet to find commitment. Each meets the women of their dreams on the same day. Ryan knocks heads in an elevator with the gorgeous Jessica, passing out before getting her number. Pete falls for the insatiable Tiara, but Tiara's uncle is mob boss Vincent Bublione, charged with her protection. This high-energy romantic comedy asks to what extent will you go for true love?
Title: Pearl Harbor, Plot: Pearl Harbor is a classic tale of romance set during a war that complicates everything. It all starts when childhood friends Rafe and Danny become Army Air Corps pilots and meet Evelyn, a Navy nurse. Rafe falls head over heels and next thing you know Evelyn and Rafe are hooking up. Then Rafe volunteers to go fight in Britain and Evelyn and Danny get transferred to Pearl Harbor. While Rafe is off fighting everything gets completely whack and next thing you know everybody is in the middle of an air raid we now know as "Pearl Harbor."
Title: Titanic, Plot: The plot focuses on the romances of two couples upon the doomed ship's maiden voyage. Isabella Paradine (Catherine Zeta-Jones) is a wealthy woman mourning the loss of her aunt, who reignites a romance with former flame Wynn Park (Peter Gallagher). Meanwhile, a charming ne'er-do-well named Jamie Perse (Mike Doyle) steals a ticket for the ship, and falls for a sweet innocent Irish girl on board. But their romance is threatened by the villainous Simon Doonan (Tim Curry), who has discovered about the ticket and makes Jamie his unwilling accomplice, as well as having sinister plans for the girl.
Title: China Girl, Plot: A modern day Romeo & Juliet story is told in New York when an Italian boy and a Chinese girl become lovers, causing a tragic conflict between ethnic gangs.
.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b-it")
# CPU Enabled uncomment below 👇🏽
# model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
# GPU Enabled use below 👇🏽
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it", device_map="auto")

>>> # Moving tensors to GPU
>>> input_ids = tokenizer(combined_information, return_tensors="pt").to("cuda")
>>> response = model.generate(**input_ids, max_new_tokens=500)
>>> print(tokenizer.decode(response[0]))

Based on the search results, the best romantic movie to watch is **Shut Up and Kiss Me!** because it is a romantic comedy that explores the complexities of love and relationships. The movie is funny, heartwarming, and thought-provoking.

< > 在 GitHub 上更新

←使用 Gemma、Elasticsearch 和開源模型構建 RAG 系統使用 Hugging Face Zephyr 和 LangChain 實現簡單 RAG→