使用 FAISS 進行語義搜尋

在第 5 節中，我們從 🤗 Datasets 倉庫建立了一個 GitHub issue 和評論資料集。在本節中，我們將使用這些資訊構建一個搜尋引擎，幫助我們找到有關該庫最緊迫問題的答案！

使用嵌入進行語義搜尋

如第 1 章所述，基於 Transformer 的語言模型將文字段落中的每個 token 表示為嵌入向量。事實證明，可以透過“池化”單個嵌入來建立整個句子、段落或（在某些情況下）文件的向量表示。然後，這些嵌入可以透過計算每個嵌入之間的點積相似度（或一些其他相似度度量）來查詢語料庫中的相似文件，並返回重疊度最大的文件。

在本節中，我們將使用嵌入來開發一個語義搜尋引擎。這些搜尋引擎比基於查詢中關鍵字與文件匹配的傳統方法具有多個優勢。

載入和準備資料集

我們首先需要下載 GitHub issue 資料集，所以我們像往常一樣使用 load_dataset() 函式

from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 2855
})

這裡我們在 load_dataset() 中指定了預設的 train 分割，因此它返回一個 Dataset 而不是 DatasetDict。首要任務是過濾掉拉取請求，因為這些請求通常很少用於回答使用者查詢，並且會在我們的搜尋引擎中引入噪音。現在應該很熟悉了，我們可以使用 Dataset.filter() 函式來排除資料集中的這些行。同時，我們還會過濾掉沒有評論的行，因為這些行無法為使用者查詢提供答案

issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 771
})

我們可以看到資料集中有很多列，其中大部分我們不需要構建搜尋引擎。從搜尋角度來看，資訊最豐富的列是 title、body 和 comments，而 html_url 為我們提供了指向源 issue 的連結。讓我們使用 Dataset.remove_columns() 函式刪除其餘列

columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 771
})

為了建立嵌入，我們將為每個評論新增 issue 的標題和正文，因為這些欄位通常包含有用的上下文資訊。由於我們的 comments 列目前是每個 issue 的評論列表，我們需要“展開”該列，以便每行由一個 (html_url, title, body, comment) 元組組成。在 Pandas 中，我們可以使用 DataFrame.explode() 函式來做到這一點，該函式會為類列表列中的每個元素建立新行，同時複製所有其他列值。為了檢視實際效果，我們首先切換到 Pandas DataFrame 格式

issues_dataset.set_format("pandas")
df = issues_dataset[:]

如果我們檢查此 DataFrame 的第一行，我們可以看到有四個與此 issue 相關的評論

df["comments"][0].tolist()

['the bug code locate in ：\r\n    if data_args.task_name is not None:\r\n        # Downloading and loading a dataset from the hub.\r\n        datasets = load_dataset("glue", data_args.task_name, cache_dir=model_args.cache_dir)',
 'Hi @jinec,\r\n\r\nFrom time to time we get this kind of `ConnectionError` coming from the github.com website: https://raw.githubusercontent.com\r\n\r\nNormally, it should work if you wait a little and then retry.\r\n\r\nCould you please confirm if the problem persists?',
 'cannot connect，even by Web browser，please check that  there is some  problems。',
 'I can access https://raw.githubusercontent.com/huggingface/datasets/1.7.0/datasets/glue/glue.py without problem...']

當我們展開 df 時，我們期望為每個評論獲得一行。讓我們檢查一下是否如此

comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

	html_url	title	comments	body
0	https://github.com/huggingface/datasets/issues/2787	ConnectionError: 無法訪問 https://raw.githubusercontent.com	錯誤程式碼位於：\r\n if data_args.task_name is not None...	你好，\r\n我正在嘗試執行 run_glue.py，它給了我這個錯誤...
1	https://github.com/huggingface/datasets/issues/2787	ConnectionError: 無法訪問 https://raw.githubusercontent.com	你好 @jinec，\r\n\r\n我們偶爾會從 github.com 網站收到這種 `ConnectionError`：https://raw.githubusercontent.com...	你好，\r\n我正在嘗試執行 run_glue.py，它給了我這個錯誤...
2	https://github.com/huggingface/datasets/issues/2787	ConnectionError: 無法訪問 https://raw.githubusercontent.com	無法連線，甚至透過 Web 瀏覽器也無法連線，請檢查是否存在一些問題。	你好，\r\n我正在嘗試執行 run_glue.py，它給了我這個錯誤...
3	https://github.com/huggingface/datasets/issues/2787	ConnectionError: 無法訪問 https://raw.githubusercontent.com	我可以訪問 https://raw.githubusercontent.com/huggingface/datasets/1.7.0/datasets/glue/glue.py 而沒有問題...	你好，\r\n我正在嘗試執行 run_glue.py，它給了我這個錯誤...

太棒了，我們可以看到行已被複制，comments 列包含單個評論！既然我們已經完成了 Pandas 的工作，我們可以透過在記憶體中載入 DataFrame 快速切換回 Dataset

from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2842
})

好的，這給了我們幾千條評論來處理！

✏️ 試一試！ 看看你是否可以使用 Dataset.map() 展開 issues_dataset 的 comments 列，而不使用 Pandas。這有點棘手；你可能會發現 🤗 Datasets 文件的“批處理對映”部分對這項任務很有用。

現在我們每行有一個評論，讓我們建立一個新的 comments_length 列，其中包含每個評論的字數

comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

我們可以使用這個新列來過濾掉短評論，這些評論通常包含“cc @lewtun”或“謝謝！”等與我們的搜尋引擎無關的內容。過濾的精確數字沒有定論，但大約 15 個單詞似乎是一個不錯的開始

comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2098
})

清理完資料集後，讓我們將 issue 標題、描述和評論串聯到一個新的 text 列中。像往常一樣，我們將編寫一個簡單的函式，可以將其傳遞給 Dataset.map()

def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

我們終於準備好建立嵌入了！讓我們來看看。

建立文字嵌入

我們在第 2 章中看到，我們可以透過使用 AutoModel 類來獲取 token 嵌入。我們只需要選擇一個合適的檢查點來載入模型。幸運的是，有一個名為 sentence-transformers 的庫專門用於建立嵌入。正如該庫的文件所述，我們的用例是非對稱語義搜尋的一個例子，因為我們有一個簡短的查詢，我們希望在較長的文件（如 issue 評論）中找到答案。文件中方便的模型概覽表表明，multi-qa-mpnet-base-dot-v1 檢查點在語義搜尋方面具有最佳效能，因此我們將將其用於我們的應用程式。我們還將使用相同的檢查點載入分詞器

from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

為了加快嵌入過程，將模型和輸入放置在 GPU 裝置上有助於提速，所以我們現在就這樣做

import torch

device = torch.device("cuda")
model.to(device)

如前所述，我們希望將 GitHub issue 語料庫中的每個條目表示為單個向量，因此我們需要以某種方式“池化”或平均我們的 token 嵌入。一種流行的方法是對模型的輸出執行 CLS 池化，我們只需收集特殊 [CLS] token 的最後一個隱藏狀態。以下函式為我們解決了這個問題

def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

接下來，我們將建立一個輔助函式，該函式將對文件列表進行標記化，將張量放在 GPU 上，將其饋送到模型，最後對輸出應用 CLS 池化

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

我們可以透過將語料庫中的第一個文字條目輸入函式並檢查輸出形狀來測試函式是否正常工作

embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

太棒了，我們已經將語料庫中的第一個條目轉換成一個 768 維向量！我們可以使用 Dataset.map() 將我們的 get_embeddings() 函式應用於語料庫中的每一行，所以我們接下來建立一個新的 embeddings 列，如下所示

embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

請注意，我們已將嵌入轉換為 NumPy 陣列 — 這是因為 🤗 Datasets 要求在將它們與 FAISS 建立索引時使用這種格式，我們將在下一步進行。

使用 FAISS 進行高效相似度搜索

現在我們有了一個嵌入資料集，我們需要某種方式來搜尋它們。為此，我們將使用 🤗 Datasets 中的一種特殊資料結構，稱為 FAISS 索引。FAISS（Facebook AI Similarity Search 的縮寫）是一個提供高效演算法以快速搜尋和聚類嵌入向量的庫。

FAISS 的基本思想是建立一個名為索引的特殊資料結構，該結構允許查詢哪些嵌入與輸入嵌入相似。在 🤗 Datasets 中建立 FAISS 索引很簡單——我們使用 Dataset.add_faiss_index() 函式並指定我們希望索引的資料集的哪一列

embeddings_dataset.add_faiss_index(column="embeddings")

現在我們可以透過使用 Dataset.get_nearest_examples() 函式進行最近鄰查詢來對該索引執行查詢。讓我們透過首先嵌入一個問題來測試一下

question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

torch.Size([1, 768])

就像處理文件一樣，我們現在有一個 768 維向量表示查詢，我們可以將其與整個語料庫進行比較，以找到最相似的嵌入

scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

Dataset.get_nearest_examples() 函式返回一個包含查詢與文件之間重疊程度的得分元組，以及一組相應的樣本（此處為 5 個最佳匹配）。讓我們將它們收集到 pandas.DataFrame 中，以便我們輕鬆排序

import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

現在我們可以遍歷前幾行，看看我們的查詢與可用評論的匹配程度

for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

"""
COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.505046844482422
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
\`\`\`python
datasets = load_dataset("text", data_files=data_files)
\`\`\`

We'll do a new release soon
SCORE: 24.555509567260742
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's no internet.

Let me know if you know other ways that can make the offline mode experience better. I'd be happy to add them :)

I already note the "freeze" modules option, to prevent local modules updates. It would be a cool feature.

----------

> @mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?

Indeed `load_dataset` allows to load remote dataset script (squad, glue, etc.) but also you own local ones.
For example if you have a dataset script at `./my_dataset/my_dataset.py` then you can do
\`\`\`python
load_dataset("./my_dataset")
\`\`\`
and the dataset script will generate your dataset once and for all.

----------

About I'm looking into having `csv`, `json`, `text`, `pandas` dataset builders already included in the `datasets` package, so that they are available offline by default, as opposed to the other datasets that require the script to be downloaded.
cf #1724
SCORE: 24.14896583557129
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: > here is my way to load a dataset offline, but it **requires** an online machine
>
> 1. (online machine)
>
> ```
>
> import datasets
>
> data = datasets.load_dataset(...)
>
> data.save_to_disk(/YOUR/DATASET/DIR)
>
> ```
>
> 2. copy the dir from online to the offline machine
>
> 3. (offline machine)
>
> ```
>
> import datasets
>
> data = datasets.load_from_disk(/SAVED/DATA/DIR)
>
> ```
>
>
>
> HTH.


SCORE: 22.893993377685547
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: here is my way to load a dataset offline, but it **requires** an online machine
1. (online machine)
\`\`\`
import datasets
data = datasets.load_dataset(...)
data.save_to_disk(/YOUR/DATASET/DIR)
\`\`\`
2. copy the dir from online to the offline machine
3. (offline machine)
\`\`\`
import datasets
data = datasets.load_from_disk(/SAVED/DATA/DIR)
\`\`\`

HTH.
SCORE: 22.406635284423828
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================
"""

還不錯！我們的第二個結果似乎與查詢匹配。

✏️ 試一試！ 建立你自己的查詢，看看是否能在檢索到的文件中找到答案。你可能需要增加 Dataset.get_nearest_examples() 中的 k 引數來擴大搜索範圍。

< > 在 GitHub 上更新

LLM 課程

使用 FAISS 進行語義搜尋

使用嵌入進行語義搜尋

載入和準備資料集

建立文字嵌入

使用 FAISS 進行高效相似度搜索