使用 FAISS 進行語義搜尋

在第 5 章中，我們從 🤗 Datasets 庫建立了一個包含 GitHub 問題和評論的資料集。在本節中，我們將使用這些資訊構建一個搜尋引擎，幫助我們找到有關該庫的最緊迫問題的答案！

使用嵌入進行語義搜尋

正如我們在第 1 章中看到的，基於 Transformer 的語言模型將文字片段中的每個標記表示為一個嵌入向量。事實證明，可以“池化”各個嵌入來建立整個句子、段落或（在某些情況下）文件的向量表示。然後，可以透過計算每個嵌入與查詢之間的點積相似度（或其他一些相似度度量）並返回重疊最大的文件，來使用這些嵌入在語料庫中查詢類似的文件。

在本節中，我們將使用嵌入來開發一個語義搜尋引擎。與基於將查詢中的關鍵詞與文件進行匹配的傳統方法相比，這些搜尋引擎提供了幾個優勢。

載入和準備資料集

首先，我們需要下載 GitHub 問題的 dataset，因此像往常一樣使用 load_dataset() 函式

from datasets import load_dataset

issues_dataset = load_dataset("lewtun/github-issues", split="train")
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 2855
})

在這裡，我們在 load_dataset() 中指定了預設的 train 分割，因此它返回一個 Dataset 而不是 DatasetDict。首要任務是過濾掉 pull request，因為這些請求很少用於回答使用者查詢，並且會在我們的搜尋引擎中引入噪聲。正如我們現在應該熟悉的那樣，我們可以使用 Dataset.filter() 函式來排除資料集中的這些行。趁此機會，我們還可以過濾掉沒有評論的行，因為這些行沒有提供對使用者查詢的任何答案

issues_dataset = issues_dataset.filter(
    lambda x: (x["is_pull_request"] == False and len(x["comments"]) > 0)
)
issues_dataset

Dataset({
    features: ['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'pull_request', 'body', 'performed_via_github_app', 'is_pull_request'],
    num_rows: 771
})

我們可以看到 dataset 中有很多列，其中大多數我們不需要構建搜尋引擎。從搜尋的角度來看，資訊量最大的列是 title、body 和 comments，而 html_url 為我們提供了指向源問題的連結。讓我們使用 Dataset.remove_columns() 函式刪除其餘列

columns = issues_dataset.column_names
columns_to_keep = ["title", "body", "html_url", "comments"]
columns_to_remove = set(columns_to_keep).symmetric_difference(columns)
issues_dataset = issues_dataset.remove_columns(columns_to_remove)
issues_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 771
})

為了建立嵌入，我們將為每個評論新增問題的標題和正文，因為這些欄位通常包含有用的上下文資訊。因為我們的 comments 列當前是每個問題的評論列表，所以我們需要“展開”該列，以便每一行都包含一個 (html_url, title, body, comment) 元組。在 Pandas 中，我們可以使用 DataFrame.explode() 函式來實現這一點，該函式為列表型列中的每個元素建立一個新行，同時複製所有其他列值。為了檢視其工作原理，讓我們首先切換到 Pandas DataFrame 格式

issues_dataset.set_format("pandas")
df = issues_dataset[:]

如果我們檢查此 DataFrame 中的第一行，我們可以看到與此問題關聯的四個評論

df["comments"][0].tolist()

['the bug code locate in ：\r\n    if data_args.task_name is not None:\r\n        # Downloading and loading a dataset from the hub.\r\n        datasets = load_dataset("glue", data_args.task_name, cache_dir=model_args.cache_dir)',
 'Hi @jinec,\r\n\r\nFrom time to time we get this kind of `ConnectionError` coming from the github.com website: https://raw.githubusercontent.com\r\n\r\nNormally, it should work if you wait a little and then retry.\r\n\r\nCould you please confirm if the problem persists?',
 'cannot connect，even by Web browser，please check that  there is some  problems。',
 'I can access https://raw.githubusercontent.com/huggingface/datasets/1.7.0/datasets/glue/glue.py without problem...']

當我們展開 df 時，我們期望為每個評論獲得一行。讓我們檢查一下是否如此

comments_df = df.explode("comments", ignore_index=True)
comments_df.head(4)

	html_url	標題	評論	正文
0	https://github.com/huggingface/datasets/issues/2787	ConnectionError: 無法訪問 https://raw.githubusercontent.com	錯誤程式碼位於：\r\n if data_args.task_name is not None...	您好，\r\n我正在嘗試執行 run_glue.py，但它給了我這個錯誤...
1	https://github.com/huggingface/datasets/issues/2787	ConnectionError: 無法訪問 https://raw.githubusercontent.com	你好 @jinec，\r\n\r\n我們不時會收到來自 github.com 網站的這種 `ConnectionError`：https://raw.githubusercontent.com...	您好，\r\n我正在嘗試執行 run_glue.py，但它給了我這個錯誤...
2	https://github.com/huggingface/datasets/issues/2787	ConnectionError: 無法訪問 https://raw.githubusercontent.com	無法連線，即使透過 Web 瀏覽器，請檢查是否存在某些問題。	您好，\r\n我正在嘗試執行 run_glue.py，但它給了我這個錯誤...
3	https://github.com/huggingface/datasets/issues/2787	ConnectionError: 無法訪問 https://raw.githubusercontent.com	我可以毫無問題地訪問 https://raw.githubusercontent.com/huggingface/datasets/1.7.0/datasets/glue/glue.py...	您好，\r\n我正在嘗試執行 run_glue.py，但它給了我這個錯誤...

很好，我們可以看到行已被複制，並且 comments 列包含各個評論！現在我們已經完成了 Pandas 的操作，可以透過將 DataFrame 載入到記憶體中快速切換回 Dataset

from datasets import Dataset

comments_dataset = Dataset.from_pandas(comments_df)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body'],
    num_rows: 2842
})

好的，這給了我們幾千條評論可以使用！

✏️ 試一試！ 看看你是否可以使用 Dataset.map() 展開 issues_dataset 的 comments 列，無需訴諸於使用 Pandas。這有點棘手；你可能會發現 🤗 Datasets 文件的 “批處理對映” 部分對此任務很有用。

現在我們每個行一個評論了，讓我們建立一個新的 comments_length 列，其中包含每個評論的單詞數

comments_dataset = comments_dataset.map(
    lambda x: {"comment_length": len(x["comments"].split())}
)

我們可以使用這個新列來過濾掉短評論，這些評論通常包含諸如“cc @lewtun”或“謝謝！”之類的資訊，這些資訊與我們的搜尋引擎無關。沒有精確的數字可用於選擇過濾器，但大約 15 個詞似乎是一個不錯的起點

comments_dataset = comments_dataset.filter(lambda x: x["comment_length"] > 15)
comments_dataset

Dataset({
    features: ['html_url', 'title', 'comments', 'body', 'comment_length'],
    num_rows: 2098
})

在清理了 dataset 之後，讓我們在一個新的 text 列中將問題的標題、描述和評論連線在一起。像往常一樣，我們將編寫一個簡單的函式，我們可以將其傳遞給 Dataset.map()

def concatenate_text(examples):
    return {
        "text": examples["title"]
        + " \n "
        + examples["body"]
        + " \n "
        + examples["comments"]
    }


comments_dataset = comments_dataset.map(concatenate_text)

我們終於準備好建立一些嵌入向量了！讓我們來看一下。

建立文字嵌入向量

我們在第 2 章中看到，我們可以使用AutoModel類獲取詞元嵌入向量。我們只需要選擇一個合適的檢查點來載入模型。幸運的是，有一個名為sentence-transformers的庫專門用於建立嵌入向量。如庫的文件中所述，我們的用例是非對稱語義搜尋的一個例子，因為我們有一個簡短的查詢，我們希望在較長的文件（如問題評論）中找到答案。文件中方便的模型概述表表明，multi-qa-mpnet-base-dot-v1檢查點在語義搜尋方面具有最佳效能，因此我們將將其用於我們的應用程式。我們還將使用相同的檢查點載入分詞器。

from transformers import AutoTokenizer, AutoModel

model_ckpt = "sentence-transformers/multi-qa-mpnet-base-dot-v1"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModel.from_pretrained(model_ckpt)

為了加快嵌入向量過程，將模型和輸入放在GPU裝置上會有所幫助，所以我們現在就來做這件事。

import torch

device = torch.device("cuda")
model.to(device)

正如我們前面提到的，我們希望將GitHub問題語料庫中的每個條目表示為一個單一的向量，因此我們需要以某種方式“池化”或平均我們的詞元嵌入向量。一種流行的方法是對模型的輸出執行CLS池化，我們只需收集特殊[CLS]詞元的最後一個隱藏狀態。以下函式可以幫我們做到這一點。

def cls_pooling(model_output):
    return model_output.last_hidden_state[:, 0]

接下來，我們將建立一個輔助函式，該函式將對文件列表進行分詞，將張量放到GPU上，將其饋送到模型，最後對輸出應用CLS池化。

def get_embeddings(text_list):
    encoded_input = tokenizer(
        text_list, padding=True, truncation=True, return_tensors="pt"
    )
    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
    model_output = model(**encoded_input)
    return cls_pooling(model_output)

我們可以透過將語料庫中的第一個文字條目饋送到函式並檢查輸出形狀來測試函式是否正常工作。

embedding = get_embeddings(comments_dataset["text"][0])
embedding.shape

torch.Size([1, 768])

太好了，我們已經將語料庫中的第一個條目轉換為一個768維的向量！我們可以使用Dataset.map()將我們的get_embeddings()函式應用於語料庫中的每一行，因此讓我們建立一個新的embeddings列，如下所示。

embeddings_dataset = comments_dataset.map(
    lambda x: {"embeddings": get_embeddings(x["text"]).detach().cpu().numpy()[0]}
)

請注意，我們已將嵌入向量轉換為NumPy陣列——這是因為當我們嘗試使用FAISS對其進行索引時，🤗 Datasets需要這種格式，我們將在下一步中執行此操作。

使用FAISS進行高效相似性搜尋

現在我們有了嵌入向量資料集，我們需要一些方法來搜尋它們。為此，我們將使用🤗 Datasets中一個名為FAISS索引的特殊資料結構。FAISS（Facebook AI Similarity Search的縮寫）是一個庫，它提供了高效的演算法來快速搜尋和聚類嵌入向量。

FAISS背後的基本思想是建立一個稱為索引的特殊資料結構，允許人們找到哪些嵌入向量與輸入嵌入向量相似。在🤗 Datasets中建立FAISS索引很簡單——我們使用Dataset.add_faiss_index()函式並指定我們想要索引的資料集的哪一列。

embeddings_dataset.add_faiss_index(column="embeddings")

我們現在可以透過使用Dataset.get_nearest_examples()函式進行最近鄰查詢來對該索引執行查詢。讓我們透過首先嵌入一個問題來測試一下，如下所示。

question = "How can I load a dataset offline?"
question_embedding = get_embeddings([question]).cpu().detach().numpy()
question_embedding.shape

torch.Size([1, 768])

就像文件一樣，我們現在有一個768維的向量表示查詢，我們可以將其與整個語料庫進行比較以找到最相似的嵌入向量。

scores, samples = embeddings_dataset.get_nearest_examples(
    "embeddings", question_embedding, k=5
)

Dataset.get_nearest_examples()函式返回一個元組，其中包含對查詢和文件之間重疊程度進行排序的分數，以及一組相應的樣本（此處為5個最佳匹配）。讓我們將它們收集到pandas.DataFrame中，以便我們可以輕鬆地對其進行排序。

import pandas as pd

samples_df = pd.DataFrame.from_dict(samples)
samples_df["scores"] = scores
samples_df.sort_values("scores", ascending=False, inplace=True)

現在我們可以遍歷前幾行，看看我們的查詢與可用的評論匹配程度如何。

for _, row in samples_df.iterrows():
    print(f"COMMENT: {row.comments}")
    print(f"SCORE: {row.scores}")
    print(f"TITLE: {row.title}")
    print(f"URL: {row.html_url}")
    print("=" * 50)
    print()

"""
COMMENT: Requiring online connection is a deal breaker in some cases unfortunately so it'd be great if offline mode is added similar to how `transformers` loads models offline fine.

@mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?
SCORE: 25.505046844482422
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: The local dataset builders (csv, text , json and pandas) are now part of the `datasets` package since #1726 :)
You can now use them offline
\`\`\`python
datasets = load_dataset("text", data_files=data_files)
\`\`\`

We'll do a new release soon
SCORE: 24.555509567260742
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: I opened a PR that allows to reload modules that have already been loaded once even if there's no internet.

Let me know if you know other ways that can make the offline mode experience better. I'd be happy to add them :)

I already note the "freeze" modules option, to prevent local modules updates. It would be a cool feature.

----------

> @mandubian's second bullet point suggests that there's a workaround allowing you to use your offline (custom?) dataset with `datasets`. Could you please elaborate on how that should look like?

Indeed `load_dataset` allows to load remote dataset script (squad, glue, etc.) but also you own local ones.
For example if you have a dataset script at `./my_dataset/my_dataset.py` then you can do
\`\`\`python
load_dataset("./my_dataset")
\`\`\`
and the dataset script will generate your dataset once and for all.

----------

About I'm looking into having `csv`, `json`, `text`, `pandas` dataset builders already included in the `datasets` package, so that they are available offline by default, as opposed to the other datasets that require the script to be downloaded.
cf #1724
SCORE: 24.14896583557129
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: > here is my way to load a dataset offline, but it **requires** an online machine
>
> 1. (online machine)
>
> ```
>
> import datasets
>
> data = datasets.load_dataset(...)
>
> data.save_to_disk(/YOUR/DATASET/DIR)
>
> ```
>
> 2. copy the dir from online to the offline machine
>
> 3. (offline machine)
>
> ```
>
> import datasets
>
> data = datasets.load_from_disk(/SAVED/DATA/DIR)
>
> ```
>
>
>
> HTH.


SCORE: 22.893993377685547
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================

COMMENT: here is my way to load a dataset offline, but it **requires** an online machine
1. (online machine)
\`\`\`
import datasets
data = datasets.load_dataset(...)
data.save_to_disk(/YOUR/DATASET/DIR)
\`\`\`
2. copy the dir from online to the offline machine
3. (offline machine)
\`\`\`
import datasets
data = datasets.load_from_disk(/SAVED/DATA/DIR)
\`\`\`

HTH.
SCORE: 22.406635284423828
TITLE: Discussion using datasets in offline mode
URL: https://github.com/huggingface/datasets/issues/824
==================================================
"""

還不錯！我們的第二個命中似乎與查詢相匹配。

✏️ 試試看！建立您自己的查詢，看看您是否可以在檢索到的文件中找到答案。您可能需要增加Dataset.get_nearest_examples()中的k引數以擴大搜索範圍。

NLP 課程

使用 FAISS 進行語義搜尋

使用嵌入進行語義搜尋

載入和準備資料集

建立文字嵌入向量

使用FAISS進行高效相似性搜尋