使用 Elasticsearch 和 Hugging Face 進行語義重排

在本筆記中，我們將學習如何透過將 Hugging Face 中的模型上傳到 Elasticsearch 叢集來實現 Elasticsearch 中的語義重排。我們將使用 `retriever` 抽象，這是一種更簡單的 Elasticsearch 語法，用於構建查詢和組合不同的搜尋操作。

你將：

選擇一個 Hugging Face 交叉編碼器模型來執行語義重排。
使用 Eland（用於 Elasticsearch 機器學習的 Python 客戶端）將模型上傳到你的 Elasticsearch 部署中。
建立一個推理端點來管理你的 `rerank` 任務。
使用 `text_similarity_rerank` 檢索器查詢你的資料。

🧰 要求

對於此示例，你需要：

Elastic 部署版本 8.15.0 或更高（對於非無伺服器部署）
- 本示例將使用 Elastic Cloud（可透過免費試用獲得）。
- 檢視我們的其他部署選項
你需要找到你的部署的 Cloud ID 並建立一個 API 金鑰。瞭解更多。

安裝和匯入包

ℹ️ `eland` 的安裝將花費幾分鐘。

!pip install -qU elasticsearch
!pip install eland[pytorch]
from elasticsearch import Elasticsearch, helpers

初始化 Elasticsearch Python 客戶端

首先你需要連線到你的 Elasticsearch 例項。

>>> from getpass import getpass

>>> # https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#finding-your-cloud-id
>>> ELASTIC_CLOUD_ID = getpass("Elastic Cloud ID: ")

>>> # https://www.elastic.co/search-labs/tutorials/install-elasticsearch/elastic-cloud#creating-an-api-key
>>> ELASTIC_API_KEY = getpass("Elastic Api Key: ")

>>> # Create the client instance
>>> client = Elasticsearch(
...     # For local development
...     # hosts=["https://:9200"]
...     cloud_id=ELASTIC_CLOUD_ID,
...     api_key=ELASTIC_API_KEY,
... )

Elastic Cloud ID: ··········
Elastic Api Key: ··········

測試連線

透過此測試確認 Python 客戶端已連線到您的 Elasticsearch 例項。

print(client.info())

此示例使用電影的小型資料集。

>>> from urllib.request import urlopen
>>> import json
>>> import time

>>> url = "https://huggingface.co/datasets/leemthompo/small-movies/raw/main/small-movies.json"
>>> response = urlopen(url)

>>> # Load the response data into a JSON object
>>> data_json = json.loads(response.read())

>>> # Prepare the documents to be indexed
>>> documents = []
>>> for doc in data_json:
...     documents.append(
...         {
...             "_index": "movies",
...             "_source": doc,
...         }
...     )

>>> # Use helpers.bulk to index
>>> helpers.bulk(client, documents)

>>> print("Done indexing documents into `movies` index!")
>>> time.sleep(3)

Done indexing documents into `movies` index!

使用 Eland 上傳 Hugging Face 模型

現在我們將使用 Eland 的 `eland_import_hub_model` 命令將模型上傳到 Elasticsearch。本例中我們選擇了 `cross-encoder/ms-marco-MiniLM-L-6-v2` 文字相似度模型。

>>> !eland_import_hub_model \
...   --cloud-id $ELASTIC_CLOUD_ID \
...   --es-api-key $ELASTIC_API_KEY \
...   --hub-model-id cross-encoder/ms-marco-MiniLM-L-6-v2 \
...   --task-type text_similarity \
...   --clear-previous \
...   --start

2024-08-13 17:04:12,386 INFO : Establishing connection to Elasticsearch
2024-08-13 17:04:12,567 INFO : Connected to serverless cluster 'bd8c004c050e4654ad32fb86ab159889'
2024-08-13 17:04:12,568 INFO : Loading HuggingFace transformer tokenizer and model 'cross-encoder/ms-marco-MiniLM-L-6-v2'
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
tokenizer_config.json: 100% 316/316 [00:00<00:00, 1.81MB/s]
config.json: 100% 794/794 [00:00<00:00, 4.09MB/s]
vocab.txt: 100% 232k/232k [00:00<00:00, 2.37MB/s]
special_tokens_map.json: 100% 112/112 [00:00<00:00, 549kB/s]
pytorch_model.bin: 100% 90.9M/90.9M [00:00<00:00, 135MB/s]
STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:312] Completed Stage: Warm Up
STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:318] Completed Stage: Collection
STAGE:2024-08-13 17:04:15 1454:1454 ActivityProfilerController.cpp:322] Completed Stage: Post Processing
2024-08-13 17:04:18,789 INFO : Creating model with id 'cross-encoder__ms-marco-minilm-l-6-v2'
2024-08-13 17:04:21,123 INFO : Uploading model definition
100% 87/87 [00:55<00:00,  1.57 parts/s]
2024-08-13 17:05:16,416 INFO : Uploading model vocabulary
2024-08-13 17:05:16,987 INFO : Starting model deployment
2024-08-13 17:05:18,238 INFO : Model successfully imported with id 'cross-encoder__ms-marco-minilm-l-6-v2'

建立推理端點

接下來，我們將為 `rerank` 任務建立一個推理端點，以部署和管理我們的模型，並在必要時在後臺啟動所需的 ML 資源。

client.inference.put(
    task_type="rerank",
    inference_id="my-msmarco-minilm-model",
    inference_config={
        "service": "elasticsearch",
        "service_settings": {
            "model_id": "cross-encoder__ms-marco-minilm-l-6-v2",
            "num_allocations": 1,
            "num_threads": 1,
        },
    },
)

執行以下命令以確認您的推理端點已部署。

client.inference.get()

⚠️ 部署模型時，您可能需要在 Kibana（或無伺服器）UI 中同步您的 ML 儲存物件。轉到**已訓練模型**並選擇**同步儲存的物件**。

詞彙查詢

首先，我們使用一個 `standard` 檢索器來測試一些詞彙（或全文）搜尋，然後我們將比較當我們加入語義重排時的改進。

使用 query_string 查詢進行詞彙匹配

假設我們模糊地記得有一部關於一個吃受害者的殺手的著名電影。為了論證，假設我們暫時忘記了“食人魔”這個詞。

讓我們執行一個 `query_string` 查詢，在我們的 Elasticsearch 文件的 `plot` 欄位中查詢“食人惡棍”這個短語。

>>> resp = client.search(
...     index="movies",
...     retriever={
...         "standard": {
...             "query": {
...                 "query_string": {
...                     "query": "flesh-eating bad guy",
...                     "default_field": "plot",
...                 }
...             }
...         }
...     },
... )

>>> if resp["hits"]["hits"]:
...     for hit in resp["hits"]["hits"]:
...         title = hit["_source"]["title"]
...         plot = hit["_source"]["plot"]
...         print(f"Title: {title}\nPlot: {plot}\n")
>>> else:
...     print("No search results found")

No search results found

沒有結果！不幸的是，我們沒有任何接近“食人惡棍”的精確匹配。因為我們沒有關於 Elasticsearch 資料中確切措辭的更具體資訊，所以我們需要擴大搜索範圍。

簡單的 multi_match 查詢

此詞彙查詢在我們的 Elasticsearch 文件的“plot”和“genre”欄位中對“crime”一詞執行標準關鍵字搜尋。

>>> resp = client.search(
...     index="movies",
...     retriever={"standard": {"query": {"multi_match": {"query": "crime", "fields": ["plot", "genre"]}}}},
... )

>>> for hit in resp["hits"]["hits"]:
...     title = hit["_source"]["title"]
...     plot = hit["_source"]["plot"]
...     print(f"Title: {title}\nPlot: {plot}\n")

Title: The Godfather
Plot: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

Title: Goodfellas
Plot: The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.

Title: The Silence of the Lambs
Plot: A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.

Title: Pulp Fiction
Plot: The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.

Title: Se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.

Title: The Departed
Plot: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.

Title: The Usual Suspects
Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.

Title: The Dark Knight
Plot: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

好多了！至少我們現在有一些結果了。我們擴大了搜尋標準，以增加找到相關結果的機會。

但是這些結果與我們最初的查詢“食人惡棍”的語境不太精確。我們可以看到“沉默的羔羊”在這個通用 `match` 查詢的結果集中間被返回。讓我們看看是否可以使用我們的語義重排模型更接近搜尋者的原始意圖。

語義重排器

在以下 `retriever` 語法中，我們將標準查詢檢索器包裝在 `text_similarity_reranker` 中。這允許我們利用部署到 Elasticsearch 的 NLP 模型，根據短語“食肉惡棍”重新排列結果。

>>> resp = client.search(
...     index="movies",
...     retriever={
...         "text_similarity_reranker": {
...             "retriever": {"standard": {"query": {"multi_match": {"query": "crime", "fields": ["plot", "genre"]}}}},
...             "field": "plot",
...             "inference_id": "my-msmarco-minilm-model",
...             "inference_text": "flesh-eating bad guy",
...         }
...     },
... )

>>> for hit in resp["hits"]["hits"]:
...     title = hit["_source"]["title"]
...     plot = hit["_source"]["plot"]
...     print(f"Title: {title}\nPlot: {plot}\n")

Title: The Silence of the Lambs
Plot: A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.

Title: Pulp Fiction
Plot: The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption.

Title: Se7en
Plot: Two detectives, a rookie and a veteran, hunt a serial killer who uses the seven deadly sins as his motives.

Title: Goodfellas
Plot: The story of Henry Hill and his life in the mob, covering his relationship with his wife Karen Hill and his mob partners Jimmy Conway and Tommy DeVito in the Italian-American crime syndicate.

Title: The Dark Knight
Plot: When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.

Title: The Godfather
Plot: An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son.

Title: The Departed
Plot: An undercover cop and a mole in the police attempt to identify each other while infiltrating an Irish gang in South Boston.

Title: The Usual Suspects
Plot: A sole survivor tells of the twisty events leading up to a horrific gun battle on a boat, which began when five criminals met at a seemingly random police lineup.

成功！“沉默的羔羊”是我們的首要結果。語義重排透過解析自然語言查詢幫助我們找到最相關的結果，克服了詞法搜尋對精確匹配的依賴的侷限性。

語義重排無需生成和儲存嵌入，只需幾個步驟即可實現語義搜尋。能夠在您的 Elasticsearch 叢集中原生使用託管在 Hugging Face 上的開源模型，對於原型設計、測試和構建搜尋體驗來說非常棒。

瞭解更多

本示例中，我們選擇了 `cross-encoder/ms-marco-MiniLM-L-6-v2` 文字相似度模型。有關 Elasticsearch 支援的第三方文字相似度模型列表，請參閱 Elastic NLP 模型參考。
瞭解更多關於整合 Hugging Face 與 Elasticsearch 的資訊。
檢視 `elasticsearch-labs` 倉庫中 Elastic 的 Python 筆記本目錄。
瞭解更多關於 Elasticsearch 中的檢索器和重排。

< > 在 GitHub 上更新

開源 AI 食譜