如何使用新的 Hugging Face Embedding DLC 將嵌入模型部署到 Amazon SageMaker

這是一個關於如何使用新的 Hugging Face Embedding 推理容器將開放嵌入模型（如 Snowflake/snowflake-arctic-embed-l、BAAI/bge-large-en-v1.5 或 sentence-transformers/all-MiniLM-L6-v2）部署到 Amazon SageMaker 進行推理的示例。我們將部署 Snowflake/snowflake-arctic-embed-m，它是 MTEB 排行榜上用於檢索和排名的最佳開放嵌入模型之一。

本示例涵蓋以下內容

設定開發環境
檢索新的 Hugging Face Embedding 容器
將 Snowflake Arctic 部署到 Amazon SageMaker
執行並評估推理效能
刪除模型和端點

什麼是 Hugging Face Embedding DLC？

Hugging Face Embedding DLC 是一個新的專用推理容器，用於在安全和託管的環境中輕鬆部署嵌入模型。該 DLC 由 Text Embedding Inference (TEI) 提供支援，TEI 是一種用於部署和提供嵌入模型的超快速且記憶體高效的解決方案。TEI 為最流行的模型（包括 FlagEmbedding、Ember、GTE 和 E5）實現高效能提取。TEI 實現了許多功能，例如

無模型圖編譯步驟
小巧的 Docker 映象和快速啟動時間
基於 token 的動態批處理
使用 Flash Attention、Candle 和 cuBLASLt 優化了 Transformers 推理程式碼
Safetensors 權重載入
生產就緒（透過 Open Telemetry 進行分散式追蹤，Prometheus 指標）

TEI 支援以下模型架構

BERT/CamemBERT，例如 BAAI/bge-large-en-v1.5 或 Snowflake/snowflake-arctic-embed-m
RoBERTa，sentence-transformers/all-roberta-large-v1
XLM-RoBERTa，例如 sentence-transformers/paraphrase-xlm-r-multilingual-v1
NomicBert，例如 jinaai/jina-embeddings-v2-base-en
JinaBert，例如 nomic-ai/nomic-embed-text-v1.5

讓我們開始吧！

1. 設定開發環境

我們將使用 sagemaker python SDK 將 Snowflake Arctic 部署到 Amazon SageMaker。我們需要確保已配置 AWS 賬戶並安裝了 sagemaker python SDK。

!pip install "sagemaker>=2.221.1" --upgrade --quiet

如果您要在本地環境中使用 Sagemaker。您需要訪問具有 SageMaker 所需許可權的 IAM 角色。您可以在此處瞭解更多資訊。

import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. 檢索新的 Hugging Face Embedding 容器

與部署常規 Hugging Face 模型相比，我們首先需要檢索容器 uri 並將其提供給我們的 HuggingFaceModel 模型類，其中 image_uri 指向該映象。為了在 Amazon SageMaker 中檢索新的 Hugging Face Embedding 容器，我們可以使用 sagemaker SDK 提供的 get_huggingface_llm_image_uri 方法。此方法允許我們檢索所需 Hugging Face Embedding 容器的 URI。需要注意的是，TEI 有 CPU 和 GPU 兩種不同版本，因此我們建立了一個輔助函式，根據例項型別檢索正確的映象 uri。

from sagemaker.huggingface import get_huggingface_llm_image_uri


# retrieve the image uri based on instance type
def get_image_uri(instance_type):
    key = (
        "huggingface-tei"
        if instance_type.startswith("ml.g") or instance_type.startswith("ml.p")
        else "huggingface-tei-cpu"
    )
    return get_huggingface_llm_image_uri(key, version="1.4.0")

3. 將 Snowflake Arctic 部署到 Amazon SageMaker

要將 Snowflake/snowflake-arctic-embed-m 部署到 Amazon SageMaker，我們建立一個 HuggingFaceModel 模型類並定義我們的端點配置，包括 HF_MODEL_ID、instance_type 等。我們將使用 c6i.2xlarge 例項型別，它有 4 個 Intel Ice-Lake vCPU，8GB 記憶體，每小時成本約為 $0.204。

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.g5.xlarge"

# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "Snowflake/snowflake-arctic-embed-m",  # model_id from hf.co/models
}

# create HuggingFaceModel with the image uri
emb_model = HuggingFaceModel(role=role, image_uri=get_image_uri(instance_type), env=config)

建立 HuggingFaceModel 後，我們可以使用 deploy 方法將其部署到 Amazon SageMaker。我們將使用 ml.c6i.2xlarge 例項型別部署模型。

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
emb = emb_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
)

SageMaker 現在將建立我們的端點並將模型部署到其中。這可能需要大約 5 分鐘。

4. 執行並評估推理效能

在我們的終端節點部署完成後，我們可以在其上執行推理。我們將使用 predictor 的 predict 方法在我們的終端節點上執行推理。

data = {
    "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = emb.predict(data=data)


# print some results
print(f"length of embeddings: {len(res[0])}")
print(f"first 10 elements of embeddings: {res[0][:10]}")

太棒了！我們現在可以使用我們的模型生成嵌入。讓我們測試一下我們模型的效能。

我們將向端點發送 3,900 個請求，使用 10 個併發執行緒進行執行緒化。我們將測量端點的平均延遲和吞吐量。我們將傳送 256 個令牌作為輸入，總共大約 100 萬個令牌。我們選擇 256 個令牌作為輸入長度，以在較短和較長的輸入之間找到平衡。

注意：執行負載測試時，請求從歐洲傳送，端點部署在美國東部 1 區。這會增加網路開銷。

import threading
import time

number_of_threads = 10
number_of_requests = int(3900 // number_of_threads)
print(f"number of threads: {number_of_threads}")
print(f"number of requests per thread: {number_of_requests}")


def send_rquests():
    for _ in range(number_of_requests):
        # input counted at https://huggingface.co/spaces/Xenova/the-tokenizer-playground for 100 tokens
        emb.predict(
            data={
                "inputs": "Hugging Face is a company and a popular platform in the field of natural language processing (NLP) and machine learning. They are known for their contributions to the development of state-of-the-art models for various NLP tasks and for providing a platform that facilitates the sharing and usage of pre-trained models. One of the key offerings from Hugging Face is the Transformers library, which is an open-source library for working with a variety of pre-trained transformer models, including those for text generation, translation, summarization, question answering, and more. The library is widely used in the research and development of NLP applications and is supported by a large and active community. Hugging Face also provides a model hub where users can discover, share, and download pre-trained models. Additionally, they offer tools and frameworks to make it easier for developers to integrate and use these models in their own projects. The company has played a significant role in advancing the field of NLP and making cutting-edge models more accessible to the broader community. Hugging Face also provides a model hub where users can discover, share, and download pre-trained models. Additionally, they offer tools and frameworks to make it easier for developers and ma"
            }
        )


# Create multiple threads
threads = [threading.Thread(target=send_rquests) for _ in range(number_of_threads)]
# start all threads
start = time.time()
[t.start() for t in threads]
# wait for all threads to finish
[t.join() for t in threads]
print(f"total time: {round(time.time() - start)} seconds")

傳送 3,900 個請求或嵌入 100 萬個令牌大約需要 841 秒。這意味著我們每秒可以執行大約 5 個請求。但請記住，這包括從歐洲到美國東部 1 區的網路延遲。當我們透過 Cloudwatch 檢查端點的延遲時，我們可以看到我們的嵌入模型在 10 個併發請求下的延遲為 2 秒。對於一個小型且老舊的 CPU 例項來說，這非常令人印象深刻，該例項每月花費大約 150 美元。您可以將模型部署到 GPU 例項以獲得更快的推理時間。

注意：我們在具有 1 個 NVIDIA A10G GPU 的 ml.g5.xlarge 上運行了相同的測試。嵌入 100 萬個令牌大約需要 30 秒。這意味著我們每秒可以執行大約 130 個請求。端點在 10 個併發請求下的延遲為 4 毫秒。Amazon SageMaker 上的 ml.g5.xlarge 每小時成本約為 1.408 美元。

GPU 例項比 CPU 例項快得多，但它們也更昂貴。如果您想批次處理嵌入，可以使用 GPU 例項。如果您想以低成本執行小型端點，可以使用 CPU 例項。我們計劃將來為 Hugging Face Embedding DLC 進行專門的基準測試。

print(
    f"https://console.aws.amazon.com/cloudwatch/home?region={sess.boto_region_name}#metricsV2:graph=~(metrics~(~(~'AWS*2fSageMaker~'ModelLatency~'EndpointName~'{emb.endpoint_name}~'VariantName~'AllTraffic))~view~'timeSeries~stacked~false~region~'{sess.boto_region_name}~start~'-PT5M~end~'P0D~stat~'Average~period~30);query=~'*7bAWS*2fSageMaker*2cEndpointName*2cVariantName*7d*20{emb.endpoint_name}"
)

5. 刪除模型和端點

為了清理，我們可以刪除模型和端點

emb.delete_model()
emb.delete_endpoint()

📍 在 GitHub 上找到完整的示例，點此訪問！

< > 在 GitHub 上更新