使用 Amazon SageMaker 微調和部署嵌入模型

嵌入模型對於成功的 RAG 應用程式至關重要，但它們通常在通用知識上進行訓練，這限制了它們在公司或領域特定採用方面的有效性。為特定領域資料定製嵌入可以顯著提高 RAG 應用程式的檢索效能。隨著 Sentence Transformers 3 和 Hugging Face Embedding Container 的新發布，微調和部署嵌入模型比以往任何時候都更容易。

在本示例中，我們將向您展示如何使用新的 Hugging Face Embedding Container 在 Amazon SageMaker 上微調和部署自定義嵌入模型。我們將使用 Sentence Transformers 3 庫來微調自定義資料集上的模型，並將其部署到 Amazon SageMaker 進行推理。我們將使用 2023_10 NVIDIA SEC Filing 中的合成數據集，為金融 RAG 應用程式微調 BAAI/bge-base-en-v1.5。

設定開發環境
建立和準備資料集
在 Amazon SageMaker 上微調嵌入模型
在 Amazon SageMaker 上部署和測試微調的嵌入模型

Sentence Transformers 3 有什麼新功能？

Sentence Transformers v3 引入了一個新的訓練器，使微調和訓練嵌入模型變得更容易。此更新包括增強的元件，如多樣化的資料集、更新的損失函式和簡化的訓練過程，提高了模型開發的效率和靈活性。

Hugging Face Embedding Container 是什麼？

Hugging Face Embedding Container 是一個新的專門構建的推理容器，用於在安全和託管環境中輕鬆部署嵌入模型。該 DLC 由 Text Embedding Inference (TEI) 提供支援，TEI 是一種用於部署和提供嵌入模型的超快速且記憶體高效的解決方案。TEI 為最流行的模型提供高效能提取，包括 FlagEmbedding、Ember、GTE 和 E5。TEI 實現了許多功能，例如

注意：此部落格是在 ml.g5.xlarge 上進行訓練並在 ml.c6i.2xlarge 上進行推理例項建立和驗證的。

1. 設定開發環境

我們的第一步是安裝客戶端上所需的 Hugging Face 庫，以便正確準備我們的資料集並開始我們的訓練/評估作業。

!pip install transformers "datasets[s3]==2.18.0" "sagemaker>=2.190.0" "huggingface_hub[cli]" --upgrade --quiet

如果您將在本地環境中使用 Sagemaker。您需要訪問具有 Sagemaker 所需許可權的 IAM 角色。您可以在此處瞭解更多資訊。

import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. 建立和準備資料集

嵌入資料集通常由文字對（問題、答案/上下文）或表示句子之間關係或相似性的三元組組成。您選擇或可用的資料集格式也會影響您可以使用哪個損失函式。嵌入資料集的常見格式

正向對：相關句子的文字對（查詢、上下文 | 查詢、答案），適用於相似性或語義搜尋等任務，示例資料集：`sentence-transformers/sentence-compression`，`sentence-transformers/natural-questions`。
三元組：由（錨點、正向、負向）組成的三元組文字，示例資料集：`sentence-transformers/quora-duplicates`，`nirantk/triplets`。
帶相似性分數的對：帶有相似性分數表示它們相關程度的句子對，示例資料集：`sentence-transformers/stsb`，`PhilipMay/stsb_multi_mt`

在資料集概覽中瞭解更多資訊。

我們將使用philschmid/finanical-rag-embedding-dataset，其中包含來自2023_10 NVIDIA SEC Filing 的 7,000 個問題和相應上下文的正向文字對。

資料集具有以下格式

{"question": "<question>", "context": "<relevant context to answer>"}
{"question": "<question>", "context": "<relevant context to answer>"}
{"question": "<question>", "context": "<relevant context to answer>"}

我們將使用檔案系統整合將資料集上傳到 S3。我們正在使用 sess.default_bucket()，如果您想將資料集儲存在不同的 S3 儲存桶中，請進行調整。我們稍後將在訓練指令碼中使用 S3 路徑。

from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("philschmid/finanical-rag-embedding-dataset", split="train")
input_path = f"s3://{sess.default_bucket()}/datasets/rag-embedding"

# rename columns
dataset = dataset.rename_column("question", "anchor")
dataset = dataset.rename_column("context", "positive")

# Add an id column to the dataset
dataset = dataset.add_column("id", range(len(dataset)))

# split dataset into a 10% test set
dataset = dataset.train_test_split(test_size=0.1)

# save train_dataset to s3 using our SageMaker session

# save datasets to s3
dataset["train"].to_json(f"{input_path}/train/dataset.json", orient="records")
train_dataset_s3_path = f"{input_path}/train/dataset.json"
dataset["test"].to_json(f"{input_path}/test/dataset.json", orient="records")
test_dataset_s3_path = f"{input_path}/test/dataset.json"

print(f"Training data uploaded to:")
print(train_dataset_s3_path)
print(test_dataset_s3_path)
print(
    f"https://s3.console.aws.amazon.com/s3/buckets/{sess.default_bucket()}/?region={sess.boto_region_name}&prefix={input_path.split('/', 3)[-1]}/"
)

3. 在 Amazon SageMaker 上微調嵌入模型

我們現在準備好微調我們的模型了。我們將使用 sentence-transformers 中的 SentenceTransformerTrainer 來微調我們的模型。SentenceTransformerTrainer 使監督微調開放嵌入模型變得簡單，因為它是 transformers 中 Trainer 的子類。我們準備了一個指令碼 run_mnr.py，它將從磁碟載入資料集，準備模型、分詞器並開始訓練。SentenceTransformerTrainer 使監督微調開放嵌入模型變得簡單，支援

整合元件：將資料集、損失函式和評估器組合成統一的訓練框架。
靈活的資料處理：支援各種資料格式，並易於與 Hugging Face 資料集整合。
多功能損失函式：為不同的訓練任務提供多種損失函式。
多資料集訓練：方便使用多個數據集和不同損失函式進行同時訓練。
無縫整合：在 Hugging Face 生態系統中輕鬆儲存、載入和共享模型。

為了建立 SageMaker 訓練作業，我們需要一個 HuggingFace Estimator。Estimator 處理端到端的 Amazon SageMaker 訓練和部署任務。Estimator 管理基礎設施使用。Amazon SagMaker 負責為我們啟動和管理所有所需的 EC2 例項，提供正確的 Hugging Face 容器，上傳提供的指令碼並將資料從我們的 S3 儲存桶下載到容器的 /opt/ml/input/data。然後，它透過執行以下命令開始訓練作業。

注意：如果您使用自定義訓練指令碼，請確保在 source_dir 中包含 requirements.txt。我們建議直接克隆整個倉庫。

我們先定義我們的訓練引數。這些引數作為 CLI 引數傳遞給我們的訓練指令碼。我們將使用 BAAI/bge-base-en-v1.5 模型，這是一個在大規模英語文字語料庫上預訓練的模型。我們將結合 MatryoshkaLoss 使用 MultipleNegativesRankingLoss。這種方法使我們能夠利用 Matryoshka 嵌入的效率和靈活性，從而能夠在不顯著降低效能的情況下利用不同的嵌入維度。如果您只有正向對，MultipleNegativesRankingLoss 是一個很好的損失函式，因為它在批處理中新增負樣本到損失函式中，從而每個樣本有 n-1 個負樣本。

from sagemaker.huggingface import HuggingFace

# define Training Job Name
job_name = f"bge-base-exp1"

# define hyperparameters, which are passed into the training job
training_arguments = {
    "model_id": "BAAI/bge-base-en-v1.5",  # model id from the hub
    "train_dataset_path": "/opt/ml/input/data/train/",  # path inside the container where the training data is stored
    "test_dataset_path": "/opt/ml/input/data/test/",  # path inside the container where the test data is stored
    "num_train_epochs": 3,  # number of training epochs
    "learning_rate": 2e-5,  # learning rate
}

# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point="run_mnr.py",  # train script
    source_dir="scripts",  # directory which includes all the files needed for training
    instance_type="ml.g5.xlarge",  # instances type used for the training job
    instance_count=1,  # the number of instances used for training
    max_run=2 * 24 * 60 * 60,  # maximum runtime in seconds (days * hours * minutes * seconds)
    base_job_name=job_name,  # the name of the training job
    role=role,  # Iam role used in training job to access AWS ressources, e.g. S3
    transformers_version="4.36.0",  # the transformers version used in the training job
    pytorch_version="2.1.0",  # the pytorch_version version used in the training job
    py_version="py310",  # the python version used in the training job
    hyperparameters=training_arguments,
    disable_output_compression=True,  # not compress output to save training time and cost
    environment={
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache",  # set env variable to cache models in /tmp
    },
)

我們現在可以使用 .fit() 方法啟動我們的訓練作業，並將我們的 S3 路徑傳遞給訓練指令碼。

# define a data input dictonary with our uploaded s3 uris
data = {
    "train": train_dataset_s3_path,
    "test": test_dataset_s3_path,
}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

在我們的示例中，使用 Flash Attention 2 (SDPA) 訓練 BGE Base 3 個 epoch，資料集包含 6.3k 個訓練樣本和 700 個評估樣本，在 ml.g5.xlarge (1.2575 $/h) 上耗時 645 秒（約 10 分鐘），花費約 5 美元。

4. 在 Amazon SageMaker 上部署和測試微調的嵌入模型

我們將使用 Hugging Face Embedding Container，這是一個專門構建的推理容器，用於在安全和託管環境中輕鬆部署嵌入模型。該 DLC 由 Text Embedding Inference (TEI) 提供支援，TEI 是一種用於部署和提供嵌入模型的超快速且記憶體高效的解決方案。

要在 Amazon SageMaker 中檢索新的 Hugging Face Embedding Container，我們可以使用 SageMaker SDK 提供的 `get_huggingface_llm_image_uri` 方法。此方法允許我們檢索所需 Hugging Face Embedding Container 的 URI。需要注意的是，TEI 有 CPU 和 GPU 兩種不同的版本，因此我們建立一個輔助函式，根據例項型別檢索正確的映象 URI。

from sagemaker.huggingface import get_huggingface_llm_image_uri


# retrieve the image uri based on instance type
def get_image_uri(instance_type):
    key = (
        "huggingface-tei"
        if instance_type.startswith("ml.g") or instance_type.startswith("ml.p")
        else "huggingface-tei-cpu"
    )
    return get_huggingface_llm_image_uri(key, version="1.4.0")

我們現在可以使用容器 uri 和模型 S3 路徑建立一個 HuggingFaceModel。我們還需要設定我們的 TEI 配置。

from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.c6i.2xlarge"

# create HuggingFaceModel with the image uri
emb_model = HuggingFaceModel(
    role=role,
    image_uri=get_image_uri(instance_type),
    model_data=huggingface_estimator.model_data,
    env={"HF_MODEL_ID": "/opt/ml/model"},  # Path to the model in the container
)

建立 HuggingFaceModel 後，我們可以使用部署方法將其部署到 Amazon SageMaker。我們將使用 ml.c6i.2xlarge 例項型別部署模型。

# Deploy model to an endpoint
emb = emb_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
)

SageMaker 現在將建立我們的端點並將模型部署到該端點。這可能需要大約 5 分鐘。端點部署後，我們可以在其上執行推理。我們將使用預測器的 predict 方法在我們的端點上執行推理。

data = {
    "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = emb.predict(data=data)


# print some results
print(f"length of embeddings: {len(res[0])}")
print(f"first 10 elements of embeddings: {res[0][:10]}")

我們使用 Matryoshka Loss 訓練了我們的模型，這意味著語義含義是預載入的。要使用不同的 Matryoshka 維度，我們需要手動截斷我們的嵌入。下面是一個如何將嵌入截斷為 256 維的示例，這是原始大小的 1/3。如果我們檢查訓練日誌，我們可以看到 768 的 NDCG 指標為 0.823，256 的 NDCG 指標為 0.818，這意味著我們保留了 > 99% 的準確率。

data = {
    "inputs": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .",
}

res = emb.predict(data=data)

# truncate embeddings to matryoshka dimensions
dim = 256
res = res[0][0:dim]

# print some results
print(f"length of embeddings: {len(res)}")

太棒了！🚀 現在我們可以生成嵌入並將您的端點整合到您的 RAG 應用程式中。

為了清理，我們可以刪除模型和端點。

emb.delete_model()
emb.delete_endpoint()

📍 完整的示例請參見 GitHub 此處！

< > 在 GitHub 上更新

在 AWS 上部署

使用 Amazon SageMaker 微調和部署嵌入模型

1. 設定開發環境

2. 建立和準備資料集

3. 在 Amazon SageMaker 上微調嵌入模型

4. 在 Amazon SageMaker 上部署和測試微調的嵌入模型