將模型部署到 Amazon SageMaker

在 SageMaker 中部署 🤗 Transformers 模型進行推理非常簡單

from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class and deploy it as SageMaker endpoint
huggingface_model = HuggingFaceModel(...).deploy()

本指南將向您展示如何使用 Inference Toolkit 實現零程式碼部署模型。Inference Toolkit 構建在 🤗 Transformers 的 pipeline 功能之上。學習如何

安裝和設定 Inference Toolkit.
部署在 SageMaker 中訓練的 🤗 Transformers 模型.
部署來自 Hugging Face [模型中心](https://huggingface.co/models) 的 🤗 Transformers 模型.
使用 🤗 Transformers 和 Amazon SageMaker 執行批次轉換作業.
建立自定義推理模組.

安裝和設定

在將 🤗 Transformers 模型部署到 SageMaker 之前，您需要註冊一個 AWS 賬戶。如果您還沒有 AWS 賬戶，請在此處瞭解更多資訊。

擁有 AWS 賬戶後，請使用以下方法之一開始使用

要在本地開始訓練，您需要設定適當的 IAM 角色。

升級到最新的 sagemaker 版本。

pip install sagemaker --upgrade

SageMaker 環境

按如下所示設定您的 SageMaker 環境

import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

注意：執行角色僅在 SageMaker 中執行筆記本時可用。如果您在非 SageMaker 的筆記本中執行 get_execution_role，則會遇到 region 錯誤。

本地環境

按如下所示設定您的本地環境

import sagemaker
import boto3

iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='role-name-of-your-iam-role-with-right-permissions')['Role']['Arn']
sess = sagemaker.Session()

部署在 SageMaker 中訓練的 🤗 Transformers 模型

有兩種方法可以部署您在 SageMaker 中訓練的 Hugging Face 模型：

在訓練完成後立即部署。
稍後使用 model_data 從 S3 部署您儲存的模型。

📓 開啟 deploy_transformer_model_from_s3.ipynb 筆記本，檢視如何將模型從 S3 部署到 SageMaker 進行推理的示例。

訓練後部署

要在訓練後直接部署模型，請確保所有必需的檔案都儲存在您的訓練指令碼中，包括分詞器和模型。

如果您使用 Hugging Face 的 Trainer，可以將分詞器作為引數傳遞給 Trainer。當您呼叫 trainer.save_model() 時，它將自動儲存。

from sagemaker.huggingface import HuggingFace

############ pseudo code start ############

# create Hugging Face Estimator for training
huggingface_estimator = HuggingFace(....)

# start the train job with our uploaded datasets as input
huggingface_estimator.fit(...)

############ pseudo code end ############

# deploy model to SageMaker Inference
predictor = hf_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

# example request: you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days."
}

# request
predictor.predict(data)

執行完請求後，您可以按如下所示刪除端點

# delete endpoint
predictor.delete_endpoint()

使用 model_data 部署

如果您已經訓練好模型並希望稍後部署，請使用 model_data 引數指定分詞器和模型權重的位置。

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://models/my-bert-model/model.tar.gz",  # path to your trained SageMaker model
   role=role,                                            # IAM role with permissions to create an endpoint
   transformers_version="4.26",                           # Transformers version used
   pytorch_version="1.13",                                # PyTorch version used
   py_version='py39',                                    # Python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request: you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 fromm landline. Delivery within 28 days."
}

# request
predictor.predict(data)

在執行我們的請求後，您可以使用以下命令再次刪除端點

# delete endpoint
predictor.delete_endpoint()

為部署建立模型工件

為了稍後部署，您可以建立一個包含所有必需檔案的 model.tar.gz 檔案，例如：

pytorch_model.bin
tf_model.h5
tokenizer.json
tokenizer_config.json

例如，您的檔案應該如下所示

model.tar.gz/
|- pytorch_model.bin
|- vocab.txt
|- tokenizer_config.json
|- config.json
|- special_tokens_map.json

從 🤗 Hub 上的模型建立您自己的 model.tar.gz

下載模型

git lfs install
git clone git@hf.co:{repository}

建立一個 tar 檔案

cd {repository}
tar zcvf model.tar.gz *

將 model.tar.gz 上傳到 S3

aws s3 cp model.tar.gz <s3://{my-s3-path}>

現在，您可以將 S3 URI 提供給 model_data 引數，以便稍後部署您的模型。

從 🤗 Hub 部署模型

要直接從 🤗 Hub 將模型部署到 SageMaker，您需要在建立 HuggingFaceModel 時定義兩個環境變數：

HF_MODEL_ID 定義了模型 ID，當您建立 SageMaker 端點時，該模型將自動從 huggingface.co/models 載入。透過此環境變數可以訪問 🤗 Hub 上的 10,000 多個模型。
HF_TASK 定義了 🤗 Transformers pipeline 的任務。完整的任務列表可以在這裡找到。

⚠️ ** Pipeline 未針對並行處理（多執行緒）進行最佳化，並且往往會消耗大量 RAM。例如，在基於 GPU 的例項上，pipeline 在單個 vCPU 上執行。當該 vCPU 因推理請求預處理而飽和時，可能會造成瓶頸，從而阻止 GPU 被充分利用於模型推理。在此處瞭解更多資訊。

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <https://huggingface.co/models>
hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad', # model_id from hf.co/models
  'HF_TASK':'question-answering'                           # NLP task you want to use for predictions
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.26",                             # Transformers version used
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request: you always need to define "inputs"
data = {
"inputs": {
	"question": "What is used for inference?",
	"context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
	}
}

# request
predictor.predict(data)

在執行我們的請求後，您可以使用以下命令再次刪除端點

# delete endpoint
predictor.delete_endpoint()

📓 開啟 deploy_transformer_model_from_hf_hub.ipynb 筆記本，檢視如何將模型從 🤗 Hub 部署到 SageMaker 進行推理的示例。

使用 🤗 Transformers 和 SageMaker 執行批次轉換

訓練模型後，您可以使用SageMaker 批次轉換來對模型進行推理。批次轉換接受您的推理資料作為 S3 URI，然後 SageMaker 將負責下載資料、執行預測並將結果上傳到 S3。有關批次轉換的更多詳細資訊，請檢視此處。

⚠️ 由於文字資料的複雜結構，Hugging Face 推理 DLC 目前僅支援 .jsonl 格式進行批次轉換。

注意：確保在預處理期間，您的 inputs 適合模型的 max_length。

如果您使用 Hugging Face Estimator 訓練了模型，可以呼叫 transformer() 方法來為基於該訓練作業的模型建立一個轉換作業（更多詳細資訊請參見此處）

batch_job = huggingface_estimator.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord')


batch_job.transform(
    data='s3://s3-uri-to-batch-data',
    content_type='application/json',    
    split_type='Line')

如果您想稍後執行批次轉換作業，或者使用來自 🤗 Hub 的模型，請建立一個 HuggingFaceModel 例項，然後呼叫 transformer() 方法

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <https://huggingface.co/models>
hub = {
	'HF_MODEL_ID':'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
	'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.26",                             # Transformers version used
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)

# create transformer to run a batch job
batch_job = huggingface_model.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord'
)

# starts batch transform job and uses S3 data as input
batch_job.transform(
    data='s3://sagemaker-s3-demo-test/samples/input.jsonl',
    content_type='application/json',    
    split_type='Line'
)

input.jsonl 看起來像這樣

{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}

📓 開啟 sagemaker-notebook.ipynb 筆記本，檢視如何執行批次轉換作業進行推理的示例。

使用 TGI 將 LLM 部署到 SageMaker

如果您有興趣為 LLM 使用高效能的服務容器，您可以使用 Hugging Face TGI 容器。它利用了 Text Generation Inference 庫。相容模型的列表可以在這裡找到。

首先，確保安裝了最新版本的 SageMaker SDK

pip install sagemaker>=2.231.0

然後，我們匯入 SageMaker Python SDK 並例項化一個 sagemaker_session 來查詢當前區域和執行角色。

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

接下來我們檢索 LLM 映象 URI。我們使用輔助函式 get_huggingface_llm_image_uri() 來為 Hugging Face 大型語言模型 (LLM) 推理生成合適的映象 URI。該函式接受一個必需的引數 backend 和幾個可選引數。backend 指定了模型使用的後端型別：“huggingface” 指的是使用 Hugging Face TGI 後端。

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface",
  region=region
)

現在我們有了映象 URI，下一步是配置模型物件。我們指定一個唯一的名稱、託管 TGI 容器的 image_uri 以及端點的執行角色。此外，我們還指定了多個環境變數，包括與將要部署的 HuggingFace Hub 模型對應的 HF_MODEL_ID，以及配置模型執行的推理任務的 HF_TASK。

您還應該定義 SM_NUM_GPUS，它指定了模型的張量並行度。當處理對於單個 GPU 來說太大的 LLM 時，張量並行可以用來將模型拆分到多個 GPU 上。要了解更多關於推理中的張量並行，請參閱我們之前的部落格文章。在這裡，您應該將 SM_NUM_GPUS 設定為您所選例項型別上可用 GPU 的數量。例如，在本教程中，我們將 SM_NUM_GPUS 設定為 4，因為我們選擇的例項型別 ml.g4dn.12xlarge 有 4 個可用 GPU。

請注意，您可以選擇性地透過將 HF_MODEL_QUANTIZE 環境變數設定為 true 來減少模型的記憶體和計算佔用，但這種較低的權重精度可能會影響某些模型輸出的質量。

model_name = "llama-3-1-8b-instruct" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'meta-llama/Llama-3.1-8B-Instruct',
    'SM_NUM_GPUS':'1',
	'HUGGING_FACE_HUB_TOKEN': '<REPLACE WITH YOUR TOKEN>',
}

assert hub['HUGGING_FACE_HUB_TOKEN'] != '<REPLACE WITH YOUR TOKEN>', "You have to provide a token."


model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)

接下來，我們呼叫 deploy 方法來部署模型。

predictor = model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.2xlarge",
  endpoint_name=model_name
)

模型部署後，我們可以呼叫它來生成文字。我們傳遞一個輸入提示並執行 predict 方法，從在 TGI 容器中執行的 LLM 生成文字響應。

input_data = {
  "inputs": "The diamondback terrapin was the first reptile to",
  "parameters": {
    "do_sample": True,
    "max_new_tokens": 100,
    "temperature": 0.7,
    "watermark": True
  }
}

predictor.predict(input_data)

我們收到以下自動生成的文字響應

[{'generated_text': 'The diamondback terrapin was the first reptile to make the list, followed by the American alligator, the American crocodile, and the American box turtle. The polecat, a ferret-like animal, and the skunk rounded out the list, both having gained their slots because they have proven to be particularly dangerous to humans.\n\nCalifornians also seemed to appreciate the new list, judging by the comments left after the election.\n\n“This is fantastic,” one commenter declared.\n\n“California is a very'}]

實驗結束後，我們刪除端點和模型資源。

predictor.delete_model()
predictor.delete_endpoint()

使用者定義的程式碼和模組

Hugging Face Inference Toolkit 允許使用者覆蓋 HuggingFaceHandlerService 的預設方法。您需要建立一個名為 code/ 的資料夾，並在其中包含一個 inference.py 檔案。有關如何歸檔模型工件的更多詳細資訊，請參見此處。例如：

model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt

inference.py 檔案包含您的自定義推理模組，而 requirements.txt 檔案包含應新增的其他依賴項。自定義模組可以覆蓋以下方法：

model_fn(model_dir) 覆蓋載入模型的預設方法。返回值 model 將在 predict 中用於預測。predict 接收引數 model_dir，即您解壓的 model.tar.gz 的路徑。
transform_fn(model, data, content_type, accept_type) 使用您的自定義實現覆蓋預設的轉換函式。您需要在 transform_fn 中實現自己的 preprocess、predict 和 postprocess 步驟。此方法不能與下面提到的 input_fn、predict_fn 或 output_fn 結合使用。
input_fn(input_data, content_type) 覆蓋預處理的預設方法。返回值 data 將在 predict 中用於預測。輸入是：
- input_data 是您請求的原始正文。
- content_type 是請求頭中的內容型別。
predict_fn(processed_data, model) 覆蓋預測的預設方法。返回值 predictions 將在 postprocess 中使用。輸入是 processed_data，即 preprocess 的結果。
output_fn(prediction, accept) 覆蓋後處理的預設方法。返回值 result 將是您請求的響應（例如 JSON）。輸入是：
- predictions 是 predict 的結果。
- accept 是 HTTP 請求的返回接受型別，例如 application/json。

這是一個帶有 model_fn、input_fn、predict_fn 和 output_fn 的自定義推理模組示例：

from sagemaker_huggingface_inference_toolkit import decoder_encoder

def model_fn(model_dir):
    # implement custom code to load the model
    loaded_model = ...
    
    return loaded_model 

def input_fn(input_data, content_type):
    # decode the input data  (e.g. JSON string -> dict)
    data = decoder_encoder.decode(input_data, content_type)
    return data

def predict_fn(data, model):
    # call your custom model with the data
    outputs = model(data , ... )
    return predictions

def output_fn(prediction, accept):
    # convert the model output to the desired output format (e.g. dict -> JSON string)
    response = decoder_encoder.encode(prediction, accept)
    return response

僅使用 model_fn 和 transform_fn 自定義您的推理模組：

from sagemaker_huggingface_inference_toolkit import decoder_encoder

def model_fn(model_dir):
    # implement custom code to load the model
    loaded_model = ...
    
    return loaded_model 

def transform_fn(model, input_data, content_type, accept):
     # decode the input data (e.g. JSON string -> dict)
    data = decoder_encoder.decode(input_data, content_type)

    # call your custom model with the data
    outputs = model(data , ... ) 

    # convert the model output to the desired output format (e.g. dict -> JSON string)
    response = decoder_encoder.encode(output, accept)

    return response

< > 在 GitHub 上更新

在 AWS 上部署