將模型部署到 Amazon SageMaker

在 SageMaker 中部署 🤗 Transformers 模型進行推理非常簡單

from sagemaker.huggingface import HuggingFaceModel

# create Hugging Face Model Class and deploy it as SageMaker endpoint
huggingface_model = HuggingFaceModel(...).deploy()

本指南將向您展示如何使用推理工具包進行零程式碼模型部署。推理工具包基於 🤗 Transformers 的 pipeline 功能構建。瞭解如何

安裝和設定推理工具包.
部署在 SageMaker 中訓練的 🤗 Transformers 模型.
從 Hugging Face [模型 Hub](https://huggingface.co/models) 部署 🤗 Transformers 模型.
使用 🤗 Transformers 和 Amazon SageMaker 執行批次轉換作業.
建立自定義推理模組.

安裝和設定

在將 🤗 Transformers 模型部署到 SageMaker 之前，您需要註冊一個 AWS 賬戶。如果您還沒有 AWS 賬戶，請在此處瞭解更多資訊。

擁有 AWS 賬戶後，請使用以下方法之一開始使用

要在本地開始訓練，您需要設定適當的 IAM 角色。

升級到最新的 sagemaker 版本。

pip install sagemaker --upgrade

SageMaker 環境

按如下所示設定您的 SageMaker 環境

import sagemaker
sess = sagemaker.Session()
role = sagemaker.get_execution_role()

注意：執行角色僅在 SageMaker 中執行筆記本時可用。如果您在非 SageMaker 的筆記本中執行 get_execution_role，則會遇到 region 錯誤。

本地環境

按如下所示設定您的本地環境

import sagemaker
import boto3

iam_client = boto3.client('iam')
role = iam_client.get_role(RoleName='role-name-of-your-iam-role-with-right-permissions')['Role']['Arn']
sess = sagemaker.Session()

在 SageMaker 中部署訓練好的 🤗 Transformers 模型

有兩種方法可以在 SageMaker 中部署您訓練好的 Hugging Face 模型

訓練完成後立即部署。
稍後使用 model_data 從 S3 部署您儲存的模型。

📓 開啟 deploy_transformer_model_from_s3.ipynb 筆記本，檢視如何從 S3 部署模型到 SageMaker 進行推理的示例。

訓練後部署

要在訓練後直接部署模型，請確保所有必需檔案（包括分詞器和模型）都儲存在訓練指令碼中。

如果您使用 Hugging Face Trainer，可以將分詞器作為引數傳遞給 Trainer。當您呼叫 trainer.save_model() 時，它將自動儲存。

from sagemaker.huggingface import HuggingFace

############ pseudo code start ############

# create Hugging Face Estimator for training
huggingface_estimator = HuggingFace(....)

# start the train job with our uploaded datasets as input
huggingface_estimator.fit(...)

############ pseudo code end ############

# deploy model to SageMaker Inference
predictor = hf_estimator.deploy(initial_instance_count=1, instance_type="ml.m5.xlarge")

# example request: you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 from landline. Delivery within 28 days."
}

# request
predictor.predict(data)

執行請求後，您可以按如下所示刪除端點

# delete endpoint
predictor.delete_endpoint()

使用 model_data 部署

如果您已經訓練好模型並希望稍後部署，請使用 model_data 引數指定分詞器和模型權重的路徑。

from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://models/my-bert-model/model.tar.gz",  # path to your trained SageMaker model
   role=role,                                            # IAM role with permissions to create an endpoint
   transformers_version="4.26",                           # Transformers version used
   pytorch_version="1.13",                                # PyTorch version used
   py_version='py39',                                    # Python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request: you always need to define "inputs"
data = {
   "inputs": "Camera - You are awarded a SiPix Digital Camera! call 09061221066 from landline. Delivery within 28 days."
}

# request
predictor.predict(data)

執行請求後，您可以再次使用以下命令刪除端點

# delete endpoint
predictor.delete_endpoint()

建立用於部署的模型工件

為了方便後續部署，您可以建立一個包含所有必需檔案（例如）的 model.tar.gz 檔案

pytorch_model.bin
tf_model.h5
tokenizer.json
tokenizer_config.json

例如，您的檔案應如下所示

model.tar.gz/
|- pytorch_model.bin
|- vocab.txt
|- tokenizer_config.json
|- config.json
|- special_tokens_map.json

從 🤗 Hub 中的模型建立您自己的 model.tar.gz

下載模型

git lfs install
git clone git@hf.co:{repository}

建立 tar 檔案

cd {repository}
tar zcvf model.tar.gz *

將 model.tar.gz 上傳到 S3

aws s3 cp model.tar.gz <s3://{my-s3-path}>

現在，您可以將 S3 URI 提供給 model_data 引數，以便稍後部署您的模型。

從 🤗 Hub 部署模型

要將模型直接從 🤗 Hub 部署到 SageMaker，請在建立 HuggingFaceModel 時定義兩個環境變數

HF_MODEL_ID 定義了模型 ID，該 ID 在您建立 SageMaker 端點時會自動從 huggingface.co/models 載入。透過此環境變數訪問 🤗 Hub 上的 10,000 多個模型。
HF_TASK 定義了 🤗 Transformers pipeline 的任務。完整的任務列表可在此處找到。

⚠️ ** Pipeline 未針對並行（多執行緒）進行最佳化，並且往往會消耗大量 RAM。例如，在基於 GPU 的例項上，pipeline 在單個 vCPU 上執行。當此 vCPU 因推理請求預處理而飽和時，可能會造成瓶頸，阻止 GPU 充分用於模型推理。在此處瞭解更多資訊：here

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <https://huggingface.co/models>
hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad', # model_id from hf.co/models
  'HF_TASK':'question-answering'                           # NLP task you want to use for predictions
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.26",                             # Transformers version used
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
   initial_instance_count=1,
   instance_type="ml.m5.xlarge"
)

# example request: you always need to define "inputs"
data = {
"inputs": {
	"question": "What is used for inference?",
	"context": "My Name is Philipp and I live in Nuremberg. This model is used with sagemaker for inference."
	}
}

# request
predictor.predict(data)

執行請求後，您可以再次使用以下命令刪除端點

# delete endpoint
predictor.delete_endpoint()

📓 開啟 deploy_transformer_model_from_hf_hub.ipynb 筆記本，檢視如何將模型從 🤗 Hub 部署到 SageMaker 進行推理的示例。

使用 🤗 Transformers 和 SageMaker 執行批次轉換

訓練模型後，您可以使用 SageMaker 批次轉換對模型執行推理。批次轉換接受您的推理資料作為 S3 URI，然後 SageMaker 將負責下載資料、執行預測並將結果上傳到 S3。有關批次轉換的更多詳細資訊，請參閱此處。

⚠️ 由於文字資料的複雜結構，Hugging Face 推理 DLC 目前僅支援 .jsonl 用於批次轉換。

注意：確保您的 inputs 在預處理期間適合模型的 max_length。

如果您使用 Hugging Face Estimator 訓練模型，請呼叫 transformer() 方法來為基於訓練作業的模型建立轉換作業（更多詳細資訊請參見此處）

batch_job = huggingface_estimator.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord')


batch_job.transform(
    data='s3://s3-uri-to-batch-data',
    content_type='application/json',    
    split_type='Line')

如果您想稍後執行批次轉換作業或使用 🤗 Hub 中的模型，請建立 HuggingFaceModel 例項，然後呼叫 transformer() 方法

from sagemaker.huggingface.model import HuggingFaceModel

# Hub model configuration <https://huggingface.co/models>
hub = {
	'HF_MODEL_ID':'distilbert/distilbert-base-uncased-finetuned-sst-2-english',
	'HF_TASK':'text-classification'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   env=hub,                                                # configuration for loading model from Hub
   role=role,                                              # IAM role with permissions to create an endpoint
   transformers_version="4.26",                             # Transformers version used
   pytorch_version="1.13",                                  # PyTorch version used
   py_version='py39',                                      # Python version used
)

# create transformer to run a batch job
batch_job = huggingface_model.transformer(
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    strategy='SingleRecord'
)

# starts batch transform job and uses S3 data as input
batch_job.transform(
    data='s3://sagemaker-s3-demo-test/samples/input.jsonl',
    content_type='application/json',    
    split_type='Line'
)

input.jsonl 如下所示

{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"SageMaker is pretty cool"}
{"inputs":"this movie is terrible"}
{"inputs":"this movie is amazing"}

📓 開啟 sagemaker-notebook.ipynb 筆記本，檢視如何執行批次轉換作業進行推理的示例。

使用 TGI 將 LLM 部署到 SageMaker

如果您有興趣將高效能服務容器用於 LLM，可以使用 Hugging Face TGI 容器。該容器利用了 Text Generation Inference 庫。相容模型的列表可在此處找到。

首先，確保已安裝最新版本的 SageMaker SDK

pip install sagemaker>=2.231.0

然後，我們匯入 SageMaker Python SDK 並例項化 sagemaker_session 以查詢當前區域和執行角色。

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

接下來，我們檢索 LLM 映象 URI。我們使用輔助函式 get_huggingface_llm_image_uri() 生成 Hugging Face 大型語言模型 (LLM) 推理的適當映象 URI。該函式接受一個必需引數 backend 和幾個可選引數。backend 指定用於模型的後端型別：“huggingface”表示使用 Hugging Face TGI 後端。

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface",
  region=region
)

現在我們有了映象 uri，下一步是配置模型物件。我們指定一個唯一的名稱、託管 TGI 容器的 image_uri 以及端點的執行角色。此外，我們還指定了許多環境變數，包括 HF_MODEL_ID（對應於將部署的 HuggingFace Hub 中的模型）和 HF_TASK（用於配置模型要執行的推理任務）。

您還應該定義 SM_NUM_GPUS，它指定模型的張量並行度。張量並行性可用於將模型拆分到多個 GPU 上，這在使用對於單個 GPU 來說太大的 LLM 時是必要的。要了解有關推理張量並行性的更多資訊，請參閱我們之前的部落格文章。在此處，您應該將 SM_NUM_GPUS 設定為所選例項型別上可用 GPU 的數量。例如，在本教程中，我們將 SM_NUM_GPUS 設定為 4，因為我們選擇的例項型別 ml.g4dn.12xlarge 有 4 個可用 GPU。

請注意，您可以透過將 HF_MODEL_QUANTIZE 環境變數設定為 true 來選擇性地減少模型的記憶體和計算佔用空間，但這種較低的權重精度可能會影響某些模型的輸出質量。

model_name = "llama-3-1-8b-instruct" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {
    'HF_MODEL_ID':'meta-llama/Llama-3.1-8B-Instruct',
    'SM_NUM_GPUS':'1',
	'HUGGING_FACE_HUB_TOKEN': '<REPLACE WITH YOUR TOKEN>',
}

assert hub['HUGGING_FACE_HUB_TOKEN'] != '<REPLACE WITH YOUR TOKEN>', "You have to provide a token."


model = HuggingFaceModel(
    name=model_name,
    env=hub,
    role=role,
    image_uri=image_uri
)

接下來，我們呼叫 deploy 方法來部署模型。

predictor = model.deploy(
  initial_instance_count=1,
  instance_type="ml.g5.2xlarge",
  endpoint_name=model_name
)

部署模型後，我們可以呼叫它來生成文字。我們傳遞一個輸入提示並執行 predict 方法，以從 TGI 容器中執行的 LLM 生成文字響應。

input_data = {
  "inputs": "The diamondback terrapin was the first reptile to",
  "parameters": {
    "do_sample": True,
    "max_new_tokens": 100,
    "temperature": 0.7,
    "watermark": True
  }
}

predictor.predict(input_data)

我們收到以下自動生成的文字響應

[{'generated_text': 'The diamondback terrapin was the first reptile to make the list, followed by the American alligator, the American crocodile, and the American box turtle. The polecat, a ferret-like animal, and the skunk rounded out the list, both having gained their slots because they have proven to be particularly dangerous to humans.\n\nCalifornians also seemed to appreciate the new list, judging by the comments left after the election.\n\n“This is fantastic,” one commenter declared.\n\n“California is a very'}]

實驗完成後，我們刪除端點和模型資源。

predictor.delete_model()
predictor.delete_endpoint()

使用者定義的程式碼和模組

Hugging Face 推理工具包允許使用者覆蓋 HuggingFaceHandlerService 的預設方法。您需要建立一個名為 code/ 的資料夾，並在其中包含一個 inference.py 檔案。有關如何歸檔模型工件的更多詳細資訊，請參閱此處。例如

model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt

inference.py 檔案包含您的自定義推理模組，而 requirements.txt 檔案包含應新增的其他依賴項。自定義模組可以覆蓋以下方法

model_fn(model_dir) 覆蓋載入模型的預設方法。返回值 model 將用於 predict 進行預測。predict 接收引數 model_dir，即未解壓的 model.tar.gz 的路徑。
transform_fn(model, data, content_type, accept_type) 使用您的自定義實現覆蓋預設轉換函式。您需要在 transform_fn 中實現您自己的 preprocess、predict 和 postprocess 步驟。此方法不能與下面提到的 input_fn、predict_fn 或 output_fn 結合使用。
input_fn(input_data, content_type) 覆蓋預處理的預設方法。返回值 data 將用於 predict 進行預測。輸入是
- input_data 是您請求的原始正文。
- content_type 是請求頭中的內容型別。
predict_fn(processed_data, model) 覆蓋預測的預設方法。返回值 predictions 將用於 postprocess。輸入是 processed_data，即 preprocess 的結果。
output_fn(prediction, accept) 覆蓋後處理的預設方法。返回值 result 將是您請求的響應（例如 JSON）。輸入是
- predictions 是 predict 的結果。
- accept 是 HTTP 請求的返回接受型別，例如 application/json。

以下是包含 model_fn、input_fn、predict_fn 和 output_fn 的自定義推理模組示例

from sagemaker_huggingface_inference_toolkit import decoder_encoder

def model_fn(model_dir):
    # implement custom code to load the model
    loaded_model = ...
    
    return loaded_model 

def input_fn(input_data, content_type):
    # decode the input data  (e.g. JSON string -> dict)
    data = decoder_encoder.decode(input_data, content_type)
    return data

def predict_fn(data, model):
    # call your custom model with the data
    outputs = model(data , ... )
    return predictions

def output_fn(prediction, accept):
    # convert the model output to the desired output format (e.g. dict -> JSON string)
    response = decoder_encoder.encode(prediction, accept)
    return response

僅使用 model_fn 和 transform_fn 自定義您的推理模組

from sagemaker_huggingface_inference_toolkit import decoder_encoder

def model_fn(model_dir):
    # implement custom code to load the model
    loaded_model = ...
    
    return loaded_model 

def transform_fn(model, input_data, content_type, accept):
     # decode the input data (e.g. JSON string -> dict)
    data = decoder_encoder.decode(input_data, content_type)

    # call your custom model with the data
    outputs = model(data , ... ) 

    # convert the model output to the desired output format (e.g. dict -> JSON string)
    response = decoder_encoder.encode(output, accept)

    return response

< > 在 GitHub 上更新

在 AWS 上部署