在 AWS Inferentia2 上部署 Llama 3.3 70B

在本教程中，您將學習如何在 Amazon SageMaker 上使用 Hugging Face Optimum 在 AWS Inferentia2 上部署 /meta-llama/Llama-3.3-70B-Instruct 模型。我們將使用 Hugging Face TGI Neuron 容器，這是一個專門用於在由 Text Generation Inference 和 Optimum Neuron 提供支援的 AWS Inferentia2 上輕鬆部署 LLM 的推理容器。

我們將介紹如何

設定開發環境
檢索新的 Hugging Face TGI Neuron DLC
將 Llama 3.3 70B 部署到 Inferentia2
清理

讓我們開始吧！🚀

AWS Inferentia (Inf2) 是專為深度學習 (DL) 推理工作負載而構建的 EC2 例項。以下是 Inferentia2 系列的不同例項。

例項大小	加速器	Neuron 核心	加速器記憶體	vCPU	CPU 記憶體	按需價格（美元/小時）
inf2.xlarge	1	2	32	4	16	0.76
inf2.8xlarge	1	2	32	32	128	1.97
inf2.24xlarge	6	12	192	96	384	6.49
inf2.48xlarge	12	24	384	192	768	12.98

1. 設定開發環境

在本教程中，我們將使用 Amazon SageMaker 中的 Notebook 例項，其中包含 Python 3 (ipykernel) 和 sagemaker Python SDK，以便將 Llama 3.3 70B 部署到 SageMaker 推理端點。

確保您已安裝最新版本的 SageMaker SDK。

!pip install sagemaker --upgrade --quiet

然後，例項化 sagemaker 角色和會話。

import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. 檢索最新的 Hugging Face TGI Neuron DLC

最新的 Hugging Face TGI Neuron DLC 可用於在 AWS Inferentia2 上執行推理。您可以使用 sagemaker SDK 的 get_huggingface_llm_image_uri 方法，根據您所需的 backend、session、region 和 version 檢索相應的 Hugging Face TGI Neuron DLC URI。如果尚未新增到 SageMaker SDK，您可以在此處找到容器的最新版本。

在編寫本教程時，容器的最新版本尚未新增到 SageMaker SDK，因此我們不會使用 get_huggingface_llm_image_uri。

# pulled from https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/image_uri_config/huggingface-llm-neuronx.json
account_id_dict = {
    "ap-northeast-1": "763104351884",
    "ap-south-1": "763104351884",
    "ap-south-2": "772153158452",
    "ap-southeast-1": "763104351884",
    "ap-southeast-2": "763104351884",
    "ap-southeast-4": "457447274322",
    "ap-southeast-5": "550225433462",
    "ap-southeast-7": "590183813437",
    "cn-north-1": "727897471807",
    "cn-northwest-1": "727897471807",
    "eu-central-1": "763104351884",
    "eu-central-2": "380420809688",
    "eu-south-2": "503227376785",
    "eu-west-1": "763104351884",
    "eu-west-3": "763104351884",
    "il-central-1": "780543022126",
    "mx-central-1": "637423239942",
    "sa-east-1": "763104351884",
    "us-east-1": "763104351884",
    "us-east-2": "763104351884",
    "us-gov-east-1": "446045086412",
    "us-gov-west-1": "442386744353",
    "us-west-2": "763104351884",
    "ca-west-1": "204538143572",
}

region = boto3.Session().region_name
llm_image = f"{account_id_dict[region]}.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.28-neuronx-py310-ubuntu22.04"

3. 將 Llama 3.3 70B 部署到 Inferentia2

在撰寫本文時，AWS Inferentia2 不支援推理的動態形狀，這意味著我們需要提前指定序列長度和批大小。為了方便客戶充分利用 Inferentia2 的強大功能，我們建立了一個神經元模型快取，其中包含最流行的 LLM（包括 Llama 3.3 70B）的預編譯配置。

這意味著我們不需要自己編譯模型，而是可以使用快取中的預編譯模型。您可以在 Hugging Face Hub 上找到已編譯/快取的配置。如果您所需的配置尚未快取，您可以使用 Optimum CLI 自己編譯，或在快取儲存庫中提出請求。

將 Llama 3.3 70B 部署到 SageMaker 端點

在將模型部署到 Amazon SageMaker 之前，我們必須定義 TGI Neuron 端點配置。我們需要確保定義以下附加引數：

HF_NUM_CORES：用於編譯的Neuron核心數量。
HF_BATCH_SIZE：用於編譯模型的批處理大小。
HF_SEQUENCE_LENGTH：用於編譯模型的序列長度。
HF_AUTO_CAST_TYPE：用於編譯模型的自動轉換型別。

我們仍然需要定義傳統的TGI引數：

HF_MODEL_ID：Hugging Face模型ID。
HF_TOKEN：用於訪問受限模型的Hugging Face API令牌。
MAX_BATCH_SIZE：模型可以處理的最大批處理大小，等於用於編譯的批處理大小。
MAX_INPUT_TOKEN：模型可以處理的最大輸入長度。
MAX_TOTAL_TOKENS：模型可以生成的最大總token數，等於用於編譯的序列長度。

或者，您可以將端點配置為支援聊天模板

MESSAGES_API_ENABLED：啟用訊息 API

選擇正確的例項型別

Llama 3.3 70B 是一個大型模型，需要大量記憶體。我們將使用 inf2.48xlarge 例項型別，它有 192 個 vCPU 和 384 GB 加速器記憶體。inf2.48xlarge 例項配備 12 個 Inferentia2 加速器，其中包括 24 個 Neuron Core。如果您想查詢 Llama 3.3 70B 的快取配置，可以在此處找到。在本例中，我們將使用批大小為 4，序列長度為 4096。

在將 Llama 3.3 70B 部署到 Inferentia2 之前，我們需要確保擁有訪問該模型的必要許可權。您可以在此處請求訪問該模型，並按照此指南建立使用者訪問令牌。

之後，我們可以建立端點配置並將模型部署到 Amazon SageMaker。我們將啟用訊息 API 來部署端點，使其與 OpenAI 聊天完成 API 完全相容。

from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.inf2.48xlarge"
health_check_timeout = 3600  # additional time to load the model
volume_size = 512  # size in GB of the EBS volume

# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3-70B-Instruct",
    "HF_NUM_CORES": "24",  # number of neuron cores
    "HF_AUTO_CAST_TYPE": "bf16",  # dtype of the model
    "MAX_BATCH_SIZE": "4",  # max batch size for the model
    "MAX_INPUT_TOKENS": "4000",  # max length of input text
    "MAX_TOTAL_TOKENS": "4096",  # max length of generated text
    "MESSAGES_API_ENABLED": "true",  # Enable the messages API
    "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",
}

assert (
    config["HF_TOKEN"] != "<REPLACE WITH YOUR TOKEN>"
), "Please replace '<REPLACE WITH YOUR TOKEN>' with your Hugging Face Hub API token"


# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(role=role, image_uri=llm_image, env=config)

建立 HuggingFaceModel 後，我們可以使用 deploy 方法將其部署到 Amazon SageMaker。我們將使用 ml.inf2.48xlarge 例項型別部署模型。TGI 將自動在所有 Inferentia 裝置上分配和分片模型。

# deactivate warning since model is compiled
llm_model._is_compiled_model = True

llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    volume_size=volume_size,
)

SageMaker 現在將建立我們的端點並將模型部署到其中。部署大約需要 30 分鐘。

在我們的終端節點部署完成後，我們可以在其上執行推理。我們將使用 predictor 的 predict 方法在我們的終端節點上執行推理。

該端點支援 Messages API，它與 OpenAI Chat Completion API 完全相容。Messages API 允許我們以對話方式與模型互動。我們可以定義訊息的角色和內容。角色可以是 system、assistant 或 user。system 角色用於為模型提供上下文，user 角色用於提問或向模型提供輸入。

引數可以在負載的 parameters 屬性中定義。請檢視聊天完成文件以查詢支援的引數。

{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is deep learning?" }
  ]
}

# Prompt to generate
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is deep learning in one sentence?"},
]

# Generation arguments https://platform.openai.com/docs/api-reference/chat/create
parameters = {
    "max_tokens": 100,
}

好的，讓我們測試一下。

chat = llm.predict({"messages": messages, **parameters, "steam": True})

print(chat["choices"][0]["message"]["content"].strip())

4. 清理

為了清理，我們可以刪除模型和端點。

llm.delete_model()
llm.delete_endpoint()

AWS Trainium & Inferentia

在 AWS Inferentia2 上部署 Llama 3.3 70B

1. 設定開發環境

2. 檢索最新的 Hugging Face TGI Neuron DLC

3. 將 Llama 3.3 70B 部署到 Inferentia2

4. 清理