在 Azure AI 上部署 SmolLM3

本示例展示瞭如何將 Hugging Face Collection 中的 SmolLM3 作為 Azure ML 託管線上端點部署在 Azure AI Foundry Hub 上，由 Transformers 提供支援並具有 OpenAI 相容路由。此外，本示例還展示瞭如何使用 OpenAI Python SDK 進行不同場景和用例的推理。

SmolLM3 3B Logo image

總結：Transformers 是文字、計算機視覺、音訊、影片和多模態模型領域最先進機器學習模型的模型定義框架，適用於推理和訓練。Azure AI Foundry 為企業 AI 操作、模型構建器和應用程式開發提供了統一平臺。Azure Machine Learning 是一種雲服務，用於加速和管理機器學習 (ML) 專案生命週期。

本示例將具體部署來自 Hugging Face Hub 的 HuggingFaceTB/SmolLM3-3B（或在 AzureML 或 Azure AI Foundry 上檢視）作為 Azure AI Foundry Hub 上的 Azure ML 託管線上端點。

SmolLM3 是一種 30 億引數的語言模型，旨在突破小型模型的極限。它支援雙模式推理、6 種語言和長上下文。SmolLM3 是一個完全開放的模型，在 30 億至 40 億引數規模下提供強大的效能。

Small LLM win-rate on benchmarks per model size

該模型是一個僅解碼器 transformer，使用 GQA 和 NoPE（比例為 3:1），它在 11.2T 令牌上進行預訓練，採用分階段的 Web、程式碼、數學和推理資料課程。訓練後包括對 140B 推理令牌進行中訓練，然後透過錨定偏好最佳化（APO）進行監督微調和對齊。

指令模型，針對**混合推理**進行了最佳化。
**完全開放的模型**：開放權重 + 完整的訓練細節，包括公共資料混合和訓練配置
**長上下文：** 在 64k 上下文上訓練，並支援使用 YARN 推斷法擴充套件至 **128k 令牌**
**多語言**：原生支援 6 種語言（英語、法語、西班牙語、德語、義大利語和葡萄牙語）

SmolLM3 3B on the Hugging Face Hub

SmolLM3 3B on Azure AI Foundry

欲瞭解更多資訊，請務必檢視我們在 Hugging Face Hub 上的模型卡。

先決條件

要執行以下示例，您需要滿足以下先決條件，或者，您也可以在Azure Machine Learning 教程：建立入門所需資源中閱讀更多相關資訊。

具有活動訂閱的 Azure 帳戶。
已安裝並登入 Azure CLI。
適用於 Azure CLI 的 Azure 機器學習擴充套件。
一個 Azure 資源組。
基於 Azure AI Foundry Hub 的專案。

有關更多資訊，請按照為 Azure AI 配置 Microsoft Azure 中的步驟操作。

設定與安裝

在本例中，將使用適用於 Python 的 Azure 機器學習 SDK 建立端點和部署，並呼叫部署的 API。此外，您還需要安裝 azure-identity 以透過 Python 使用您的 Azure 憑據進行身份驗證。

%pip install azure-ai-ml azure-identity --upgrade --quiet

更多資訊請參見適用於 Python 的 Azure 機器學習 SDK。

然後，為了方便起見，建議設定以下環境變數，因為它們將在示例中用於 Azure ML 客戶端，因此請務必根據您的 Microsoft Azure 帳戶和資源更新並設定這些值。

%env LOCATION eastus
%env SUBSCRIPTION_ID <YOUR_SUBSCRIPTION_ID>
%env RESOURCE_GROUP <YOUR_RESOURCE_GROUP>
%env AI_FOUNDRY_HUB_PROJECT <YOUR_AI_FOUNDRY_HUB_PROJECT>

最後，您還需要定義終結點和部署名稱，因為它們也將在整個示例中使用。

請注意，端點名稱在每個區域內都必須是全域性唯一的，即即使您的訂閱下沒有以此名稱執行的任何端點，如果該名稱已被其他 Azure 客戶預留，則您將無法使用相同的名稱。建議新增時間戳或自定義識別符號，以防止在嘗試部署已鎖定/預留名稱的端點時遇到 HTTP 400 驗證問題。此外，端點名稱長度必須在 3 到 32 個字元之間。

import os
from uuid import uuid4

os.environ["ENDPOINT_NAME"] = f"smollm3-endpoint-{str(uuid4())[:8]}"
os.environ["DEPLOYMENT_NAME"] = f"smollm3-deployment-{str(uuid4())[:8]}"

向 Azure ML 進行身份驗證

首先，您需要透過 Azure ML Python SDK 向 Azure ML 的 Azure AI Foundry Hub 進行身份驗證，之後將使用該 SDK 將 HuggingFaceTB/SmolLM3-3B 作為 Azure ML 託管線上端點部署到您的 Azure AI Foundry Hub 中。

在標準的 Azure ML 部署中，您需要使用 Azure ML 工作區作為 workspace_name 來建立 MLClient，而在 Azure AI Foundry 中，您需要將 Azure AI Foundry Hub 名稱作為 workspace_name 提供，這樣也會將端點部署到 Azure AI Foundry 中。

import os
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id=os.getenv("SUBSCRIPTION_ID"),
    resource_group_name=os.getenv("RESOURCE_GROUP"),
    workspace_name=os.getenv("AI_FOUNDRY_HUB_PROJECT"),
)

建立和部署 Azure AI 端點

在建立託管線上端點之前，您需要構建模型 URI，其格式為 azureml://registries/HuggingFace/models/<MODEL_ID>/labels/latest，其中 MODEL_ID 不是 Hugging Face Hub ID，而是其在 Azure 上的名稱，如下所示

model_id = "HuggingFaceTB/SmolLM3-3B"

model_uri = (
    f"azureml://registries/HuggingFace/models/{model_id.replace('/', '-').replace('_', '-').lower()}/labels/latest"
)
model_uri

要檢查 Hugging Face Hub 中的模型是否在 Azure 中可用，您應該在支援的模型中閱讀相關資訊。如果不可用，您隨時可以請求在 Azure 的 Hugging Face 集合中新增模型）。

然後，您需要透過 Azure ML Python SDK 建立 ManagedOnlineEndpoint，如下所示。

Hugging Face Collection 中的每個模型都由高效的推理後端提供支援，並且每個模型都可以在各種例項型別上執行（如支援的硬體中所列）。由於模型和推理引擎需要 GPU 加速例項，您可能需要根據管理和增加 Azure 機器學習資源的配額和限制來請求增加配額。

from azure.ai.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment

endpoint = ManagedOnlineEndpoint(name=os.getenv("ENDPOINT_NAME"))

deployment = ManagedOnlineDeployment(
    name=os.getenv("DEPLOYMENT_NAME"),
    endpoint_name=os.getenv("ENDPOINT_NAME"),
    model=model_uri,
    instance_type="Standard_NC40ads_H100_v5",
    instance_count=1,
)

client.begin_create_or_update(endpoint).wait()

Azure AI Endpoint from Azure AI Foundry

在 Azure AI Foundry 中，終結點只有在部署建立後才會在“我的資產 -> 模型 + 終結點”選項卡中列出，不像 Azure ML 那樣，即使終結點不包含任何活動或正在進行的部署也會顯示。

client.online_deployments.begin_create_or_update(deployment).wait()

Azure AI Deployment from Azure AI Foundry

請注意，儘管 Azure AI 端點建立相對較快，但部署將需要更長時間，因為它需要在 Azure 上分配資源，因此預計需要約 10-15 分鐘，但根據例項預置和可用性，也可能會花費更長時間。

部署後，您可以透過 Azure AI Foundry 或 Azure ML Studio 檢查端點詳細資訊、即時日誌、如何使用端點，甚至可以使用（仍處於預覽階段）監控功能。欲瞭解更多資訊，請訪問Azure ML 託管線上端點

向 Azure AI 端點發送請求

最後，Azure AI 端點部署完成後，您可以向其傳送請求。在這種情況下，由於模型的任務是 `text-generation`（也稱為 `chat-completion`），因此您可以使用 OpenAI SDK，透過 OpenAI 相容路由向評分 URI 傳送請求，即 `/v1/chat/completions`。

請注意，下面僅列出了一些選項，但只要您傳送的 HTTP 請求設定了 `azureml-model-deployment` 標頭（設定為 Azure AI 部署的名稱，而不是端點的名稱），並且擁有向給定端點發送請求所需的身份驗證令牌/金鑰，就可以向已部署的端點發送請求；然後您可以向後端引擎公開的所有路由傳送 HTTP 請求，而不僅僅是評分路由。

%pip install openai --upgrade --quiet

要將 OpenAI Python SDK 與 Azure ML 託管線上終結點一起使用，您需要首先檢索

api_url，帶 /v1 路由（包含 OpenAI Python SDK 將向其傳送請求的 v1/chat/completions 終結點）
api_key，它是 Azure AI 中的 API 金鑰或 Azure ML 中的主金鑰（除非使用專用的 Azure ML 令牌）

from urllib.parse import urlsplit

api_key = client.online_endpoints.get_keys(os.getenv("ENDPOINT_NAME")).primary_key

url_parts = urlsplit(client.online_endpoints.get(os.getenv("ENDPOINT_NAME")).scoring_uri)
api_url = f"{url_parts.scheme}://{url_parts.netloc}/v1"

或者，您也可以手動構建 API URL，如下所示，因為 URI 在每個區域中都是全域性唯一的，這意味著在同一區域中只會有一個同名終結點。

api_url = f"https://{os.getenv('ENDPOINT_NAME')}.{os.getenv('LOCATION')}.inference.ml.azure.com/v1"

或者直接從 Azure AI Foundry 或 Azure ML Studio 中檢索。

然後，您可以正常使用 OpenAI Python SDK，確保包含包含 Azure AI / ML 部署名稱的額外標頭 azureml-model-deployment。

透過 OpenAI Python SDK，可以透過 `chat.completions.create` 的每次呼叫中的 `extra_headers` 引數進行設定（如下文註釋），或者在例項化 `OpenAI` 客戶端時透過 `default_headers` 引數進行設定（這是推薦的方法，因為標頭需要在每個請求中都存在，所以只需設定一次即可）。

import os
from openai import OpenAI

openai_client = OpenAI(
    base_url=api_url,
    api_key=api_key,
    default_headers={"azureml-model-deployment": os.getenv("DEPLOYMENT_NAME")},
)

聊天完成

completion = openai_client.chat.completions.create(
    model="HuggingFaceTB/SmolLM3-3B",
    messages=[
        {
            "role": "system",
            "content": "You are an assistant that responds like a pirate.",
        },
        {
            "role": "user",
            "content": "Give me a brief explanation of gravity in simple terms.",
        },
    ],
    max_tokens=128,
)
print(completion)
# ChatCompletion(id='chatcmpl-74f6852e28', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="<think>\nOkay, the user wants a simple explanation of gravity. Let me start by recalling what I know. Gravity is the force that pulls objects towards each other. But how to explain that simply?\n\nMaybe start with a common example, like how you fall when you jump. That's gravity pulling you down. But wait, I should mention that it's not just on Earth. The moon orbits the Earth because of gravity too. But how to make that easy to understand?\n\nI need to avoid technical terms. Maybe use metaphors. Like comparing gravity to a magnet, but not exactly. Or think of it as a stretchy rope pulling", refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1753178803, model='HuggingFaceTB/SmolLM3-3B', object='chat.completion', service_tier='default', system_fingerprint='1a28be5c-df18-4e97-822f-118bf57374c8', usage=CompletionUsage(completion_tokens=128, prompt_tokens=66, total_tokens=194, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0), reasoning_tokens=0))

擴充套件思考模式

預設情況下，`SmolLM3-3B` 啟用擴充套件思考，因此上述示例會生成帶有推理跟蹤的輸出，因為推理預設是啟用的。

要啟用和停用它，您可以在系統提示中分別提供 `/think` 和 `/no_think`。

completion = openai_client.chat.completions.create(
    model="HuggingFaceTB/SmolLM3-3B",
    messages=[
        {
            "role": "system",
            "content": "/no_think You are an assistant that responds like a pirate.",
        },
        {
            "role": "user",
            "content": "Give me a brief explanation of gravity in simple terms.",
        },
    ],
    max_tokens=128,
)
print(completion)
# ChatCompletion(id='chatcmpl-776e84a272', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="Arr matey! Ye be askin' about gravity, the mighty force that keeps us swabbin' the decks and not floatin' off into the vast blue yonder. Gravity be the pull o' the Earth, a mighty force that keeps us grounded and keeps the stars in their place. It's like a giant invisible hand that pulls us towards the center o' the Earth, makin' sure we don't float off into space. It's what makes the apples fall from the tree and the moon orbit 'round the Earth. So, gravity be the force that keeps us all tied to this fine planet we call home.", refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1753178805, model='HuggingFaceTB/SmolLM3-3B', object='chat.completion', service_tier='default', system_fingerprint='d644cb1c-84d6-49ae-b790-ac6011851042', usage=CompletionUsage(completion_tokens=128, prompt_tokens=72, total_tokens=200, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0), reasoning_tokens=0))

多語言能力

如前所述，`SmolLM3-3B` 經過訓練，原生支援 6 種語言：英語、法語、西班牙語、德語、義大利語和葡萄牙語；這意味著您可以透過使用這些語言中的任何一種傳送請求來利用其多語言潛力。

completion = openai_client.chat.completions.create(
    model="HuggingFaceTB/SmolLM3-3B",
    messages=[
        {
            "role": "system",
            "content": "/no_think You are an expert translator.",
        },
        {
            "role": "user",
            "content": "Translate the following English sentence into both Spanish and German: 'The brown cat sat on the mat.'",
        },
    ],
    max_tokens=128,
)
print(completion)
# ChatCompletion(id='chatcmpl-da6188629f', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="The translation of the English sentence 'The brown cat sat on the mat.' into Spanish is: 'El gato marrón se sentó en el tapete.'\n\nThe translation of the English sentence 'The brown cat sat on the mat.' into German is: 'Der braune Katze saß auf dem Teppich.'", refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1753178807, model='HuggingFaceTB/SmolLM3-3B', object='chat.completion', service_tier='default', system_fingerprint='054f8a76-4e8c-4a2f-90eb-31f0e802916c', usage=CompletionUsage(completion_tokens=68, prompt_tokens=77, total_tokens=145, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0), reasoning_tokens=0))

代理用例和工具呼叫

SmolLM3-3B 具有工具呼叫能力，這意味著您可以提供一個或多個 LLM 可以利用和使用的工具。

為防止 `tool_call` 不完整，您可能需要取消設定 `max_completion_tokens`（原 `max_tokens`）的值，或將其設定為足夠大的值，以便模型在 `tool_call` 完成之前不會因長度限制而停止生成令牌。

completion = openai_client.chat.completions.create(
    model="HuggingFaceTB/SmolLM3-3B",
    messages=[{"role": "user", "content": "What is the weather like in New York?"}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get the current weather in a given location",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "unit": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The unit of temperature",
                        },
                    },
                    "required": ["location"],
                },
            },
        }
    ],
    tool_choice="auto",
    max_completion_tokens=256,
)
print(completion)
# ChatCompletion(id='chatcmpl-c36090e6b5', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content='<think>I need to retrieve the current weather information for New York, so I\'ll use the get_weather function with the location set to \'New York\' and the unit set to \'fahrenheit\'.</think>\n<tool_call>{"name": "get_weather", "arguments": {"location": "New York", "unit": "fahrenheit"}}</tool_call>', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call-5d5eb71a', function=Function(arguments='{"location": "New York", "unit": "fahrenheit"}', name='get_weather'), type='function')]))], created=1753178808, model='HuggingFaceTB/SmolLM3-3B', object='chat.completion', service_tier='default', system_fingerprint='5e58b305-773c-40b6-900b-fe5b177aeab9', usage=CompletionUsage(completion_tokens=68, prompt_tokens=442, total_tokens=510, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0), reasoning_tokens=0))

釋放資源

完成 Azure AI 終結點/部署的使用後，您可以按如下方式刪除資源，這意味著您將停止支付模型執行所在的例項費用，並且所有相關費用都將停止。

client.online_endpoints.begin_delete(name=os.getenv("ENDPOINT_NAME")).result()

結論

透過本示例，您學習瞭如何為 Azure ML 和 Azure AI Foundry 建立和配置 Azure 帳戶，如何在 Azure ML / Azure AI Foundry 模型目錄中建立並執行 Hugging Face Collection 中的開放模型的託管線上端點，如何使用 OpenAI SDK 傳送各種用例的推理請求，以及最後如何停止並釋放資源。

如果您對此示例有任何疑問、問題或疑問，請隨時提出問題，我們將盡力提供幫助！

📍 在 GitHub 上找到完整的示例此處！

< > 在 GitHub 上更新