推理端點（專用）

您是否曾想建立自己的機器學習 API？在本教程中，我們將使用 HF 專用推理端點來實現這一目標。推理端點使您能夠從 HF Hub 上的數十萬個模型中選擇任何一個，在您控制的部署平臺和您選擇的硬體上建立自己的 API。

無伺服器推理 API 非常適合初步測試，但它們僅限於預配置的流行模型選擇，並且受到速率限制，因為無伺服器 API 的硬體同時被許多使用者使用。透過專用推理端點，您可以自定義模型的部署，並且硬體專門為您服務。

在本教程中，我們將：

透過簡單的使用者介面建立推理端點，並向該端點發送標準 HTTP 請求
使用 huggingface_hub 庫以程式設計方式建立和管理不同的推理端點
涵蓋三種用例：使用 LLM 進行文字生成、使用 Stable Diffusion 進行影像生成，以及使用 Idefics2 對影像進行推理。

安裝和登入

如果您沒有 HF 賬戶，可以在此處建立賬戶。如果您在一個大型團隊中工作，還可以建立HF 組織並透過該組織管理您的所有模型、資料集和端點。專用推理端點是一項付費服務，因此您需要在您的個人 HF 賬戶或 HF 組織的賬單設定中新增信用卡。

然後您可以在此處建立使用者訪問令牌。具有read或write許可權的令牌適用於本指南，但我們鼓勵使用細粒度令牌以提高安全性。對於本筆記本，您需要一個具有使用者許可權 > 推理 > 呼叫推理端點並管理推理端點和倉庫許可權 > google/gemma-1.1-2b-it 和 HuggingFaceM4/idefics2-8b-chatty的細粒度令牌。

!pip install huggingface_hub~=0.23.3
!pip install transformers~=4.41.2

# Login to the HF Hub. We recommend using this login method
# to avoid the need for explicitly storing your HF token in variables
import huggingface_hub

huggingface_hub.interpreter_login()

建立您的第一個端點

完成初始設定後，我們現在可以建立我們的第一個端點。導航到 https://ui.endpoints.huggingface.co/ 並點選“專用端點”旁邊的 + New。您將看到建立新端點的介面，其中包含以下選項（見下圖）

模型倉庫：您可以在此處插入 HF Hub 上任何模型的識別符號。對於本次初始演示，我們使用 google/gemma-1.1-2b-it，這是一個小型生成式 LLM（2.5B 引數）。
端點名稱：端點名稱是根據模型識別符號自動生成的，但您可以自由更改名稱。有效的端點名稱只能包含小寫字母、數字或連字元（“-”），長度介於 4 到 32 個字元之間。
例項配置：您可以在此處選擇所有主要雲平臺提供的各種 CPU 或 GPU。您還可以調整區域，例如，如果您需要在歐盟託管您的端點。
自動縮減到零：您可以配置您的端點在一定時間後縮減到零個 GPU/CPU。縮減到零的端點不再計費。請注意，重新啟動端點需要將模型重新載入到記憶體中（並可能重新下載），這對於大型模型可能需要幾分鐘。
端點安全級別：標準安全級別是Protected，這需要授權的 HF 令牌才能訪問端點。Public端點可以由任何人訪問，無需令牌認證。Private端點只能透過區域內安全的 AWS 或 Azure PrivateLink 連線訪問。
高階配置：您可以在此處選擇一些高階選項，例如 Docker 容器型別。由於 Gemma 與文字生成推理 (TGI) 容器相容，系統會自動選擇 TGI 作為容器型別和其他良好的預設值。

對於本指南，選擇下圖中所示的選項，然後點選 Create Endpoint。

大約一分鐘後，您的端點將建立完成，您將看到一個類似於下圖的頁面。

在端點的Overview頁面上，您將找到查詢端點的 URL、一個用於測試模型的 Playground，以及Analytics、Usage & Cost、Logs和Settings等其他選項卡。

以程式設計方式建立和管理端點

在投入生產時，您不總是希望手動啟動、停止和修改您的端點。huggingface_hub 庫提供了良好的功能，可用於以程式設計方式管理您的端點。請參閱此處的文件，以及此處所有功能的詳細資訊。以下是一些關鍵功能：

# list all your inference endpoints
huggingface_hub.list_inference_endpoints()

# get an existing endpoint and check it's status
endpoint = huggingface_hub.get_inference_endpoint(
    name="gemma-1-1-2b-it-yci",  # the name of the endpoint
    namespace="MoritzLaurer",  # your user name or organization name
)
print(endpoint)

# Pause endpoint to stop billing
endpoint.pause()

# Resume and wait until the endpoint is ready
# endpoint.resume()
# endpoint.wait()

# Update the endpoint to a different GPU
# You can find the correct arguments for different hardware types in this table: https://huggingface.co/docs/inference-endpoints/pricing#gpu-instances
# endpoint.update(
#    instance_size="x1",
#    instance_type="nvidia-a100",  # nvidia-a10g
# )

您還可以透過程式設計方式建立推理端點。讓我們重新建立與透過 UI 建立的相同的 gemma LLM 端點。

from huggingface_hub import create_inference_endpoint


model_id = "google/gemma-1.1-2b-it"
endpoint_name = "gemma-1-1-2b-it-001"  # Valid Endpoint names must only contain lower-case characters, numbers or hyphens ("-") and are between 4 to 32 characters long.
namespace = "MoritzLaurer"  # your user or organization name


# check if endpoint with this name already exists from previous tests
available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
if endpoint_name in available_endpoints_names:
    endpoint_exists = True
else:
    endpoint_exists = False
print("Does the endpoint already exist?", endpoint_exists)


# create new endpoint
if not endpoint_exists:
    endpoint = create_inference_endpoint(
        endpoint_name,
        repository=model_id,
        namespace=namespace,
        framework="pytorch",
        task="text-generation",
        # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing
        accelerator="gpu",
        vendor="aws",
        region="us-east-1",
        instance_size="x1",
        instance_type="nvidia-a10g",
        min_replica=0,
        max_replica=1,
        type="protected",
        # since the LLM is compatible with TGI, we specify that we want to use the latest TGI image
        custom_image={
            "health_route": "/health",
            "env": {"MODEL_ID": "/repository"},
            "url": "ghcr.io/huggingface/text-generation-inference:latest",
        },
    )
    print("Waiting for endpoint to be created")
    endpoint.wait()
    print("Endpoint ready")

# if endpoint with this name already exists, get and resume existing endpoint
else:
    endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
    if endpoint.status in ["paused", "scaledToZero"]:
        print("Resuming endpoint")
        endpoint.resume()
    print("Waiting for endpoint to start")
    endpoint.wait()
    print("Endpoint ready")

# access the endpoint url for API calls
print(endpoint.url)

查詢您的端點

現在，讓我們像查詢其他任何 LLM API 一樣查詢此端點。首先，從介面複製端點 URL（或使用 endpoint.url），並將其分配給下面的 API_URL。然後，我們使用標準化的訊息格式作為文字輸入，即使用者和助手訊息的字典，您可能從其他 LLM API 服務中瞭解這種格式。然後，我們需要將聊天模板應用於訊息，LLM（如 Gemma、Llama-3 等）都經過訓練，期望這種格式（參見文件中的詳細資訊）。對於大多數最新的生成式 LLM，應用此聊天模板至關重要，否則模型效能會下降，而不會丟擲錯誤。

>>> import requests
>>> from transformers import AutoTokenizer

>>> # paste your endpoint URL here or reuse endpoint.url if you created the endpoint programmatically
>>> API_URL = endpoint.url  # or paste link like "https://dz07884a53qjqb98.us-east-1.aws.endpoints.huggingface.cloud"
>>> HEADERS = {"Authorization": f"Bearer {huggingface_hub.get_token()}"}


>>> # function for standard http requests
>>> def query(payload=None, api_url=None):
...     response = requests.post(api_url, headers=HEADERS, json=payload)
...     return response.json()


>>> # define conversation input in messages format
>>> # you can also provide multiple turns between user and assistant
>>> messages = [
...     {"role": "user", "content": "Please write a short poem about open source for me."},
...     # {"role": "assistant", "content": "I am not in the mood."},
...     # {"role": "user", "content": "Can you please do this for me?"},
... ]

>>> # apply the chat template for the respective model
>>> model_id = "google/gemma-1.1-2b-it"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> messages_with_template = tokenizer.apply_chat_template(messages, tokenize=False)
>>> print("Your text input looks like this, after the chat template has been applied:\n")
>>> print(messages_with_template)

Your text input looks like this, after the chat template has been applied:

user
Please write a short poem about open source for me.

>>> # send standard http request to endpoint
>>> output = query(
...     payload={
...         "inputs": messages_with_template,
...         "parameters": {"temperature": 0.2, "max_new_tokens": 100, "seed": 42, "return_full_text": False},
...     },
...     api_url=API_URL,
... )

>>> print("The output from your API/Endpoint call:\n")
>>> print(output)

The output from your API/Endpoint call:

[&#123;'generated_text': "Free to use, free to share,\nA collaborative code, a community's care.\n\nCode transparent, bugs readily found,\nContributions welcome, stories unbound.\nOpen source, a gift to all,\nBuilding the future, one line at a call.\n\nSo join the movement, embrace the light,\nOpen source, shining ever so bright."}]

就是這樣，您已經向您的端點（您自己的 API！）發出了第一個請求！

如果您希望端點自動處理聊天模板，並且您的 LLM 在 TGI 容器上執行，您也可以透過在 URL 後附加 /v1/chat/completions 路徑來使用 messages API。透過 /v1/chat/completions 路徑，在端點上執行的 TGI 容器會自動應用聊天模板，並且與 OpenAI 的 API 結構完全相容，以實現更簡單的互操作性。有關所有可用引數，請參閱 TGI Swagger UI。請注意，預設 / 路徑和 /v1/chat/completions 路徑接受的引數略有不同。以下是使用 messages API 的略微修改後的程式碼：

>>> API_URL_CHAT = API_URL + "/v1/chat/completions"

>>> output = query(
...     payload={
...         "messages": messages,
...         "model": "tgi",
...         "parameters": {"temperature": 0.2, "max_tokens": 100, "seed": 42},
...     },
...     api_url=API_URL_CHAT,
... )

>>> print("The output from your API/Endpoint call with the OpenAI-compatible messages API route:\n")
>>> print(output)

The output from your API/Endpoint call with the OpenAI-compatible messages API route:

&#123;'id': '', 'object': 'text_completion', 'created': 1718283608, 'model': '/repository', 'system_fingerprint': '2.0.5-dev0-sha-90184df', 'choices': [&#123;'index': 0, 'message': &#123;'role': 'assistant', 'content': '**Open Source**\n\nA license for the mind,\nTo share, distribute, and bind,\nIdeas freely given birth,\nFor the good of all to sort.\n\nCode transparent, eyes open wide,\nA permission for the wise,\nTo learn, to build, to use at will,\nA future bright, we help fill.\n\nFrom servers vast to candles low,\nOpen source, a guiding key,\nFor progress made, knowledge shared,\nA future brimming with'}, 'logprobs': None, 'finish_reason': 'length'}], 'usage': &#123;'prompt_tokens': 20, 'completion_tokens': 100, 'total_tokens': 120}}

使用 InferenceClient 簡化端點使用

您也可以使用 InferenceClient 輕鬆向您的端點發送請求。該客戶端是 huggingface_hub Python 庫中提供的一個方便的實用程式，它允許您輕鬆呼叫專用推理端點和無伺服器推理 API。有關詳細資訊，請參閱文件。

這是向您的端點發送請求的最簡潔方式

from huggingface_hub import InferenceClient

client = InferenceClient()

output = client.chat_completion(
    messages,  # the chat template is applied automatically, if your endpoint uses a TGI container
    model=API_URL,
    temperature=0.2,
    max_tokens=100,
    seed=42,
)

print("The output from your API/Endpoint call with the InferenceClient:\n")
print(output)

# pause the endpoint to stop billing
# endpoint.pause()

為各種模型建立端點

按照相同的流程，您可以為 HF Hub 上的任何模型建立端點。讓我們演示其他一些用例。

使用 Stable Diffusion 進行影像生成

我們可以使用與 LLM 幾乎完全相同的程式碼來建立影像生成端點。唯一的區別在於，在這種情況下我們不使用 TGI 容器，因為 TGI 僅用於 LLM（和視覺 LM）。

>>> !pip install Pillow  # for image processing

Collecting Pillow
  Downloading pillow-10.3.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB)
[K     |████████████████████████████████| 4.5 MB 24.7 MB/s eta 0:00:01
[?25hInstalling collected packages: Pillow
Successfully installed Pillow-10.3.0

>>> from huggingface_hub import create_inference_endpoint

>>> model_id = "stabilityai/stable-diffusion-xl-base-1.0"
>>> endpoint_name = "stable-diffusion-xl-base-1-0-001"  # Valid Endpoint names must only contain lower-case characters, numbers or hyphens ("-") and are between 4 to 32 characters long.
>>> namespace = "MoritzLaurer"  # your user or organization name
>>> task = "text-to-image"

>>> # check if endpoint with this name already exists from previous tests
>>> available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
>>> if endpoint_name in available_endpoints_names:
...     endpoint_exists = True
>>> else:
...     endpoint_exists = False
>>> print("Does the endpoint already exist?", endpoint_exists)


>>> # create new endpoint
>>> if not endpoint_exists:
...     endpoint = create_inference_endpoint(
...         endpoint_name,
...         repository=model_id,
...         namespace=namespace,
...         framework="pytorch",
...         task=task,
...         # see the available hardware options here: https://huggingface.co/docs/inference-endpoints/pricing#pricing
...         accelerator="gpu",
...         vendor="aws",
...         region="us-east-1",
...         instance_size="x1",
...         instance_type="nvidia-a100",
...         min_replica=0,
...         max_replica=1,
...         type="protected",
...     )
...     print("Waiting for endpoint to be created")
...     endpoint.wait()
...     print("Endpoint ready")

>>> # if endpoint with this name already exists, get existing endpoint
>>> else:
...     endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
...     if endpoint.status in ["paused", "scaledToZero"]:
...         print("Resuming endpoint")
...         endpoint.resume()
...     print("Waiting for endpoint to start")
...     endpoint.wait()
...     print("Endpoint ready")

Does the endpoint already exist? True
Waiting for endpoint to start
Endpoint ready

>>> prompt = "A whimsical illustration of a fashionably dressed llama proudly holding a worn, vintage cookbook, with a warm cup of tea and a few freshly baked treats scattered around, set against a cozy background of rustic wood and blooming flowers."

>>> image = client.text_to_image(
...     prompt=prompt,
...     model=endpoint.url,  # "stabilityai/stable-diffusion-xl-base-1.0",
...     guidance_scale=8,
... )

>>> print("PROMPT: ", prompt)
>>> display(image.resize((image.width // 2, image.height // 2)))

PROMPT:  A whimsical illustration of a fashionably dressed llama proudly holding a worn, vintage cookbook, with a warm cup of tea and a few freshly baked treats scattered around, set against a cozy background of rustic wood and blooming flowers.

我們再次暫停端點以停止計費。

endpoint.pause()

視覺語言模型：對文字和影像進行推理

現在，讓我們為視覺語言模型 (VLM) 建立一個端點。VLM 與 LLM 非常相似，只是它們可以同時接受文字和影像作為輸入。它們的輸出是自迴歸生成的文字，就像標準的 LLM 一樣。VLM 可以處理從視覺問答到文件理解的許多工。在本例中，我們使用 Idefics2，一個功能強大的 8B 引數 VLM。

我們首先需要將使用 Stable Diffusion 生成的 PIL 影像轉換為 base64 編碼字串，以便透過網路將其傳送到模型。

import base64
from io import BytesIO


def pil_image_to_base64(image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_str


image_b64 = pil_image_to_base64(image)

由於 VLM 和 LLM 非常相似，我們可以再次使用幾乎相同的訊息格式和聊天模板，只是增加了一些程式碼，用於在提示中包含影像。有關提示格式的特定詳細資訊，請參閱 Idefics2 模型卡片。

from transformers import AutoProcessor

# load the processor
model_id_vlm = "HuggingFaceM4/idefics2-8b-chatty"
processor = AutoProcessor.from_pretrained(model_id_vlm)

# define the user messages
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image"
            },  # the image is placed here in the prompt. You can add multiple images throughout the conversation.
            {"type": "text", "text": "Write a short limerick about this image."},
        ],
    },
]

# apply the chat template to the messages
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

# the chat template places a special "<image>" token at the position where the image should go
# here we replace the "<image>" token with the base64 encoded image string in the prompt
# to be able to send the image via an API request
image_input = f"data:image/jpeg;base64,{image_b64}"
image_input = f"![]({image_input})"
prompt = prompt.replace("<image>", image_input)

對於 VLM，一張影像表示一定數量的 token。例如，對於 Idefics2，一張影像在低解析度下表示 64 個 token，在高解析度下表示 5*64=320 個 token。高解析度是 TGI 中的預設設定（有關詳細資訊，請參閱模型卡片中的 do_image_splitting）。這意味著一張影像消耗了 320 個 token。

TGI 也支援一些 VLM，如 Idefics2（請參閱支援模型列表），因此在建立端點時我們再次使用 TGI 容器。

>>> from huggingface_hub import create_inference_endpoint

>>> endpoint_name = "idefics2-8b-chatty-001"
>>> namespace = "MoritzLaurer"
>>> task = "text-generation"

>>> # check if endpoint with this name already exists from previous tests
>>> available_endpoints_names = [endpoint.name for endpoint in huggingface_hub.list_inference_endpoints()]
>>> if endpoint_name in available_endpoints_names:
...     endpoint_exists = True
>>> else:
...     endpoint_exists = False
>>> print("Does the endpoint already exist?", endpoint_exists)


>>> if endpoint_exists:
...     endpoint = huggingface_hub.get_inference_endpoint(name=endpoint_name, namespace=namespace)
...     if endpoint.status in ["paused", "scaledToZero"]:
...         print("Resuming endpoint")
...         endpoint.resume()
...     print("Waiting for endpoint to start")
...     endpoint.wait()
...     print("Endpoint ready")

>>> else:
...     endpoint = create_inference_endpoint(
...         endpoint_name,
...         repository=model_id_vlm,
...         namespace=namespace,
...         framework="pytorch",
...         task=task,
...         accelerator="gpu",
...         vendor="aws",
...         region="us-east-1",
...         type="protected",
...         instance_size="x1",
...         instance_type="nvidia-a100",
...         min_replica=0,
...         max_replica=1,
...         custom_image={
...             "health_route": "/health",
...             "env": {
...                 "MAX_BATCH_PREFILL_TOKENS": "2048",
...                 "MAX_INPUT_LENGTH": "1024",
...                 "MAX_TOTAL_TOKENS": "1536",
...                 "MODEL_ID": "/repository",
...             },
...             "url": "ghcr.io/huggingface/text-generation-inference:latest",
...         },
...     )

...     print("Waiting for endpoint to be created")
...     endpoint.wait()
...     print("Endpoint ready")

Does the endpoint already exist? False
Waiting for endpoint to be created
Endpoint ready

>>> output = client.text_generation(prompt, model=model_id_vlm, max_new_tokens=200, seed=42)

>>> print(output)

In a quaint little café, there lived a llama,
With glasses on his face, he was quite a charm.
He'd sit at the table,
With a book and a mable,
And sip from a cup of warm tea.

endpoint.pause()

額外資訊

當您建立多個端點時，您可能會收到 GPU 配額已達到的錯誤訊息。請不要猶豫，向錯誤訊息中提供的電子郵件地址傳送訊息，我們很可能會增加您的 GPU 配額。
paused（暫停）和 scaled-to-zero（縮減到零）端點有什麼區別？scaled-to-zero 端點可以根據使用者請求靈活地喚醒和擴容，而 paused 端點需要由端點建立者手動取消暫停。此外，scaled-to-zero 端點會計入您的 GPU 配額（以其可以擴容到的最大副本數為準），而 paused 端點則不會。因此，釋放 GPU 配額的一個簡單方法是暫停一些端點。

結論與後續步驟

就這樣，您已經為文字到文字、文字到影像和影像到文字生成建立了三個不同的端點（您自己的 API！），並且對於許多其他模型和任務，這同樣可行。

我們鼓勵您閱讀專用推理端點文件以瞭解更多資訊。如果您正在使用生成式 LLM 和 VLM，我們還建議您閱讀 TGI 文件，因為最流行的 LLM/VLM 也受 TGI 支援，這會顯著提高您的端點效率。

例如，您可以透過TGI Guidance使用開源模型進行JSON 模式或函式呼叫（另請參閱此教程，瞭解帶結構化生成的 RAG 示例）。

當您將端點投入生產時，您將需要進行一些額外的改進，以提高您的設定效率。在使用 TGI 時，您應該透過非同步函式呼叫向端點發送批處理請求，以充分利用端點的硬體，並且您可以調整幾個容器引數，以最佳化您的用例的延遲和吞吐量。我們將在另一份教程中介紹這些最佳化。

< > 在 GitHub 上更新

開源 AI 食譜