在伺服器上執行推理

推理是使用訓練好的模型對新資料進行預測的過程。由於此過程可能計算密集，因此在專用或外部服務上執行可能是一個有趣的選擇。`huggingface_hub` 庫提供了一個統一的介面，用於對託管在 Hugging Face Hub 上的模型在多個服務上執行推理

推理提供商：由我們的無伺服器推理合作伙伴提供支援，對數百個機器學習模型進行簡化、統一的訪問。這種新方法建立在我們之前的無伺服器推理 API 的基礎上， благодаря世界一流的提供商，提供了更多的模型、改進的效能和更高的可靠性。請參閱文件以獲取受支援的提供商列表。
推理端點：一種用於輕鬆將模型部署到生產環境的產品。推理由 Hugging Face 在您選擇的雲提供商的專用、完全託管的基礎設施中執行。
本地端點：您還可以透過將客戶端連線到本地端點，使用本地推理伺服器（如 llama.cpp、Ollama、vLLM、LiteLLM 或 Text Generation Inference (TGI)）執行推理。

這些服務都可以透過 InferenceClient 物件呼叫。它替代了舊版 InferenceApi 客戶端，增加了對任務和第三方提供商的特定支援。在舊版 InferenceAPI 客戶端部分了解如何遷移到新客戶端。

InferenceClient 是一個發出 HTTP 呼叫到我們 API 的 Python 客戶端。如果您想使用您喜歡的工具（curl、postman 等）直接進行 HTTP 呼叫，請參閱推理提供商文件或推理端點文件頁面。

對於 Web 開發，已釋出了一個 JS 客戶端。如果您對遊戲開發感興趣，可以看看我們的 C# 專案。

入門

讓我們從文字到影像任務開始

>>> from huggingface_hub import InferenceClient

# Example with an external provider (e.g. replicate)
>>> replicate_client = InferenceClient(
    provider="replicate",
    api_key="my_replicate_api_key",
)
>>> replicate_image = replicate_client.text_to_image(
    "A flying car crossing a futuristic cityscape.",
    model="black-forest-labs/FLUX.1-schnell",
)
>>> replicate_image.save("flying_car.png")

在上面的示例中，我們使用第三方提供商 Replicate 初始化了一個 InferenceClient。使用提供商時，您必須指定要使用的模型。模型 ID 必須是 Hugging Face Hub 上的模型 ID，而不是第三方提供商的模型 ID。在我們的示例中，我們從文字提示生成了一個影像。返回值是一個可以儲存到檔案的 `PIL.Image` 物件。有關更多詳細資訊，請檢視 text_to_image() 文件。

現在我們來看一個使用 chat_completion() API 的示例。此任務使用 LLM 從訊息列表生成響應

>>> from huggingface_hub import InferenceClient
>>> messages = [
    {
        "role": "user",
        "content": "What is the capital of France?",
    }
]
>>> client = InferenceClient(
    provider="together",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    api_key="my_together_api_key",
)
>>> client.chat_completion(messages, max_tokens=100)
ChatCompletionOutput(
    choices=[
        ChatCompletionOutputComplete(
            finish_reason="eos_token",
            index=0,
            message=ChatCompletionOutputMessage(
                role="assistant", content="The capital of France is Paris.", name=None, tool_calls=None
            ),
            logprobs=None,
        )
    ],
    created=1719907176,
    id="",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    object="text_completion",
    system_fingerprint="2.0.4-sha-f426a33",
    usage=ChatCompletionOutputUsage(completion_tokens=8, prompt_tokens=17, total_tokens=25),
)

在上面的例子中，我們使用了第三方提供商 (Together AI)，並指定了我們要使用的模型 (`"meta-llama/Meta-Llama-3-8B-Instruct"`)。然後我們提供了一個要完成的訊息列表（這裡是一個單獨的問題），並向 API 傳遞了一個額外的引數 (`max_token=100`)。輸出是一個遵循 OpenAI 規範的 `ChatCompletionOutput` 物件。生成的內容可以透過 `output.choices[0].message.content` 訪問。有關更多詳細資訊，請檢視 chat_completion() 文件。

該 API 設計得簡單易用。並非所有引數和選項都對終端使用者可用或進行了描述。如果您有興趣瞭解每個任務的所有可用引數，請檢視此頁面。

使用特定提供商

如果您想使用特定的提供商，可以在初始化客戶端時指定。預設值為“auto”，它將根據使用者在 https://huggingface.co/settings/inference-providers 中的排序選擇模型可用的第一個提供商。有關支援的提供商列表，請參閱支援的提供商和任務部分。

>>> from huggingface_hub import InferenceClient
>>> client = InferenceClient(provider="replicate", api_key="my_replicate_api_key")

使用特定模型

如果你想使用特定的模型怎麼辦？你可以將其指定為引數，或者直接在例項級別指定

>>> from huggingface_hub import InferenceClient
# Initialize client for a specific model
>>> client = InferenceClient(provider="together", model="meta-llama/Llama-3.1-8B-Instruct")
>>> client.text_to_image(...)
# Or use a generic client but pass your model as an argument
>>> client = InferenceClient(provider="together")
>>> client.text_to_image(..., model="meta-llama/Llama-3.1-8B-Instruct")

當使用“hf-inference”提供商時，每個任務都會附帶一個來自 Hub 上 1M+ 模型中的推薦模型。然而，這個推薦模型可能會隨著時間而改變，因此最好在您決定使用哪個模型後明確設定模型。對於第三方提供商，您必須始終指定一個與該提供商相容的模型。

訪問 Hub 上的模型頁面，探索透過推理提供商提供的模型。

使用推理端點

我們上面看到的示例使用推理提供商。雖然這些在快速原型設計和測試方面非常有用。一旦您準備好將模型部署到生產環境，您將需要使用專用的基礎設施。這就是推理端點發揮作用的地方。它允許您部署任何模型並將其作為私有 API 公開。部署後，您將獲得一個 URL，您可以使用與之前完全相同的程式碼連線到該 URL，只需更改 `model` 引數

>>> from huggingface_hub import InferenceClient
>>> client = InferenceClient(model="https://uu149rez6gw9ehej.eu-west-1.aws.endpoints.huggingface.cloud/deepfloyd-if")
# or
>>> client = InferenceClient()
>>> client.text_to_image(..., model="https://uu149rez6gw9ehej.eu-west-1.aws.endpoints.huggingface.cloud/deepfloyd-if")

請注意，您不能同時指定 URL 和提供商——它們是互斥的。URL 用於直接連線到已部署的端點。

使用本地端點

您可以使用 InferenceClient 在您自己的機器上執行本地推理伺服器（llama.cpp、vllm、litellm 伺服器、TGI、mlx 等）進行聊天補全。API 應該與 OpenAI API 相容。

>>> from huggingface_hub import InferenceClient
>>> client = InferenceClient(model="https://:8080")

>>> response = client.chat.completions.create(
...     messages=[
...         {"role": "user", "content": "What is the capital of France?"}
...     ],
...     max_tokens=100
... )
>>> print(response.choices[0].message.content)

與 OpenAI Python 客戶端類似，InferenceClient 可以用於與任何與 OpenAI REST API 相容的端點執行聊天補全推理。

身份驗證

身份驗證可以透過兩種方式完成

透過 Hugging Face 路由：使用 Hugging Face 作為代理訪問第三方提供商。呼叫將透過 Hugging Face 的基礎設施使用我們的提供商金鑰進行路由，使用費用將直接計入您的 Hugging Face 賬戶。

您可以使用使用者訪問令牌進行身份驗證。您可以直接透過 `api_key` 引數提供您的 Hugging Face 令牌

>>> client = InferenceClient(
    provider="replicate",
    api_key="hf_****"  # Your HF token
)

如果您**不**傳遞`api_key`，客戶端將嘗試查詢並使用本地儲存在您機器上的令牌。這通常發生在您之前已登入的情況下。有關登入的詳細資訊，請參閱身份驗證指南。

>>> client = InferenceClient(
    provider="replicate",
    token="hf_****"  # Your HF token
)

直接訪問提供商：使用您自己的 API 金鑰直接與提供商的服務互動

>>> client = InferenceClient(
    provider="replicate",
    api_key="r8_****"  # Your Replicate API key
)

更多詳情請參考推理提供商定價文件。

支援的提供商和任務

InferenceClient 的目標是提供最簡單的介面，以便在任何提供商上執行 Hugging Face 模型上的推理。它具有支援最常見任務的簡單 API。下表顯示了哪些提供商支援哪些任務

任務	黑森林實驗室	Cerebras	Cohere	fal-ai	無羽毛AI	焰火人工智慧	Groq	HF 推理	雙曲線	Nebius AI Studio	諾維塔人工智慧	Replicate	Sambanova	Together
audio_classification()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
audio_to_audio()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
automatic_speech_recognition()	❌	❌	❌	✅	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
chat_completion()	❌	✅	✅	❌	✅	✅	✅	✅	✅	✅	✅	❌	✅	✅
document_question_answering()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
feature_extraction()	❌	❌	❌	❌	❌	❌	❌	✅	❌	✅	❌	❌	✅	❌
fill_mask()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
image_classification()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
image_segmentation()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
image_to_image()	❌	❌	❌	✅	❌	❌	❌	✅	❌	❌	❌	✅	❌	❌
image_to_video()	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	❌	❌	❌
image_to_text()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
object_detection()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
question_answering()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
sentence_similarity()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
summarization()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
table_question_answering()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
text_classification()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
text_generation()	❌	❌	❌	❌	✅	❌	❌	✅	✅	✅	✅	❌	❌	✅
text_to_image()	✅	❌	❌	✅	❌	❌	❌	✅	✅	✅	❌	✅	❌	✅
text_to_speech()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	✅	❌	❌
text_to_video()	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌
tabular_classification()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
tabular_regression()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
token_classification()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
translation()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
visual_question_answering()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
zero_shot_image_classification()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌
zero_shot_classification()	❌	❌	❌	❌	❌	❌	❌	✅	❌	❌	❌	❌	❌	❌

請檢視任務頁面以瞭解每個任務的更多資訊。

OpenAI 相容性

`chat_completion` 任務遵循 OpenAI 的 Python 客戶端語法。這對您意味著什麼？這意味著如果您習慣於使用 `OpenAI` 的 API，您可以透過更新兩行程式碼來切換到 `huggingface_hub.InferenceClient` 以使用開源模型！

- from openai import OpenAI
+ from huggingface_hub import InferenceClient

- client = OpenAI(
+ client = InferenceClient(
    base_url=...,
    api_key=...,
)


output = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)

就是這樣！唯一需要的更改是將 `from openai import OpenAI` 替換為 `from huggingface_hub import InferenceClient`，將 `client = OpenAI(...)` 替換為 `client = InferenceClient(...)`。您可以透過將模型 ID 作為 `model` 引數傳遞，從 Hugging Face Hub 中選擇任何 LLM 模型。這裡是支援的模型列表。對於身份驗證，您應該將有效的使用者訪問令牌作為 `api_key` 傳遞，或者使用 `huggingface_hub` 進行身份驗證（參見身份驗證指南）。

所有輸入引數和輸出格式都嚴格相同。特別是，您可以傳遞 `stream=True` 以在生成令牌時接收它們。您還可以使用 AsyncInferenceClient 使用 `asyncio` 執行推理

import asyncio
- from openai import AsyncOpenAI
+ from huggingface_hub import AsyncInferenceClient

- client = AsyncOpenAI()
+ client = AsyncInferenceClient()

async def main():
    stream = await client.chat.completions.create(
        model="meta-llama/Meta-Llama-3-8B-Instruct",
        messages=[{"role": "user", "content": "Say this is a test"}],
        stream=True,
    )
    async for chunk in stream:
        print(chunk.choices[0].delta.content or "", end="")

asyncio.run(main())

您可能想知道為什麼使用 InferenceClient 而不是 OpenAI 的客戶端？有以下幾個原因

InferenceClient 已為 Hugging Face 服務配置。您無需提供 `base_url` 即可使用推理提供商執行模型。如果您的機器已正確登入，您也無需提供 `token` 或 `api_key`。
InferenceClient 專為 Text-Generation-Inference (TGI) 和 `transformers` 框架量身定製，這意味著您能確保它始終與最新更新保持同步。
InferenceClient 集成了我們的推理端點服務，使得啟動推理端點、檢查其狀態並在其上執行推理變得更加容易。有關更多詳細資訊，請檢視推理端點指南。

`InferenceClient.chat.completions.create` 只是 `InferenceClient.chat_completion` 的一個別名。有關更多詳細資訊，請檢視 chat_completion() 的包參考。例項化客戶端時的 `base_url` 和 `api_key` 引數也是 `model` 和 `token` 的別名。這些別名是為了減少從 `OpenAI` 切換到 `InferenceClient` 時的摩擦而定義的。

函式呼叫

函式呼叫允許 LLM 與外部工具（例如已定義的函式或 API）互動。這使使用者能夠輕鬆構建針對特定用例和實際任務定製的應用程式。`InferenceClient` 實現了與 OpenAI Chat Completions API 相同的工具呼叫介面。以下是使用 Nebius 作為推理提供商的工具呼叫簡單示例

from huggingface_hub import InferenceClient

tools = [
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get current temperature for a given location.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Paris, France"
                        }
                    },
                    "required": ["location"],
                },
            }
        }
]

client = InferenceClient(provider="nebius")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=[
    {
        "role": "user",
        "content": "What's the weather like the next 3 days in London, UK?"
    }
    ],
    tools=tools,
    tool_choice="auto",
)

print(response.choices[0].message.tool_calls[0].function.arguments)

請參閱提供商的文件，以驗證他們支援哪些模型用於函式/工具呼叫。

結構化輸出與 JSON 模式

InferenceClient 支援 JSON 模式用於語法有效的 JSON 響應，以及結構化輸出用於強制執行模式的響應。JSON 模式提供機器可讀的資料，但沒有嚴格的結構，而結構化輸出則保證了有效的 JSON 和對預定義模式的遵守，以實現可靠的下游處理。

我們遵循 OpenAI API 規範，同時支援 JSON 模式和結構化輸出。您可以透過 `response_format` 引數啟用它們。以下是使用 Cerebras 作為推理提供商的結構化輸出示例

from huggingface_hub import InferenceClient

json_schema = {
    "name": "book",
    "schema": {
        "properties": {
            "name": {
                "title": "Name",
                "type": "string",
            },
            "authors": {
                "items": {"type": "string"},
                "title": "Authors",
                "type": "array",
            },
        },
        "required": ["name", "authors"],
        "title": "Book",
        "type": "object",
    },
    "strict": True,
}

client = InferenceClient(provider="cerebras")


completion = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[
        {"role": "system", "content": "Extract the books information."},
        {"role": "user", "content": "I recently read 'The Great Gatsby' by F. Scott Fitzgerald."},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": json_schema,
    },
)

print(completion.choices[0].message)

請參閱提供商的文件，以驗證他們支援哪些模型用於結構化輸出和 JSON 模式。

非同步客戶端

還提供了基於 `asyncio` 和 `aiohttp` 的客戶端非同步版本。您可以直接安裝 `aiohttp` 或使用 `[inference]` 額外安裝

pip install --upgrade huggingface_hub[inference]
# or
# pip install aiohttp

安裝完成後，所有非同步 API 端點均可透過 AsyncInferenceClient 訪問。其初始化和 API 與僅同步版本完全相同。

# Code must be run in an asyncio concurrent context.
# $ python -m asyncio
>>> from huggingface_hub import AsyncInferenceClient
>>> client = AsyncInferenceClient()

>>> image = await client.text_to_image("An astronaut riding a horse on the moon.")
>>> image.save("astronaut.png")

>>> async for token in await client.text_generation("The Huggingface Hub is", stream=True):
...     print(token, end="")
 a platform for sharing and discussing ML-related content.

有關 `asyncio` 模組的更多資訊，請參閱官方文件。

MCP 客戶端

`huggingface_hub` 庫現在包含一個實驗性的 MCPClient，旨在透過模型上下文協議 (MCP) 賦予大型語言模型 (LLM) 與外部工具互動的能力。此客戶端擴充套件了 AsyncInferenceClient，以無縫整合工具使用。

MCPClient 連線到公開工具的 MCP 伺服器（本地 `stdio` 指令碼或遠端 `http` / `sse` 服務）。它將這些工具提供給 LLM（透過 AsyncInferenceClient）。如果 LLM 決定使用某個工具，MCPClient 將管理對 MCP 伺服器的執行請求，並將工具的輸出中繼回 LLM，通常以即時流的形式。

在下面的示例中，我們透過 Nebius 推理提供商使用 Qwen/Qwen2.5-72B-Instruct 模型。然後，我們添加了一個遠端 MCP 伺服器，在這種情況下，它是一個 SSE 伺服器，它使 Flux 影像生成工具可供 LLM 使用。

import os

from huggingface_hub import ChatCompletionInputMessage, ChatCompletionStreamOutput, MCPClient


async def main():
    async with MCPClient(
        provider="nebius",
        model="Qwen/Qwen2.5-72B-Instruct",
        api_key=os.environ["HF_TOKEN"],
    ) as client:
        await client.add_mcp_server(type="sse", url="https://evalstate-flux1-schnell.hf.space/gradio_api/mcp/sse")

        messages = [
            {
                "role": "user",
                "content": "Generate a picture of a cat on the moon",
            }
        ]

        async for chunk in client.process_single_turn_with_tools(messages):
            # Log messages
            if isinstance(chunk, ChatCompletionStreamOutput):
                delta = chunk.choices[0].delta
                if delta.content:
                    print(delta.content, end="")

            # Or tool calls
            elif isinstance(chunk, ChatCompletionInputMessage):
                print(
                    f"\nCalled tool '{chunk.name}'. Result: '{chunk.content if len(chunk.content) < 1000 else chunk.content[:1000] + '...'}'"
                )


if __name__ == "__main__":
    import asyncio

    asyncio.run(main())

為了更簡單的開發，我們提供了一個更高級別的 Agent 類。這個“Tiny Agent”透過管理聊天迴圈和狀態來簡化對話式 Agent 的建立，本質上充當 MCPClient 的包裝器。它旨在成為一個簡單的 while 迴圈，直接構建在 MCPClient 之上。您可以直接從命令列執行這些 Agent

# install latest version of huggingface_hub with the mcp extra
pip install -U huggingface_hub[mcp]
# Run an agent that uses the Flux image generation tool
tiny-agents run julien-c/flux-schnell-generator

啟動後，代理將載入、列出其從連線的 MCP 伺服器發現的工具，然後即可接受您的提示！

高階技巧

在上面一節中，我們瞭解了 InferenceClient 的主要方面。現在我們深入探討一些更高階的技巧。

計費

作為 HF 使用者，您每月可獲得積分，用於透過 Hub 上的各種提供商執行推理。您獲得的積分數量取決於您的賬戶型別（免費、PRO 或企業版 Hub）。每次推理請求都會根據提供商的定價表收費。預設情況下，請求將計入您的個人賬戶。但是，可以透過簡單地將 `bill_to=""` 傳遞給 `InferenceClient` 來設定計費，以便將請求計入您所屬的組織。為此，您的組織必須訂閱企業版 Hub。有關計費的更多詳細資訊，請檢視此指南。

>>> from huggingface_hub import InferenceClient
>>> client = InferenceClient(provider="fal-ai", bill_to="openai")
>>> image = client.text_to_image(
...     "A majestic lion in a fantasy forest",
...     model="black-forest-labs/FLUX.1-schnell",
... )
>>> image.save("lion.png")

請注意，無法向其他使用者或您不屬於的組織收費。如果您想授予其他人一些積分，您必須與他們建立一個聯合組織。

超時

推理呼叫可能需要大量時間。預設情況下，InferenceClient 將“無限期”等待，直到推理完成。如果您想在工作流中進行更多控制，可以將 `timeout` 引數設定為特定的秒值。如果超時延遲到期，則會引發 InferenceTimeoutError，您可以在程式碼中捕獲它

>>> from huggingface_hub import InferenceClient, InferenceTimeoutError
>>> client = InferenceClient(timeout=30)
>>> try:
...     client.text_to_image(...)
... except InferenceTimeoutError:
...     print("Inference timed out after 30s.")

二進位制輸入

某些任務需要二進位制輸入，例如處理影像或音訊檔案時。在這種情況下，InferenceClient 嘗試儘可能寬鬆地接受不同型別

原始 `bytes`
以二進位制形式開啟的檔案類物件（`with open("audio.flac", "rb") as f: ...`）
指向本地檔案的路徑（`str` 或 `Path`）
指向遠端檔案的 URL（`str`）（例如 `https://...`）。在這種情況下，檔案將在傳送到 API 之前在本地下載。

>>> from huggingface_hub import InferenceClient
>>> client = InferenceClient()
>>> client.image_classification("https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Cute_dog.jpg/320px-Cute_dog.jpg")
[{'score': 0.9779096841812134, 'label': 'Blenheim spaniel'}, ...]

< > 在 GitHub 上更新

Hub Python 庫