使用文字生成推理

在您的應用程式中，有多種方法可以使用文字生成推理（TGI）伺服器。啟動伺服器後，您可以使用 Messages API 的 /v1/chat/completions 路由，併發出 POST 請求以從伺服器獲取結果。如果您希望 TGI 返回令牌流，您也可以在呼叫中傳遞 "stream": true。

有關 API 的更多資訊，請參閱此處提供的 text-generation-inference OpenAPI 文件。

您可以使用任何您喜歡的工具（例如 curl、Python 或 TypeScript）發出請求。為了提供端到端的體驗，我們開源了 ChatUI，這是一個用於開放訪問模型的聊天介面。

curl

成功啟動伺服器後，您可以使用 v1/chat/completions 路由查詢模型，以獲取符合 OpenAI 聊天補全規範的響應。

curl localhost:8080/v1/chat/completions \
    -X POST \
    -d '{
  "model": "tgi",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is deep learning?"
    }
  ],
  "stream": true,
  "max_tokens": 20
}' \
    -H 'Content-Type: application/json'

對於非聊天用例，您還可以使用 /generate 和 /generate_stream 路由。

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{
  "inputs":"What is Deep Learning?",
  "parameters":{
    "max_new_tokens":20
  }
}' \
    -H 'Content-Type: application/json'

Python

Inference Client

huggingface_hub 是一個 Python 庫，用於與 Hugging Face Hub 及其端點進行互動。它提供了一個高階類 huggingface_hub.InferenceClient，可以輕鬆呼叫 TGI 的 Messages API。InferenceClient 還負責引數驗證並提供易於使用的介面。

透過 pip 安裝 huggingface_hub 包。

pip install huggingface_hub

您現在可以像在 Python 中使用 OpenAI 客戶端一樣使用 InferenceClient

from huggingface_hub import InferenceClient

client = InferenceClient(
    base_url="https://:8080/v1/",
)

output = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Count to 10"},
    ],
    stream=True,
    max_tokens=1024,
)

for chunk in output:
    print(chunk.choices[0].delta.content)

您可以在這裡檢視有關 OpenAI 相容性的更多詳細資訊。

還有一個基於 asyncio 和 aiohttp 的非同步客戶端版本 AsyncInferenceClient。您可以在這裡找到它的文件。

OpenAI 客戶端

您可以直接使用 OpenAI 的 Python 或 JS 客戶端與 TGI 互動。

透過 pip 安裝 OpenAI Python 包。

pip install openai

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(
    base_url="https://:8080/v1/",
    api_key="-"
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant." },
        {"role": "user", "content": "What is deep learning?"}
    ],
    stream=True
)

# iterate and print stream
for message in chat_completion:
    print(message)

UI

Gradio

Gradio 是一個 Python 庫，可以幫助您用幾行程式碼為您的機器學習模型構建 Web 應用程式。它有一個 ChatInterface 包裝器，可以幫助為聊天機器人建立簡潔的使用者介面。讓我們看看如何使用 TGI 和 Gradio 建立一個帶流模式的聊天機器人。我們先安裝 Gradio 和 Hub Python 庫。

pip install huggingface-hub gradio

假設您在 8080 埠提供模型服務，我們將透過 InferenceClient 進行查詢。

import gradio as gr
from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://127.0.0.1:8080")

def inference(message, history):
    partial_message = ""
    output = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": message},
        ],
        stream=True,
        max_tokens=1024,
    )

    for chunk in output:
        partial_message += chunk.choices[0].delta.content
        yield partial_message

gr.ChatInterface(
    inference,
    type="messages",
    description="This is the demo for Gradio UI consuming TGI endpoint.",
    title="Gradio 🤝 TGI",
    examples=["Are tomatoes vegetables?"],
).queue().launch()

您可以在此處檢視使用者介面並直接嘗試演示 👇

您可以此處閱讀更多關於如何自定義 ChatInterface 的資訊。

ChatUI

ChatUI 是一個用於消費大型語言模型（LLM）的開源介面。它提供了許多自定義選項，例如使用 SERP API 進行網路搜尋等。ChatUI 可以自動消費 TGI 伺服器，甚至提供了在不同 TGI 端點之間切換的選項。您可以在 Hugging Chat 上嘗試，或者使用 ChatUI Docker Space 將您自己的 Hugging Chat 部署到 Spaces。

要在同一環境中同時服務 ChatUI 和 TGI，只需將您自己的端點新增到 chat-ui 儲存庫中 .env.local 檔案中的 MODELS 變數。提供指向 TGI 服務的端點。

{
// rest of the model config here
"endpoints": [{"url": "https://HOST:PORT/generate_stream"}]
}

ChatUI

< > 在 GitHub 上更新