快速上手

設定

開始使用 TEI 最簡單的方法是使用官方 Docker 容器之一（請參閱支援的模型和硬體以選擇正確的容器）。

因此，需要按照其安裝說明安裝 Docker。

TEI 支援在 GPU 和 CPU 上進行推理。如果您計劃使用 GPU，請務必透過檢視此表來檢查您的硬體是否受支援。接下來，安裝 NVIDIA Container Toolkit。您裝置上的 NVIDIA 驅動程式需要與 CUDA 12.2 或更高版本相容。

部署

接下來是部署模型。假設您想使用Qwen/Qwen3-Embedding-0.6B。以下是您如何操作：

model=Qwen/Qwen3-Embedding-0.6B
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model

我們還建議與 Docker 容器共享一個卷 (volume=$PWD/data)，以避免每次執行都下載權重。

推理

推理可以透過 3 種方式進行：使用 cURL，或透過 InferenceClient 或 OpenAI Python SDK。

cURL

要使用 cURL 向 TEI 端點發送 POST 請求，您可以執行以下命令

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

Python

要使用 Python 執行推理，您可以使用 huggingface_hub Python SDK（推薦）或 openai Python SDK。

huggingface_hub

您可以透過 pip 安裝它，執行 pip install --upgrade --quiet huggingface_hub，然後執行

from huggingface_hub import InferenceClient

client = InferenceClient()

embedding = client.feature_extraction("What is deep learning?",
                                      model="https://:8080/embed")
print(len(embedding[0]))

OpenAI

您可以透過 pip 安裝它，執行 pip install --upgrade openai，然後執行

import os
from openai import OpenAI

client = OpenAI(base_url="https://:8080/v1/embeddings")

response = client.embeddings.create(
  model="tei",
  input="What is deep learning?"
)

print(response)

重排序器和序列分類

TEI 還支援重排序器和經典序列分類模型。

重排序器

重排序器，也稱為交叉編碼器，是具有單個類別的序列分類模型，用於評估查詢和文字之間的相似性。請參閱 LlamaIndex 團隊的這篇博文，瞭解如何在您的 RAG 管道中使用重排序器模型來提高下游效能。

假設您想使用BAAI/bge-reranker-large。首先，您可以這樣部署它

model=BAAI/bge-reranker-large
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model

部署模型後，您可以使用 rerank 端點來對查詢和文字列表之間的相似度進行排序。使用 cURL 可以這樣操作

curl 127.0.0.1:8080/rerank \
    -X POST \
    -d '{"query":"What is Deep Learning?", "texts": ["Deep Learning is not...", "Deep learning is..."], "raw_scores": false}' \
    -H 'Content-Type: application/json'

序列分類模型

您還可以使用經典的序列分類模型，例如SamLowe/roberta-base-go_emotions

model=SamLowe/roberta-base-go_emotions
volume=$PWD/data

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id $model

部署模型後，您可以使用 predict 端點來獲取與輸入最相關的情緒

curl 127.0.0.1:8080/predict \
    -X POST \
    -d '{"inputs":"I like you."}' \
    -H 'Content-Type: application/json'

批處理

您可以批次傳送多個輸入。例如，對於嵌入

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":["Today is a nice day", "I like you"]}' \
    -H 'Content-Type: application/json'

以及用於序列分類

curl 127.0.0.1:8080/predict \
    -X POST \
    -d '{"inputs":[["I like you."], ["I hate pineapples"]]}' \
    -H 'Content-Type: application/json'

氣隙部署

要在氣隙環境中部署文字嵌入推理，請先下載權重，然後使用卷將它們掛載到容器內。

例如：

# (Optional) create a `models` directory
mkdir models
cd models

# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5

# Set the models directory as the volume path
volume=$PWD

# Mount the models directory inside the container with a volume and set the model ID
docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.8 --model-id /data/gte-base-en-v1.5

< > 在 GitHub 上更新