在 Google Cloud TPU 例項上部署文字生成推理伺服器 (TGI)

文字生成推理 (TGI) 可以在 TPU 上部署大型語言模型 (LLM)，Optimum TPU 提供專門最佳化的 TGI 執行時，充分利用 TPU 硬體的優勢。

TGI 還提供與 OpenAI 相容的 API，使其易於與各種工具整合。

有關支援模型的列表，請檢視支援的模型頁面。

在 Cloud TPU 例項上部署 TGI

本指南假設您已有一個正在執行的 Cloud TPU 例項。如果沒有，請參閱我們的部署指南。

您有兩種部署 TGI 的選擇

使用我們預構建的 TGI 映象（推薦）
手動構建映象以獲取最新功能

選項 1：使用預構建映象

optimum-tpu 映象可在 ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi 獲取。有關最新的 TGI 映象，請檢視 optimum-tpu 容器文件。有關如何從預構建映象啟動 TGI 容器的教程，請參閱服務教程。以下是如何部署它：

docker run -p 8080:80 \
        --shm-size 16GB \
        --privileged \
        --net host \
        -e LOG_LEVEL=text_generation_router=debug \
        -v ~/hf_data:/data \
        -e HF_TOKEN=<your_hf_token_here> \
        ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
        --model-id google/gemma-2b-it \
        --max-input-length 512 \
        --max-total-tokens 1024 \
        --max-batch-prefill-tokens 512 \
        --max-batch-total-tokens 1024

您需要替換使用您可以在 [此處](https://huggingface.co/settings/tokens) 獲取的 HuggingFace 訪問令牌

如果您已經透過 `huggingface-cli login` 登入，則可以設定 HF_TOKEN=$(cat ~/.cache/huggingface/token) 以便更方便。

您還可以使用 GCP 提供的映象，如optimum-tpu 容器頁面中所述

選項 2：手動構建映象

為了獲取最新功能（optimum-tpu 的 main 分支）或進行自定義修改，請自行構建映象

克隆倉庫

git clone https://github.com/huggingface/optimum-tpu.git

構建映象

make tpu-tgi

執行容器

HF_TOKEN=<your_hf_token_here>
MODEL_ID=google/gemma-2b-it

sudo docker run --net=host \
                --privileged \
                -v $(pwd)/data:/data \
                -e HF_TOKEN=${HF_TOKEN} \
                huggingface/optimum-tpu:latest \
                --model-id ${MODEL_ID} \
                --max-concurrent-requests 4 \
                --max-input-length 32 \
                --max-total-tokens 64 \
                --max-batch-size 1

對服務執行請求

您可以使用 /generate 或 /generate_stream 路由查詢模型

curl localhost/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'

curl localhost/generate_stream \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json'