最佳化的推理部署

在本節中，我們將探討用於最佳化 LLM 部署的高階框架：Text Generation Inference (TGI)、vLLM 和 llama.cpp。這些應用程式主要用於生產環境，為使用者提供 LLM 服務。本節重點介紹如何在生產環境中部署這些框架，而不是如何在單機上使用它們進行推理。

我們將介紹這些工具如何最大限度地提高推理效率並簡化大型語言模型的生產部署。

框架選擇指南

TGI、vLLM 和 llama.cpp 服務於類似的目的，但它們具有不同的特點，使其更適合不同的用例。讓我們看看它們之間的主要區別，重點關注效能和整合。

記憶體管理和效能

TGI 旨在生產中保持穩定和可預測，它使用固定的序列長度來保持記憶體使用一致。TGI 使用 Flash Attention 2 和連續批處理技術來管理記憶體。這意味著它可以非常高效地處理注意力計算，並透過不斷地提供工作來保持 GPU 繁忙。系統可以在需要時在 CPU 和 GPU 之間移動模型的部分，這有助於處理更大的模型。

Flash Attention 是一種透過解決記憶體頻寬瓶頸來最佳化 Transformer 模型中注意力機制的技術。正如 [第 1.8 章](/course/chapter1/8) 中討論的那樣，注意力機制具有二次複雜性和記憶體使用量，使其在長序列中效率低下。

其關鍵創新在於它如何管理高頻寬記憶體 (HBM) 和更快的 SRAM 快取之間的記憶體傳輸。傳統的注意力機制反覆在 HBM 和 SRAM 之間傳輸資料，透過讓 GPU 空閒來造成瓶頸。Flash Attention 將資料一次載入到 SRAM 中並在那裡執行所有計算，從而最大限度地減少昂貴的記憶體傳輸。

雖然其優勢在訓練期間最為顯著，但 Flash Attention 減少的 VRAM 使用和提高的效率使其在推理方面也很有價值，從而實現更快、更可擴充套件的 LLM 服務。

vLLM 採用不同的方法，使用 PagedAttention。就像計算機管理其記憶體頁一樣，vLLM 將模型的記憶體分成更小的塊。這種巧妙的系統意味著它可以更靈活地處理不同大小的請求，並且不會浪費記憶體空間。它特別擅長在不同請求之間共享記憶體並減少記憶體碎片，這使得整個系統更加高效。

PagedAttention 是一種解決 LLM 推理中另一個關鍵瓶頸的技術：KV 快取記憶體管理。正如 [第 1.8 章](/course/chapter1/8) 中討論的那樣，在文字生成過程中，模型為每個生成的 token 儲存注意力鍵和值（KV 快取），以減少冗餘計算。KV 快取可能變得非常龐大，尤其是在長序列或多個併發請求的情況下。

vLLM 的關鍵創新在於它如何管理此快取

記憶體分頁：不將 KV 快取視為一個大塊，而是將其劃分為固定大小的“頁面”（類似於作業系統中的虛擬記憶體）。
非連續儲存：頁面無需在 GPU 記憶體中連續儲存，從而實現更靈活的記憶體分配。
頁表管理：頁表跟蹤哪些頁面屬於哪個序列，從而實現高效查詢和訪問。
記憶體共享：對於並行取樣等操作，儲存提示符 KV 快取的頁面可以在多個序列之間共享。

與傳統方法相比，PagedAttention 方法可以使吞吐量提高高達 24 倍，這使其成為生產 LLM 部署的遊戲規則改變者。如果您想深入瞭解 PagedAttention 的工作原理，您可以閱讀 vLLM 文件中的指南。

llama.cpp 是一個高度最佳化的 C/C++ 實現，最初設計用於在消費級硬體上執行 LLaMA 模型。它專注於 CPU 效率，並可選地進行 GPU 加速，是資源受限環境的理想選擇。llama.cpp 使用量化技術來減小模型大小和記憶體需求，同時保持良好的效能。它為各種 CPU 架構實現了最佳化的核心，並支援基本的 KV 快取管理以實現高效的 token 生成。

llama.cpp 中的量化將模型權重從 32 位或 16 位浮點精度降低到 8 位整數 (INT8)、4 位甚至更低的低精度格式。這顯著減少了記憶體使用並提高了推理速度，同時最大限度地減少了質量損失。

llama.cpp 中的關鍵量化功能包括

多級量化：支援 8 位、4 位、3 位甚至 2 位量化
GGML/GGUF 格式：使用針對量化推理最佳化的自定義張量格式
混合精度：可以對模型的不同部分應用不同的量化級別
硬體特定最佳化：包括各種 CPU 架構（AVX2、AVX-512、NEON）的最佳化程式碼路徑

這種方法使得在記憶體有限的消費級硬體上執行數十億引數的模型成為可能，使其非常適合本地部署和邊緣裝置。

部署和整合

讓我們繼續探討框架之間的部署和整合差異。

TGI 憑藉其生產就緒功能在企業級部署中表現出色。它內建 Kubernetes 支援，幷包含在生產環境中執行所需的一切，例如透過 Prometheus 和 Grafana 進行監控、自動擴充套件和全面的安全功能。該系統還包括企業級日誌記錄和各種保護措施，例如內容過濾和速率限制，以確保您的部署安全穩定。

vLLM 採用更靈活、更注重開發人員的部署方法。它以 Python 為核心構建，可以輕鬆替換現有應用程式中的 OpenAI API。該框架專注於提供原始效能，並且可以根據您的特定需求進行定製。它與 Ray 配合使用特別好，可以管理叢集，使其成為需要高效能和適應性時的絕佳選擇。

llama.cpp 優先考慮簡單性和可移植性。其伺服器實現輕量級，可以在各種硬體上執行，從強大的伺服器到消費級筆記型電腦，甚至一些高階移動裝置。憑藉最少的依賴項和簡單的 C/C++ 核心，它易於部署在安裝 Python 框架會很困難的環境中。該伺服器提供與 OpenAI 相容的 API，同時比其他解決方案具有更小的資源佔用空間。

入門

讓我們探討如何使用這些框架部署 LLM，從安裝和基本設定開始。

安裝和基本設定

TGI 易於安裝和使用，並與 Hugging Face 生態系統深度整合。

首先，使用 Docker 啟動 TGI 伺服器

docker run --gpus all \
    --shm-size 1g \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id HuggingFaceTB/SmolLM2-360M-Instruct

然後使用 Hugging Face 的 InferenceClient 與其互動

from huggingface_hub import InferenceClient

# Initialize client pointing to TGI endpoint
client = InferenceClient(
    model="https://:8080",  # URL to the TGI server
)

# Text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
    stop_sequences=[],
)
print(response.generated_text)

# For chat format
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

或者，您可以使用 OpenAI 客戶端

from openai import OpenAI

# Initialize client pointing to TGI endpoint
client = OpenAI(
    base_url="https://:8080/v1",  # Make sure to include /v1
    api_key="not-needed",  # TGI doesn't require an API key by default
)

# Chat completion
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

</hfoption> <hfoption value="llama.cpp" label="llama.cpp">

llama.cpp 易於安裝和使用，所需依賴項極少，並支援 CPU 和 GPU 推理。

首先，安裝和構建 llama.cpp

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build the project
make

# Download the SmolLM2-1.7B-Instruct-GGUF model
curl -L -O https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF/resolve/main/smollm2-1.7b-instruct.Q4_K_M.gguf

然後，啟動伺服器（與 OpenAI API 相容）

# Start the server
./server \
    -m smollm2-1.7b-instruct.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096 \
    --n-gpu-layers 0  # Set to a higher number to use GPU

使用 Hugging Face 的 InferenceClient 與伺服器互動

from huggingface_hub import InferenceClient

# Initialize client pointing to llama.cpp server
client = InferenceClient(
    model="https://:8080/v1",  # URL to the llama.cpp server
    token="sk-no-key-required",  # llama.cpp server requires this placeholder
)

# Text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
)
print(response.generated_text)

# For chat format
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

或者，您可以使用 OpenAI 客戶端

from openai import OpenAI

# Initialize client pointing to llama.cpp server
client = OpenAI(
    base_url="https://:8080/v1",
    api_key="sk-no-key-required",  # llama.cpp server requires this placeholder
)

# Chat completion
response = client.chat.completions.create(
    model="smollm2-1.7b-instruct",  # Model identifier can be anything as server only loads one model
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

</hfoption> <hfoption value="vllm" label="vLLM">

vLLM 易於安裝和使用，它與 OpenAI API 相容，並提供本機 Python 介面。

首先，啟動 vLLM OpenAI 相容伺服器

python -m vllm.entrypoints.openai.api_server \
    --model HuggingFaceTB/SmolLM2-360M-Instruct \
    --host 0.0.0.0 \
    --port 8000

然後使用 Hugging Face 的 InferenceClient 與其互動

from huggingface_hub import InferenceClient

# Initialize client pointing to vLLM endpoint
client = InferenceClient(
    model="https://:8000/v1",  # URL to the vLLM server
)

# Text generation
response = client.text_generation(
    "Tell me a story",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    details=True,
)
print(response.generated_text)

# For chat format
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

或者，您可以使用 OpenAI 客戶端

from openai import OpenAI

# Initialize client pointing to vLLM endpoint
client = OpenAI(
    base_url="https://:8000/v1",
    api_key="not-needed",  # vLLM doesn't require an API key by default
)

# Chat completion
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a story"},
    ],
    max_tokens=100,
    temperature=0.7,
    top_p=0.95,
)
print(response.choices[0].message.content)

</hfoption>

基本文字生成

讓我們看看使用這些框架進行文字生成的示例

首先，使用高階引數部署 TGI

docker run --gpus all \
    --shm-size 1g \
    -p 8080:80 \
    -v ~/.cache/huggingface:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id HuggingFaceTB/SmolLM2-360M-Instruct \
    --max-total-tokens 4096 \
    --max-input-length 3072 \
    --max-batch-total-tokens 8192 \
    --waiting-served-ratio 1.2

使用 InferenceClient 進行靈活的文字生成

from huggingface_hub import InferenceClient

client = InferenceClient(model="https://:8080")

# Advanced parameters example
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    max_tokens=200,
    top_p=0.95,
)
print(response.choices[0].message.content)

# Raw text generation
response = client.text_generation(
    "Write a creative story about space exploration",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.1,
    do_sample=True,
    details=True,
)
print(response.generated_text)

或者使用 OpenAI 客戶端

from openai import OpenAI

client = OpenAI(base_url="https://:8080/v1", api_key="not-needed")

# Advanced parameters example
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,  # Higher for more creativity
)
print(response.choices[0].message.content)

</hfoption> <hfoption value="llama.cpp" label="llama.cpp">

對於 llama.cpp，您可以在啟動伺服器時設定高階引數

./server \
    -m smollm2-1.7b-instruct.Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 4096 \            # Context size
    --threads 8 \        # CPU threads to use
    --batch-size 512 \   # Batch size for prompt evaluation
    --n-gpu-layers 0     # GPU layers (0 = CPU only)

使用 InferenceClient

from huggingface_hub import InferenceClient

client = InferenceClient(model="https://:8080/v1", token="sk-no-key-required")

# Advanced parameters example
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    max_tokens=200,
    top_p=0.95,
)
print(response.choices[0].message.content)

# For direct text generation
response = client.text_generation(
    "Write a creative story about space exploration",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    repetition_penalty=1.1,
    details=True,
)
print(response.generated_text)

或者使用 OpenAI 客戶端進行生成，並控制取樣引數

from openai import OpenAI

client = OpenAI(base_url="https://:8080/v1", api_key="sk-no-key-required")

# Advanced parameters example
response = client.chat.completions.create(
    model="smollm2-1.7b-instruct",
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,  # Higher for more creativity
    top_p=0.95,  # Nucleus sampling probability
    frequency_penalty=0.5,  # Reduce repetition of frequent tokens
    presence_penalty=0.5,  # Reduce repetition by penalizing tokens already present
    max_tokens=200,  # Maximum generation length
)
print(response.choices[0].message.content)

您還可以使用 llama.cpp 的本機庫進行更多控制

# Using llama-cpp-python package for direct model access
from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="smollm2-1.7b-instruct.Q4_K_M.gguf",
    n_ctx=4096,  # Context window size
    n_threads=8,  # CPU threads
    n_gpu_layers=0,  # GPU layers (0 = CPU only)
)

# Format prompt according to the model's expected format
prompt = """<|im_start|>system
You are a creative storyteller.
<|im_end|>
<|im_start|>user
Write a creative story
<|im_end|>
<|im_start|>assistant
"""

# Generate response with precise parameter control
output = llm(
    prompt,
    max_tokens=200,
    temperature=0.8,
    top_p=0.95,
    frequency_penalty=0.5,
    presence_penalty=0.5,
    stop=["<|im_end|>"],
)

print(output["choices"][0]["text"])

</hfoption> <hfoption value="vllm" label="vLLM">

對於 vLLM 的高階用法，您可以使用 InferenceClient

from huggingface_hub import InferenceClient

client = InferenceClient(model="https://:8000/v1")

# Advanced parameters example
response = client.chat_completion(
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    max_tokens=200,
    top_p=0.95,
)
print(response.choices[0].message.content)

# For direct text generation
response = client.text_generation(
    "Write a creative story about space exploration",
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    details=True,
)
print(response.generated_text)

您還可以使用 OpenAI 客戶端

from openai import OpenAI

client = OpenAI(base_url="https://:8000/v1", api_key="not-needed")

# Advanced parameters example
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a creative storyteller."},
        {"role": "user", "content": "Write a creative story"},
    ],
    temperature=0.8,
    top_p=0.95,
    max_tokens=200,
)
print(response.choices[0].message.content)

vLLM 還提供具有精細控制的本機 Python 介面

from vllm import LLM, SamplingParams

# Initialize the model with advanced parameters
llm = LLM(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    gpu_memory_utilization=0.85,
    max_num_batched_tokens=8192,
    max_num_seqs=256,
    block_size=16,
)

# Configure sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,  # Higher for more creativity
    top_p=0.95,  # Consider top 95% probability mass
    max_tokens=100,  # Maximum length
    presence_penalty=1.1,  # Reduce repetition
    frequency_penalty=1.1,  # Reduce repetition
    stop=["\n\n", "###"],  # Stop sequences
)

# Generate text
prompt = "Write a creative story"
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)

# For chat-style interactions
chat_prompt = [
    {"role": "system", "content": "You are a creative storyteller."},
    {"role": "user", "content": "Write a creative story"},
]
formatted_prompt = llm.get_chat_template()(chat_prompt)  # Uses model's chat template
outputs = llm.generate(formatted_prompt, sampling_params)
print(outputs[0].outputs[0].text)

</hfoption>

高階生成控制

Token 選擇和取樣

文字生成過程涉及在每個步驟中選擇下一個 token。此選擇過程可以透過各種引數進行控制

原始 Logits：每個 token 的初始輸出機率
溫度：控制選擇中的隨機性（越高 = 越有創造力）
Top-p (Nucleus) 取樣：過濾到組成 X% 機率質量的 Top token
Top-k 過濾：將選擇限制為 k 個最可能的 token

以下是配置這些引數的方法

控制重複

這兩個框架都提供了防止重複文字生成的方法

長度控制和停止序列

您可以控制生成長度並指定何時停止

記憶體管理

這兩個框架都實現了先進的記憶體管理技術，以實現高效推理。

資源

< > 在 GitHub 上更新

LLM 課程

最佳化的推理部署

框架選擇指南

記憶體管理和效能

部署和整合

入門

安裝和基本設定

基本文字生成

高階生成控制

Token 選擇和取樣

控制重複

長度控制和停止序列

記憶體管理

資源