LLM 課程文件
最佳化的推理部署
並獲得增強的文件體驗
開始使用
最佳化的推理部署
在本節中,我們將探討用於最佳化 LLM 部署的高階框架:Text Generation Inference (TGI)、vLLM 和 llama.cpp。這些應用程式主要用於生產環境,為使用者提供 LLM 服務。本節重點介紹如何在生產環境中部署這些框架,而不是如何在單機上使用它們進行推理。
我們將介紹這些工具如何最大限度地提高推理效率並簡化大型語言模型的生產部署。
框架選擇指南
TGI、vLLM 和 llama.cpp 服務於類似的目的,但它們具有不同的特點,使其更適合不同的用例。讓我們看看它們之間的主要區別,重點關注效能和整合。
記憶體管理和效能
TGI 旨在生產中保持穩定和可預測,它使用固定的序列長度來保持記憶體使用一致。TGI 使用 Flash Attention 2 和連續批處理技術來管理記憶體。這意味著它可以非常高效地處理注意力計算,並透過不斷地提供工作來保持 GPU 繁忙。系統可以在需要時在 CPU 和 GPU 之間移動模型的部分,這有助於處理更大的模型。

其關鍵創新在於它如何管理高頻寬記憶體 (HBM) 和更快的 SRAM 快取之間的記憶體傳輸。傳統的注意力機制反覆在 HBM 和 SRAM 之間傳輸資料,透過讓 GPU 空閒來造成瓶頸。Flash Attention 將資料一次載入到 SRAM 中並在那裡執行所有計算,從而最大限度地減少昂貴的記憶體傳輸。
雖然其優勢在訓練期間最為顯著,但 Flash Attention 減少的 VRAM 使用和提高的效率使其在推理方面也很有價值,從而實現更快、更可擴充套件的 LLM 服務。
vLLM 採用不同的方法,使用 PagedAttention。就像計算機管理其記憶體頁一樣,vLLM 將模型的記憶體分成更小的塊。這種巧妙的系統意味著它可以更靈活地處理不同大小的請求,並且不會浪費記憶體空間。它特別擅長在不同請求之間共享記憶體並減少記憶體碎片,這使得整個系統更加高效。
vLLM 的關鍵創新在於它如何管理此快取
- 記憶體分頁:不將 KV 快取視為一個大塊,而是將其劃分為固定大小的“頁面”(類似於作業系統中的虛擬記憶體)。
- 非連續儲存:頁面無需在 GPU 記憶體中連續儲存,從而實現更靈活的記憶體分配。
- 頁表管理:頁表跟蹤哪些頁面屬於哪個序列,從而實現高效查詢和訪問。
- 記憶體共享:對於並行取樣等操作,儲存提示符 KV 快取的頁面可以在多個序列之間共享。
與傳統方法相比,PagedAttention 方法可以使吞吐量提高高達 24 倍,這使其成為生產 LLM 部署的遊戲規則改變者。如果您想深入瞭解 PagedAttention 的工作原理,您可以閱讀 vLLM 文件中的指南。
llama.cpp 是一個高度最佳化的 C/C++ 實現,最初設計用於在消費級硬體上執行 LLaMA 模型。它專注於 CPU 效率,並可選地進行 GPU 加速,是資源受限環境的理想選擇。llama.cpp 使用量化技術來減小模型大小和記憶體需求,同時保持良好的效能。它為各種 CPU 架構實現了最佳化的核心,並支援基本的 KV 快取管理以實現高效的 token 生成。
llama.cpp 中的關鍵量化功能包括
- 多級量化:支援 8 位、4 位、3 位甚至 2 位量化
- GGML/GGUF 格式:使用針對量化推理最佳化的自定義張量格式
- 混合精度:可以對模型的不同部分應用不同的量化級別
- 硬體特定最佳化:包括各種 CPU 架構(AVX2、AVX-512、NEON)的最佳化程式碼路徑
這種方法使得在記憶體有限的消費級硬體上執行數十億引數的模型成為可能,使其非常適合本地部署和邊緣裝置。
部署和整合
讓我們繼續探討框架之間的部署和整合差異。
TGI 憑藉其生產就緒功能在企業級部署中表現出色。它內建 Kubernetes 支援,幷包含在生產環境中執行所需的一切,例如透過 Prometheus 和 Grafana 進行監控、自動擴充套件和全面的安全功能。該系統還包括企業級日誌記錄和各種保護措施,例如內容過濾和速率限制,以確保您的部署安全穩定。
vLLM 採用更靈活、更注重開發人員的部署方法。它以 Python 為核心構建,可以輕鬆替換現有應用程式中的 OpenAI API。該框架專注於提供原始效能,並且可以根據您的特定需求進行定製。它與 Ray 配合使用特別好,可以管理叢集,使其成為需要高效能和適應性時的絕佳選擇。
llama.cpp 優先考慮簡單性和可移植性。其伺服器實現輕量級,可以在各種硬體上執行,從強大的伺服器到消費級筆記型電腦,甚至一些高階移動裝置。憑藉最少的依賴項和簡單的 C/C++ 核心,它易於部署在安裝 Python 框架會很困難的環境中。該伺服器提供與 OpenAI 相容的 API,同時比其他解決方案具有更小的資源佔用空間。
入門
讓我們探討如何使用這些框架部署 LLM,從安裝和基本設定開始。
安裝和基本設定
TGI 易於安裝和使用,並與 Hugging Face 生態系統深度整合。
首先,使用 Docker 啟動 TGI 伺服器
docker run --gpus all \ --shm-size 1g \ -p 8080:80 \ -v ~/.cache/huggingface:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id HuggingFaceTB/SmolLM2-360M-Instruct
然後使用 Hugging Face 的 InferenceClient 與其互動
from huggingface_hub import InferenceClient
# Initialize client pointing to TGI endpoint
client = InferenceClient(
model="https://:8080", # URL to the TGI server
)
# Text generation
response = client.text_generation(
"Tell me a story",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
details=True,
stop_sequences=[],
)
print(response.generated_text)
# For chat format
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
或者,您可以使用 OpenAI 客戶端
from openai import OpenAI
# Initialize client pointing to TGI endpoint
client = OpenAI(
base_url="https://:8080/v1", # Make sure to include /v1
api_key="not-needed", # TGI doesn't require an API key by default
)
# Chat completion
response = client.chat.completions.create(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
llama.cpp 易於安裝和使用,所需依賴項極少,並支援 CPU 和 GPU 推理。
首先,安裝和構建 llama.cpp
# Clone the repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build the project
make
# Download the SmolLM2-1.7B-Instruct-GGUF model
curl -L -O https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct-GGUF/resolve/main/smollm2-1.7b-instruct.Q4_K_M.gguf
然後,啟動伺服器(與 OpenAI API 相容)
# Start the server
./server \
-m smollm2-1.7b-instruct.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 4096 \
--n-gpu-layers 0 # Set to a higher number to use GPU
使用 Hugging Face 的 InferenceClient 與伺服器互動
from huggingface_hub import InferenceClient
# Initialize client pointing to llama.cpp server
client = InferenceClient(
model="https://:8080/v1", # URL to the llama.cpp server
token="sk-no-key-required", # llama.cpp server requires this placeholder
)
# Text generation
response = client.text_generation(
"Tell me a story",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
details=True,
)
print(response.generated_text)
# For chat format
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
或者,您可以使用 OpenAI 客戶端
from openai import OpenAI
# Initialize client pointing to llama.cpp server
client = OpenAI(
base_url="https://:8080/v1",
api_key="sk-no-key-required", # llama.cpp server requires this placeholder
)
# Chat completion
response = client.chat.completions.create(
model="smollm2-1.7b-instruct", # Model identifier can be anything as server only loads one model
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
vLLM 易於安裝和使用,它與 OpenAI API 相容,並提供本機 Python 介面。
首先,啟動 vLLM OpenAI 相容伺服器
python -m vllm.entrypoints.openai.api_server \ --model HuggingFaceTB/SmolLM2-360M-Instruct \ --host 0.0.0.0 \ --port 8000
然後使用 Hugging Face 的 InferenceClient 與其互動
from huggingface_hub import InferenceClient
# Initialize client pointing to vLLM endpoint
client = InferenceClient(
model="https://:8000/v1", # URL to the vLLM server
)
# Text generation
response = client.text_generation(
"Tell me a story",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
details=True,
)
print(response.generated_text)
# For chat format
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
或者,您可以使用 OpenAI 客戶端
from openai import OpenAI
# Initialize client pointing to vLLM endpoint
client = OpenAI(
base_url="https://:8000/v1",
api_key="not-needed", # vLLM doesn't require an API key by default
)
# Chat completion
response = client.chat.completions.create(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a story"},
],
max_tokens=100,
temperature=0.7,
top_p=0.95,
)
print(response.choices[0].message.content)
基本文字生成
讓我們看看使用這些框架進行文字生成的示例
首先,使用高階引數部署 TGI
docker run --gpus all \ --shm-size 1g \ -p 8080:80 \ -v ~/.cache/huggingface:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id HuggingFaceTB/SmolLM2-360M-Instruct \ --max-total-tokens 4096 \ --max-input-length 3072 \ --max-batch-total-tokens 8192 \ --waiting-served-ratio 1.2
使用 InferenceClient 進行靈活的文字生成
from huggingface_hub import InferenceClient
client = InferenceClient(model="https://:8080")
# Advanced parameters example
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8,
max_tokens=200,
top_p=0.95,
)
print(response.choices[0].message.content)
# Raw text generation
response = client.text_generation(
"Write a creative story about space exploration",
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
do_sample=True,
details=True,
)
print(response.generated_text)
或者使用 OpenAI 客戶端
from openai import OpenAI
client = OpenAI(base_url="https://:8080/v1", api_key="not-needed")
# Advanced parameters example
response = client.chat.completions.create(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8, # Higher for more creativity
)
print(response.choices[0].message.content)
對於 llama.cpp,您可以在啟動伺服器時設定高階引數
./server \
-m smollm2-1.7b-instruct.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 4096 \ # Context size
--threads 8 \ # CPU threads to use
--batch-size 512 \ # Batch size for prompt evaluation
--n-gpu-layers 0 # GPU layers (0 = CPU only)
使用 InferenceClient
from huggingface_hub import InferenceClient
client = InferenceClient(model="https://:8080/v1", token="sk-no-key-required")
# Advanced parameters example
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8,
max_tokens=200,
top_p=0.95,
)
print(response.choices[0].message.content)
# For direct text generation
response = client.text_generation(
"Write a creative story about space exploration",
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
repetition_penalty=1.1,
details=True,
)
print(response.generated_text)
或者使用 OpenAI 客戶端進行生成,並控制取樣引數
from openai import OpenAI
client = OpenAI(base_url="https://:8080/v1", api_key="sk-no-key-required")
# Advanced parameters example
response = client.chat.completions.create(
model="smollm2-1.7b-instruct",
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8, # Higher for more creativity
top_p=0.95, # Nucleus sampling probability
frequency_penalty=0.5, # Reduce repetition of frequent tokens
presence_penalty=0.5, # Reduce repetition by penalizing tokens already present
max_tokens=200, # Maximum generation length
)
print(response.choices[0].message.content)
您還可以使用 llama.cpp 的本機庫進行更多控制
# Using llama-cpp-python package for direct model access
from llama_cpp import Llama
# Load the model
llm = Llama(
model_path="smollm2-1.7b-instruct.Q4_K_M.gguf",
n_ctx=4096, # Context window size
n_threads=8, # CPU threads
n_gpu_layers=0, # GPU layers (0 = CPU only)
)
# Format prompt according to the model's expected format
prompt = """<|im_start|>system
You are a creative storyteller.
<|im_end|>
<|im_start|>user
Write a creative story
<|im_end|>
<|im_start|>assistant
"""
# Generate response with precise parameter control
output = llm(
prompt,
max_tokens=200,
temperature=0.8,
top_p=0.95,
frequency_penalty=0.5,
presence_penalty=0.5,
stop=["<|im_end|>"],
)
print(output["choices"][0]["text"])
對於 vLLM 的高階用法,您可以使用 InferenceClient
from huggingface_hub import InferenceClient
client = InferenceClient(model="https://:8000/v1")
# Advanced parameters example
response = client.chat_completion(
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8,
max_tokens=200,
top_p=0.95,
)
print(response.choices[0].message.content)
# For direct text generation
response = client.text_generation(
"Write a creative story about space exploration",
max_new_tokens=200,
temperature=0.8,
top_p=0.95,
details=True,
)
print(response.generated_text)
您還可以使用 OpenAI 客戶端
from openai import OpenAI
client = OpenAI(base_url="https://:8000/v1", api_key="not-needed")
# Advanced parameters example
response = client.chat.completions.create(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
messages=[
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
],
temperature=0.8,
top_p=0.95,
max_tokens=200,
)
print(response.choices[0].message.content)
vLLM 還提供具有精細控制的本機 Python 介面
from vllm import LLM, SamplingParams
# Initialize the model with advanced parameters
llm = LLM(
model="HuggingFaceTB/SmolLM2-360M-Instruct",
gpu_memory_utilization=0.85,
max_num_batched_tokens=8192,
max_num_seqs=256,
block_size=16,
)
# Configure sampling parameters
sampling_params = SamplingParams(
temperature=0.8, # Higher for more creativity
top_p=0.95, # Consider top 95% probability mass
max_tokens=100, # Maximum length
presence_penalty=1.1, # Reduce repetition
frequency_penalty=1.1, # Reduce repetition
stop=["\n\n", "###"], # Stop sequences
)
# Generate text
prompt = "Write a creative story"
outputs = llm.generate(prompt, sampling_params)
print(outputs[0].outputs[0].text)
# For chat-style interactions
chat_prompt = [
{"role": "system", "content": "You are a creative storyteller."},
{"role": "user", "content": "Write a creative story"},
]
formatted_prompt = llm.get_chat_template()(chat_prompt) # Uses model's chat template
outputs = llm.generate(formatted_prompt, sampling_params)
print(outputs[0].outputs[0].text)
高階生成控制
Token 選擇和取樣
文字生成過程涉及在每個步驟中選擇下一個 token。此選擇過程可以透過各種引數進行控制
- 原始 Logits:每個 token 的初始輸出機率
- 溫度:控制選擇中的隨機性(越高 = 越有創造力)
- Top-p (Nucleus) 取樣:過濾到組成 X% 機率質量的 Top token
- Top-k 過濾:將選擇限制為 k 個最可能的 token
以下是配置這些引數的方法
client.generate(
"Write a creative story",
temperature=0.8, # Higher for more creativity
top_p=0.95, # Consider top 95% probability mass
top_k=50, # Consider top 50 tokens
max_new_tokens=100, # Maximum length
repetition_penalty=1.1, # Reduce repetition
)
# Via OpenAI API compatibility
response = client.completions.create(
model="smollm2-1.7b-instruct", # Model name (can be any string for llama.cpp server)
prompt="Write a creative story",
temperature=0.8, # Higher for more creativity
top_p=0.95, # Consider top 95% probability mass
frequency_penalty=1.1, # Reduce repetition
presence_penalty=0.1, # Reduce repetition
max_tokens=100, # Maximum length
)
# Via llama-cpp-python direct access
output = llm(
"Write a creative story",
temperature=0.8,
top_p=0.95,
top_k=50,
max_tokens=100,
repeat_penalty=1.1,
)
params = SamplingParams(
temperature=0.8, # Higher for more creativity
top_p=0.95, # Consider top 95% probability mass
top_k=50, # Consider top 50 tokens
max_tokens=100, # Maximum length
presence_penalty=0.1, # Reduce repetition
)
llm.generate("Write a creative story", sampling_params=params)
控制重複
這兩個框架都提供了防止重複文字生成的方法
# Via OpenAI API
response = client.completions.create(
model="smollm2-1.7b-instruct",
prompt="Write a varied text",
frequency_penalty=1.1, # Penalize frequent tokens
presence_penalty=0.8, # Penalize tokens already present
)
# Via direct library
output = llm(
"Write a varied text",
repeat_penalty=1.1, # Penalize repeated tokens
frequency_penalty=0.5, # Additional frequency penalty
presence_penalty=0.5, # Additional presence penalty
)
params = SamplingParams(
presence_penalty=0.1, # Penalize token presence
frequency_penalty=0.1, # Penalize token frequency
)
長度控制和停止序列
您可以控制生成長度並指定何時停止
# Via OpenAI API
response = client.completions.create(
model="smollm2-1.7b-instruct",
prompt="Generate a short paragraph",
max_tokens=100,
stop=["\n\n", "###"],
)
# Via direct library
output = llm("Generate a short paragraph", max_tokens=100, stop=["\n\n", "###"])
params = SamplingParams(
max_tokens=100,
min_tokens=10,
stop=["###", "\n\n"],
ignore_eos=False,
skip_special_tokens=True,
)
記憶體管理
這兩個框架都實現了先進的記憶體管理技術,以實現高效推理。
# Docker deployment with memory optimization
docker run --gpus all -p 8080:80 \
--shm-size 1g \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id HuggingFaceTB/SmolLM2-1.7B-Instruct \
--max-batch-total-tokens 8192 \
--max-input-length 4096
llama.cpp 使用量化和最佳化的記憶體佈局
# Server with memory optimizations
./server \
-m smollm2-1.7b-instruct.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-c 2048 \ # Context size
--threads 4 \ # CPU threads
--n-gpu-layers 32 \ # Use more GPU layers for larger models
--mlock \ # Lock memory to prevent swapping
--cont-batching # Enable continuous batching
對於 GPU 過大的模型,您可以使用 CPU 解除安裝
./server \
-m smollm2-1.7b-instruct.Q4_K_M.gguf \
--n-gpu-layers 20 \ # Keep first 20 layers on GPU
--threads 8 # Use more CPU threads for CPU layers
vLLM 使用 PagedAttention 進行最佳化記憶體管理
from vllm.engine.arg_utils import AsyncEngineArgs
engine_args = AsyncEngineArgs(
model="HuggingFaceTB/SmolLM2-1.7B-Instruct",
gpu_memory_utilization=0.85,
max_num_batched_tokens=8192,
block_size=16,
)
llm = LLM(engine_args=engine_args)
資源
- 文字生成推理文件
- TGI GitHub 儲存庫
- vLLM 文件
- vLLM GitHub 儲存庫
- PagedAttention 論文
- llama.cpp GitHub 儲存庫
- llama-cpp-python 儲存庫