SGLang 與 Transformers 後端整合

釋出於 2025 年 6 月 23 日

在 GitHub 上更新

贊

訪客

Hugging Face transformers 庫是使用最先進模型的標準工具——從試驗前沿研究到在自定義資料上進行微調。其簡單性、靈活性和龐大的模型庫使其成為快速開發的強大工具。

但是，當您準備從 Notebooks 轉向生產環境時，推理效能就變得至關重要。這時 SGLang 就派上用場了。

SGLang 專為高吞吐量、低延遲的推理而設計，現在它與 transformers 實現了無縫整合，可將其作為後端。這意味著您可以將 transformers 的靈活性與 SGLang 的原始效能結合起來。

讓我們深入瞭解此整合所實現的功能以及如何使用它。

TL;DR (太長不看)

SGLang 現在支援將 Hugging Face transformers 作為後端，讓您可以開箱即用地執行任何與 transformers 相容的模型，並實現高效能推理。

import sglang as sgl

llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", impl="transformers")
print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])

無需原生支援——SGLang 在需要時會自動回退到 Transformers，或者您可以顯式地設定 impl="transformers"。

Transformers 和 SGLang

讓我們用一個簡單的文字生成示例，使用 meta-llama/Llama-3.2-1B-Instruct 來比較這兩種方法。

Transformers

transformers 庫非常適合實驗、小規模任務和訓練，但它並未針對高併發或低延遲場景進行最佳化。

from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B-Instruct")
generate_kwargs = {
    "top_p": 0.95,
    "top_k": 20,
    "temperature": 0.8,
    "max_new_tokens": 256
}
result = pipe("The future of AI is", **generate_kwargs)
print(result[0]["generated_text"])

SGLang

SGLang 採用了不同的路線，透過 RadixAttention (一種記憶體高效的注意力機制) 等特性來優先考慮效率。使用 SGLang 進行推理的速度明顯更快，資源效率更高，尤其是在高負載下。以下是在 SGLang 中使用離線引擎完成相同任務的程式碼：

import sglang as sgl

if __name__ == '__main__':
    llm = sgl.Engine(model_path="meta-llama/Llama-3.2-1B-Instruct")
    prompts = ["The future of AI is"]
    sampling_params =  {
        "top_p": 0.95,
        "top_k": 20,
        "temperature": 0.8,
        "max_new_tokens": 256
    }
    outputs = llm.generate(prompts, sampling_params)
    print(outputs[0])

或者，您也可以啟動一個伺服器併發送請求

python3 -m sglang.launch_server \
  --model-path meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 30000

response = requests.post(
    "https://:30000/generate",
    json={
        "text": "The future of AI is",
        "sampling_params": {
            "top_p": 0.95,
            "top_k": 20,
            "temperature": 0.8,
            "max_new_tokens": 256
        },
    },
)
print(response.json())

請注意，SGLang 還提供與 OpenAI 相容的 API，使其成為外部服務的直接替代品。

SGLang 中的 Transformers 後端

藉助新的 transformers 後端整合，SGLang 現在可以自動回退到使用其原生不支援的 transformers 模型。這在實踐中意味著：

可以立即訪問新增到 transformers 的新模型
支援來自 Hugging Face Hub 的自定義模型
更少的工程開銷

這在不犧牲 transformers 生態系統的簡單性和多功能性的情況下，解鎖了更快的推理和最佳化的部署 (例如啟用 RadixAttention)。

用法

llm = sgl.Engine(model_path="meta-llama/Llama-3.2-1B-Instruct", impl="transformers")

請注意，指定 impl 引數是可選的。如果模型不是 SGLang 原生支援的，它會自動切換到 transformers 實現。

任何在 Hugging Face Hub 上使用 trust_remote_code=True 且能與 transformers 正常工作並正確實現注意力機制的模型，都與 SGLang 相容。您可以在官方文件中找到具體要求。如果您的自定義模型滿足這些標準，您只需在載入時設定 trust_remote_code=True 即可。

llm = sgl.Engine(model_path="new-custom-transformers-model", impl="transformers", trust_remote_code=True)

示例

Kyutai 團隊的 Helium 模型尚未得到 SGLang 的原生支援。這正是 transformers 後端的優勢所在，它可以在無需等待原生支援的情況下實現最佳化的推理。

python3 -m sglang.launch_server \
  --model-path kyutai/helium-1-preview-2b \
  --impl transformers \
  --host 0.0.0.0 \
  --port 30000

response = requests.post(
    "https://:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "top_p": 0.95,
            "top_k": 20,
            "temperature": 0.8,
            "max_new_tokens": 256
        },
    },
)
print(response.json())

下一步計劃

我們正在積極致力於以下幾個關鍵領域以增強此整合：

效能改進：目前 transformer 模型的效能落後於原生整合。我們的主要目標是最佳化並縮小這一差距。
LoRA 支援
VLM 整合：我們還在努力增加對視覺語言模型 (VLM) 的支援，以擴大功能範圍和用例。

更多部落格文章

Accelerate ND-Parallel：高效多 GPU 訓練指南

作者 2025 年 8 月 8 日 • 32

Transformers 庫：標準化模型定義

作者 2025 年 5 月 15 日 • 116

社群

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入以發表評論

贊