開源 AI 食譜文件

使用 TGI 的 Messages API 從 OpenAI 遷移到開放式 LLM

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

使用 TGI 的 Messages API 從 OpenAI 遷移到開放式 LLM

作者：Andrew Reed

本筆記本演示瞭如何輕鬆地從 OpenAI 模型過渡到開放式 LLM，而無需重構任何現有程式碼。

文字生成推理 (TGI) 現在提供了一個 Messages API，使其與 OpenAI 聊天補全 API 直接相容。這意味著任何使用 OpenAI 模型（透過 OpenAI 客戶端庫或 LangChain、LlamaIndex 等第三方工具）的現有指令碼都可以直接替換為使用在 TGI 端點上執行的任何開放式 LLM！

這使您能夠快速測試並受益於開放模型提供的眾多優勢。例如：

對模型和資料的完全控制和透明
不再擔心速率限制
能夠根據您的特定需求完全定製系統

在本筆記本中，我們將向您展示如何：

建立推理端點以使用 TGI 部署模型
使用 OpenAI 客戶端庫查詢推理端點
將端點與 LangChain 和 LlamaIndex 工作流整合

讓我們深入瞭解！

設定

首先我們需要安裝依賴項並設定 HF API 金鑰。

!pip install --upgrade -q huggingface_hub langchain langchain-community langchainhub langchain-openai llama-index chromadb bs4 sentence_transformers torch torchvision torchaudio llama-index-llms-openai-like llama-index-embeddings-huggingface

import os
import getpass

# enter API key
os.environ["HF_TOKEN"] = HF_API_KEY = getpass.getpass()

1. 建立一個推理端點

首先，讓我們將 Nous-Hermes-2-Mixtral-8x7B-DPO（一個經過微調的 Mixtral 模型）部署到使用 TGI 的推理端點。

我們可以透過UI 介面上的幾次點選來部署模型，或者利用 huggingface_hub Python 庫以程式設計方式建立和管理推理端點。

這裡我們將使用 Hub 庫，透過指定端點名稱和模型倉庫，以及 text-generation 任務。在本例中，我們使用 protected 型別，以便訪問已部署的模型需要有效的 Hugging Face 令牌。我們還需要配置硬體要求，如供應商、區域、加速器、例項型別和大小。您可以透過此 API 呼叫檢視可用的資源選項列表，並在此處的目錄中檢視選定模型的推薦配置。

注意：您可能需要傳送電子郵件至 api-enterprise@huggingface.co 申請配額升級

>>> from huggingface_hub import create_inference_endpoint

>>> endpoint = create_inference_endpoint(
...     "nous-hermes-2-mixtral-8x7b-demo",
...     repository="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
...     framework="pytorch",
...     task="text-generation",
...     accelerator="gpu",
...     vendor="aws",
...     region="us-east-1",
...     type="protected",
...     instance_type="p4de",
...     instance_size="2xlarge",
...     custom_image={
...         "health_route": "/health",
...         "env": {
...             "MAX_INPUT_LENGTH": "4096",
...             "MAX_BATCH_PREFILL_TOKENS": "4096",
...             "MAX_TOTAL_TOKENS": "32000",
...             "MAX_BATCH_TOTAL_TOKENS": "1024000",
...             "MODEL_ID": "/repository",
...         },
...         "url": "ghcr.io/huggingface/text-generation-inference:sha-1734540",  # must be >= 1.4.0
...     },
... )

>>> endpoint.wait()
>>> print(endpoint.status)

running

我們的部署需要幾分鐘才能啟動。我們可以使用 .wait() 工具來阻塞執行中的執行緒，直到端點達到最終的“執行”狀態。一旦執行，我們可以確認其狀態並透過 UI Playground 進行試用。

IE UI Overview

太棒了，我們現在有了一個可用的端點！

注意：當使用 huggingface_hub 部署時，您的端點預設會在閒置 15 分鐘後縮減至零，以在非活動期間最佳化成本。請檢視 Hub Python 庫文件，瞭解所有可用於管理端點生命週期的功能。

2. 使用 OpenAI 客戶端庫查詢推理端點

如上所述，由於我們的模型是託管在 TGI 上的，它現在支援 Messages API，這意味著我們可以直接使用熟悉的 OpenAI 客戶端庫來查詢它。

使用 Python 客戶端

下面的例子展示瞭如何使用 OpenAI Python 庫進行這種轉換。只需將 <ENDPOINT_URL> 替換為您的端點 URL（請確保包含字尾 v1/），並用有效的 Hugging Face 使用者令牌填充 <HF_API_KEY> 欄位。<ENDPOINT_URL> 可以從推理端點 UI 中獲取，或者從我們上面用 endpoint.url 建立的端點物件中獲取。

然後我們可以像往常一樣使用客戶端，傳遞一個訊息列表來從我們的推理端點流式傳輸響應。

>>> from openai import OpenAI

>>> BASE_URL = endpoint.url

>>> # init the client but point it to TGI
>>> client = OpenAI(
...     base_url=os.path.join(BASE_URL, "v1/"),
...     api_key=HF_API_KEY,
... )
>>> chat_completion = client.chat.completions.create(
...     model="tgi",
...     messages=[
...         {"role": "system", "content": "You are a helpful assistant."},
...         {"role": "user", "content": "Why is open-source software important?"},
...     ],
...     stream=True,
...     max_tokens=500,
... )

>>> # iterate and print stream
>>> for message in chat_completion:
...     print(message.choices[0].delta.content, end="")

Open-source software is important due to a number of reasons, including:

1. Collaboration: The collaborative nature of open-source software allows developers from around the world to work together, share their ideas and improve the code. This often results in faster progress and better software.

2. Transparency: With open-source software, the code is publicly available, making it easy to see exactly how the software functions, and allowing users to determine if there are any security vulnerabilities.

3. Customization: Being able to access the code also allows users to customize the software to better suit their needs. This makes open-source software incredibly versatile, as users can tweak it to suit their specific use case.

4. Quality: Open-source software is often developed by large communities of dedicated developers, who work together to improve the software. This results in a higher level of quality than might be found in proprietary software.

5. Cost: Open-source software is often provided free of charge, which makes it accessible to a wider range of users. This can be especially important for organizations with limited budgets for software.

6. Shared Benefit: By sharing the code of open-source software, everyone can benefit from the hard work of the developers. This contributes to the overall advancement of technology, as users and developers work together to improve and build upon the software.

In summary, open-source software provides a collaborative platform that leads to high-quality, customizable, and transparent software, all available at little or no cost, benefiting both individuals and the technology community as a whole.<|im_end|>

在幕後，TGI 的 Messages API 會使用其聊天模板，自動將訊息列表轉換為模型所需的指令格式。

注意：某些 OpenAI 功能，如函式呼叫，與 TGI 不相容。目前，Messages API 支援以下聊天補全引數：stream、max_new_tokens、frequency_penalty、logprobs、seed、temperature 和 top_p。

使用 JavaScript 客戶端

這是上面相同的流式傳輸示例，但使用的是 OpenAI Javascript/Typescript 庫。

import OpenAI from "openai";

const openai = new OpenAI({
  baseURL: "<ENDPOINT_URL>" + "/v1/", // replace with your endpoint url
  apiKey: "<HF_API_TOKEN>", // replace with your token
});

async function main() {
  const stream = await openai.chat.completions.create({
    model: "tgi",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "Why is open-source software important?" },
    ],
    stream: true,
    max_tokens: 500,
  });
  for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content || "");
  }
}

main();

3. 與 LangChain 和 LlamaIndex 整合

現在，讓我們看看如何將這個新建立的端點與流行的 RAG 框架（如 LangChain 和 LlamaIndex）一起使用。

如何與 LangChain 一起使用

要在 LangChain 中使用它，只需建立一個 ChatOpenAI 的例項，並按如下方式傳遞您的 <ENDPOINT_URL> 和 <HF_API_TOKEN>

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model_name="tgi",
    openai_api_key=HF_API_KEY,
    openai_api_base=os.path.join(BASE_URL, "v1/"),
)
llm.invoke("Why is open-source software important?")

我們能夠直接利用與使用 OpenAI 模型時相同的 ChatOpenAI 類。這使得所有以前的程式碼只需更改一行程式碼即可與我們的端點配合使用。

現在，讓我們在一個簡單的 RAG 管道中使用我們的 Mixtral 模型，來回答一個關於 HF 部落格文章內容的問題。

from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.runnables import RunnableParallel
from langchain_community.embeddings import HuggingFaceEmbeddings

# Load, chunk and index the contents of the blog
loader = WebBaseLoader(
    web_paths=("https://huggingface.co/blog/open-source-llms-as-agents",),
)
docs = loader.load()

# declare an HF embedding model
hf_embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-large-en-v1.5")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vectorstore = Chroma.from_documents(documents=splits, embedding=hf_embeddings)

# Retrieve and generate using the relevant snippets of the blog
retriever = vectorstore.as_retriever()
prompt = hub.pull("rlm/rag-prompt")


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain_from_docs = (
    RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"]))) | prompt | llm | StrOutputParser()
)

rag_chain_with_source = RunnableParallel({"context": retriever, "question": RunnablePassthrough()}).assign(
    answer=rag_chain_from_docs
)

rag_chain_with_source.invoke("According to this article which open-source model is the best for an agent behaviour?")

如何與 LlamaIndex 一起使用

同樣，您也可以在 LlamaIndex 中使用 TGI 端點。我們將使用 OpenAILike 類，並透過配置一些額外的引數（即 is_local, is_function_calling_model, is_chat_model, context_window）來例項化它。

注意：context_window 引數應與您端點先前設定的 MAX_TOTAL_TOKENS 值匹配。

from llama_index.llms.openai_like import OpenAILike

llm = OpenAILike(
    model="tgi",
    api_key=HF_API_KEY,
    api_base=BASE_URL + "/v1/",
    is_chat_model=True,
    is_local=False,
    is_function_calling_model=False,
    context_window=4096,
)

llm.complete("Why is open-source software important?")

我們現在可以在一個類似的 RAG 管道中使用它。請記住，您在推理端點中先前選擇的 MAX_INPUT_LENGTH 將直接影響模型可以處理的檢索到的塊（similarity_top_k）的數量。

from llama_index.core import VectorStoreIndex, download_loader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.query_engine import CitationQueryEngine

SimpleWebPageReader = download_loader("SimpleWebPageReader")

documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["https://huggingface.co/blog/open-source-llms-as-agents"]
)

# Load embedding model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-large-en-v1.5")

# Pass LLM to pipeline
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, show_progress=True)

# Query the index
query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=2,
)
response = query_engine.query("According to this article which open-source model is the best for an agent behaviour?")

response.response

總結

當您使用完端點後，可以暫停或刪除它。此步驟可以透過 UI 完成，或者像下面這樣以程式設計方式完成。

# pause our running endpoint
endpoint.pause()

# optionally delete
# endpoint.delete()

< > 在 GitHub 上更新

←透過推理端點使用 TEI 自動生成嵌入向量使用 LangChain 在 HuggingFace 文件上進行高階 RAG→