無伺服器推理 API

Hugging Face 提供無伺服器推理 API，讓使用者可以透過簡單的 API 呼叫免費快速測試和評估數千個公開可訪問的（或您自己私有許可權的）機器學習模型！

在此筆記本食譜中，我們將演示查詢無伺服器推理 API 的幾種不同方法，同時探索各種任務，包括

使用開放式 LLM 生成文字
使用穩定擴散建立影像
使用 VLM 對影像進行推理
從文字生成語音

目標是幫助您從基礎知識開始！

由於我們免費提供無伺服器推理 API，普通 Hugging Face 使用者有速率限制（每小時約幾百個請求）。如需獲得更高的速率限制，您每月只需 9 美元即可升級到 PRO 賬戶。但是，對於大批次、生產推理工作負載，請檢視我們的專用推理端點解決方案。

開始吧

要開始使用無伺服器推理 API，您需要一個 Hugging Face Hub 配置檔案：如果您沒有，可以註冊；如果您有，可以在此處登入。

接下來，您需要建立使用者訪問令牌。具有讀或寫許可權的令牌都可以。但是，我們強烈建議使用細粒度令牌。

對於此筆記本，您需要一個細粒度令牌，具有“推理 > 呼叫無伺服器推理 API”使用者許可權，以及對“meta-llama/Meta-Llama-3-8B-Instruct”和“HuggingFaceM4/idefics2-8b-chatty”倉庫的讀取許可權，因為我們必須下載它們的 tokenizer 才能執行此筆記本。

完成這些步驟後，我們可以安裝所需的軟體包並使用我們的使用者訪問令牌向 Hub 進行身份驗證。

%pip install -U huggingface_hub transformers

import os
from huggingface_hub import interpreter_login, whoami, get_token

# running this will prompt you to enter your Hugging Face credentials
interpreter_login()

我們上面使用了 interpreter_login() 來以程式設計方式登入到 Hub。作為替代，我們還可以使用其他方法，例如來自 Hub Python 庫的 notebook_login() 或來自 Hugging Face CLI 工具的 login 命令。

現在，讓我們使用 whoami() 驗證是否正確登入，它會打印出當前使用者名稱和您的個人資料所屬的組織。

whoami()

查詢無伺服器推理 API

無伺服器推理 API 透過一個簡單的 API 將模型暴露在 Hub 上

https://api-inference.huggingface.co/models/<MODEL_ID>

其中 <MODEL_ID> 對應於 Hub 上的模型倉庫名稱。

例如，codellama/CodeLlama-7b-hf 變成 https://api-inference.huggingface.co/models/codellama/CodeLlama-7b-hf

透過 HTTP 請求

我們可以使用 requests 庫透過簡單的 POST 請求輕鬆呼叫此 API。

>>> import requests

>>> API_URL = "https://api-inference.huggingface.co/models/codellama/CodeLlama-7b-hf"
>>> HEADERS = {"Authorization": f"Bearer {get_token()}"}


>>> def query(payload):
...     response = requests.post(API_URL, headers=HEADERS, json=payload)
...     return response.json()


>>> print(
...     query(
...         payload={
...             "inputs": "A HTTP POST request is used to ",
...             "parameters": {"temperature": 0.8, "max_new_tokens": 50, "seed": 42},
...         }
...     )
... )

[&#123;'generated_text': 'A HTTP POST request is used to send data to a web server.\n\n# Example\n```javascript\npost("localhost:3000", &#123;foo: "bar"})\n  .then(console.log => console.log(\'success\'))\n```\n\n'}]

不錯！API 回應了我們輸入提示的延續。但您可能會想……API 是如何知道如何處理有效載荷的？以及作為使用者，我如何知道給定模型可以傳遞哪些引數？

在後臺，推理 API 將動態載入請求的模型到共享計算基礎設施以提供預測。當模型載入時，無伺服器推理 API 將使用模型卡中指定的 pipeline_tag（參見此處）來確定適當的推理任務。您可以參考相應的任務或流水線文件來查詢允許的引數。

如果請求的模型在請求時未載入到記憶體中（這取決於該模型的近期請求），無伺服器推理 API 將最初返回 503 響應，然後才能成功響應預測。請稍等片刻，讓模型有時間啟動，然後重試。您還可以使用 InferenceClient().list_deployed_models() 隨時檢查哪些模型已載入並可用。

使用 huggingface_hub Python 庫

要在 Python 中傳送請求，您可以利用 InferenceClient，這是一個方便的實用工具，可在 huggingface_hub Python 庫中找到，它允許您輕鬆呼叫無伺服器推理 API。

>>> from huggingface_hub import InferenceClient

>>> client = InferenceClient()
>>> response = client.text_generation(
...     prompt="A HTTP POST request is used to ",
...     model="codellama/CodeLlama-7b-hf",
...     temperature=0.8,
...     max_new_tokens=50,
...     seed=42,
...     return_full_text=True,
... )
>>> print(response)

A HTTP POST request is used to send data to a web server.

# Example
```javascript
post("localhost:3000", &#123;foo: "bar"})
  .then(console.log => console.log('success'))
```

請注意，使用 InferenceClient，我們只指定模型 ID，並直接在 text_generation() 方法中傳遞引數。我們可以輕鬆檢查函式簽名，以瞭解更多關於如何使用任務及其允許引數的詳細資訊。

# uncomment the following line to see the function signature
# help(client.text_generation)

除了 Python，您還可以使用 JavaScript 將推理呼叫整合到您的 JS 或 Node 應用程式中。請檢視 huggingface.js 以開始使用。

應用

現在我們瞭解了無伺服器推理 API 的工作原理，讓我們來試一試，並在此過程中學習一些技巧。

1. 使用開放式 LLM 生成文字

文字生成是一個非常常見的用例。然而，與開放式 LLM 互動有一些微妙之處，理解這些微妙之處對於避免悄無聲息的效能下降至關重要。在文字生成方面，底層語言模型可能以幾種不同的形式出現

基礎模型： 指的是普通的、預訓練的語言模型，例如 codellama/CodeLlama-7b-hf 或 meta-llama/Meta-Llama-3-8B。這些模型擅長從提供的提示繼續生成（就像我們在上面的例子中看到的那樣）。然而，它們並未針對會話使用（例如回答問題）進行微調。
指令微調模型： 以多工方式進行訓練，以遵循廣泛的指令，例如“給我寫一個巧克力蛋糕食譜”。諸如 meta-llama/Meta-Llama-3-8B-Instruct 或 mistralai/Mistral-7B-Instruct-v0.3 之類的模型就是以這種方式訓練的。指令微調模型將比基礎模型對指令產生更好的響應。通常，這些模型也針對多輪聊天對話進行了微調，使其非常適合會話用例。

理解這些細微的差異很重要，因為它們會影響我們查詢特定模型的方式。指令模型使用特定於模型的聊天模板進行訓練，因此您需要仔細瞭解模型期望的格式，並在查詢中複製它。

例如，meta-llama/Meta-Llama-3-8B-Instruct 使用以下提示結構來區分系統、使用者和助手對話輪次

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>

特殊令牌和提示格式因模型而異。為了確保我們使用正確的格式，我們可以依賴模型的聊天模板，透過其分詞器，如下所示。

>>> from transformers import AutoTokenizer

>>> # define the system and user messages
>>> system_input = "You are an expert prompt engineer with artistic flair."
>>> user_input = "Write a concise prompt for a fun image containing a llama and a cookbook. Only return the prompt."
>>> messages = [
...     {"role": "system", "content": system_input},
...     {"role": "user", "content": user_input},
... ]

>>> # load the model and tokenizer
>>> model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)

>>> # apply the chat template to the messages
>>> prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> print(f"\nPROMPT:\n-----\n\n{prompt}")

PROMPT:
-----

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are an expert prompt engineer with artistic flair.<|eot_id|><|start_header_id|>user<|end_header_id|>

Write a concise prompt for a fun image containing a llama and a cookbook. Only return the prompt.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

請注意，apply_chat_template() 方法是如何將熟悉的 messages 列表轉換為模型期望的正確格式字串的。我們可以使用這個格式化的字串傳遞給無伺服器推理 API 的 text_generation 方法。

>>> llm_response = client.text_generation(prompt, model=model_id, max_new_tokens=250, seed=42)
>>> print(llm_response)

"A whimsical illustration of a llama proudly holding a cookbook, with a sassy expression and a sprinkle of flour on its nose, surrounded by a colorful kitchen backdrop with utensils and ingredients scattered about, as if the llama is about to whip up a culinary masterpiece."

在不遵守模型提示模板的情況下查詢 LLM 不會產生任何明顯的錯誤！但是，它將導致輸出質量低下。看看當我們傳遞相同的系統和使用者輸入，但沒有根據聊天模板格式化它時會發生什麼。

>>> out = client.text_generation(system_input + " " + user_input, model=model_id, max_new_tokens=250, seed=42)
>>> print(out)

Do not write the... 1 answer below »

You are an expert prompt engineer with artistic flair. Write a concise prompt for a fun image containing a llama and a cookbook. Only return the prompt. Do not write the image description.

A llama is sitting at a kitchen table, surrounded by cookbooks and utensils, with a cookbook open in front of it. The llama is wearing a chef's hat and holding a spatula. The cookbook is titled "Llama's Favorite Recipes" and has a llama on the cover. The llama is surrounded by a warm, golden light, and the kitchen is filled with the aroma of freshly baked bread. The llama is smiling and looking directly at the viewer, as if inviting them to join in the cooking fun. The image should be colorful, whimsical, and full of texture and detail. The llama should be the main focus of the image, and the cookbook should be prominently displayed. The background should be a warm, earthy color, such as terracotta or sienna. The overall mood of the image should be playful, inviting, and joyful. 1 answer below »

You are an expert prompt engineer with artistic flair. Write a concise prompt for a fun image containing a llama and a

天哪！LLM 虛構了一個毫無意義的引言，意外地重複了提示，並且未能保持簡潔。為了簡化提示過程並確保使用正確的聊天模板，InferenceClient 還提供了一個 chat_completion 方法，它抽象掉了 chat_template 的細節。這允許您簡單地傳遞一個訊息列表

>>> for token in client.chat_completion(messages, model=model_id, max_tokens=250, stream=True, seed=42):
...     print(token.choices[0].delta.content)

"A
 whims
ical
 illustration
 of
 a
 fashion
ably
 dressed
 llama
 proudly
 holding
 a
 worn
,
 vintage
 cookbook
,
 with
 a
 warm
 cup
 of
 tea
 and
 a
 few
 freshly
 baked
 treats
 scattered
 around
,
 set
 against
 a
 cozy
 background
 of
 rustic
 wood
 and
 blo
oming
 flowers
."

流式傳輸

在上面的示例中，我們還設定了 stream=True 以啟用從端點流式傳輸文字。要了解更多此類功能以及查詢 LLM 的最佳實踐，我們建議閱讀這些支援資源以獲取更多資訊

2. 使用穩定擴散建立影像

無伺服器推理 API 可用於許多不同的任務。在這裡，我們將使用它透過穩定擴散生成影像。

>>> image = client.text_to_image(
...     prompt=llm_response,
...     model="stabilityai/stable-diffusion-xl-base-1.0",
...     guidance_scale=8,
...     seed=42,
... )

>>> display(image.resize((image.width // 2, image.height // 2)))
>>> print("PROMPT: ", llm_response)

快取

預設情況下，InferenceClient 將快取 API 響應。這意味著如果您多次使用相同的負載查詢 API，您將看到 API 返回的結果完全相同。請看

>>> image = client.text_to_image(
...     prompt=llm_response,
...     model="stabilityai/stable-diffusion-xl-base-1.0",
...     guidance_scale=8,
...     seed=42,
... )

>>> display(image.resize((image.width // 2, image.height // 2)))
>>> print("PROMPT: ", llm_response)

為了強制每次都返回不同的響應，我們可以使用 HTTP 頭部讓客戶端忽略快取並執行新的生成：x-use-cache: 0。

>>> # turn caching off
>>> client.headers["x-use-cache"] = "0"

>>> # generate a new image with the same prompt
>>> image = client.text_to_image(
...     prompt=llm_response,
...     model="stabilityai/stable-diffusion-xl-base-1.0",
...     guidance_scale=8,
...     seed=42,
... )

>>> display(image.resize((image.width // 2, image.height // 2)))
>>> print("PROMPT: ", llm_response)

3. 使用 Idefics2 對影像進行推理

視覺語言模型 (VLM) 可以同時接收文字和影像作為輸入，並生成文字作為輸出。這使它們能夠處理從視覺問答到影像標註的許多工。讓我們使用無伺服器推理 API 查詢 Idefics2，一個強大的 8B 引數 VLM，並讓它為我們新生成的影像寫一首詩。

我們首先需要將 PIL 影像轉換為 base64 編碼字串，以便透過網路將其傳送到模型。

import base64
from io import BytesIO


def pil_image_to_base64(image):
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_str


image_b64 = pil_image_to_base64(image)

然後，我們需要使用聊天模板正確格式化我們的文字 + 影像提示。有關提示格式的具體細節，請參閱 Idefics2 模型卡。

from transformers import AutoProcessor

# load the processor
vlm_model_id = "HuggingFaceM4/idefics2-8b-chatty"
processor = AutoProcessor.from_pretrained(vlm_model_id)

# define the user messages
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Write a short limerick about this image."},
        ],
    },
]

# apply the chat template to the messages
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)

# add the base64 encoded image to the prompt
image_input = f"data:image/jpeg;base64,{image_b64}"
image_input = f"![]({image_input})"
prompt = prompt.replace("<image>", image_input)

然後，最後呼叫無伺服器 API 以獲取預測。在我們的例子中，是關於我們生成影像的一首有趣的打油詩！

>>> limerick = client.text_generation(prompt, model=vlm_model_id, max_new_tokens=200, seed=42)
>>> print(limerick)

In the heart of a kitchen, so bright and so clean,
Lived a llama named Lulu, quite the culinary queen.
With a book in her hand, she'd read and she'd cook,
Her recipes were magic, her skills were so nook.
In her world, there was no room for defeat,
For Lulu, the kitchen was where she'd meet.

4. 從文字生成語音

最後，讓我們使用一個基於 transformer 的文字到音訊模型 Bark 為我們的詩歌生成一個可聽的配音。

tts_model_id = "suno/bark"
speech_out = client.text_to_speech(text=limerick, model=tts_model_id)

>>> from IPython.display import Audio

>>> display(Audio(speech_out, rate=24000))
>>> print(limerick)

In the heart of a kitchen, so bright and so clean,
Lived a llama named Lulu, quite the culinary queen.
With a book in her hand, she'd read and she'd cook,
Her recipes were magic, her skills were so nook.
In her world, there was no room for defeat,
For Lulu, the kitchen was where she'd meet.

下一步

就是這樣！在這個筆記本中，我們學習瞭如何使用無伺服器推理 API 查詢各種強大的 transformer 模型。我們只是觸及了您能做的事情的皮毛，建議檢視文件以瞭解更多可能性。

< > 在 GitHub 上更新

開源 AI 食譜