TGI 中的視覺語言模型推理

視覺語言模型 (VLM) 是同時接收影像和文字輸入以生成文字的模型。

VLM 經過影像和文字資料的組合訓練，可以處理廣泛的任務，例如影像字幕、視覺問答和視覺對話。

VLM 與其他文字和影像模型的區別在於它們能夠處理長上下文並生成與影像連貫且相關的文字，即使經過多輪或在某些情況下，多張影像之後也是如此。

以下是視覺語言模型的幾個常見用例：

影像字幕：給定影像，生成描述影像的字幕。
視覺問答 (VQA)：給定影像和關於影像的問題，生成問題的答案。
多模態對話：生成對多輪影像和對話的響應。
影像資訊檢索：給定影像，從影像中檢索資訊。

如何使用視覺語言模型？

Hugging Face Hub Python 庫

要透過 Python 進行視覺語言模型推理，可以使用 huggingface_hub 庫。InferenceClient 類提供了一種與 Inference API 互動的簡單方法。影像可以作為 URL 或 Base64 編碼的字串傳遞。InferenceClient 將自動檢測影像格式。

from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://127.0.0.1:3000")
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
prompt = f"![]({image})What is this a picture of?\n\n"
for token in client.text_generation(prompt, max_new_tokens=16, stream=True):
    print(token)

# This is a picture of an anthropomorphic rabbit in a space suit.

from huggingface_hub import InferenceClient
import base64
import requests
import io

client = InferenceClient(base_url="http://127.0.0.1:3000")

# read image from local file
image_path = "rabbit.png"
with open(image_path, "rb") as f:
    image = base64.b64encode(f.read()).decode("utf-8")

image = f"data:image/png;base64,{image}"
prompt = f"![]({image})What is this a picture of?\n\n"

for token in client.text_generation(prompt, max_new_tokens=10, stream=True):
    print(token)

# This is a picture of an anthropomorphic rabbit in a space suit.

或者透過 chat_completion 端點

from huggingface_hub import InferenceClient

client = InferenceClient(base_url="http://127.0.0.1:3000")

chat = client.chat_completion(
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Whats in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    },
                },
            ],
        },
    ],
    seed=42,
    max_tokens=100,
)

print(chat)
# ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content=" The image you've provided features an anthropomorphic rabbit in spacesuit attire. This rabbit is depicted with human-like posture and movement, standing on a rocky terrain with a vast, reddish-brown landscape in the background. The spacesuit is detailed with mission patches, circuitry, and a helmet that covers the rabbit's face and ear, with an illuminated red light on the chest area.\n\nThe artwork style is that of a", name=None, tool_calls=None), logprobs=None)], created=1714589614, id='', model='llava-hf/llava-v1.6-mistral-7b-hf', object='text_completion', system_fingerprint='2.0.2-native', usage=ChatCompletionOutputUsage(completion_tokens=100, prompt_tokens=2943, total_tokens=3043))

或者使用 OpenAI 的客戶端庫

from openai import OpenAI

# init the client but point it to TGI
client = OpenAI(base_url="https://:3000/v1", api_key="-")

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Whats in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png"
                    },
                },
            ],
        },
    ],
    stream=False,
)

print(chat_completion)
# ChatCompletion(id='', choices=[Choice(finish_reason='eos_token', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image depicts an anthropomorphic rabbit dressed in a space suit with gear that resembles NASA attire. The setting appears to be a solar eclipse with dramatic mountain peaks and a partial celestial body in the sky. The artwork is detailed and vivid, with a warm color palette and a sense of an adventurous bunny exploring or preparing for a journey beyond Earth. ', role='assistant', function_call=None, tool_calls=None))], created=1714589732, model='llava-hf/llava-v1.6-mistral-7b-hf', object='text_completion', system_fingerprint='2.0.2-native', usage=CompletionUsage(completion_tokens=84, prompt_tokens=2943, total_tokens=3027))

透過傳送 cURL 請求進行推理

要使用 curl 搭配 generate_stream 端點，可以新增 -N 標誌。此標誌停用 curl 的預設緩衝，並顯示從伺服器接收到的資料。

curl -N 127.0.0.1:3000/generate_stream \
    -X POST \
    -d '{"inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n","parameters":{"max_new_tokens":16, "seed": 42}}' \
    -H 'Content-Type: application/json'

# ...
# data:{"index":16,"token":{"id":28723,"text":".","logprob":-0.6196289,"special":false},"generated_text":"This is a picture of an anthropomorphic rabbit in a space suit.","details":null}

透過 JavaScript 進行推理

首先，我們需要安裝 @huggingface/inference 庫。

npm install @huggingface/inference

無論您使用 Inference Providers（我們的無伺服器 API）還是 Inference Endpoints，都可以呼叫 InferenceClient。

我們可以建立一個 InferenceClient，提供我們的端點 URL 和 Hugging Face 訪問令牌。

import { InferenceClient } from "@huggingface/inference";

const client = new InferenceClient('hf_YOUR_TOKEN', { endpointUrl: 'https://YOUR_ENDPOINT.endpoints.huggingface.cloud' });

const prompt =
  "![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n";

const stream = client.textGenerationStream({
  inputs: prompt,
  parameters: { max_new_tokens: 16, seed: 42 },
});
for await (const r of stream) {
  // yield the generated token
  process.stdout.write(r.token.text);
}

// This is a picture of an anthropomorphic rabbit in a space suit.

將視覺語言模型與其他功能結合使用

TGI 中的 VLM 具有多項優勢，例如這些模型可以與其他功能協同使用以完成更復雜的任務。例如，您可以將 VLM 與引導生成結合使用，從影像生成特定的 JSON 資料。

例如，我們可以從兔子影像中提取資訊，並生成一個 JSON 物件，其中包含位置、活動、看到的動物數量以及看到的動物。這看起來會像這樣：

{
  "activity": "Standing",
  "animals": ["Rabbit"],
  "animals_seen": 1,
  "location": "Rocky surface with mountains in the background and a red light on the rabbit's chest"
}

我們只需要向 VLM 模型提供 JSON 模式，它就會為我們生成 JSON 物件。

curl localhost:3000/generate \
    -X POST \
    -H 'Content-Type: application/json' \
    -d '{
    "inputs":"![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/rabbit.png)What is this a picture of?\n\n",
    "parameters": {
        "max_new_tokens": 100,
        "seed": 42,
        "grammar": {
            "type": "json",
            "value": {
                "properties": {
                    "location": {
                        "type": "string"
                    },
                    "activity": {
                        "type": "string"
                    },
                    "animals_seen": {
                        "type": "integer",
                        "minimum": 1,
                        "maximum": 5
                    },
                    "animals": {
                        "type": "array",
                        "items": {
                            "type": "string"
                        }
                    }
                },
                "required": ["location", "activity", "animals_seen", "animals"]
            }
        }
    }
}'

# {
#   "generated_text": "{ \"activity\": \"Standing\", \"animals\": [ \"Rabbit\" ], \"animals_seen\": 1, \"location\": \"Rocky surface with mountains in the background and a red light on the rabbit's chest\" }"
# }

想了解更多關於視覺語言模型如何工作的資訊？請檢視這篇關於該主題的精彩部落格文章。