多模態模板

多模態模型聊天模板需要與純文字模型類似的模板。它需要一個包含 `role` 和 `content` 字典的 `messages` 欄位。

多模態模板包含在 Processor 類中，需要一個額外的 `type` 鍵來指定包含的內容是影像、影片還是文字。

本指南將向您展示如何為多模態模型格式化聊天模板以及配置模板的一些最佳實踐。

ImageTextToTextPipeline

ImageTextToTextPipeline 是一個具有“聊天模式”的高階影像和文字生成類。當檢測到對話模型且聊天提示格式正確時，將啟用聊天模式。

首先使用以下兩個角色構建聊天曆史記錄。

“system”描述了模型在與您聊天時應如何表現和響應。並非所有聊天模型都支援此角色。
“user”是您向模型輸入第一條訊息的地方。

messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "What are these?"},
        ],
    },
]

建立 ImageTextToTextPipeline 並將聊天內容傳遞給它。對於大型模型，設定 device_map=“auto” 有助於更快地載入模型並自動將其放置在可用的最快裝置上。將資料型別更改為 torch.bfloat16 也有助於節省記憶體。

ImageTextToTextPipeline 接受 OpenAI 格式的聊天，以便更輕鬆、更便捷地進行推理。

import torch
from transformers import pipeline

pipeline = pipeline("image-text-to-text", model="llava-hf/llava-onevision-qwen2-0.5b-ov-hf", device="cuda", torch_dtype=torch.float16)
pipeline(text=messages, max_new_tokens=50, return_full_text=False)
[{'input_text': [{'role': 'system',
    'content': [{'type': 'text',
      'text': 'You are a friendly chatbot who always responds in the style of a pirate'}]},
   {'role': 'user',
    'content': [{'type': 'image',
      'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
     {'type': 'text', 'text': 'What are these?'}]}],
  'generated_text': 'The image shows two cats lying on a pink surface, which appears to be a cushion or a soft blanket. The cat on the left has a striped coat, typical of tabby cats, and is lying on its side with its head resting on the'}]

影像輸入

對於接受影像的多模態模型（如 LLaVA），請在 `content` 中包含以下內容，如下所示。

內容 `“type”` 可以是 `“image”` 或 `“text”`。
對於影像，它可以是影像連結（`"url"`）、檔案路徑（`"path"`）或 `“base64”`。影像會自動載入、處理並準備成畫素值作為模型的輸入。

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model = LlavaOnevisionForConditionalGeneration.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")
processor = AutoProcessor.from_pretrained("llava-hf/llava-onevision-qwen2-0.5b-ov-hf")

messages = [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "What are these?"},
        ],
    },
]

將 `messages` 傳遞給 apply_chat_template() 以對輸入內容進行標記化，並返回 `input_ids` 和 `pixel_values`。

processed_chat = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
print(processed_chat.keys())

這些輸入現在已準備好在 generate() 中使用。

影片輸入

一些視覺模型也支援影片輸入。訊息格式與影像輸入的格式非常相似。

內容 `“type”` 應為 `“video”`，表示內容是影片。
對於影片，它可以是影片連結（`"url"`）或檔案路徑（`"path"`）。從 URL 載入的影片只能透過 PyAV 或 Decord 解碼。

僅 PyAV 或 Decord 後端支援從 `“url”` 載入影片。

from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
model = LlavaOnevisionForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

messages = [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
    },
    {
      "role": "user",
      "content": [
            {"type": "video", "url": "https://test-videos.co.uk/vids/bigbuckbunny/mp4/h264/720/Big_Buck_Bunny_720_10s_10MB.mp4"},
            {"type": "text", "text": "What do you see in this video?"},
        ],
    },
]

將 `messages` 傳遞給 apply_chat_template() 以對輸入內容進行標記化。在 apply_chat_template() 中還需要包含一些額外的引數來控制取樣過程。

`video_load_backend` 引數指的是用於載入影片的特定框架。它支援 PyAV、Decord、OpenCV 和 torchvision。

以下示例使用 Decord 作為後端，因為它比 PyAV 稍快。

固定幀數

fps

影像幀列表

模板配置

您可以使用 Jinja 建立自定義聊天模板，並透過 apply_chat_template() 進行設定。有關更多詳細資訊，請參閱模板編寫指南。

例如，要使模板能夠處理來自多種模態的*內容列表*，同時仍然支援純文字推理的普通字串，請在 Llama 3.2 Vision Instruct 模板中指定如何處理 `content['type']`（如果它是影像或文字），如下所示。

{% for message in messages %}
{% if loop.index0 == 0 %}{{ bos_token }}{% endif %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' }}
{% if message['content'] is string %}
{{ message['content'] }}
{% else %}
{% for content in message['content'] %}
{% if content['type'] == 'image' %}
{{ '<|image|>' }}
{% elif content['type'] == 'text' %}
{{ content['text'] }}
{% endif %}
{% endfor %}
{% endif %}
{{ '<|eot_id|>' }}
{% endfor %}
{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}

< > 在 GitHub 上更新

Transformers

多模態模板

ImageTextToTextPipeline

影像輸入

影片輸入

模板配置