影像-文字到文字

影像-文字到文字模型，也稱為視覺語言模型（VLM），是接受影像輸入的語言模型。這些模型可以處理各種任務，從視覺問答到影像分割。此任務與影像到文字任務有許多相似之處，但也有一些重疊的用例，如影像字幕。影像到文字模型只接受影像輸入，通常完成特定任務，而 VLM 接受開放式文字和影像輸入，並且是更通用的模型。

在本指南中，我們將簡要概述 VLM，並展示如何使用 Transformers 進行推理。

首先，VLM 有多種型別

用於微調的基礎模型
用於對話的聊天微調模型
指令微調模型

本指南重點介紹使用指令微調模型進行推理。

我們先安裝依賴項。

pip install -q transformers accelerate flash_attn

讓我們初始化模型和處理器。

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

device = torch.device("cuda")
model = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
).to(device)

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")

此模型有一個聊天模板，可幫助使用者解析聊天輸出。此外，該模型還可以在單個對話或訊息中接受多張影像作為輸入。現在我們將準備輸入。

影像輸入如下所示。

from PIL import Image
import requests

img_urls =["https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
           "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"]
images = [Image.open(requests.get(img_urls[0], stream=True).raw),
          Image.open(requests.get(img_urls[1], stream=True).raw)]

下面是聊天模板的示例。我們可以將對話輪次和最後一條訊息作為輸入，透過將其附加到模板末尾。

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image we can see two cats on the nets."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },
]

現在我們將呼叫處理器的 `apply_chat_template()` 方法，以預處理其輸出以及影像輸入。

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[images[0], images[1]], return_tensors="pt").to(device)

我們現在可以將預處理的輸入傳遞給模型。

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
## ['User: What do we see in this image? \nAssistant: In this image we can see two cats on the nets. \nUser: And how about this image? \nAssistant: In this image we can see flowers, plants and insect.']

Pipeline

最快入門的方法是使用 Pipeline API。指定 `“image-text-to-text”` 任務和要使用的模型。

from transformers import pipeline
pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")

以下示例使用聊天模板格式化文字輸入。

messages = [
     {
         "role": "user",
         "content": [
             {
                 "type": "image",
                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
             },
             {"type": "text", "text": "Describe this image."},
         ],
     },
     {
         "role": "assistant",
         "content": [
             {"type": "text", "text": "There's a pink flower"},
         ],
     },
 ]

將聊天模板格式化的文字和影像傳遞給 Pipeline，並將 `return_full_text=False` 設定為從生成輸出中移除輸入。

outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
outputs[0]["generated_text"]
#  with a yellow center in the foreground. The flower is surrounded by red and white flowers with green stems

如果願意，您也可以單獨載入影像並將其傳遞給流水線，如下所示

pipe = pipeline("image-text-to-text", model="HuggingFaceTB/SmolVLM-256M-Instruct")

img_urls = [
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
]
images = [
    Image.open(requests.get(img_urls[0], stream=True).raw),
    Image.open(requests.get(img_urls[1], stream=True).raw),
]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "What do you see in these images?"},
        ],
    }
]
outputs = pipe(text=messages, images=images, max_new_tokens=50, return_full_text=False)
outputs[0]["generated_text"]
" In the first image, there are two cats sitting on a plant. In the second image, there are flowers with a pinkish hue."

影像仍將包含在輸出的 `“input_text”` 欄位中

outputs[0]['input_text']
"""
[{'role': 'user',
  'content': [{'type': 'image',
    'image': <PIL.PngImagePlugin.PngImageFile image mode=RGBA size=622x412>},
   {'type': 'image',
    'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=5184x3456>},
   {'type': 'text', 'text': 'What do you see in these images?'}]}]## Streaming
"""

我們可以使用文字流來獲得更好的生成體驗。Transformers 支援使用 TextStreamer 或 TextIteratorStreamer 類進行流式傳輸。我們將使用 TextIteratorStreamer 與 IDEFICS-8B。

假設我們有一個應用程式，它儲存聊天曆史記錄並接收新的使用者輸入。我們將像往常一樣預處理輸入，並初始化 TextIteratorStreamer 以在單獨的執行緒中處理生成。這允許您即時流式傳輸生成的文字標記。任何生成引數都可以傳遞給 TextIteratorStreamer。

import time
from transformers import TextIteratorStreamer
from threading import Thread

def model_inference(
    user_prompt,
    chat_history,
    max_new_tokens,
    images
):
    user_prompt = {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": user_prompt},
        ]
    }
    chat_history.append(user_prompt)
    streamer = TextIteratorStreamer(
        processor.tokenizer,
        skip_prompt=True,
        timeout=5.0,
    )

    generation_args = {
        "max_new_tokens": max_new_tokens,
        "streamer": streamer,
        "do_sample": False
    }

    # add_generation_prompt=True makes model generate bot response
    prompt = processor.apply_chat_template(chat_history, add_generation_prompt=True)
    inputs = processor(
        text=prompt,
        images=images,
        return_tensors="pt",
    ).to(device)
    generation_args.update(inputs)

    thread = Thread(
        target=model.generate,
        kwargs=generation_args,
    )
    thread.start()

    acc_text = ""
    for text_token in streamer:
        time.sleep(0.04)
        acc_text += text_token
        if acc_text.endswith("<end_of_utterance>"):
            acc_text = acc_text[:-18]
        yield acc_text

    thread.join()

現在讓我們呼叫我們建立的 `model_inference` 函式並流式傳輸值。

generator = model_inference(
    user_prompt="And what is in this image?",
    chat_history=messages[:2],
    max_new_tokens=100,
    images=images
)

for value in generator:
  print(value)

# In
# In this
# In this image ...

在較小硬體中執行模型

VLM 通常很大，需要最佳化才能適應較小的硬體。Transformers 支援許多模型量化庫，這裡我們只展示使用 Quanto 進行 int8 量化。int8 量化可將記憶體改進高達 75%（如果所有權重都已量化）。然而，這並非沒有代價，因為 8 位不是 CUDA 原生精度，權重會在執行時來回量化，這會增加延遲。

首先，安裝依賴項。

pip install -U quanto bitsandbytes

要在載入過程中量化模型，我們首先需要建立 QuantoConfig。然後像往常一樣載入模型，但在模型初始化期間傳遞 `quantization_config`。

from transformers import AutoModelForImageTextToText, QuantoConfig

model_id = "HuggingFaceM4/idefics2-8b"
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="cuda", quantization_config=quantization_config
)

就是這樣，我們可以以相同的方式使用模型，無需任何更改。

進一步閱讀

以下是關於影像-文字到文字任務的更多資源。

影像-文字到文字任務頁面涵蓋模型型別、用例、資料集等。
視覺語言模型解釋是一篇部落格文章，涵蓋了視覺語言模型以及使用 TRL 進行監督微調的所有內容。

< > 在 GitHub 上更新