使用視覺語言模型從影像或文件進行結構化生成

我們將使用 HuggingFaceTB 的 SmolVLM-Instruct 模型從文件中提取結構化資訊。我們將使用 Hugging Face Transformers 庫和 Outlines 庫執行 VLM，Outlines 庫有助於基於限制令牌取樣機率的結構化生成。

此方法基於 Outlines 教程。

依賴項和匯入

首先，讓我們安裝必要的庫。

%pip install accelerate outlines transformers torch flash-attn datasets sentencepiece

接下來，匯入必要的庫。

import outlines
import torch

from datasets import load_dataset
from outlines.models.transformers_vision import transformers_vision
from transformers import AutoModelForImageTextToText, AutoProcessor
from pydantic import BaseModel

初始化模型

我們將從 HuggingFaceTB/SmolVLM-Instruct 初始化模型。Outlines 要求我們傳入模型類和處理器類，因此我們將透過建立一個返回這些的函式來使此示例更通用。或者，您可以檢視 Hub repo 檔案中的模型和分詞器配置，並直接匯入這些類。

model_name = "HuggingFaceTB/SmolVLM-Instruct"


def get_model_and_processor_class(model_name: str):
    model = AutoModelForImageTextToText.from_pretrained(model_name)
    processor = AutoProcessor.from_pretrained(model_name)
    classes = model.__class__, processor.__class__
    del model, processor
    return classes


model_class, processor_class = get_model_and_processor_class(model_name)

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

model = transformers_vision(
    model_name,
    model_class=model_class,
    device=device,
    model_kwargs={"torch_dtype": torch.bfloat16, "device_map": "auto"},
    processor_kwargs={"device": device},
    processor_class=processor_class,
)

結構化生成

現在，我們將定義一個函式，該函式將定義我們模型的輸出結構。我們將使用 openbmb/RLAIF-V-Dataset，其中包含一組影像以及問題及其選擇和拒絕的響應。這是一個不錯的資料集，但我們希望在影像之上建立額外的影像到文字資料，以獲取我們自己的結構化資料集，並可能在此基礎上微調我們的模型。我們將使用模型為影像生成字幕、問題和簡單的質量標籤。

class ImageData(BaseModel):
    quality: str
    description: str
    question: str


structured_generator = outlines.generate.json(model, ImageData)

現在，讓我們想出一個提取提示。

prompt = """
You are an image analysis assisant.

Provide a quality tag, a description and a question.

The quality can either be "good", "okay" or "bad".
The question should be concise and objective.

Return your response as a valid JSON object.
""".strip()

讓我們載入影像資料集。

dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train[:10]")
dataset

現在，讓我們定義一個函式，該函式將從影像中提取結構化資訊。我們將使用 `apply_chat_template` 方法格式化提示，然後將其連同影像一起傳遞給模型。

def extract(row):
    messages = [
        {
            "role": "user",
            "content": [{"type": "image"}, {"type": "text", "text": prompt}],
        },
    ]

    formatted_prompt = model.processor.apply_chat_template(messages, add_generation_prompt=True)

    result = structured_generator(formatted_prompt, [row["image"]])
    row["synthetic_question"] = result.question
    row["synthetic_description"] = result.description
    row["synthetic_quality"] = result.quality
    return row


dataset = dataset.map(lambda x: extract(x))
dataset

現在，讓我們將新資料集推送到 Hub。

dataset.push_to_hub(
    "davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset", split="train"
)

結果並不完美，但它們是繼續探索不同模型和提示的良好起點！

結論

我們已經看到了如何使用視覺語言模型從文件中提取結構化資訊。我們可以使用類似的提取方法從文件中提取結構化資訊，例如使用 `pdf2image` 將文件轉換為影像，並對每頁 PDF 影像進行資訊提取。

pdf_path = "path/to/your/pdf/file.pdf"
pages = convert_from_path(pdf_path)
for page in pages:
    extract_objects = extract_objects(page, prompt)

後續步驟

請檢視 Outlines 庫以獲取更多關於如何使用它的資訊。探索不同的方法和引數。
使用您自己的模型探索您自己的用例中的提取。
使用不同的方法從文件中提取結構化資訊。

< > 在 GitHub 上更新

開源 AI 食譜

使用視覺語言模型從影像或文件進行結構化生成

依賴項和匯入

初始化模型

結構化生成

結論

後續步驟