開源 AI 食譜文件

使用TRL微調Granite Vision 3.1 2B

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

使用TRL微調Granite Vision 3.1 2B

作者：Eli Schwartz

改編自Sergio Paniego的Notebook

本教程將幫助您微調IBM的Granite Vision 3.1 2B模型。這是一個輕量級但功能強大的模型，透過結合影像和文字模態對Granite語言模型進行微調而得到。我們將利用Hugging Face生態系統，特別是強大的Transformer Reinforcement Learning庫（TRL）。這份分步指南將使您能夠在消費級GPU上為特定任務微調Granite Vision。

🌟 模型和資料集概覽

在本Notebook中，我們將使用幾何感知資料集對Granite Vision模型進行微調和評估，該資料集包含模型最初未訓練過的任務。Granite Vision是一個高效能、記憶體高效的模型，是新任務微調的理想選擇。幾何感知資料集提供了各種幾何圖的影像，這些影像從高中教科書中收集而來，並配有問答對。

本 Notebook 使用 A100 GPU 進行測試。

1. 安裝依賴項

讓我們先安裝微調所需的基本庫吧！🚀

!pip install -q git+https://github.com/huggingface/transformers.git
!pip install  -U -q trl datasets bitsandbytes peft accelerate
# Tested with transformers==4.49.0.dev0, trl==0.14.0, datasets==3.2.0, bitsandbytes==0.45.2, peft==0.14.0, accelerate==1.3.0

>>> !pip install -q flash-attn --no-build-isolation

>>> try:
...     import flash_attn
...     print("FlashAttention is installed")
...     USE_FLASH_ATTENTION = True
>>> except ImportError:
...     print("FlashAttention is not installed")
...     USE_FLASH_ATTENTION = False

FlashAttention is not installed

2. 載入資料集 📁

我們將載入幾何感知資料集，該資料集提供了各種幾何圖的影像，這些影像從流行的高中教科書中收集而來，並配有問答對。

我們將使用模型訓練期間使用的原始系統提示。

system_message = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions."

為了教育目的，我們將只在資料集的“謂詞”欄位中指定的“線長比較”任務上進行訓練和評估。

from datasets import load_dataset

dataset_id = "euclid-multimodal/Geoperception"
dataset = load_dataset(dataset_id)
dataset_LineComparison = dataset["train"].filter(lambda x: x["predicate"] == "LineComparison")
train_test = dataset_LineComparison.train_test_split(test_size=0.5, seed=42)

讓我們看看資料集的結構。它包含影像、問題、答案和我們用於篩選資料集的“謂詞”。

train_test

我們將把資料集格式化為聊天機器人結構，每個互動都包含系統訊息、影像、使用者查詢和答案。

💡有關使用此模型進行推理的更多提示，請檢視模型卡。

def format_data(sample):
    return [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_message}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": sample["image"],
                },
                {
                    "type": "text",
                    "text": sample["question"],
                },
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": sample["answer"]}],
        },
    ]

現在，讓我們使用聊天機器人結構格式化資料。這將為模型設定互動。

train_dataset = [format_data(x) for x in train_test["train"]]
test_dataset = [format_data(x) for x in train_test["test"]]

train_dataset[200]

3. 載入模型並檢查效能！🤔

現在我們已經載入了資料集，是時候載入IBM的Granite Vision模型了，這是一個基於先進技術構建的2B引數視覺語言模型（VLM），它提供了最先進（SOTA）的效能，同時在記憶體使用方面也非常高效。

要更廣泛地比較最先進的VLM，請瀏覽WildVision Arena和OpenVLM排行榜，您可以在其中找到各種基準測試中表現最佳的模型。

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor

model_id = "ibm-granite/granite-vision-3.1-2b-preview"

接下來，我們將載入模型和分詞器，為推理做準備。

model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if USE_FLASH_ATTENTION else None,
)

processor = AutoProcessor.from_pretrained(model_id)

為了評估模型的效能，我們將使用資料集中的一個樣本。首先，讓我們檢查此樣本的內部結構，以瞭解資料的組織方式。

test_idx = 20
sample = test_dataset[test_idx]
sample

現在，讓我們來看看與樣本對應的影像。您能根據視覺資訊回答查詢嗎？

>>> sample[1]["content"][0]["image"]

讓我們建立一個方法，該方法將模型、處理器和樣本作為輸入，以生成模型的答案。這將使我們能夠簡化推理過程並輕鬆評估 VLM 的效能。

def generate_text_from_sample(model, processor, sample, max_new_tokens=100, device="cuda"):
    # Prepare the text input by applying the chat template
    text_input = processor.apply_chat_template(
        sample[:2], add_generation_prompt=True  # Use the sample without the assistant response
    )

    image_inputs = []
    image = sample[1]["content"][0]["image"]
    if image.mode != "RGB":
        image = image.convert("RGB")
    image_inputs.append([image])

    # Prepare the inputs for the model
    model_inputs = processor(
        # text=[text_input],
        text=text_input,
        images=image_inputs,
        return_tensors="pt",
    ).to(
        device
    )  # Move inputs to the specified device

    # Generate text with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Trim the generated ids to remove the input ids
    trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)]

    # Decode the output text
    output_text = processor.batch_decode(
        trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]  # Return the first decoded output text

output = generate_text_from_sample(model, processor, sample)
output

看來模型無法比較未明確指定的線的長度。為了提高其效能，我們可以使用更多相關資料對模型進行微調，以確保它更好地理解上下文並提供更準確的響應。

移除模型並清理 GPU

在下一節中進行模型訓練之前，讓我們清除當前變數並清理 GPU 以釋放資源。

>>> import gc
>>> import time


>>> def clear_memory():
...     # Delete variables if they exist in the current global scope
...     if "inputs" in globals():
...         del globals()["inputs"]
...     if "model" in globals():
...         del globals()["model"]
...     if "processor" in globals():
...         del globals()["processor"]
...     if "trainer" in globals():
...         del globals()["trainer"]
...     if "peft_model" in globals():
...         del globals()["peft_model"]
...     if "bnb_config" in globals():
...         del globals()["bnb_config"]
...     time.sleep(2)

...     # Garbage collection and clearing CUDA memory
...     gc.collect()
...     time.sleep(2)
...     torch.cuda.empty_cache()
...     torch.cuda.synchronize()
...     time.sleep(2)
...     gc.collect()
...     time.sleep(2)

...     print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
...     print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")


>>> clear_memory()

GPU allocated memory: 0.01 GB
GPU reserved memory: 0.02 GB

4. 使用TRL微調模型

4.1 載入量化模型進行訓練 ⚙️

接下來，我們將使用bitsandbytes載入量化模型。如果您想了解更多關於量化的資訊，請檢視這篇部落格文章或這篇。

from transformers import BitsAndBytesConfig

USE_QLORA = True
USE_LORA = True

if USE_QLORA:
    # BitsAndBytesConfig int-4 config
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        llm_int8_skip_modules=["vision_tower", "lm_head"],  # Skip problematic modules
        llm_int8_enable_fp32_cpu_offload=True,
    )
else:
    bnb_config = None

# Load model and tokenizer
model = AutoModelForVision2Seq.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    _attn_implementation="flash_attention_2" if USE_FLASH_ATTENTION else None,
)
processor = AutoProcessor.from_pretrained(model_id)

4.2 設定QLoRA和SFTConfig 🚀

接下來，我們將為我們的訓練設定配置QLoRA。QLoRA透過減少記憶體佔用，實現大型模型的高效微調。與使用低秩近似的傳統LoRA不同，QLoRA進一步量化了LoRA介面卡權重，從而進一步降低記憶體使用並加快訓練速度。

為了提高效率，我們還可以在QLoRA實現過程中利用分頁最佳化器或8位最佳化器。這種方法增強了記憶體效率並加快了計算速度，使其成為最佳化模型而不犧牲效能的理想選擇。

if USE_LORA:
    from peft import LoraConfig, get_peft_model

    # Configure LoRA
    peft_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules=[name for name, _ in model.named_modules() if "language_model" in name and "_proj" in name],
        use_dora=True,
        init_lora_weights="gaussian",
    )

    # Apply PEFT model adaptation
    # model = get_peft_model(model, peft_config)
    model.add_adapter(peft_config)
    model.enable_adapters()
    model = get_peft_model(model, peft_config)

    # Print trainable parameters
    model.print_trainable_parameters()

else:
    peft_config = None

我們將使用監督微調（SFT）來提高模型在特定任務上的效能。為此，我們將使用TRL庫中的SFTConfig類定義訓練引數。SFT利用帶標籤的資料幫助模型生成更準確的響應，使其適應任務。這種方法增強了模型理解和響應視覺查詢的能力，從而提高了效率。

from trl import SFTConfig

# Configure training arguments using SFTConfig
training_args = SFTConfig(
    output_dir="./checkpoints/geoperception",
    num_train_epochs=1,
    # max_steps=30,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    warmup_steps=10,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=10,
    save_strategy="steps",
    save_steps=20,
    save_total_limit=1,
    optim="adamw_torch_fused",
    bf16=True,
    push_to_hub=False,
    report_to="none",
    remove_unused_columns=False,
    gradient_checkpointing=True,
    dataset_text_field="",
    dataset_kwargs={"skip_prepare_dataset": True},
)

4.3 訓練模型 🏃

為確保模型在訓練過程中資料結構正確，我們需要定義一個collator函式。該函式將處理資料集輸入的格式化和批處理，確保資料正確對齊以進行訓練。

👉 更多詳細資訊，請檢視官方的TRL示例指令碼。

def collate_fn(examples):
    texts = [processor.apply_chat_template(example, tokenize=False) for example in examples]

    image_inputs = []
    for example in examples:
        image = example[1]["content"][0]["image"]
        if image.mode != "RGB":
            image = image.convert("RGB")
        image_inputs.append([image])

    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)

    labels = batch["input_ids"].clone()
    assistant_tokens = processor.tokenizer("<|assistant|>", return_tensors="pt")["input_ids"][0]
    eos_token = processor.tokenizer("<|end_of_text|>", return_tensors="pt")["input_ids"][0]

    for i in range(batch["input_ids"].shape[0]):
        apply_loss = False
        for j in range(batch["input_ids"].shape[1]):
            if not apply_loss:
                labels[i][j] = -100
            if (j >= len(assistant_tokens) + 1) and torch.all(
                batch["input_ids"][i][j + 1 - len(assistant_tokens) : j + 1] == assistant_tokens
            ):
                apply_loss = True
            if batch["input_ids"][i][j] == eos_token:
                apply_loss = False

    batch["labels"] = labels

    return batch

現在，我們將定義SFTTrainer，它是transformers.Trainer類的包裝器，並繼承其屬性和方法。當提供PeftConfig物件時，此分類器透過正確初始化PeftModel來簡化微調過程。透過使用SFTTrainer，我們可以有效地管理訓練流程，並確保我們的視覺語言模型獲得流暢的微調體驗。

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=collate_fn,
    peft_config=peft_config,
    processing_class=processor.tokenizer,
)

是時候訓練模型了！🎉

trainer.train()

讓我們儲存結果 💾

trainer.save_model(training_args.output_dir)

5. 測試微調模型 🔍

現在我們的視覺語言模型（VLM）已經過微調，是時候評估其效能了！在本節中，我們將使用ChartQA資料集中的示例來測試模型，以評估其根據圖表影像回答問題的準確性。讓我們深入瞭解結果，看看它的表現如何！🚀

讓我們清理 GPU 記憶體以確保最佳效能 🧹

>>> clear_memory()

GPU allocated memory: 0.02 GB
GPU reserved memory: 0.19 GB

我們將使用與之前相同的流程重新載入基礎模型。

model = AutoModelForVision2Seq.from_pretrained(
    training_args.output_dir,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2" if USE_FLASH_ATTENTION else None,
)

processor = AutoProcessor.from_pretrained(model_id)

如果使用LORA介面卡，我們將合併它們。

if USE_LORA:
    from peft import PeftModel

    model = PeftModel.from_pretrained(model, training_args.output_dir)

讓我們在一個未見過的樣本上評估模型。

test_idx = 20
sample = test_dataset[test_idx]
sample[1:]

>>> sample[1]["content"][0]["image"]

output = generate_text_from_sample(model, processor, sample)
output

🎉✨ 模型已成功學會根據資料集中指定的查詢進行響應。我們已達成目標！🎉✨

< > 在 GitHub 上更新

←使用視覺語言模型從影像或文件進行結構化生成使用TRL微調VLM進行物件檢測接地→