開源 AI 食譜文件

使用 TRL 在消費級 GPU 上微調 SmolVLM

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

使用 TRL 在消費級 GPU 上微調 SmolVLM

作者: Sergio Paniego

在本教程中，我們將演示如何利用強大的 Transformer 強化學習庫 (TRL)，使用 Hugging Face 生態系統微調一個 Vision Language Model (VLM)。本分步指南將使您能夠自定義 VLM 以完成特定任務，即使是在消費級 GPU 上也能實現。

🌟 模型與資料集概述

在本筆記中，我們將使用 ChartQA 資料集對 SmolVLM 模型進行微調。SmolVLM 是一款高效能、記憶體高效的模型，是此任務的理想選擇。ChartQA 資料集包含各種圖表型別的影像以及問答對，為增強模型的視覺問答 (VQA) 能力提供了寶貴的資源。這些技能對於資料分析、商業智慧和教育工具等一系列實際應用至關重要。

💡 注意：我們正在微調的 instruct 模型已經在此資料集上進行了訓練，因此它熟悉資料。然而，這對於理解微調技術來說是一個寶貴的教育練習。有關用於訓練此模型的資料集的完整列表，請檢視此文件。

📖 附加資源

透過這些資源擴充套件您對視覺語言模型和相關工具的知識

食譜中的多模態教程： 探索多模態模型的實用教程，包括 RAG 管道和微調。我們已經有使用 TRL 微調 VLM 的教程，請參考它獲取更多詳細資訊。
TRL 社群教程： 包含大量教程，可加深您對 TRL 及其應用的理解。

有了這些資源，您將能夠更深入地探索 VLM 的世界，並突破它們的極限！

本筆記使用 L4 GPU 進行測試。

Smol VLMs comparison

1. 安裝依賴項

讓我們先安裝微調所需的基本庫吧！🚀

!pip install  -U -q transformers trl datasets bitsandbytes peft accelerate
# Tested with transformers==4.53.0.dev0, trl==0.20.0.dev0, datasets==3.6.0, bitsandbytes==0.46.0, peft==0.15.2, accelerate==1.8.1

!pip install -q flash-attn --no-build-isolation

使用您的 Hugging Face 賬戶進行身份驗證，以便直接從本 Notebook 儲存和分享您的模型 🗝️。

from huggingface_hub import notebook_login

notebook_login()

2. 載入資料集 📁

我們將載入 HuggingFaceM4/ChartQA 資料集，該資料集提供圖表影像以及相應的問答對——非常適合微調視覺問答模型。

我們將建立一個系統訊息，使 VLM 充當圖表分析專家，提供關於圖表影像的簡潔答案。

system_message = """You are a Vision Language Model specialized in interpreting visual data from chart images.
Your task is to analyze the provided chart image and respond to queries with concise answers, usually a single word, number, or short phrase.
The charts include a variety of types (e.g., line charts, bar charts) and contain colors, labels, and text.
Focus on delivering accurate, succinct answers based on the visual information. Avoid additional explanation unless absolutely necessary."""

我們將資料集格式化為聊天機器人結構，每次互動都包含系統訊息、影像、使用者查詢和答案。

💡有關使用此模型的更多提示，請檢視模型卡。

def format_data(sample):
    return [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_message}],
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": sample["image"],
                },
                {
                    "type": "text",
                    "text": sample["query"],
                },
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": sample["label"][0]}],
        },
    ]

出於教育目的，我們將只加載資料集中每個分割的 10%。在實際場景中，您將載入整個資料集。

from datasets import load_dataset

dataset_id = "HuggingFaceM4/ChartQA"
train_dataset, eval_dataset, test_dataset = load_dataset(dataset_id, split=["train[:10%]", "val[:10%]", "test[:10%]"])

讓我們看看資料集的結構。它包括一個影像、一個查詢、一個標籤（答案）和一個我們將丟棄的第四個特徵。

train_dataset

現在，讓我們使用聊天機器人結構格式化資料。這將為模型設定互動。

train_dataset = [format_data(sample) for sample in train_dataset]
eval_dataset = [format_data(sample) for sample in eval_dataset]
test_dataset = [format_data(sample) for sample in test_dataset]

train_dataset[200]

3. 載入模型並檢查效能！🤔

現在我們已經載入了資料集，是時候載入 HuggingFaceTB/SmolVLM-Instruct 模型了，這是一個 2B 引數的視覺語言模型 (VLM)，它提供了最先進 (SOTA) 的效能，同時在記憶體使用方面效率很高。

為了更廣泛地比較最先進的 VLM，請探索 WildVision Arena 和 OpenVLM 排行榜，在那裡您可以找到在各種基準測試中表現最佳的模型。

import torch
from transformers import Idefics3ForConditionalGeneration, AutoProcessor

model_id = "HuggingFaceTB/SmolVLM-Instruct"

接下來，我們將載入模型和分詞器，為推理做準備。

model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(model_id)

為了評估模型的效能，我們將使用資料集中的一個樣本。首先，讓我們檢查此樣本的內部結構，以瞭解資料的組織方式。

train_dataset[1]

我們將使用不帶系統訊息的樣本來評估 VLM 的原始理解能力。這是我們將使用的輸入：

train_dataset[1][1:2]

現在，讓我們看一下與樣本對應的圖表。您能根據視覺資訊回答查詢嗎？

>>> train_dataset[1][1]["content"][0]["image"]

讓我們建立一個方法，該方法將模型、處理器和樣本作為輸入，以生成模型的答案。這將使我們能夠簡化推理過程並輕鬆評估 VLM 的效能。

def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
    # Prepare the text input by applying the chat template
    text_input = processor.apply_chat_template(
        sample[1:2], add_generation_prompt=True  # Use the sample without the system message
    )

    image_inputs = []
    image = sample[1]["content"][0]["image"]
    if image.mode != "RGB":
        image = image.convert("RGB")
    image_inputs.append([image])

    # Prepare the inputs for the model
    model_inputs = processor(
        # text=[text_input],
        text=text_input,
        images=image_inputs,
        return_tensors="pt",
    ).to(
        device
    )  # Move inputs to the specified device

    # Generate text with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Trim the generated ids to remove the input ids
    trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)]

    # Decode the output text
    output_text = processor.batch_decode(
        trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]  # Return the first decoded output text

output = generate_text_from_sample(model, processor, train_dataset[1])
output

看來模型引用了錯誤的行，導致它失敗。為了提高其效能，我們可以使用更多相關資料對模型進行微調，以確保它更好地理解上下文並提供更準確的響應。

移除模型並清理 GPU

在下一節中進行模型訓練之前，讓我們清除當前變數並清理 GPU 以釋放資源。

>>> import gc
>>> import time


>>> def clear_memory():
...     # Delete variables if they exist in the current global scope
...     if "inputs" in globals():
...         del globals()["inputs"]
...     if "model" in globals():
...         del globals()["model"]
...     if "processor" in globals():
...         del globals()["processor"]
...     if "trainer" in globals():
...         del globals()["trainer"]
...     if "peft_model" in globals():
...         del globals()["peft_model"]
...     if "bnb_config" in globals():
...         del globals()["bnb_config"]
...     time.sleep(2)

...     # Garbage collection and clearing CUDA memory
...     gc.collect()
...     time.sleep(2)
...     torch.cuda.empty_cache()
...     torch.cuda.synchronize()
...     time.sleep(2)
...     gc.collect()
...     time.sleep(2)

...     print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
...     print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")


>>> clear_memory()

GPU allocated memory: 0.01 GB
GPU reserved memory: 0.06 GB

4. 使用 TRL 微調模型

4.1 載入量化模型進行訓練 ⚙️

接下來，我們將使用 bitsandbytes 載入量化模型。如果您想了解更多關於量化的資訊，請檢視這篇部落格文章或這篇文章。

from transformers import BitsAndBytesConfig

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    _attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_id)

4.2 設定 QLoRA 和 SFTConfig 🚀

接下來，我們將為我們的訓練設定配置 QLoRA。QLoRA 透過減少記憶體佔用，實現大型模型的高效微調。與使用低秩近似的傳統 LoRA 不同，QLoRA 進一步量化了 LoRA 介面卡權重，從而實現更低的記憶體使用和更快的訓練。

為了提高效率，我們還可以在 QLoRA 實現期間利用分頁最佳化器或8 位最佳化器。這種方法提高了記憶體效率並加快了計算速度，使其成為最佳化我們模型而又不犧牲效能的理想選擇。

>>> from peft import LoraConfig, get_peft_model

>>> # Configure LoRA
>>> peft_config = LoraConfig(
...     r=8,
...     lora_alpha=8,
...     lora_dropout=0.1,
...     target_modules=["down_proj", "o_proj", "k_proj", "q_proj", "gate_proj", "up_proj", "v_proj"],
...     use_dora=True,
...     init_lora_weights="gaussian",
... )

>>> # Apply PEFT model adaptation
>>> peft_model = get_peft_model(model, peft_config)

>>> # Print trainable parameters
>>> peft_model.print_trainable_parameters()

trainable params: 11,269,248 || all params: 2,257,542,128 || trainable%: 0.4992

我們將使用監督微調 (SFT) 來提高模型在特定任務上的效能。為此，我們將使用 TRL 庫中的 SFTConfig 類定義訓練引數。SFT 利用標記資料幫助模型生成更準確的響應，使其適應任務。這種方法增強了模型理解和更有效地響應視覺查詢的能力。

from trl import SFTConfig

# Configure training arguments using SFTConfig
training_args = SFTConfig(
    output_dir="smolvlm-instruct-trl-sft-ChartQA",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=50,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=25,
    save_strategy="steps",
    save_steps=25,
    save_total_limit=1,
    optim="adamw_torch_fused",
    bf16=True,
    push_to_hub=True,
    report_to="tensorboard",
    remove_unused_columns=False,
    gradient_checkpointing=True,
    dataset_text_field="",
    dataset_kwargs={"skip_prepare_dataset": True},
)

4.3 訓練模型 🏃

為確保資料在訓練期間正確地為模型構建，我們需要定義一個 collator 函式。此函式將處理資料集輸入的格式化和批處理，確保資料與訓練正確對齊。

👉 更多詳情，請檢視官方 TRL 示例指令碼。

image_token_id = processor.tokenizer.additional_special_tokens_ids[
    processor.tokenizer.additional_special_tokens.index("<image>")
]


def collate_fn(examples):
    texts = [processor.apply_chat_template(example, tokenize=False) for example in examples]

    image_inputs = []
    for example in examples:
        image = example[1]["content"][0]["image"]
        if image.mode != "RGB":
            image = image.convert("RGB")
        image_inputs.append([image])

    batch = processor(text=texts, images=image_inputs, return_tensors="pt", padding=True)
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100  # Mask padding tokens in labels
    labels[labels == image_token_id] = -100  # Mask image token IDs in labels

    batch["labels"] = labels

    return batch

現在，我們將定義 SFTTrainer，它是 transformers.Trainer 類的包裝器，並繼承其屬性和方法。當提供 PeftConfig 物件時，此類別透過正確初始化 PeftModel 來簡化微調過程。透過使用 SFTTrainer，我們可以有效地管理訓練流程，並確保我們的視覺語言模型獲得流暢的微調體驗。

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=collate_fn,
    peft_config=peft_config,
    processing_class=processor.tokenizer,
)

是時候訓練模型了！🎉

trainer.train()

讓我們儲存結果 💾

trainer.save_model(training_args.output_dir)

5. 測試微調模型 🔍

現在我們的視覺語言模型 (VLM) 已經微調完成，是時候評估其效能了！在本節中，我們將使用 ChartQA 資料集中的示例來測試模型，以評估其根據圖表影像回答問題的準確性。讓我們深入瞭解結果，看看它的表現如何！🚀

讓我們清理 GPU 記憶體以確保最佳效能 🧹

>>> clear_memory()

GPU allocated memory: 16.34 GB
GPU reserved memory: 18.69 GB

我們將使用與之前相同的流程重新載入基礎模型。

model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(model_id)

我們將把訓練好的介面卡附加到預訓練模型。該介面卡包含訓練期間進行的微調調整，使基礎模型能夠利用新知識，同時保持其核心引數不變。透過整合介面卡，我們在不改變模型原始結構的情況下增強了模型的功能。

adapter_path = "sergiopaniego/smolvlm-instruct-trl-sft-ChartQA"
model.load_adapter(adapter_path)

讓我們在一個未見過的樣本上評估模型。

test_dataset[20][:2]

>>> test_dataset[20][1]["content"][0]["image"]

output = generate_text_from_sample(model, processor, test_dataset[20])
output

模型已成功學會按照資料集中指定的方式響應查詢。我們已經實現了目標！🎉✨

💻 我已經開發了一個用於測試模型的示例應用程式，您可以在這裡找到。您可以輕鬆將其與另一個包含預訓練模型的 Space 進行比較，該 Space 可在這裡獲取。

from IPython.display import IFrame

IFrame(src="https://sergiopaniego-smolvlm-trl-sft-chartqa.hf.space", width=1000, height=800)

6. 繼續學習之旅 🧑‍🎓️

為了進一步提升您使用多模態模型的技能，我建議您檢視本筆記開頭分享的資源，或重新訪問使用 Hugging Face 生態系統 (TRL) 微調視覺語言模型 (Qwen2-VL-7B) 中同名的部分。

這些資源將幫助您加深對多模態學習的知識和專業技能。

< > 在 GitHub 上更新

←在消費級 GPU 上使用 ColQwen2、Reranker 和量化 VLM 的多模態 RAG Colab 免費 GPU 上使用 ColSmolVLM 和 SmolVLM 構建的 Smol 多模態 RAG→