開源 AI 食譜文件

在消費級 GPU 上使用 TRL 和直接偏好最佳化 (DPO) 微調 SmolVLM

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

在消費級 GPU 上使用 TRL 和直接偏好最佳化 (DPO) 微調 SmolVLM

作者: Sergio Paniego

在本指南中，我們將指導您如何使用**Transformer 強化學習 (TRL)** 庫，透過**直接偏好最佳化 (DPO)** 來微調一個**小巧的 🤏 視覺語言模型 (VLM)**，以展示即使在消費級 GPU 上，您也可以根據特定需求定製 VLM。

我們將使用**偏好資料集**來微調 SmolVLM，以幫助模型與期望的輸出保持一致。SmolVLM 是一款效能高、記憶體效率高的模型，是完成此項任務的理想選擇。如果您對語言或視覺語言模型的**偏好最佳化**還不熟悉，可以檢視這篇部落格進行深入瞭解。

我們將使用的資料集是 HuggingFaceH4/rlaif-v_formatted，其中包含成對的**`提示 + 影像`**，以及每對的**`選擇`**和**`拒絕`**答案。此次微調過程的目標是使模型始終偏好資料集中的**選擇答案**，從而減少幻覺。

本 Notebook 已在 **NVIDIA L4 GPU** 上測試透過。

1. 安裝依賴

讓我們先安裝微調所需的基本庫吧！🚀

!pip install  -U -q transformers trl datasets bitsandbytes peft accelerate
# Tested with transformers==4.46.3, trl==0.12.2, datasets==3.2.0, bitsandbytes==0.45.0, peft==0.14.0, accelerate==1.2.0

!pip install -q flash-attn --no-build-isolation

使用您的 Hugging Face 賬戶進行身份驗證，以便直接從本 Notebook 儲存和分享您的模型 🗝️。

from huggingface_hub import notebook_login

notebook_login()

2. 載入資料集 📁

我們將使用 HuggingFaceH4/rlaif-v_formatted 資料集，其中提供了成對的**`提示 + 影像`**，以及每對的**`選擇`**和**`拒絕`**答案。這種結構化格式非常適合使用**直接偏好最佳化 (DPO)** 進行模型訓練。

該資料集已經為此任務預先格式化。如果您使用自定義資料集，則需要將其預處理成相同的格式。

在此示例中，我們將使用資料集的一個子集來演示該過程。然而，在實際場景中，您應使用完整的資料集以獲得更好的效能。

from datasets import load_dataset

dataset_id = "HuggingFaceH4/rlaif-v_formatted"
train_dataset, test_dataset = load_dataset(dataset_id, split=["train[:6%]", "test[:1%]"])

我們將確保所有影像都格式化為 RGB

from PIL import Image


def ensure_rgb(example):
    # Convert the image to RGB if it's not already
    image = example["images"][0]
    if isinstance(image, Image.Image):
        if image.mode != "RGB":
            image = image.convert("RGB")
        example["images"] = [image]
    return example


# Apply the transformation to the dataset
train_dataset = train_dataset.map(ensure_rgb, num_proc=32)
test_dataset = test_dataset.map(ensure_rgb, num_proc=32)

讓我們瀏覽一個數據集中的示例，以便更好地瞭解其結構和我們正在處理的資料型別。

train_dataset[20]

>>> train_dataset[20]["images"][0]

3. 使用 TRL 微調模型

3.1 載入量化模型以進行訓練 ⚙️

首先，讓我們使用 bitsandbytes 載入 SmolVLM-Instruct 模型的量化版本，並載入處理器。我們將使用 SmolVLM-Instruct。

import torch
from transformers import Idefics3ForConditionalGeneration, AutoProcessor

model_id = "HuggingFaceTB/SmolVLM-Instruct"

from transformers import BitsAndBytesConfig

# BitsAndBytesConfig int-4 config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    quantization_config=bnb_config,
    _attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_id)

3.2 設定 QLoRA 和 DPOConfig 🚀

在這一步中，我們將為我們的訓練設定配置 QLoRA。QLoRA 是一種強大的微調技術，旨在減少記憶體佔用，使得即使在有限的硬體上也能高效地微調大型模型。

QLoRA 在傳統的 **LoRA** (Low-Rank Adaptation) 基礎上，引入了介面卡權重的量化。這一增強顯著降低了記憶體使用量並加快了訓練速度，使其成為資源受限環境的理想選擇。

>>> from peft import LoraConfig, get_peft_model

>>> # Configure LoRA
>>> peft_config = LoraConfig(
...     r=8,
...     lora_alpha=8,
...     lora_dropout=0.1,
...     target_modules=["down_proj", "o_proj", "k_proj", "q_proj", "gate_proj", "up_proj", "v_proj"],
...     use_dora=True,
...     init_lora_weights="gaussian",
... )

>>> # Apply PEFT model adaptation
>>> peft_model = get_peft_model(model, peft_config)

>>> # Print trainable parameters
>>> peft_model.print_trainable_parameters()

trainable params: 11,269,248 || all params: 2,257,542,128 || trainable%: 0.4992

接下來，我們將使用 `DPOConfig` 配置訓練選項。

from trl import DPOConfig

training_args = DPOConfig(
    output_dir="smolvlm-instruct-trl-dpo-rlaif-v",
    bf16=True,
    gradient_checkpointing=True,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=32,
    num_train_epochs=5,
    dataset_num_proc=8,  # tokenization will use 8 processes
    dataloader_num_workers=8,  # data loading will use 8 workers
    logging_steps=10,
    report_to="tensorboard",
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
    save_total_limit=1,
    eval_steps=10,  # Steps interval for evaluation
    eval_strategy="steps",
)

我們將使用 TRL 庫中的 DPOTrainer 類為**直接偏好最佳化 (DPO)** 定義訓練引數。

**DPO** 使用帶標籤的偏好資料來引導模型生成符合偏好的響應。TRL 的 DPOTrainer 會在訓練前**對資料集進行分詞**並將其儲存到磁碟。這個過程可能會消耗大量磁碟空間，具體取決於用於訓練的資料量。請做好相應規劃以避免儲存空間不足。

這一步可能需要一些時間，所以請放鬆並享受這個過程！😄

from trl import DPOTrainer

trainer = DPOTrainer(
    model=model,
    ref_model=None,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    peft_config=peft_config,
    processing_class=processor,
)

開始訓練模型！🎉

trainer.train()

讓我們儲存結果 💾

trainer.save_model(training_args.output_dir)

4. 測試微調後的模型 🔍

在微調完我們的視覺語言模型 (VLM) 後，是時候評估它的效能了！在本節中，我們將使用 HuggingFaceH4/rlaif-v_formatted 資料集中的示例來測試模型。讓我們深入瞭解結果，評估模型與偏好響應的一致性如何！🚀

在開始之前，讓我們清理一下 GPU 記憶體，以確保流暢和最佳的效能。🧹

>>> import gc
>>> import time


>>> def clear_memory():
...     # Delete variables if they exist in the current global scope
...     if "inputs" in globals():
...         del globals()["inputs"]
...     if "model" in globals():
...         del globals()["model"]
...     if "processor" in globals():
...         del globals()["processor"]
...     if "trainer" in globals():
...         del globals()["trainer"]
...     if "peft_model" in globals():
...         del globals()["peft_model"]
...     if "bnb_config" in globals():
...         del globals()["bnb_config"]
...     time.sleep(2)

...     # Garbage collection and clearing CUDA memory
...     gc.collect()
...     time.sleep(2)
...     torch.cuda.empty_cache()
...     torch.cuda.synchronize()
...     time.sleep(2)
...     gc.collect()
...     time.sleep(2)

...     print(f"GPU allocated memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
...     print(f"GPU reserved memory: {torch.cuda.memory_reserved() / 1024**3:.2f} GB")


>>> clear_memory()

GPU allocated memory: 1.64 GB
GPU reserved memory: 2.01 GB

我們將使用與之前相同的流程重新載入基礎模型。

model = Idefics3ForConditionalGeneration.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2",
)

processor = AutoProcessor.from_pretrained(model_id)

我們將訓練好的介面卡附加到預訓練模型上。該介面卡包含了訓練期間進行的微調調整，使基礎模型能夠在保持其核心引數不變的情況下利用新知識。透過整合介面卡，我們在不改變其原始結構的情況下增強了模型的能力。

adapter_path = "sergiopaniego/smolvlm-instruct-trl-dpo-rlaif-v"
model.load_adapter(adapter_path)

讓我們在一個未見過的樣本上評估模型。

test_dataset[20]

>>> test_dataset[20]["images"][0]

讓我們建立一個通用函式，可以用不同的樣本呼叫，以簡化測試過程。這個函式將使我們能夠高效地評估模型在多個示例上的效能，而無需為每個示例重寫程式碼。透過使用這個可重用的函式，我們可以快速評估模型在各種輸入下的表現。

def generate_text_from_sample(model, processor, sample, max_new_tokens=1024, device="cuda"):
    # Prepare the text input by applying the chat template
    text_input = processor.apply_chat_template(sample["prompt"], add_generation_prompt=True)

    image_inputs = []
    image = sample["images"][0]
    if image.mode != "RGB":
        image = image.convert("RGB")
    image_inputs.append([image])

    # Prepare the inputs for the model
    model_inputs = processor(
        text=text_input,
        images=image_inputs,
        return_tensors="pt",
    ).to(
        device
    )  # Move inputs to the specified device

    # Generate text with the model
    generated_ids = model.generate(**model_inputs, max_new_tokens=max_new_tokens)

    # Trim the generated ids to remove the input ids
    trimmed_generated_ids = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(model_inputs.input_ids, generated_ids)]

    # Decode the output text
    output_text = processor.batch_decode(
        trimmed_generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    return output_text[0]  # Return the first decoded output text

現在，我們準備好呼叫函式並評估模型了！🚀

output = generate_text_from_sample(model, processor, test_dataset[20])
output

該模型現在能夠根據提供的影像和提示生成響應。對於這樣的任務，將您的模型的效能與基準進行比較是很有用的，以瞭解它改進了多少，以及它與其他選項的對比情況。有關此比較的更多資訊和詳細資訊，請檢視這篇文章。

💻 我開發了一個用於測試模型的示例應用程式，您可以在這裡找到它。

由於這裡我們只用資料集的一個子集進行了一個示例訓練，對於 Space，我使用了官方的 Hugging Face DPO 微調模型。您可以輕鬆地將其與另一個展示預訓練模型的 Space 進行比較，該 Space 可在這裡找到。

from IPython.display import IFrame

IFrame(src="https://sergiopaniego-smolvlm-trl-dpo-rlaif-v.hf.space", width=1000, height=800)

5. 繼續學習之旅 🧑‍🎓️

透過這些資源擴充套件您對視覺語言模型及相關工具的知識。

Cookbook 中的多模態指南： 發現多模態模型的實用指南，包括檢索增強生成 (RAG) 流程和微調。我們已經發布了一篇關於使用 SFT 和 TRL 微調 smol VLM 的指南，它與本指南完美互補——請查閱以獲取更多細節。
TRL 社群教程： 探索豐富的教程集，深入瞭解 TRL 的複雜性及其在實際應用中的使用。

您也可以重新訪問使用 Hugging Face 生態系統 (TRL) 微調視覺語言模型 (Qwen2-VL-7B) 中的“繼續學習之旅”部分。

這些資源將幫助您深化在多模態學習領域的知識和專業技能。

< > 在 GitHub 上更新

←Smol 多模態 RAG，在 Colab 免費版 GPU 上使用 ColSmolVLM 和 SmolVLM 進行構建使用視覺語言模型從影像或文件中進行結構化生成→