使用 TRL 對視覺語言模型進行偏好最佳化

釋出於 2024 年 7 月 10 日

在 GitHub 上更新

贊

Quentin Gallouédec

qgallouedec

黃聖毅 (Shengyi Costa Huang)

訓練模型以理解和預測人類偏好可能極其複雜。傳統的監督微調方法通常需要為資料分配特定的標籤，這在處理細微任務時成本效益不高。偏好最佳化是一種替代方法，可以簡化此過程併產生更準確的結果。透過側重於比較和排名候選答案而不是分配固定標籤，偏好最佳化允許模型更有效地捕捉人類判斷的細微差別。

偏好最佳化廣泛用於微調語言模型，但它也可應用於視覺語言模型（VLM）。我們很高興地宣佈，**TRL 庫現在支援 VLM 的直接偏好最佳化（DPO）**。本文將指導您完成使用 TRL 和 DPO 訓練 VLM 的過程。

偏好資料集

偏好最佳化需要捕獲使用者偏好的資料。在二元選擇設定中，每個示例包含一個提示和兩個候選答案：一個被選中，一個被拒絕。模型的任務是學習預測被選中的答案而不是被拒絕的答案。例如，您需要有以下示例：

❔ 問題: 有多少個家庭？

❌ 拒絕: 影像沒有提供任何關於家庭的資訊。
✅ 選中: 影像顯示了一個工會組織表格設定，有 18,000 個家庭。

請注意，被選中的訊息不一定正確。例如，被選中的回覆“18,000 個家庭”仍然是錯誤的，但與被拒絕的回覆相比，它的錯誤程度較低。

對於這篇部落格文章，我們將使用 openbmb/RLAIF-V-Dataset，它包含超過 83,000 行帶註釋的資料。讓我們仔細看看這個資料集：

>>> from datasets import load_dataset
>>> dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train[:1%]")
>>> sample = dataset[1]
>>> sample["image"].show()
>>> sample["question"]
'how many families?'
>>> sample["rejected"]
'The image does not provide any information about families.'
>>> sample["chosen"]
'The image shows a Union Organization table setup with 18,000 families.'

我們的模型需要文字和影像作為輸入，所以第一步是格式化資料集以符合此要求。資料應結構化為模擬使用者和助手之間的對話。使用者提供包含影像和問題的提示，而助手則提供答案。以下是此格式化方式：

from datasets import features
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)

def format(example):
    # Prepare the input for the chat template
    prompt = [
        {
            "role": "user",
            "content": [{"type": "image"}, {"type": "text", "text": example["question"]}],
        },
    ]
    chosen = [
        {
            "role": "assistant",
            "content": [{"type": "text", "text": example["chosen"]}],
        },
    ]
    rejected = [
        {
            "role": "assistant",
            "content": [{"type": "text", "text": example["rejected"]}],
        },
    ]
    # Apply the chat template
    prompt = processor.apply_chat_template(prompt, tokenize=False)
    chosen = processor.apply_chat_template(chosen, tokenize=False)
    rejected = processor.apply_chat_template(rejected, tokenize=False)
    # Resize the image to ensure it fits within the maximum allowable
    # size of the processor to prevent OOM errors.
    max_size = processor.image_processor.size["longest_edge"]
    example["image"].thumbnail((max_size, max_size))
    return {"images": [example["image"]], "prompt": prompt, "chosen": chosen, "rejected": rejected}

# Apply the formatting function to the dataset,
# remove columns to end up with only "images", "prompt", "chosen", "rejected" columns
dataset = dataset.map(format, remove_columns=dataset.column_names)

# Make sure that the images are decoded, it prevents from storing bytes.
# More info here https://github.com/huggingface/blog/pull/2148#discussion_r1667400478
f = dataset.features
f["images"] = features.Sequence(features.Image(decode=True))  # to avoid bytes
dataset = dataset.cast(f)

我們的資料集現在已格式化。讓我們看看第一個示例：

>>> dataset[1]
{'images': [<PIL.JpegImagePlugin.JpegImageFile image mode=L size=980x812 at 0x154505570>],
 'prompt': 'User:<image>how many families?<end_of_utterance>\n',
 'rejected': 'Assistant: The image does not provide any information about families.<end_of_utterance>\n',
 'chosen': 'Assistant: The image shows a Union Organization table setup with 18,000 families.<end_of_utterance>\n'}

預熱您的 GPU，資料集已準備好進行訓練！

訓練

為了示例，我們將訓練 Idefics2-8b 模型，但請注意 TRL 中的 DPO 實現支援其他模型，如 Llava 1.5 和 PaliGemma。更多資訊請參見微調 Llava 1.5、PaliGemma 和其他模型部分。在檢視訓練過程之前，我們首先確保所有內容都能順利適應記憶體。

我需要多少記憶體？

我有一個 80GB 視訊記憶體的 GPU。這足夠訓練我的 Idefics2-8b 模型嗎？以下是粗略估算所需記憶體的計算步驟。

設 $N$ 為引數數量， $P$ 為精度。以下元件必須同時適配到記憶體中：

待訓練模型: $N \times P$
參考模型：參考模型與待訓練模型相同，因此也需要 $N \times P$
梯度：我們訓練整個模型，每個引數都需要一個梯度，因此需要 $N \times P$
最佳化器狀態：我們使用 AdamW，它每個引數需要兩個狀態，因此需要 $2 \times N \times P$

Idefics2-8b 有 80 億個引數，我們使用 float32 精度，每個浮點數需要 4 位元組。因此，所需的總記憶體為：

元件	計算	記憶體
待訓練模型	$8 \times 10^9 \times 4$	32 GB
參考模型	$8 \times 10^9 \times 4$	32 GB
梯度	$8 \times 10^9 \times 4$	32 GB
最佳化器狀態	$2 \times 8 \times 10^9 \times 4$	64 GB
總計		160 GB

這遠遠超出了我 GPU 的記憶體容量。幸運的是，透過應用量化和 LoRA 等技術，我們可以顯著減少記憶體需求，使訓練變得可行。讓我們看看如何做到這一點。

量化

量化是一種減少模型權重和啟用精度的方法。將精度從 float32 切換到 bfloat16 可將每個引數的儲存需求減半，從 4 位元組變為 2 位元組。這種最佳化可節省記憶體並加速計算，同時確保高效能，且妥協程度最低。要在模型中實現 bfloat16 精度：

import torch
from transformers import AutoModelForVision2Seq

model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16)

透過在訓練引數中設定 bf16=True，也可以將 bfloat16 精度應用於最佳化器。

from transformers import TrainingArguments

training_args = TrainingArguments(..., bf16=True)

LoRA

LoRA 是一種透過學習秩分解矩陣對來減少可訓練引數數量的方法，同時保持原始權重凍結。這顯著降低了適應特定任務的 LLM 的儲存需求。LoRA 已整合到 PEFT 中，您可以立即進行設定：

  from transformers import AutoModelForVision2Seq
+ from peft import get_peft_model, LoraConfig

  model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b")
+ peft_config = LoraConfig(target_modules="all-linear")
+ model = get_peft_model(model, peft_config)

PEFT 就像模型周圍的包裝器（稱為介面卡）。這個介面卡將在內部模型保持凍結的情況下進行訓練。LoRA 減少了多少可訓練引數？

>>> model.print_trainable_parameters()
trainable params: 55,348,736 || all params: 8,458,116,848 || trainable%: 0.6543860411799315

它將可訓練引數的數量從 80 億減少到 5500 萬，這是一個巨大的差距，將顯著減少記憶體需求。

量化和 LoRA 後的新記憶體要求

現在我們已經減少了記憶體需求，讓我們重新計算所需的記憶體：

元件	計算	記憶體
待訓練模型	$8 \mathrm{G} \times 2$	16 GB
參考模型	$8 \mathrm{G} \times 2$	16 GB
梯度	$55 \mathrm{M} \times 2$	0.1 GB
最佳化器狀態	$2 \times 55 \mathrm{M} \times 2$	0.2 GB
總計		32.3 GB

這次，我們需要大約 32GB 的記憶體來微調我們的 Idefics2-8b 模型，這合理得多，並且我的 GPU 可以滿足！

有關使用 LoRA 和 QLoRA 最佳化記憶體使用的更多資訊，請參閱 PEFT 文件或 Google 關於 LLM 的 LoRA 和 QLoRA 建議。

批次大小如何？

我們的記憶體計算並不精確，因為它沒有考慮啟用。啟用是網路層的中間輸出，其記憶體需求取決於模型結構和批次大小。精確計算啟用所需的記憶體具有挑戰性，因此我們將依賴於經驗觀察。

要選擇合適的訓練批次大小（per_device_train_batch_size），請從您期望的批次大小（例如 64）開始。這可能會導致記憶體不足（OOM）錯誤。如果出現此錯誤，請將批次大小減半，並將梯度累積步數（gradient_accumulation_steps）加倍，以保持相同的有效批次大小。重複此過程，直到記憶體適配您的 GPU。在我們的例子中，我們最終的批次大小為 2，梯度累積步數為 32。

另一個最佳化是使用梯度檢查點 (gradient_checkpointing) 來減少啟用所需的記憶體。這種技術透過在反向傳播過程中重新計算網路的部分來權衡計算和記憶體。可以透過在訓練引數中設定 gradient_checkpointing=True 來啟用它。

總結：完整的訓練指令碼

現在我們已經設定好模型、資料集和訓練引數，我們準備好進行訓練了。以下是如何將所有內容組合到一個指令碼中，包括一些額外的元素以加快處理速度，如 dataset_num_proc 和 dataloader_num_workers：

# dpo_idefics2-8b.py
from datasets import features, load_dataset
from transformers import AutoModelForVision2Seq, AutoProcessor
import torch
from trl import DPOConfig, DPOTrainer
from peft import LoraConfig


def main():
    # Load the model and processor
    model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16)
    processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)

    # Load the dataset
    dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train")

    def format(example):
        # Prepare the input for the chat template
        prompt = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": example["question"]}]}]
        chosen = [{"role": "assistant", "content": [{"type": "text", "text": example["chosen"]}]}]
        rejected = [{"role": "assistant", "content": [{"type": "text", "text": example["rejected"]}]}]
        # Apply the chat template
        prompt = processor.apply_chat_template(prompt, tokenize=False)
        chosen = processor.apply_chat_template(chosen, tokenize=False)
        rejected = processor.apply_chat_template(rejected, tokenize=False)
        # Resize the image to ensure it fits within the maximum allowable
        # size of the processor to prevent OOM errors.
        max_size = processor.image_processor.size["longest_edge"] // 2
        example["image"].thumbnail((max_size, max_size))
        return {"images": [example["image"]], "prompt": prompt, "chosen": chosen, "rejected": rejected}

    # Apply the formatting function to the dataset
    dataset = dataset.map(format, remove_columns=dataset.column_names, num_proc=32)

    # Make sure that the images are decoded, it prevents from storing bytes.
    # More info here https://github.com/huggingface/blog/pull/2148#discussion_r1667400478
    f = dataset.features
    f["images"] = features.Sequence(features.Image(decode=True))
    dataset = dataset.cast(f)

    # Train the model
    training_args = DPOConfig(
        output_dir="idefics2-8b-dpo",
        bf16=True,
        gradient_checkpointing=True,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=32,
        num_train_epochs=1,
        dataset_num_proc=32,  # tokenization will use 32 processes
        dataloader_num_workers=32,  # data loading will use 32 workers
        logging_steps=10,
    )
    trainer = DPOTrainer(
        model,
        ref_model=None,  # not needed when using peft
        args=training_args,
        train_dataset=dataset,
        tokenizer=processor,
        peft_config=LoraConfig(target_modules="all-linear"),
    )

    trainer.train()


if __name__ == "__main__":
    main()

讓我們執行並等待……🚀

accelerate launch dpo_idefics2-8b.py

結果

幾個小時後，訓練完成。讓我們看看訓練曲線：

在 DPO 中，我們關注以下幾個指標來評估訓練質量：

準確率：此指標表示模型更可能輸出所選答案而非被拒絕答案的訓練樣本百分比。我們可以看到準確率有所提高，這是一個積極的訊號。
獎勵：獎勵與答案被選中的機率相關。更多詳情請參閱 DPO 論文第 5 節。我們期望所選答案的獎勵高於被拒絕答案的獎勵。為了驗證這一點，我們查看了獎勵裕度，即所選答案和被拒絕答案獎勵之間的差值。此處觀察到的獎勵裕度增加也是一個好兆頭。

評估

推理

模型訓練完成後，下一步是在一些示例上評估其效能。這將使我們瞭解模型學習得有多好，以及它預測的有效性。以下是一個指令碼，可幫助您評估模型並分析其在一組測試示例上的效能：

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b").to("cuda")
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
model.load_adapter("HuggingFaceH4/idefics2-8b-dpo-rlaif-v-v0.3")  # <-- Load the adapter we've just trained

# Process
user_message = ...
image_path = ...
data = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": user_message}]}]
prompts = processor.apply_chat_template(data, add_generation_prompt=True)  # add_generation_prompt=True to end the prompt with "ASSISTANT:"
images = [Image.open(image_path)]
inputs = processor(prompts, images, return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
response_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response_text)

如上所述，openbmb/RLAIF-V-Dataset 旨在減少幻覺。但是，微調是否真的減少了幻覺呢？為了找出答案，我們可以使用 AMBER 基準測試，這是一個專門用於評估 VLM 中幻覺的資料集。我們將報告 Idefics2 和 Idefics2+DPO 在判別任務上的結果，並與其他模型進行比較以供參考。

	準確率	F1
GPT-4o	88.8	91.6
Idefics2+DPO	85.9	89.4
Idefics2	85.8	89.1
GPT-4v	83.4	87.4
MiniGemini	82.6	87.6
LLaVA-NeXT	81.4	85.4
QWEN-VL	81.9	86.4
LURE	73.5	77.7
OPERA	75.2	78.3
Less-is-more	72.4	75.8
VCD	71.8	74.9

總體而言，微調後的模型似乎幻覺少了一些。訓練似乎很成功！

以下是一些精選示例，以說明模型的效能：

問題	Idefics2	Idefics2+DPO
這張圖片裡有兩艘船嗎？	是	否
這張圖片裡的地面不平坦嗎？	否	是
這張圖片裡有一把鏟子嗎？	是	否

自己嘗試一下，看看模型在您的示例上表現如何！

微調 Llava 1.5、PaliGemma 和其他模型

在撰寫本文時，TRL 中的 DPO 實現支援 Idefics2、Llava 1.5 和 PaliGemma，並且正在努力新增對更多模型的支援。微調這些模型最簡單的方法是使用 TRL 儲存庫中提供的示例指令碼。例如，要微調 PaliGemma，您可以使用以下命令：

accelerate launch examples/scripts/dpo_visual.py \
    --dataset_name HuggingFaceH4/rlaif-v_formatted \
    --model_name_or_path google/paligemma-3b-pt-224 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 32 \
    --dataset_num_proc 32 \
    --output_dir dpo_paligemma_rlaif-v \
    --bf16 \
    --torch_dtype bfloat16 \
    --gradient_checkpointing \
    --use_peft \
    --lora_target_modules=all-linear

您可以在 smol-vision 專案中找到關於 PaliGemma 微調的詳細介紹。

🚀🚀 現在您擁有了使用 DPO 微調您自己的 VLM 所需的一切。與社群分享您的發現、模型和資料集吧！

更多部落格文章