開源 AI 食譜文件

使用 TRL 和 GRPO 對 VLM 進行推理能力的後訓練

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

使用 TRL 和 GRPO 對 VLM 進行推理能力的後訓練

作者: Sergio Paniego

🚨 警告：此 notebook 屬於資源密集型，需要大量計算能力。如果在 Colab 中執行，它將使用 A100 GPU。

在本篇指南中，我們將演示如何使用 GRPO 對一個視覺語言模型 (VLM) 進行後訓練，以利用 Hugging Face 生態系統（特別是 Transformer 強化學習庫 (trl)）為 VLM 增加推理能力。

我們將使用 lmms-lab/multimodal-open-r1-8k-verified 資料集的一個子集來微調 Qwen2.5-VL-3B-Instruct。該資料集包含帶有問題描述的影像及其解決方案和得出該解決方案的思維過程。我們將利用這種資料格式以及 GRPO 獎勵函式，來教模型如何進行推理以得出解決方案。

1. 安裝依賴

我們先從安裝微調所需的基本庫開始。我們將從原始碼安裝 trl，因為在撰寫本文時，VLM GRPO trainer 尚未包含在官方釋出版本中。

!pip install -U -q git+https://github.com/huggingface/trl.git peft math_verify qwen-vl-utils[decord]

請使用您的 Hugging Face 🤗 賬戶進行認證，以便儲存和分享訓練好的模型。

from huggingface_hub import login

login()

2. 載入資料集 📁

在本指南中，我們使用 lmms-lab/multimodal-open-r1-8k-verified。該資料集包含 8k 個專注於數學推理的多模態 RL 訓練樣本。這些資料是使用 GPT4o 建立的，每個樣本都包含 image、problem、solution、original question 和 original answer。它是在這個專案中建立的。

對於我們希望模型學習使用影像進行推理的特定情況，我們將 image 和 problem 作為輸入，solution 作為輸出。

為了這個教學資源，我們將只使用 5% 的資料集，並將其劃分為訓練集和測試集，以加快訓練速度。在實際訓練中，我們會使用完整的資料集。

我們來載入並劃分資料集。

from datasets import load_dataset

dataset_id = "lmms-lab/multimodal-open-r1-8k-verified"
dataset = load_dataset(dataset_id, split="train[:5%]")

split_dataset = dataset.train_test_split(test_size=0.2, seed=42)

train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

我們來檢查一下資料集的結構。

>>> print(train_dataset)

Dataset(&#123;
    features: ['image', 'problem', 'solution', 'original_question', 'original_answer'],
    num_rows: 307
})

我們來檢查一個樣本

print(train_dataset[0])

除了 problem 和 image 列之外，我們還包含了一個自定義的系統提示，以告知模型我們希望它如何生成內容。

系統提示是從 DeepSeek R1 中提取的。更多細節請參考之前的這篇指南。

我們將資料集樣本轉換為對話樣本，每個樣本包含系統提示、一個影像和問題描述，因為這是 GRPO trainer 所期望的格式。

我們還設定了 padding_side="left"，以確保訓練期間生成的補全內容直接連線在提示之後，這對於 GRPO 正確比較偏好響應和拒絕響應之間的 token 級機率至關重要。

from transformers import AutoProcessor

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"
processor = AutoProcessor.from_pretrained(model_id, use_fast=True, padding_side="left")

SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)


def make_conversation(example):
    conversation = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "image"},
                {"type": "text", "text": example["problem"]},
            ],
        },
    ]
    prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
    return {
        "prompt": prompt,
        "image": example["image"],
    }


train_dataset = train_dataset.map(make_conversation)

我們來看一個轉換後的例子。

>>> print(train_dataset[0]["prompt"])

<|im_start|>system
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here <|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Based on the image, determine the constant term after combining all the polynomial expressions representing the side lengths of the triangle. Choose the correct answer from the options provided.

Choices:
A. 3
B. 5
C. 8
D. 13<|im_end|>
<|im_start|>assistant

我們將移除訓練中不需要的列。

train_dataset

我們可以檢查一下，這些列現在已經消失了。

>>> train_dataset = train_dataset.remove_columns(["problem", "original_question", "original_answer"])
>>> print(train_dataset)

Dataset(&#123;
    features: ['image', 'solution', 'prompt'],
    num_rows: 307
})

3. 使用 GRPO 對 VLM 進行後訓練

下圖突顯了 PPO (近端策略最佳化) 和 GRPO (分組相對策略最佳化) 之間的主要區別，特別是在 GRPO 中移除了價值模型。關於關鍵差異的更詳細資訊，您可以參考這篇進一步的解釋。

為了實現訓練流程，我們利用了 trl，這是 Hugging Face 的強化學習庫，它提供了一個簡化的介面和對關鍵訓練演算法的內建支援。在我們的案例中，我們使用了 GRPOConfig 和 GRPOTrainer 類。這個過程中的一個關鍵步驟是定義自定義獎勵函式，這些函式引導模型的行為並幫助它與我們的特定目標對齊。

但首先，我們來載入模型。在本例中，我們使用 Qwen/Qwen2.5-VL-3B-Instruct，這是由 Qwen 開發的一款強大的 VLM。為了獲得更好的結果，考慮使用引數更多的模型將很重要。

包含推理能力的其他 VLM 專案示例有：

3.1 載入基線模型

我們先來載入基線模型。如前所述，是 Qwen/Qwen2.5-VL-3B-Instruct。

import torch
from transformers import Qwen2_5_VLForConditionalGeneration

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path=model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

3.2 配置 LoRA

我們將利用 LoRA 來訓練模型，所以我們先來配置它。

>>> from peft import LoraConfig, get_peft_model

>>> lora_config = LoraConfig(
...     task_type="CAUSAL_LM",
...     r=8,
...     lora_alpha=32,
...     lora_dropout=0.1,
...     target_modules=["q_proj", "v_proj"],
... )

>>> model = get_peft_model(model, lora_config)

>>> model.print_trainable_parameters()

trainable params: 1,843,200 || all params: 3,756,466,176 || trainable%: 0.0491

3.3 載入獎勵函式

對於系統的獎勵部分，我們可以使用預訓練的獎勵模型或直接在程式碼中定義的獎勵函式。為了訓練，DeepSeek-R1 的作者使用了一個基於準確性的獎勵模型，該模型評估響應是否正確，同時還使用了一個基於格式的獎勵，以確保模型將其推理過程置於 <think> </think> 標籤之間。您可以在這裡找到更多細節。我們可以簡單地將這些獎勵函式定義並實現為通用的 Python 函式。

在這種情況下，我們將使用以下獎勵函式，這些函式直接從 Open R1 的實現中提取。

格式強制：確保生成的內容遵循特定格式，使用 <think> </think> <answer> </answer> 標籤進行推理。

import re


def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$"
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completions]
    rewards = [1.0 if match else 0.0 for match in matches]
    return rewards

解決方案准確性： 驗證問題的解決方案是否正確，將其與資料集中的 solution 列進行比較。

from math_verify import LatexExtractionConfig, parse, verify
from latex2sympy2_extended import NormalizationConfig
from typing import Optional


def accuracy_reward(completions: list[list[dict[str, str]]], solution: list[str], **kwargs) -> list[Optional[float]]:
    """Reward function that checks if the completion matches the ground truth.
    - If both gold and prediction are parseable → use math verification.
    - If not parseable → compare as normalized text.
    """
    rewards = []

    for completion, sol in zip(completions, solution):
        try:
            gold_parsed = parse(sol, extraction_mode="first_match")
        except Exception as e:
            gold_parsed = []

        if len(gold_parsed) != 0:
            # Try parsing predicted answer too
            try:
                answer_parsed = parse(
                    completion,
                    extraction_config=[
                        LatexExtractionConfig(
                            normalization_config=NormalizationConfig(
                                nits=False,
                                malformed_operators=False,
                                basic_latex=True,
                                boxed="all",
                                units=True,
                            ),
                            boxed_match_priority=0,
                            try_extract_without_anchor=False,
                        )
                    ],
                    extraction_mode="first_match",
                )
                reward = float(verify(gold_parsed, answer_parsed))
            except Exception as e:
                print(f"verify failed: {e}, answer: {completion}, gold: {sol}")
                reward = None
        else:
            # fallback to text match
            reward = float(completion.strip().lower() == sol.strip().lower())

        rewards.append(reward)

    return rewards

3.4 配置 GRPO 訓練引數

接下來，我們來配置 GRPO 的訓練引數。我們建議對 max_completion_length、num_generations 和 max_prompt_length 引數進行實驗。

調整 max_completion_length、num_generations 和 max_prompt_length 這些引數，以找到最佳的訓練組合，將會很有趣。

引數的選擇已經調整以適應 Google Colab 會話的硬體限制。要觀察獎勵提升的全部潛力，尤其是在第二個目標函式中，並進一步提高模型在真實世界場景中的推理能力，將需要一個更具雄心的設定。這將涉及更大的模型、更多的生成次數以及高質量、多樣化的資料集。

from trl import GRPOConfig

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    output_dir="Qwen2.5-VL-3B-Instruct-Thinking",
    learning_rate=1e-5,
    remove_unused_columns=False,  # to access the solution column in accuracy_reward
    num_train_epochs=1,
    bf16=True,
    # Parameters that control the data preprocessing
    per_device_train_batch_size=2,
    max_completion_length=1024,  # default: 256
    num_generations=2,  # default: 8
    max_prompt_length=2048,
    # Parameters related to reporting and saving
    report_to=["tensorboard"],
    logging_steps=10,
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
)

3.5 訓練模型 🏃

現在，我們來配置 trainer 並開始訓練模型！

在這種情況下，我們除了模型、訓練引數和資料集之外，還將我們之前定義的兩個獎勵函式傳遞給 trainer。

下面，你會看到一個我們將要復現的訓練流程圖，該圖摘自Open-R1 專案。

from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    processing_class=processor,
    reward_funcs=[format_reward, accuracy_reward],
    args=training_args,
    train_dataset=train_dataset,
)

是時候訓練模型了！

trainer.train()

我們可以直接在[模型頁面]((https://huggingface.co/sergiopaniego/Qwen2.5-VL-3B-Instruct-Thinking/tensorboard)的 TensorBoard 中檢視訓練指標。雖然損失曲線可能看起來有點奇怪，但獎勵結果講述了一個更清晰的故事：模型在穩步提升，隨著時間的推移，它獲得的獎勵越來越多。

現在，讓我們把結果儲存到我們的賬戶中 💾

trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

4. 檢查模型效能

現在我們的模型已經訓練好了，我們可以檢查它的效能以進行定性評估。

我們建議您重啟會話以釋放用於訓練的資源。

trained_model_id = "sergiopaniego/Qwen2.5-VL-3B-Instruct-Thinking"

為此，我們將使用資料集的測試子集。首先，載入我們訓練好的模型及其處理器。

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor

trained_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    trained_model_id,
    torch_dtype="auto",
    device_map="auto",
)
trained_processor = AutoProcessor.from_pretrained(trained_model_id, use_fast=True, padding_side="left")

我們將生成一個輔助函式來生成我們的響應。這將使我們更容易地傳送一個問題和影像，並檢索模型的響應，該響應應包括推理過程和最終答案。

import time
import torch
from qwen_vl_utils import process_vision_info


def generate_with_reasoning(problem, image):
    # Conversation setting for sending to the model
    conversation = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": problem},
            ],
        },
    ]
    prompt = trained_processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)

    # Process images using the process_vision_info from qwen_vl_utils
    image_inputs, video_inputs = process_vision_info(conversation)

    inputs = processor(
        text=[prompt],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )
    inputs = inputs.to(trained_model.device)

    # Generate text without gradients
    start_time = time.time()
    with torch.no_grad():
        output_ids = trained_model.generate(**inputs, max_new_tokens=500)
    end_time = time.time()

    # Decode and extract model response
    generated_text = trained_processor.decode(output_ids[0], skip_special_tokens=True)

    # Get inference time
    inference_duration = end_time - start_time

    # Get number of generated tokens
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output_ids.shape[1] - num_input_tokens

    return generated_text, inference_duration, num_generated_tokens

我們來檢查一下！

>>> generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(
...     test_dataset[0]["problem"], test_dataset[0]["image"]
... )
>>> print(generated_text)

system
A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here 
user
Based on the image, determine the sine value of angle AOB if it measures 120 degrees. Choose the correct answer from the options provided.

Choices:
A. $\frac&#123;\sqrt&#123;3}}&#123;2}$
B. $\frac&#123;1}&#123;2}$
C. $-\frac&#123;\sqrt&#123;3}}&#123;2}$
D. $\sqrt&#123;2}$
assistant

In a circle, the sine of an angle is equal to the ratio of the length of the side opposite the angle to the hypotenuse. In this case, since angle AOB is 120 degrees, we can use the properties of a 30-60-90 triangle to find the sine value. The sine of 120 degrees is equivalent to the sine of 60 degrees because 180 - 120 = 60. The sine of 60 degrees is $\frac&#123;\sqrt&#123;3}}&#123;2}$. Therefore, the sine of angle AOB is $\frac&#123;\sqrt&#123;3}}&#123;2}$.


$\frac&#123;\sqrt&#123;3}}&#123;2}$

答案似乎遵循了我們在訓練期間使用獎勵函式新增的約束。我們可以看到模型生成了類似這樣的內容：<think>推理</think><answer>解決方案</answer>。我們來檢查一下實際的解決方案，以瞭解模型是否正確。

test_dataset[0]["solution"]

看起來模型已經將一些推理能力融入其功能中！我們再檢查一下推理時間和生成的 token 數量，以進一步檢驗模型的能力。

>>> print(f"Inference time: {inference_duration:.2f} seconds")
>>> print(f"Generated tokens: {num_generated_tokens}")

Inference time: 11.03 seconds
Generated tokens: 163

5. 繼續你的學習之旅 🧑‍🎓

學習之旅並未在此結束！

如果您渴望瞭解更多關於 GRPO、推理或 VLM 的知識，我們可以推薦一些材料

< > 在 GitHub 上更新

←使用 TRL 和 MPO 微調視覺語言模型使用 Elasticsearch 進行語義重排→