開源 AI 食譜文件

使用 TRL 中的 GRPO 對 LLM 進行推理的後期訓練

開源 AI 食譜

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

使用 TRL 中的 GRPO 對 LLM 進行推理的後期訓練

作者: Sergio Paniego

在本筆記本中，我們將指導您使用在 DeepSeekMath 論文中引入的群組相對策略最佳化 (GRPO) 方法對大型語言模型 (LLM) 進行後期訓練。GRPO 在擴充套件測試時計算以進行擴充套件推理方面特別有效，使其成為解決複雜任務（例如數學問題解決）的理想方法。

GRPO 是一種強化學習 (RL) 後期訓練技術，已整合到 DeepSeek-R1 的訓練管道中。它似乎與最新 OpenAI o1 和 o3 模型中使用的訓練程式有相似之處，儘管確切的一致性尚未得到證實。與依賴搜尋啟發式方法的早期技術不同，GRPO 專門採用 RL 進行後期訓練，增強了模型處理複雜和細微任務的能力。

GRPO 技術可透過 TRL 庫獲得。截至本文撰寫之時，Hugging Face Science 團隊正在努力重現完整的 DeepSeek-R1 訓練過程，您可以在他們的 Open-R1 專案中進行探索。我強烈建議您檢視它，以深入瞭解整個過程。

在本筆記本中，我們將專門關注使用 GRPO 進行後期訓練，儘管在最後一節中提供了有關 DeepSeek-R1 及其訓練過程的額外資源。

下面是說明此訓練過程如何工作的圖表。

1. 安裝依賴項

讓我們先安裝微調所需的基本庫吧！🚀

!pip install  -U -q trl peft math_verify
# Tested with transformers==4.47.1, trl==0.14.0, datasets==3.2.0, peft==0.14.0, accelerate==1.2.1, math_verify==0.3.3

使用您的 Hugging Face 賬戶進行身份驗證，以便直接從本 Notebook 儲存和分享您的模型 🗝️。

from huggingface_hub import notebook_login

notebook_login()

2. 載入資料集 📁

這些模型擅長需要複雜推理的任務。一個典型的例子是數學問題解決，它通常需要多步推理才能得出正確的解決方案。

對於這個專案，我們將使用 AI-MO/NuminaMath-TIR 資料集。這是一個以推理為重點的資料集，包含數學問題、它們的解決方案以及解釋如何從問題陳述過渡到最終解決方案的詳細推理步驟。

from datasets import load_dataset

dataset_id = "AI-MO/NuminaMath-TIR"
train_dataset, test_dataset = load_dataset(dataset_id, split=["train[:5%]", "test[:5%]"])

讓我們檢查資料集的結構

>>> print(train_dataset)

Dataset(&#123;
    features: ['problem', 'solution', 'messages'],
    num_rows: 3622
})

讓我們檢查一個樣本

>>> print(train_dataset[0])

&#123;'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac&#123;3}&#123;5}x-\\frac&#123;y}&#123;2}\\right)^8$?  Express your answer as a common fraction.', 'solution': "To determine the coefficient of $x^2y^6$  in the expansion of $\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8$ , we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_&#123;k=0}^&#123;n} \\binom&#123;n}&#123;k} a^&#123;n-k} b^k\n\\]\n\nIn this case, $a = \\frac{3}{5}x$ , $b = -\\frac{y}{2}$ , and $n = 8$ .\n\nWe are interested in the term that contains $x^2y^6$ . In the general term of the binomial expansion:\n\\[\n\\binom&#123;8}&#123;k} \\left(\\frac&#123;3}&#123;5}x\\right)^&#123;8-k} \\left(-\\frac&#123;y}&#123;2}\\right)^k\n\\]\n\nTo get $x^2$ , we need $8 - k = 2$ , thus $k = 6$ .\n\nSubstituting $k = 6$  into the expression:\n\\[\n\\binom&#123;8}&#123;6} \\left(\\frac&#123;3}&#123;5}x\\right)^&#123;8-6} \\left(-\\frac&#123;y}&#123;2}\\right)^6 = \\binom&#123;8}&#123;6} \\left(\\frac&#123;3}&#123;5}x\\right)^2 \\left(-\\frac&#123;y}&#123;2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficient $\\binom{8}{6}$ .\n2. Compute $\\left(\\frac{3}{5}\\right)^2$ .\n3. Compute $\\left(-\\frac{y}{2}\\right)^6$ .\n4. Combine everything together to get the coefficient of $x^2y^6$ .\n\nLet's compute these in Python.\n```python\nfrom math import comb\n\n# Given values\nn = 8\nk = 6\n\n# Calculate the binomial coefficient\nbinom_coeff = comb(n, k)\n\n# Compute (3/5)^2\na_term = (3/5)**2\n\n# Compute (-1/2)^6\nb_term = (-1/2)**6\n\n# Combine terms to get the coefficient of x^2y^6\ncoefficient = binom_coeff * a_term * b_term\nprint(coefficient)\n```\n```output\n0.1575\n```\nThe coefficient of $x^2y^6$  in the expansion of $\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8$  is $0.1575$ . To express this as a common fraction, we recognize that:\n\n\\[ 0.1575 = \\frac&#123;1575}&#123;10000} = \\frac&#123;63}&#123;400} \\]\n\nThus, the coefficient can be expressed as:\n\n\\[\n\\boxed&#123;\\frac&#123;63}&#123;400}}\n\\]", 'messages': [&#123;'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac&#123;3}&#123;5}x-\\frac&#123;y}&#123;2}\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}, &#123;'content': "To determine the coefficient of $x^2y^6$  in the expansion of $\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8$ , we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_&#123;k=0}^&#123;n} \\binom&#123;n}&#123;k} a^&#123;n-k} b^k\n\\]\n\nIn this case, $a = \\frac{3}{5}x$ , $b = -\\frac{y}{2}$ , and $n = 8$ .\n\nWe are interested in the term that contains $x^2y^6$ . In the general term of the binomial expansion:\n\\[\n\\binom&#123;8}&#123;k} \\left(\\frac&#123;3}&#123;5}x\\right)^&#123;8-k} \\left(-\\frac&#123;y}&#123;2}\\right)^k\n\\]\n\nTo get $x^2$ , we need $8 - k = 2$ , thus $k = 6$ .\n\nSubstituting $k = 6$  into the expression:\n\\[\n\\binom&#123;8}&#123;6} \\left(\\frac&#123;3}&#123;5}x\\right)^&#123;8-6} \\left(-\\frac&#123;y}&#123;2}\\right)^6 = \\binom&#123;8}&#123;6} \\left(\\frac&#123;3}&#123;5}x\\right)^2 \\left(-\\frac&#123;y}&#123;2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficient $\\binom{8}{6}$ .\n2. Compute $\\left(\\frac{3}{5}\\right)^2$ .\n3. Compute $\\left(-\\frac{y}{2}\\right)^6$ .\n4. Combine everything together to get the coefficient of $x^2y^6$ .\n\nLet's compute these in Python.\n```python\nfrom math import comb\n\n# Given values\nn = 8\nk = 6\n\n# Calculate the binomial coefficient\nbinom_coeff = comb(n, k)\n\n# Compute (3/5)^2\na_term = (3/5)**2\n\n# Compute (-1/2)^6\nb_term = (-1/2)**6\n\n# Combine terms to get the coefficient of x^2y^6\ncoefficient = binom_coeff * a_term * b_term\nprint(coefficient)\n```\n```output\n0.1575\n```\nThe coefficient of $x^2y^6$  in the expansion of $\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8$  is $0.1575$ . To express this as a common fraction, we recognize that:\n\n\\[ 0.1575 = \\frac&#123;1575}&#123;10000} = \\frac&#123;63}&#123;400} \\]\n\nThus, the coefficient can be expressed as:\n\n\\[\n\\boxed&#123;\\frac&#123;63}&#123;400}}\n\\]", 'role': 'assistant'}]}

在 DeepSeek-R1 訓練過程中，使用了一個特定的系統提示來生成包含推理步驟的對話管道。我們將調整我們的資料集以遵循這種方法，模型被引導首先思考問題，然後給出答案。

使用的系統提示是

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:

我們將修改我們的資料集以遵循這種對話格式，促使 LLM 生成推理步驟和最終答案。

SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)


def make_conversation(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }


train_dataset = train_dataset.map(make_conversation)
test_dataset = test_dataset.map(make_conversation)

讓我們看一個例子

>>> print(train_dataset[0]["prompt"])

[&#123;'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here ', 'role': 'system'}, &#123;'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac&#123;3}&#123;5}x-\\frac&#123;y}&#123;2}\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}]

我們將刪除 `messages` 和 `problem` 列，因為我們只需要自定義的 `prompt` 列和 `solution` 來驗證生成的答案。

>>> train_dataset = train_dataset.remove_columns(["messages", "problem"])
>>> print(train_dataset)

Dataset(&#123;
    features: ['solution', 'prompt'],
    num_rows: 3622
})

3. 使用 GRPO 對基礎模型進行後期訓練

下圖突出顯示了 PPO（近端策略最佳化）和 GRPO（群組相對策略最佳化）之間的主要區別，特別是 GRPO 中移除了價值模型。有關關鍵區別的更多詳細資訊，您可以參考此處的完整解釋。

3.1 載入基線模型

首先，我們將載入 Qwen/Qwen2-0.5B-Instruct 作為基線模型（上圖中的 `Policy Model`）。它只有 0.5 億個引數，重量輕且符合可用資源。但是，為了獲得更好的結果，應考慮使用更大的替代方案。

import torch
from transformers import AutoModelForCausalLM

model_id = "Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

3.2 配置 LoRA

接下來，我們將配置 LoRA 以進行模型訓練。該技術將允許我們以減少引數數量的方式高效地微調模型，從而實現更快、更節省資源的訓練。

>>> from peft import LoraConfig, get_peft_model

>>> lora_config = LoraConfig(
...     task_type="CAUSAL_LM",
...     r=8,
...     lora_alpha=32,
...     lora_dropout=0.1,
...     target_modules=["q_proj", "v_proj"],
... )

>>> model = get_peft_model(model, lora_config)

>>> model.print_trainable_parameters()

trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093

3.3 載入獎勵函式

對於系統的獎勵部分，我們可以使用預訓練的獎勵模型或直接在程式碼中定義的獎勵函式。在訓練時，DeepSeek-R1 作者使用了一個基於準確性的獎勵模型來評估響應是否正確，以及一個基於格式的獎勵來確保模型將其推理過程放在 `<think> </think>` 標籤之間。您可以在此處找到更多詳細資訊。我們可以簡單地將這些獎勵函式定義並實現為通用 Python 函式。

在這種情況下，我們將使用這些獎勵函式

格式強制：確保生成遵循特定格式，使用 `<think> </think> <answer> </answer>` 標籤進行推理。

import re


def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    rewards_list = [1.0 if match else 0.0 for match in matches]
    return [1.0 if match else 0.0 for match in matches]

解決方案准確性：驗證問題的解決方案是否正確。

from math_verify import LatexExtractionConfig, parse, verify


def accuracy_reward(completions, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    solutions = kwargs["solution"]
    completion_contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, solution in zip(completion_contents, solutions):
        gold_parsed = parse(solution, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        answer_parsed = parse(content, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        if len(gold_parsed) != 0:
            try:
                rewards.append(float(verify(answer_parsed, gold_parsed)))
            except Exception:
                rewards.append(0.0)
        else:
            rewards.append(1.0)
    return rewards

3.4 配置 GRPO 訓練引數

接下來，讓我們配置 GRPO 的訓練引數。我們建議嘗試調整 `max_completion_length`、`num_generations` 和 `max_prompt_length` 引數（有關每個引數的詳細資訊，請參閱開頭的影像）。

為了簡單起見，我們將只訓練一個 epoch，並將其 `max_completion_length`、`num_generations` 和 `max_prompt_length` 從其預設值減小。

from trl import GRPOConfig

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    output_dir="Qwen2-0.5B-GRPO-test",
    learning_rate=1e-5,
    remove_unused_columns=False,  # to access the solution column in accuracy_reward
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    bf16=True,
    # Parameters that control de data preprocessing
    max_completion_length=64,  # default: 256
    num_generations=4,  # default: 8
    max_prompt_length=128,  # default: 512
    # Parameters related to reporting and saving
    report_to=["tensorboard"],
    logging_steps=10,
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
)

3.5 訓練模型 🏃

現在，讓我們配置訓練器並開始訓練模型！

在這種情況下，我們將之前定義的兩個獎勵函式傳遞給訓練器。

下面是我們將要重現的訓練過程圖，它來源於 Open-R1 專案。

from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model, reward_funcs=[format_reward, accuracy_reward], args=training_args, train_dataset=train_dataset
)

是時候訓練模型了！🎉

trainer.train()

讓我們儲存結果 💾

trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

下面，您可以檢視訓練的 Tensorboard 結果。它們看起來很有希望！

4. 檢查模型效能

到目前為止，我們一直保持簡單，但現在讓我們檢查模型是否已經學會推理。我們將載入儲存的模型並對測試樣本進行評估。

from transformers import AutoTokenizer

model_id = "sergiopaniego/Qwen2-0.5B-GRPO"
trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
trained_tokenizer = AutoTokenizer.from_pretrained(model_id)

讓我們檢查測試集中的一個樣本！

>>> print(test_dataset["prompt"][0])

[&#123;'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here ', 'role': 'system'}, &#123;'content': "In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?", 'role': 'user'}]

我們將建立一個函式來與模型互動。除了生成答案，我們還將測量推理持續時間並計算生成的 token 數量。這將使我們瞭解模型在生成過程中推理了多少。

import time


def generate_with_reasoning(prompt):
    # Build the prompt from the dataset
    prompt = " ".join(entry["content"] for entry in prompt)

    # Tokenize and move to the same device as the model
    inputs = trained_tokenizer(prompt, return_tensors="pt").to(trained_model.device)

    # Generate text without gradients
    start_time = time.time()
    with torch.no_grad():
        output_ids = trained_model.generate(**inputs, max_length=500)
    end_time = time.time()

    # Decode and extract model response
    generated_text = trained_tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Get inference time
    inference_duration = end_time - start_time

    # Get number of generated tokens
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output_ids.shape[1] - num_input_tokens

    return generated_text, inference_duration, num_generated_tokens

讓我們為該測試樣本生成答案！

>>> prompt = test_dataset["prompt"][0]
>>> generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(prompt)
>>> print(generated_text)

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here  In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?
The reasoning process is that if the sum of the digits of the birth year is equal to the person's age, then the person must have been born in a given year.


The answer is: 1988

模型已經能夠生成正確的 `<think>` 和 `<answer>` 標籤，儘管解決方案本身不正確。

鑑於推理時間和生成的 token 數量，這種方法顯示出潛在的優勢。

>>> print(f"Inference time: {inference_duration:.2f} seconds")
>>> print(f"Generated tokens: {num_generated_tokens}")

Inference time: 2.09 seconds
Generated tokens: 55

讓我們回顧一下生成的響應，以便更好地視覺化此行為。

>>> prompt_text = " ".join(entry["content"] for entry in prompt)
>>> response_text = generated_text[len(prompt_text) :].strip()
>>> print(response_text)


The reasoning process is that if the sum of the digits of the birth year is equal to the person's age, then the person must have been born in a given year.


The answer is: 1988

我們觀察到模型表現出一定的推理能力，儘管這些能力有限。這可以歸因於幾個因素：使用了小型模型、資料集的有限子集以及為了在筆記本環境中保持過程簡單實用而縮短了訓練持續時間。

此外，資料集的複雜性也起著作用。簡化問題可能會產生更好的結果，正如此處所示。

儘管存在這些限制，但這項技術顯示出巨大的潛力。DeepSeek-R1 的釋出和這種訓練方法的採用可能會在未來幾個月內帶來重大突破！

5. 繼續您的學習之旅 🧑‍🎓

如您所見，這僅僅是探索 GRPO 訓練器和 DeepSeek R1 模型的開始。如果您渴望深入瞭解，請務必探索筆記本中連結的以下資源以及這些額外材料

祝您學習愉快，實驗順利！🚀

< > 在 GitHub 上更新

←在 LLM 中擴充套件測試時計算以進行更長時間的思考華佗GPT-o1 醫療 RAG 和推理→