實踐練習：使用 GRPO 微調模型

既然您已經瞭解了理論，那麼讓我們將其付諸實踐吧！在此練習中，您將使用 GRPO 微調模型。

本練習由 LLM 微調專家 @mlabonne 撰寫。

安裝依賴項

首先，讓我們為本次練習安裝依賴項。

!pip install -qqq datasets==3.2.0 transformers==4.47.1 trl==0.14.0 peft==0.14.0 accelerate==1.2.1 bitsandbytes==0.45.2 wandb==0.19.7 --progress-bar off
!pip install -qqq flash-attn --no-build-isolation --progress-bar off

現在我們將匯入必要的庫。

import torch
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import GRPOConfig, GRPOTrainer

匯入並登入 Weights & Biases

Weights & Biases 是用於記錄和監控實驗的工具。我們將使用它來記錄我們的微調過程。

import wandb

wandb.login()

您可以在不登入 Weights & Biases 的情況下完成此練習，但建議登入以跟蹤您的實驗並解釋結果。

載入資料集

現在，讓我們載入資料集。在本例中，我們將使用 mlabonne/smoltldr 資料集，其中包含短篇故事列表。

dataset = load_dataset("mlabonne/smoltldr")
print(dataset)

載入模型

現在，讓我們載入模型。

在此練習中，我們將使用 SmolLM2-135M 模型。

這是一個小型 135M 引數模型，可在有限的硬體上執行。這使得該模型非常適合學習，但它並不是最強大的模型。如果您可以訪問更強大的硬體，可以嘗試微調更大的模型，例如 SmolLM2-1.7B。

model_id = "HuggingFaceTB/SmolLM-135M-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
    attn_implementation="flash_attention_2",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

載入 LoRA

現在，讓我們載入 LoRA 配置。我們將利用 LoRA 來減少可訓練引數的數量，從而減少微調模型所需的記憶體佔用。

如果您不熟悉 LoRA，可以在第 11 章中閱讀更多資訊。

# Load LoRA
lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
)
model = get_peft_model(model, lora_config)
print(model.print_trainable_parameters())

Total trainable parameters: 135M

定義獎勵函式

如上一節所述，GRPO 可以使用任何獎勵函式來改進模型。在本例中，我們將使用一個簡單的獎勵函式，它鼓勵模型生成長度為 50 個標記的文字。

# Reward function
ideal_length = 50


def reward_len(completions, **kwargs):
    return [-abs(ideal_length - len(completion)) for completion in completions]

定義訓練引數

現在，讓我們定義訓練引數。我們將使用 GRPOConfig 類以典型的 transformers 風格定義訓練引數。

如果這是您第一次定義訓練引數，您可以檢視 TrainingArguments 類以獲取更多資訊，或檢視第 2 章以獲取詳細介紹。

# Training arguments
training_args = GRPOConfig(
    output_dir="GRPO",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    max_prompt_length=512,
    max_completion_length=96,
    num_generations=8,
    optim="adamw_8bit",
    num_train_epochs=1,
    bf16=True,
    report_to=["wandb"],
    remove_unused_columns=False,
    logging_steps=1,
)

現在，我們可以使用模型、資料集和訓練引數初始化訓練器並開始訓練。

# Trainer
trainer = GRPOTrainer(
    model=model,
    reward_funcs=[reward_len],
    args=training_args,
    train_dataset=dataset["train"],
)

# Train model
wandb.init(project="GRPO")
trainer.train()

在單個 A10G GPU 上訓練大約需要 1 小時，該 GPU 可在 Google Colab 或透過 Hugging Face Spaces 獲得。

在訓練期間將模型推送到 Hub

如果我們將 push_to_hub 引數設定為 True，並將 model_id 引數設定為有效的模型名稱，則模型將在訓練期間推送到 Hugging Face Hub。如果您想立即開始測試模型，這非常有用！

解釋訓練結果

GRPOTrainer 會記錄您的獎勵函式中的獎勵、損失以及一系列其他指標。

我們將重點關注獎勵函式中的獎勵和損失。

如您所見，隨著模型的學習，獎勵函式中的獎勵值趨近於 0。這是一個很好的跡象，表明模型正在學習生成正確長度的文字。

Reward from reward function

您可能會注意到，損失從零開始，然後在訓練期間增加，這似乎違反直覺。這種行為在 GRPO 中是預期的，並且與演算法的數學公式直接相關。GRPO 中的損失與 KL 散度（相對於原始策略的上限）成正比。隨著訓練的進行，模型學習生成更符合獎勵函式的文字，導致它與初始策略的偏差更大。這種不斷增加的偏差反映在不斷上升的損失值中，這實際上表明模型正在成功地適應以最佳化獎勵函式。

Loss

儲存併發布模型

讓我們與社群分享模型！

merged_model = trainer.model.merge_and_unload()
merged_model.push_to_hub(
    "SmolGRPO-135M", private=False, tags=["GRPO", "Reasoning-Course"]
)

生成文字

🎉 您已成功使用 GRPO 微調了模型！現在，讓我們使用模型生成一些文字。

首先，我們將定義一個非常長的文件！

prompt = """
# A long document about the Cat

The cat (Felis catus), also referred to as the domestic cat or house cat, is a small 
domesticated carnivorous mammal. It is the only domesticated species of the family Felidae.
Advances in archaeology and genetics have shown that the domestication of the cat occurred
in the Near East around 7500 BC. It is commonly kept as a pet and farm cat, but also ranges
freely as a feral cat avoiding human contact. It is valued by humans for companionship and
its ability to kill vermin. Its retractable claws are adapted to killing small prey species
such as mice and rats. It has a strong, flexible body, quick reflexes, and sharp teeth,
and its night vision and sense of smell are well developed. It is a social species,
but a solitary hunter and a crepuscular predator. Cat communication includes
vocalizations—including meowing, purring, trilling, hissing, growling, and grunting—as
well as body language. It can hear sounds too faint or too high in frequency for human ears,
such as those made by small mammals. It secretes and perceives pheromones.
"""

messages = [
    {"role": "user", "content": prompt},
]

現在，我們可以使用模型生成文字。

# Generate text
from transformers import pipeline

generator = pipeline("text-generation", model="SmolGRPO-135M")

## Or use the model and tokenizer we defined earlier
# generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

generate_kwargs = {
    "max_new_tokens": 256,
    "do_sample": True,
    "temperature": 0.5,
    "min_p": 0.1,
}

generated_text = generator(messages, generate_kwargs=generate_kwargs)

print(generated_text)

結論

在本章中，我們學習瞭如何使用 GRPO 微調模型。我們還學習瞭如何解釋訓練結果並使用模型生成文字。

< > 在 GitHub 上更新

LLM 課程