🚀 使用 LoRA 微調 Qwen3 8B

本教程展示瞭如何使用 optimum-neuron 在 AWS Trainium 加速器上微調 Qwen3 模型。

本教程基於 Qwen3 微調示例指令碼。

1. 🛠️ 設定 AWS 環境

我們將使用一個擁有 16 個 Trainium 加速器（32 個 Neuron 核心）的 `trn1.32xlarge` 例項以及 Hugging Face Neuron 深度學習 AMI。

Hugging Face AMI 包含了所有預裝的必需庫

datasets, transformers, optimum-neuron
Neuron SDK 包
無需額外環境設定

要建立您的例項，請遵循此處的指南。

2. 📊 載入並準備資料集

我們將使用 simple recipes 資料集來微調我們的模型以生成食譜。

{
    'recipes': "- Preheat oven to 350 degrees\n- Butter two 9x5' loaf pans\n- Cream the sugar and the butter until light and whipped\n- Add the bananas, eggs, lemon juice, orange rind\n- Beat until blended uniformly\n- Be patient, and beat until the banana lumps are gone\n- Sift the dry ingredients together\n- Fold lightly and thoroughly into the banana mixture\n- Pour the batter into prepared loaf pans\n- Bake for 45 to 55 minutes, until the loaves are firm in the middle and the edges begin to pull away from the pans\n- Cool the loaves on racks for 30 minutes before removing from the pans\n- Freezes well",
    'names': 'Beat this banana bread'
}

我們使用 `datasets` 庫中的 `load_dataset()` 方法來載入資料集。

from random import randrange

from datasets import load_dataset


# Load dataset from the hub
dataset_id = "tengomucho/simple_recipes"
recipes = load_dataset(dataset_id, split="train")

dataset_size = len(recipes)
print(f"dataset size: {dataset_size}")
print(recipes[randrange(dataset_size)])
# dataset size: 20000

為了調整我們的模型，我們需要將結構化的示例轉換為帶有給定上下文的引用集合，因此我們定義了分詞函式，以便可以將其對映到資料集上。

資料集應以輸入-輸出對的形式構建，其中每個輸入是提示，輸出是模型的預期響應。我們將利用模型的分詞器聊天模板並對資料集進行預處理，以便輸入到訓練器中。

# Preprocesses the dataset
def preprocess_dataset_with_eos(eos_token):
    def preprocess_function(examples):
        recipes = examples["recipes"]
        names = examples["names"]

        chats = []
        for recipe, name in zip(recipes, names):
            # Append the EOS token to the response
            recipe += eos_token

            chat = [
                {"role": "user", "content": f"How can I make {name}?"},
                {"role": "assistant", "content": recipe},
            ]

            chats.append(chat)
        return {"messages": chats}

    dataset = recipes.map(preprocess_function, batched=True, remove_columns=recipes.column_names)
    return dataset

# Structures the dataset into prompt-expected output pairs.
def formatting_function(examples):
    return tokenizer.apply_chat_template(examples["messages"], tokenize=False, add_generation_prompt=False)

注意：這些函式引用了 `eos_token` 和 `tokenizer`，它們在用於執行本教程的 Python 指令碼中已經明確定義。

3. 🎯 使用 NeuronSFTTrainer 和 PEFT 微調 Qwen3

對於標準的 PyTorch 微調，您通常會使用帶 LoRA 介面卡的 PEFT 和 `SFTTrainer`。

在 AWS Trainium 上，`optimum-neuron` 提供了 `NeuronSFTTrainer` 作為直接替代品。

在 Trainium 上進行分散式訓練： 由於 Qwen3 無法在單個加速器上執行，我們使用分散式訓練技術

資料並行 (DDP)
張量並行
流水線並行

模型載入和 LoRA 配置與其他加速器類似。

將所有部分組合在一起，並假設資料集已經載入，我們可以編寫以下程式碼在 AWS Trainium 上微調 Qwen3

model_id = "Qwen/Qwen3-8B"

# Define the training arguments
output_dir = "qwen3-finetuned-recipes"
training_args = NeuronTrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    do_train=True,
    max_steps=-1,  # -1 means train until the end of the dataset
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=5e-4,
    bf16=True,  
    tensor_parallel_size=8,
    logging_steps=2,
    lr_scheduler_type="cosine",
    overwrite_output_dir=True,
)

# Load the model with the NeuronModelForCausalLM class.
# It will load the model with a custom modeling speficically designed for AWS Trainium.
trn_config = training_args.trn_config
dtype = torch.bfloat16 if training_args.bf16 else torch.float32
model = NeuronModelForCausalLM.from_pretrained(
    model_id,
    trn_config,
    torch_dtype=dtype,
    # Use FlashAttention2 for better performance and to be able to use larger sequence lengths.
    use_flash_attention_2=True,
)

lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    target_modules=[
        "embed_tokens",
        "q_proj",
        "v_proj",
        "o_proj",
        "k_proj",
        "up_proj",
        "down_proj",
        "gate_proj",
    ],
    bias="none",
    task_type="CAUSAL_LM",
)

# Converting the NeuronTrainingArguments to a dictionary to feed them to the NeuronSFTConfig.
args = training_args.to_dict()

sft_config = NeuronSFTConfig(
    max_seq_length=4096,
    packing=True,
    **args,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = preprocess_dataset_with_eos(tokenizer.eos_token)

 def formatting_function(examples):
     return tokenizer.apply_chat_template(examples["messages"], tokenize=False, add_generation_prompt=False)

 # The NeuronSFTTrainer will use `formatting_function` to format the dataset and `lora_config` to apply LoRA on the
 # model.
 trainer = NeuronSFTTrainer(
     args=sft_config,
     model=model,
     peft_config=lora_config,
     tokenizer=tokenizer,
     train_dataset=dataset,
     formatting_func=formatting_function,
 )
 trainer.train()

📝 完整指令碼可用： 上述所有步驟都已整合到一個即用型指令碼中 finetune_qwen3.py。

要啟動訓練，只需在您的 AWS Trainium 例項中執行以下命令

# Flags for Neuron compilation
export NEURON_CC_FLAGS="--model-type transformer --retry_failed_compilation"
export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3 # Async Runtime
export MALLOC_ARENA_MAX=64 # Host OOM mitigation

# Variables for training
PROCESSES_PER_NODE=32
NUM_EPOCHS=3
TP_DEGREE=8
BS=1
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=2
MODEL_NAME="Qwen/Qwen3-8B" # Change this to the desired model name
OUTPUT_DIR="$(echo $MODEL_NAME | cut -d'/' -f2)-finetuned"
DISTRIBUTED_ARGS="--nproc_per_node $PROCESSES_PER_NODE"
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )

if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
    MAX_STEPS=5
else
    MAX_STEPS=-1
fi

torchrun --nproc_per_node $PROCESSES_PER_NODE finetune_qwen3.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --learning_rate 8e-4 \
  --bf16 \
  --tensor_parallel_size $TP_DEGREE \
  --zero_1 \
  --async_save \
  --logging_steps $LOGGING_STEPS \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "cosine" \
  --overwrite_output_dir

🔧 單命令執行： 完整的 bash 訓練指令碼 finetune_qwen3.sh 可用

./finetune_qwen3.sh

4. 🔄 整合並測試微調後的模型

在分散式訓練期間，Optimum Neuron 會分別儲存模型分片。這些分片在使用前需要進行整合。

使用 Optimum CLI 進行整合

optimum-cli neuron consolidate Qwen3-8B-finetuned Qwen3-8B-finetuned/adapter_default

這將建立一個 `adapter_model.safetensors` 檔案，即我們在上一步中訓練的 LoRA 介面卡權重。我們現在可以重新載入模型並將其合併，以便載入進行評估

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig


MODEL_NAME = "Qwen/Qwen3-8B"
ADAPTER_PATH = "Qwen3-8B-finetuned/adapter_default"
MERGED_MODEL_PATH = "Qwen3-8B-recipes"

# Load base model
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Load adapter configuration and model
adapter_config = PeftConfig.from_pretrained(ADAPTER_PATH)
finetuned_model = PeftModel.from_pretrained(model, ADAPTER_PATH, config=adapter_config)

print("Saving tokenizer")
tokenizer.save_pretrained(MERGED_MODEL_PATH)
print("Saving model")
finetuned_model = finetuned_model.merge_and_unload()
finetuned_model.save_pretrained(MERGED_MODEL_PATH)

完成此步驟後，就可以用新的提示來測試模型了。

您已成功從 Qwen3 建立了一個微調模型！

5. 🤗 推送到 Hugging Face Hub

透過將您微調的模型上傳到 Hugging Face Hub，與社群分享。

第 1 步：身份驗證

huggingface-cli login

第 2 步：上傳您的模型

from transformers import AutoModelForCausalLM, AutoTokenizer

MERGED_MODEL_PATH = "Qwen3-8B-recipes"
HUB_MODEL_NAME = "your-username/qwen3-8b-recipes"

# Load and push tokenizer
tokenizer = AutoTokenizer.from_pretrained(MERGED_MODEL_PATH)
tokenizer.push_to_hub(HUB_MODEL_NAME)

# Load and push model
model = AutoModelForCausalLM.from_pretrained(MERGED_MODEL_PATH)
model.push_to_hub(HUB_MODEL_NAME)