使用 Unsloth 和 🤗 TRL 將 LLM 微調速度提升 2 倍

釋出於 2024 年 1 月 10 日

在 GitHub 上更新

贊

Daniel (Unsloth)

danielhanchen

訪客

因為 LLM 微調耗時過長而抓狂？在這篇文章中，我們將介紹一個由社群開發的輕量級工具，它能讓 LLM 微調速度快如閃電！

在深入瞭解 Unsloth 之前，閱讀我們的 QLoRA 部落格文章，或者熟悉使用 🤗 PEFT 庫進行 LLM 微調可能會有所幫助。

Unsloth - 速度提升 2 倍，記憶體使用減少 40%，準確度無下降

Unsloth 是一個輕量級庫，用於加速 LLM 微調，它與 Hugging Face 生態系統（Hub、transformers、PEFT、TRL）完全相容。該庫由 Unsloth 團隊（Daniel 和 Michael）以及開源社群積極開發。該庫支援大多數 NVIDIA GPU——從 GTX 1070 到 H100——並且可以與 TRL 庫的整個訓練器套件（SFTTrainer、DPOTrainer、PPOTrainer）一起使用。撰寫本文時，Unsloth 支援 Llama（CodeLlama、Yi 等）和 Mistral 架構。

Unsloth 透過用最佳化操作覆蓋部分建模程式碼來工作。透過手動推導反向傳播步驟並將所有 Pytorch 模組重寫為 Triton 核心，Unsloth 可以同時減少記憶體使用並加快微調速度。至關重要的是，相對於正常的 QLoRA，準確度下降為 0%，因為在最佳化程式碼中沒有進行任何近似。

基準測試

1 個 A100 40GB	資料集	🤗 Hugging Face	🤗 + Flash Attention 2	🦥 Unsloth	🦥 視訊記憶體減少
Code Llama 34b	Slim Orca	1 倍	1.01 倍	1.94 倍	-22.7%
Llama-2 7b	Slim Orca	1 倍	0.96 倍	1.87 倍	-39.3%
Mistral 7b	Slim Orca	1 倍	1.17 倍	1.88 倍	-65.9%
Tiny Llama 1.1b	Alpaca	1 倍	1.55 倍	2.74 倍	-57.8%
DPO with Zephyr	Ultra Chat	1 倍	1.24 倍	1.88 倍	-11.6%

免費 Colab T4	資料集	🤗 Hugging Face	🤗 + Pytorch 2.1.1	🦥 Unsloth	🦥 視訊記憶體減少
Llama-2 7b	OASST	1 倍	1.19 倍	1.95 倍	-43.3%
Mistral 7b	Alpaca	1 倍	1.07 倍	1.56 倍	-13.7%
Tiny Llama 1.1b	Alpaca	1 倍	2.06 倍	3.87 倍	-73.8%
DPO with Zephyr	Ultra Chat	1 倍	1.09倍	1.55 倍	-18.6%

Unsloth 在 Tesla T4 和 A100 Google Colab 例項上使用 4 個數據集進行了 59 次執行的基準測試。QLoRA 應用於所有線性層（注意力層和 MLP 層），秩為 16，並開啟了梯度檢查點。透過與最新版本的 Transformers (4.36) 進行測試，如果安裝了 Pytorch 2.1.1，該版本已原生集成了 SDPA，Unsloth 的速度最高可提升 2.7 倍，記憶體使用最高可減少 74%。我們還在免費的 Google Colab 例項（低記憶體，1 個 T4 GPU，Pytorch 2.1.0 CUDA 12.1）上測試了 Unsloth。所有 59 個 Jupyter Notebook 都已提供，以確保完全可復現性，更多詳情請參見 Unsloth 的基準測試詳情此處

如何使用 Unsloth？

只需使用 FastLanguageModel.from_pretrained 載入您的模型！目前，Unsloth 支援 Llama 和 Mistral 型別的架構（Yi、Deepseek、TinyLlama、Llamafied Qwen）。如果您想要支援其他架構，請在 Github 上提出問題！此外，在最新的 Transformers main 分支中，您現在可以直接載入預量化的 4 位模型！這使模型下載速度快 4 倍，並減少了大約 500MB 的記憶體碎片，從而允許您適應更大的批次！我們提供了一些預量化模型供您方便使用，包括 unsloth/llama-2-7b-bnb-4bit、unsloth/llama-2-13b-bnb-4bit、unsloth/mistral-7b-bnb-4bit 和 unsloth/codellama-34b-bnb-4bit。

您需要向 from_pretrained 提供預期的最大序列長度。Unsloth 內部執行 RoPE 縮放，因此會自動支援更大的最大序列長度。否則，API 與 transformers 的 from_pretrained 幾乎相同，只是 FastLanguageModel.from_pretrained 為了方便也返回模型 tokenizer。

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
    max_seq_length = 2048, # Supports RoPE Scaling internally, so choose any!
    load_in_4bit = True,
)

模型載入後，使用 FastLanguageModel.get_peft_model 附加介面卡以執行 QLoRA 微調。

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
)

附加介面卡後，您可以直接在 HF 生態系統中的任何類中使用該模型，例如 TRL 中的 SFTTrainer！

Unsloth + TRL 整合

要將 Unsloth 與 TRL 庫一起使用，只需將 Unsloth 模型傳入 SFTTrainer 或 DPOTrainer！訓練後的模型與 Hugging Face 生態系統完全相容，因此您可以將最終模型推送到 Hub 並開箱即用地使用 transformers 進行推理！

import torch

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

from unsloth import FastLanguageModel

max_seq_length = 2048 # Supports RoPE Scaling interally, so choose any!
# Get dataset
dataset = load_dataset("imdb", split="train")

# Load Llama model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Supports Llama, Mistral - replace this!
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = True,
)

# Do model patching and add fast LoRA weights
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = True,
    random_state = 3407,
    max_seq_length = max_seq_length,
)

trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
      per_device_train_batch_size = 2,
      gradient_accumulation_steps = 4,
      warmup_steps = 10,
      max_steps = 60,
      fp16 = not torch.cuda.is_bf16_supported(),
      bf16 = torch.cuda.is_bf16_supported(),
      logging_steps = 1,
      output_dir = "outputs",
      optim = "adamw_8bit",
      seed = 3407,
  ),
)
trainer.train()

可復現的 Jupyter Notebook

我們將在下面分享完全可復現的 Jupyter Notebook，供任何想在免費 Google Colab 例項上使用 SFTTrainer 試用 Unsloth 的人使用。

Llama 7b 免費 Tesla T4 colab 示例此處

Mistral 7b 免費 Tesla T4 colab 示例此處

CodeLlama 34b A100 colab 示例此處

Zephyr DPO 複製 T4 colab 示例此處

更多部落格文章

🤗 Transformers 中原生支援的量化方案概述

作者： 2023 年 9 月 12 日 • 12

使用 AutoGPTQ 和 transformers 讓大型語言模型更輕量

作者： 2023 年 8 月 23 日 • 58

社群

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入評論

贊