使用 AutoTrain 微調 Mixtral 8x7B

社群文章釋出於 2024 年 4 月 1 日

在這篇博文中，我將向你展示如何使用 AutoTrain 在你自己的資料集上微調 Mixtral 8x7B。這篇博文使用的編碼量會非常少。我們將編寫零行程式碼！

由於 Mixtral 是一個相當大的模型，微調它需要相當大的硬體。在本文中，我們將使用 Hugging Face 最新提供的服務：在 DGX Cloud 上訓練。但請注意，你也可以使用本文中的流程，在你自己的硬體（或其他雲提供商）上進行訓練！本文也提供了在本地/自定義硬體上進行訓練的步驟。

注意：DGX Cloud 上的訓練服務僅對企業使用者開放。

要在你的自定義資料集上微調 mixtral-8x7b-instruct，你可以點選這裡，然後點選“Train”按鈕，你會看到一些選項，你需要選擇“NVIDIA DGX cloud”。

完成後，系統會為你建立一個 AutoTrain Space，你可以在其中上傳你的資料、選擇引數並開始訓練。

如果在本地執行，你只需要安裝 AutoTrain 並啟動應用即可。

$ pip install -U autotrain-advanced
$ export HF_TOKEN=your_huggingface_write_token
$ autotrain app --host 127.0.0.1 --port 8080

完成後，你可以在瀏覽器中訪問 127.0.0.1:8080，現在你就可以在本地進行微調了。

如果在 DGX Cloud 上執行，你會在硬體下拉選單中看到選擇 8xH100 的選項。如果在本地執行，此下拉選單將被停用。

如你所見，AutoTrain UI 為不同型別的任務、資料集和引數提供了許多選項。使用者可以使用 AutoTrain 在自己的資料集上訓練幾乎任何型別的模型 💥 如果你是高階使用者，並且想要調整更多引數，你只需點選訓練引數下的 Full 即可！

引數越多，終端使用者就越容易混淆。今天，我們只討論基本引數。99% 的情況下，你只需要調整基本引數就能得到一個最終效能非常出色的模型 😉 提供超出需求的功能只會讓終端使用者感到困惑。

今天，我們選擇 Hugging Face H4 團隊的 no_robots 資料集。你可以在這裡檢視該資料集。以下是該資料集的一個樣本示例

[ { "content": "Please summarize the goals for scientists in this text:\n\nWithin three days, the intertwined cup nest of grasses was complete, featuring a canopy of overhanging grasses to conceal it. And decades later, it served as Rinkert’s portal to the past inside the California Academy of Sciences. Information gleaned from such nests, woven long ago from species in plant communities called transitional habitat, could help restore the shoreline in the future. Transitional habitat has nearly disappeared from the San Francisco Bay, and scientists need a clearer picture of its original species composition—which was never properly documented. With that insight, conservation research groups like the San Francisco Bay Bird Observatory can help guide best practices when restoring the native habitat that has long served as critical refuge for imperiled birds and animals as adjacent marshes flood more with rising sea levels. “We can’t ask restoration ecologists to plant nonnative species or to just take their best guess and throw things out there,” says Rinkert.", "role": "user" }, { "content": "Scientists are studying nests hoping to learn about transitional habitats that could help restore the shoreline of San Francisco Bay.", "role": "assistant" } ]

這個資料集幾乎是 SFT 訓練的標準。如果你想在自己的自定義資料集上訓練你自己的對話機器人，這就是需要遵循的格式！ 🤗 感謝 H4 團隊！

現在，我們已經準備好了資料集和執行中的 AutoTrain UI。我們現在需要做的就是將 UI 指向資料集，調整引數，然後點選“Start”按鈕。這是點選“Start”按鈕前的 UI 介面。

我們選擇了 Hugging Face Hub 資料集，並更改了以下內容

資料集名稱：HuggingFaceH4/no_robots
訓練集拆分：train_sft，這是該特定資料集中拆分的命名方式。
列對映：{"text": "messages"}。這將 AutoTrain 的 text 列對映到資料集中的文字列，在我們的例子中是 messages。

對於引數，以下設定效果很好

{
  "block_size": 1024,
  "model_max_length": 2048,
  "mixed_precision": "bf16",
  "lr": 0.00003,
  "epochs": 3,
  "batch_size": 2,
  "gradient_accumulation": 4,
  "optimizer": "adamw_bnb_8bit",
  "scheduler": "linear",
  "chat_template": "zephyr",
  "target_modules": "all-linear",
  "peft": false
}

在這裡，我們使用了 adamw_bnb_8bit 最佳化器和 zephyr 聊天模板。根據你的資料集，你可以使用 zephyr、chatml 或 tokenizer 聊天模板。或者你也可以將其設定為 none，並在上傳到 AutoTrain 之前按你喜歡的方式格式化資料：可能性是無窮無盡的。

請注意，對於這個特定的模型，我們沒有使用量化，PEFT 已被停用 💥

完成後，點選“Start”按鈕，然後就可以去喝杯咖啡，放鬆一下了。

當我使用相同的引數和資料集在 8xH100 上嘗試時，訓練耗時約 45 分鐘（3 個 epoch），我的模型被推送到了 Hub 上作為一個私有模型，供我立即試用 🚀 如果你想看看訓練好的模型，可以在這裡檢視。

太棒了！我們已經在自己的自定義資料集上微調了 mixtral 8x7b-instruct 模型，並且該模型已經準備好使用 Hugging Face 的推理端點進行部署。

彩蛋：如果你喜歡命令列介面，這裡是執行命令

autotrain llm \
--train \
--trainer sft \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--data-path HuggingFaceH4/no_robots \
--train-split train_sft \
--text-column messages \
--chat-template zephyr \
--mixed-precision bf16 \
--lr 2e-5 \
--optimizer adamw_bnb_8bit \
--scheduler linear \
--batch-size 2 \
--epochs 3 \
--gradient-accumulation 4 \
--block-size 1024 \
--max-length 2048 \
--padding right \
--project-name autotrain-xva0j-mixtral8x7b
--username abhishek \
--push-to-hub

如有疑問，請透過 autotrain@hf.co 或 Twitter 聯絡：@abhi1thakur

社群

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入以發表評論

贊