Llama 3.1 - 405B, 70B 和 8B，支援多語言和長上下文

釋出於 2024 年 7 月 23 日

在 GitHub 上更新

贊

237

Philipp Schmid

philschmid

奧馬爾·桑塞維耶羅 (Omar Sanseviero)

Llama 3.1 釋出了！今天，Llama 家族的最新成員正式入駐 Hugging Face。我們很高興與 Meta 合作，以確保在 Hugging Face 生態系統中實現最佳整合。Hub 上提供了八個開放權重模型（3 個基礎模型和 5 個微調模型）。

Llama 3.1 有三種尺寸：8B 用於在消費級 GPU 上高效部署和開發，70B 用於大規模 AI 原生應用，405B 用於合成數據、作為法官的 LLM 或蒸餾。所有三種尺寸都有基礎版和指令調優版。

除了這六個生成模型之外，Meta 還發布了兩個新模型：Llama Guard 3 和 Prompt Guard。Prompt Guard 是一個小型分類器，用於檢測提示注入和越獄。Llama Guard 3 是一個安全模型，可以對 LLM 輸入和生成進行分類。

在釋出的功能和整合中，我們有：

Hub 上的模型
Hugging Face Transformers 和 TGI 整合
Hugging Chat 與 Meta Llama 3.1 405B Instruct 整合
推理和部署與推理端點、Google Cloud、Amazon SageMaker 和 DELL Enterprise Hub 整合
FP8、AWQ 和 GPTQ 量化，簡化推理
使用 🤗 TRL 在單個 GPU 上微調 Llama 3.1 8B
使用 Llama 3.1 70B 和 405B 以及 Distilabel 生成合成資料

Llama 3.1 有哪些新功能？
Llama 3.1 需要多少記憶體？
- 推理記憶體需求
- 訓練記憶體需求
Llama 3.1 評估
使用 Hugging Face Transformers
如何提示 Llama 3.1
- 內建工具呼叫
自定義工具呼叫
演示
Llama 3.1 405B 使用 FP8、AWQ 和 GPTQ 進行量化
推理整合
- Hugging Face 推理 API
- Hugging Face 推理端點
Hugging Face 合作伙伴整合
使用 Hugging Face TRL 進行微調
使用 distilabel 生成合成資料
其他資源
致謝

Llama 3.1 有哪些新功能？

為什麼 Llama 3.1 如此令人興奮？除了前代產品提供的功能之外，Llama 3.1 還有一些關鍵的新功能：

128K token 的大上下文長度（原版為 8K）
多語言能力
工具使用能力
一個非常大的密集模型，包含 4050 億個引數
更寬鬆的許可證

讓我們深入瞭解這些功能！

Llama 3.1 版本推出了六個基於 Llama 3 架構的新型開放式 LLM 模型。它們有三種尺寸：8B、70B 和 405B 引數，每個都有基礎版（預訓練）和指令調優版。所有變體都支援 128K token 的上下文長度和 8 種語言，包括英語、德語、法語、義大利語、葡萄牙語、印地語、西班牙語和泰語。Llama 3.1 繼續使用分組查詢注意力 (GQA)，這是一種高效的表示形式，有助於處理更長的上下文。

Meta-Llama-3.1-8B：基礎 8B 模型
Meta-Llama-3.1-8B-Instruct：基礎 8B 模型的指令微調版本
Meta-Llama-3.1-70B：基礎 70B 模型
Meta-Llama-3.1-70B-Instruct：基礎 70B 模型的指令微調版本
Meta-Llama-3.1-405B：基礎 405B 模型
Meta-Llama-3.1-405B-Instruct：基礎 405B 模型的指令微調版本

除了這 6 個語言模型，還發布了 Llama Guard 3 和 Prompt Guard。

Llama Guard 3 是 Llama Guard 系列的最新迭代，在 Llama 3.1 8B 上進行了微調。它專為生產用例而構建，具有 128k 的上下文長度和多語言能力。Llama Guard 3 可以對 LLM 輸入（提示）和響應進行分類，以檢測在風險分類中被認為不安全的內容。
Prompt Guard 另一方面，是一個小型 279M 引數的基於 BERT 的分類器，可以檢測提示注入和越獄。它是在大量攻擊語料庫上訓練的，建議使用特定於應用程式的資料進行進一步微調。

與 Llama 3 相比，Llama 3.1 的新功能是，指令模型在工具呼叫方面進行了微調，以適應代理用例。有內建的兩種工具（搜尋，使用 Wolfram Alpha 進行數學推理），可以透過自定義 JSON 函式進行擴充套件。

Llama 3.1 模型在一個定製的 GPU 叢集上訓練了超過 15 萬億個 token，總計 39.3M GPU 小時（8B 模型 1.46M，70B 模型 7.0M，405B 模型 30.84M）。我們不知道訓練資料集混合的確切細節，我們只能猜測它具有更多樣化的多語言策展。Llama 3.1 Instruct 已經針對指令遵循進行了最佳化，並在公開可用的指令資料集以及超過 25M 個合成生成示例上進行了監督微調 (SFT) 和帶有人類反饋的強化學習 (RLHF)。Meta 開發了基於 LLM 的分類器，用於在資料混合建立過程中過濾和篩選高質量的提示和響應。

關於許可條款，Llama 3.1 附帶的許可證非常相似，但有一個關鍵區別：**它允許使用可用於改進其他 LLM 的模型輸出**。這意味著即使使用不同的模型，也允許生成合成資料和蒸餾！正如後面所討論的，這對於 405B 模型尤為重要。該許可證允許重新分發、微調和建立衍生作品，並且仍然要求衍生模型在其名稱開頭包含“Llama”，並且任何衍生作品或服務都必須提及“Built with Llama”。有關完整詳細資訊，請務必閱讀官方許可證。

Llama 3.1 需要多少記憶體？

Llama 3.1 帶來了令人興奮的進步。然而，執行它需要仔細考慮您的硬體資源。我們根據三種模型尺寸，分解了訓練和推理的記憶體需求。

推理記憶體需求

對於推理，記憶體需求取決於模型大小和權重精度。下表顯示了不同配置所需的大致記憶體：

模型大小	FP16	FP8	INT4
8B	16 GB	8 GB	4 GB
70B	140 GB	70 GB	35 GB
405B	810 GB	405 GB	203 GB

注意：以上引用的數字表示載入模型檢查點所需的 GPU 視訊記憶體。它們不包括用於核心或 CUDA 圖形的 Torch 保留空間。

例如，一個 H100 節點（8 個 H100）大約有 640GB 的 VRAM，因此 405B 模型需要多節點設定執行或以更低的精度（例如 FP8）執行，這將是推薦的方法。

請記住，較低的精度（例如，INT4）可能會導致一些準確性損失，但可以顯著降低記憶體需求並提高推理速度。除了模型權重之外，您還需要將 KV 快取保留在記憶體中。它包含模型上下文中所有 token 的鍵和值，這樣在生成新 token 時就不需要重新計算它們。特別是當利用可用的長上下文長度時，它成為一個重要因素。在 FP16 中，KV 快取的記憶體需求為：

模型大小	1k token	16k token	128k token
8B	0.125 GB	1.95 GB	15.62 GB
70B	0.313 GB	4.88 GB	39.06 GB
405B	0.984 GB	15.38	123.05 GB

特別是對於小模型，當接近上下文長度最大值時，快取使用的記憶體與權重一樣多。

訓練記憶體需求

下表概述了使用不同技術訓練 Llama 3.1 模型的大致記憶體需求：

模型大小	完全微調	LoRA	Q-LoRA
8B	60 GB	16 GB	6 GB
70B	500 GB	160 GB	48 GB
405B	3.25 TB	950 GB	250 GB

注意：這些是估算值，可能會因具體的實施細節和最佳化而異。

Llama 3.1 評估

注意：我們目前正在新的 Open LLM Leaderboard 2 上單獨評估 Llama 3.1，並將於今天晚些時候更新此部分。以下是 Meta 官方評估的摘錄。

類別	*基準測試*	*# 次數*	指標	*Llama 3 8B*	*Llama 3.1 8B*	*Llama 3 70B*	*Llama 3.1 70B*	*Llama 3.1 405B*
通用	MMLU	5	macro_avg/acc_char	66.7	66.7	79.5	79.3	85.2
	MMLU PRO (CoT)	5	macro_avg/acc_char	36.2	37.1	55.0	53.8	61.6
	AGIEval 英語	3-5	average/acc_char	47.1	47.8	63.0	64.6	71.6
	CommonSenseQA	7	acc_char	72.6	75.0	83.8	84.1	85.8
	Winogrande	5	acc_char	-	60.5	-	83.3	86.7
	BIG-Bench Hard (CoT)	3	average/em	61.1	64.2	81.3	81.6	85.9
	ARC-挑戰	25	acc_char	79.4	79.7	93.1	92.9	96.1
知識推理	TriviaQA-Wiki	5	em	78.5	77.6	89.7	89.8	91.8
	SQuAD	1	em	76.4	77.0	85.6	81.8	89.3
閱讀理解	QuAC (F1)	1	f1	44.4	44.9	51.1	51.1	53.6
	BoolQ	0	acc_char	75.7	75.0	79.0	79.4	80.0
	DROP (F1)	3	f1	58.4	59.5	79.7	79.6	84.8

使用 Hugging Face Transformers

Llama 3.1 需要進行一次小的模型更新，以有效處理 RoPE 縮放。透過 Transformers 版本 4.43.2，您可以使用新的 Llama 3.1 模型並利用 Hugging Face 生態系統中的所有工具。請確保使用最新的 transformers 版本。

pip install "transformers>=4.43.2" --upgrade

一些細節

Transformers 預設以 bfloat16 載入模型。這是 Meta 釋出的原始檢查點所使用的型別，因此是確保最佳精度或進行評估的推薦方式。
助手響應可能會以特殊 token `<|eot_id|>` 結束，但如果找到常規的 EOS token，我們也必須停止生成。我們可以透過在 `eos_token_id` 引數中提供終止符列表來提前停止生成。
我們使用了從原始 meta 程式碼庫中獲取的預設取樣引數（`temperature` 和 `top_p`）。我們還沒有時間進行廣泛的測試，歡迎探索！

以下程式碼片段展示瞭如何使用 `meta-llama/Meta-Llama-3.1-8B-Instruct`。它需要大約 16 GB 的 VRAM，這適用於許多消費級 GPU。相同的程式碼片段適用於 `meta-llama/Meta-Llama-3.1-70B-Instruct`（需要 140GB VRAM）和 `meta-llama/Meta-Llama-3.1-405B-Instruct`（需要 810GB VRAM），這使其成為生產用例中一個非常有趣的模型。透過以 8 位或 4 位模式載入可以進一步減少記憶體消耗。

from transformers import pipeline
import torch

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
pipe = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(
    messages,
    max_new_tokens=256,
    do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)
# Arrrr, me hearty! Yer lookin' fer a bit o' information about meself, eh? Alright then, matey! I be a language-generatin' swashbuckler, a digital buccaneer with a penchant fer spinnin' words into gold doubloons o' knowledge! Me name be... (dramatic pause)...Assistant! Aye, that be me name, and I be here to help ye navigate the seven seas o' questions and find the hidden treasure o' answers! So hoist the sails and set course fer adventure, me hearty! What be yer first question?

您還可以自動量化模型，透過 bitsandbytes 以 8 位甚至 4 位模式載入它。載入大型 70B 版本的 4 位模型大約需要 34 GB 記憶體。以下是您如何以 4 位載入生成管道的方式：

pipeline = pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "quantization_config": {"load_in_4bit": True}
    },
)

有關使用 `transformers` 模型的更多詳細資訊，請檢視模型卡片。

注意：Transformers 會處理所有惱人的提示模板問題等等，如果您想了解更多關於提示的資訊，請檢視下一節。

如何提示 Llama 3.1

基礎模型沒有提示格式。像其他基礎模型一樣，它們可以用於繼續輸入序列以提供可信的延續，或者用於零樣本/少樣本推理。它們也是微調您自己用例的絕佳基礎。

指令版本支援具有 4 個角色的對話格式：

**系統：**設定對話的上下文。它允許包含規則、指導或必要資訊，以幫助有效響應。它還用於在適當的時候啟用工具使用。
**使用者：**使用者的輸入、命令和對模型的提問。
**助手：**助手的響應，基於“系統”和“使用者”提示中提供的上下文。
**ipython：**Llama 3.1 中引入的新角色。此角色用作工具呼叫的輸出，當它返回給 LLM 時。

指令版本使用以下對話結構進行簡單對話：

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ system_prompt }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ user_msg_1 }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{{ model_answer_1 }}<|eot_id|>

Llama 3.1 指令模型現在支援工具呼叫，包括三種內建工具（brave_search、wolfram_alpha 和 code_interpreter）以及透過 JSON 函式呼叫進行的自定義工具呼叫。內建工具使用 Python 語法。為函式呼叫輸出 Python 程式碼的能力是程式碼直譯器工具的一部分，必須使用 `Environment` 關鍵字在系統提示中啟用，如下所示。

內建工具呼叫

內建工具呼叫示例

<|begin_of_text|><|start_header_id|>system<|end_header_id|>


Environment: ipython
Tools: brave_search, wolfram_alpha

Cutting Knowledge Date: 01 March 2023
Today's Date: 13 July 2024


You are a helpful Assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Weather in Menlo Park, California<|eot_id|><|start_header_id|>assistant<|end_header_id|>

此時模型返回的響應將包含用於呼叫受支援工具（本例中為 `brave_search`）的 Python 程式碼。

<|python_tag|>brave_search.call(query="current weather in Menlo Park, California")<|eom_id|>

執行呼叫後的響應隨後被髮送回模型以檢索最終響應。為簡潔起見，以下內容將被附加到前一個片段中顯示的訊息中：

<|python_tag|>brave_search.call(query="Menlo Park California weather")<|eom_id|><|start_header_id|>ipython<|end_header_id|>

{"query": "Menlo Park California weather", "top_k": [{"title": "10-Day Weather Forecast for West Menlo Park, CA - The Weather Channel | weather.com", "url": "https://weather.com/weather/tenday/l/West+Menlo+Park+CA?canonicalCityId=b2375713aa1943aad7d1a13a85e1c0adad13c1b10563b2bbaad70734dc61cf11", "description": "Be prepared with the most accurate 10-day forecast for West <strong>Menlo</strong> <strong>Park</strong>, CA with highs, lows, chance of precipitation from The <strong>Weather</strong> Channel and <strong>Weather</strong>.com", "type": "search_result"},....}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

LLM 的最終響應將是：

The current weather in Menlo Park, California is mostly sunny with a high of 77°F and a low of 56°F.<|eot_id|>

自定義工具呼叫

Llama 3.1 Instruct 支援從單個使用者訊息中進行自定義函式呼叫。以下提示提供了一個示例，說明如何從模型輸出中呼叫自定義函式。在自定義函式呼叫中，模型輸出 `<|eot_id|>` 而不是 `<|eom_id|>`。系統提示需要調整，以告知模型如何處理函式呼叫輸出。

自定義工具呼叫 JSON 函式

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant with tool calling capabilities. When you receive a tool call response, use the output to format an answer to the orginal user question.<|eot_id|><|start_header_id|>user<|end_header_id|>

Given the following functions, please respond with a JSON for a function call with its proper arguments that best answers the given prompt.

Respond in the format {"name": function name, "parameters": dictionary of argument name and its value}. Do not use variables.

{
    "type": "function",
    "function": {
    "name": "get_current_conditions",
    "description": "Get the current weather conditions for a specific location",
    "parameters": {
        "type": "object",
        "properties": {
        "location": {
            "type": "string",
            "description": "The city and state, e.g., San Francisco, CA"
        },
        "unit": {
            "type": "string",
            "enum": ["Celsius", "Fahrenheit"],
            "description": "The temperature unit to use. Infer this from the user's location."
        }
        },
        "required": ["location", "unit"]
    }
    }
}

Question: what is the weather like in Menlo Park?<|eot_id|><|start_header_id|>assitant<|end_header_id|>

{"name": "get_current_conditions", "parameters": {"location": "Menlo Park, CA", "unit": "Fahrenheit"}}<|eot_id|><|start_header_id|>ipython<|end_header_id|>

當我們從選定的工具中檢索輸出時，我們使用相同的 `<|python_tag|>` 分隔符將其傳回模型。`<|python_tag|>` 並不意味著使用 Python。它僅用於表示任何工具輸出的開始。

<|python_tag|>{
    "tool_call_id": "get_current_conditions"
    "output": "Clouds giving way to sun Hi: 76° Tonight: Mainly clear early, then areas of low clouds forming Lo: 56°"
}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

The weather in Menlo Park is currently cloudy with a high of 76° and a low of 56°, with clear skies expected tonight.<|eot_id|>

這種格式必須精確重現才能有效使用。transformers 中提供的聊天模板使其可以輕鬆正確地格式化提示。

演示

您可以在以下演示中體驗這三個指令模型：

Hugging Chat 與 Llama 3.1 405B https://huggingface.co/chat/models/meta-llama/Meta-Llama-3.1-405b-instruct/
Hugging Chat 與 Llama 3.1 70B https://huggingface.co/chat/models/meta-llama/Meta-Llama-3.1-70b-instruct/
由 Gradio 提供支援的 Space，展示 Llama 3.1 8B 演示 https://huggingface.co/spaces/ysharma/Chat_with_Meta_llama3_1_8b

整個堆疊是開源的。Hugging Chat 由 chat-ui 和 text-generation-inference 提供支援。

Llama 3.1 405B 使用 FP8、AWQ 和 GPTQ 進行量化

Meta 建立了 Llama 3.1 405B 的官方 FP8 量化版本，其準確性降級極小。為此，FP8 量化僅應用於模型的主要線性運算元，例如 FFN 的門和上/下投影（覆蓋 75% 的推理 FLOPs）。我們共同努力，確保此 FP8 量化檢查點與社群（transformers、TGI、VLLM）相容。

此外，我們還分別使用 AutoAWQ 和 AutoGPTQ 建立了 INT4 格式的 AWQ 和 GPTQ 量化變體。對於 AWQ，所有線性層都使用 GEMM 核心進行量化，將其零點量化至 4 位，組大小為 128；對於 GPTQ，僅使用 GPTQ 核心，設定相同。我們確保 INT4 檢查點與 transformers 和 TGI 相容，包括 Marlin 核心支援，以加速 TGI 中 GPTQ 量化的推理。

Llama 3.1 405B 可用的量化權重：

meta-llama/Meta-Llama-3.1-405B-Base-FP8：官方 FP8 量化權重，可在 8xH100 上執行
meta-llama/Meta-Llama-3.1-405B-Instruct-FP8：官方 FP8 量化權重，可在 8xH100 上執行
hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4：Hugging Face 量化權重，可在 8xA100 80GB、8xH100 80GB 和 8xA100 40GB（在 KV 快取減小且沒有 CUDA 圖的情況下）上執行
hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4： Hugging Face 量化權重，可在 8xA100 80GB、8xH100 80GB 和 8xA100 40GB（在 KV 快取減小且沒有 CUDA 圖的情況下）上執行
hugging-quants/Meta-Llama-3.1-405B-BNB-NF4：Hugging Face 量化權重，適用於 QLoRA 微調
hugging-quants/Meta-Llama-3.1-405B-Instruct-BNB-NF4：Hugging Face 量化權重，適用於在 8xA100 和 4xH100 上進行推理

Hugging Quants 組織也包含 70B 和 8B 版本的量化檢查點。

推理整合

Hugging Face 推理 API

Hugging Face PRO 使用者現在可以訪問獨家 API 端點，該端點託管 Llama 3.1 8B Instruct、Llama 3.1 70B Instruct 和 Llama 3.1 405B Instruct AWQ，由 text-generation-inference 提供支援。所有版本都支援 Messages API，因此它們與 OpenAI 客戶端庫（包括 LangChain 和 LlamaIndex）相容。

注意：請使用 `pip install "huggingface_hub>=0.24.1"` 更新到最新版 `huggingface_hub`。


from huggingface_hub import InferenceClient

# Initialize the client, pointing it to one of the available models
client = InferenceClient()

chat_completion = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8",
    messages=[
        {"role": "system", "content": "You are a helpful an honest programming assistant."},
        {"role": "user", "content": "Is Rust better than Python?"},
    ],
    stream=True,
    max_tokens=500
)

# iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

有關 Messages API 使用的更多詳細資訊，請檢視此文章。

Hugging Face 推理端點

您可以在 Hugging Face 的推理端點上部署 Llama 3.1，該端點使用 Text Generation Inference 作為後端。Text Generation Inference 是 Hugging Face 開發的生產級推理容器，支援 FP8、連續批處理、token 流式傳輸和張量並行，可實現多 GPU 快速推理。要部署 Llama 3.1，請訪問模型頁面，然後單擊“部署”->“推理端點”小部件。

Meta-Llama-3.1-8B-Instruct 推薦在 1x NVIDIA A10G 或 L4 GPU 上執行
Meta-Llama-3.1-70B-Instruct 推薦在 4x NVIDIA A100 上執行，或作為 AWQ/GPTQ 量化模型在 2x A100 上執行
Meta-Llama-3.1-405B-Instruct-FP8 推薦在 8x NVIDIA H100 (FP) 上執行，或作為 AWQ/GPTQ 量化模型在 8x A100 上執行

from huggingface_hub import InferenceClient

# Initialize the client, pointing it to one of the available models
client = InferenceClient(
    base_url="<ENDPOINT_URL>",
)

# Create a chat completion
chat_completion = client.chat.completions.create(
    model="ENDPOINT",
    messages=[
        {"role": "system", "content": "You are a helpful an honest programming assistant."},
        {"role": "user", "content": "Is Rust better than Python?"},
    ],
    stream=True,
    max_tokens=500
)

# iterate and print stream
for message in chat_completion:
    print(message.choices[0].delta.content, end="")

Hugging Face 合作伙伴整合

注意：我們目前正在與 AWS、Google Cloud、Microsoft Azure 和 DELL 的合作伙伴合作，將 Llama 3.1 8B、70B 和 405B 新增到 Amazon SageMaker、Google Kubernetes Engine、Vertex AI 模型目錄、Azure AI Studio、DELL Enterprise Hub。一旦容器可用，我們將更新此部分——您可以訂閱 Hugging Squad 獲取電子郵件更新。

使用 Hugging Face TRL 進行微調

在本節中，我們將介紹 Hugging Face 生態系統中可用的工具，以高效地在消費級 GPU 上訓練 Llama 3.1。以下是微調 Llama 3.1 8B 在 OpenAssistant 的聊天資料集上的示例命令。我們使用 4 位量化和 QLoRA 來節省記憶體，以針對所有注意力塊的線性層。

使用 Hugging Face TRL 進行微調示例

首先，安裝 🤗 TRL 的夜間版本並克隆倉庫以訪問訓練指令碼

pip install "transformers>=4.43.2" --upgrade
pip install --upgrade bitsandbytes
pip install --ugprade peft
pip install git+https://github.com/huggingface/trl
git clone https://github.com/huggingface/trl
cd trl

然後你可以執行指令碼

python \
    examples/scripts/sft.py \
    --model_name meta-llama/Meta-Llama-3.1-8B \
    --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
    --dataset_text_field="text" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 2e-4 \
    --report_to "none" \
    --bf16 \
    --max_seq_length 1024 \
    --lora_r 16 --lora_alpha 32 \
    --lora_target_modules q_proj k_proj v_proj o_proj \
    --load_in_4bit \
    --use_peft \
    --attn_implementation "flash_attention_2" \
    --logging_steps=10 \
    --gradient_checkpointing \
    --output_dir llama31

如果您有更多 GPU 可用，您可以使用 DeepSpeed 和 ZeRO Stage 3 進行訓練。

accelerate launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
    examples/scripts/sft.py \
    --model_name meta-llama/Meta-Llama-3.1-8B \
    --dataset_name OpenAssistant/oasst_top1_2023-08-25 \
    --dataset_text_field="text" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 2e-5 \
    --report_to wandb \
    --bf16 \
    --max_seq_length 1024 \
    --attn_implementation eager \
    --logging_steps=10 \
    --gradient_checkpointing \
    --output_dir models/llama

使用 distilabel 生成合成資料

Llama 3.1 許可證的一個重大變化是它允許使用模型輸出改進其他 LLM，這意味著您可以使用 Llama 3.1 模型生成合成資料集，並使用它們來微調更小、更專業的模型。

讓我們看看如何使用distilabel（一個用於合成數據生成的開源框架）生成偏好資料集的示例。此資料集可用於使用 TRL 提供的偏好最佳化方法（如 DPO 或 KTO）微調模型。

首先，使用 `pip` 安裝最新版本的 `distilabel`，包括 `hf-inference-endpoints` 附加功能，如下所示：

pip install “distilabel[hf-inference-endpoints]” --upgrade

然後定義一個管道，它：

從 Hugging Face Hub 載入一個包含指令的資料集。
透過 Hugging Face 推理端點，使用 Llama 3.1 70B Instruct 和 Llama 3.1 405B Instruct 生成響應。
最後，使用 Llama 3.1 405B Instruct 作為判斷模型，使用 UltraFeedback 提示對響應進行評分。根據這些評分，可以選擇“選中”和“拒絕”的響應，並用於使用偏好最佳化方法微調模型。

請參閱下面的程式碼以定義管道，或使用此 Colab 筆記本自行執行，並在 Hub 中探索生成的資料集。

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import LoadDataFromHub, CombineColumns
from distilabel.steps.tasks import TextGeneration, UltraFeedback

llama70B = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-70B-Instruct"
)
llama405B = InferenceEndpointsLLM(
    model_id="meta-llama/Meta-Llama-3.1-405B-Instruct-FP8"
)

with Pipeline(name="synthetic-data-with-llama3") as pipeline:
    # load dataset with prompts
    load_dataset = LoadDataFromHub(
        repo_id="argilla/10Kprompts-mini"
    )
    # generate two responses for each prompt
    generate = [
        TextGeneration(llm=llama70B),
        TextGeneration(llm=llama405B)
    ]
    # combine responses into one column
    combine = CombineColumns(
        columns=["generation", "model_name"],
        output_columns=["generations", "model_names"]
    )
    # rate responses with 405B LLM-as-a-judge
    rate = UltraFeedback(aspect="overall-rating", llm=llama405B)
    # define the pipeline
    load_dataset >> generate >> combine >> rate

if __name__ == "__main__":
    distiset = pipeline.run()

下一步是什麼？除了上述示例，`distilabel` 還提供了在各種場景和主題中使用 LLM 進行合成數據生成的激動人心的方法。它包含了當前 SOTA 文獻中關於評估 LLM-as-a-judge 方法、指令演變、資料過濾以及定義自定義元件等任務的實現。

附加資源

致謝

在生態系統中釋出此類模型並提供支援和評估，離不開數千名社群成員的貢獻，他們為 transformers、tgi、vllm、pytorch、LM Eval Harness 和許多其他專案做出了貢獻。如果沒有 Clémentine 和 Nathan 對 LLM 評估的支援；Nicolas、Olivier Dehaene 和 Daniël de Kok 對文字生成推理的支援；Arthur、Matthew Carrigan、Zachary Mueller、Joao、Joshua Lochner 和 Lysandre 將 Llama 3.1 整合到 `transformers` 中；Matthew Douglas 對量化支援的貢獻；Gabriel Martín Blázquez 對 `distilabel` 的支援；Merve Noyan 和 Aymeric Roucher 的評審；hysts 和 Yuvi 的演示；Ellie 的微調測試；Brigitte Tousignant 和 Florent Daudens 的溝通；Nathan 和 Victor 將 Llama 3.1 引入 Hugging Chat，此次釋出是不可能實現的。

感謝 Meta 團隊釋出 Llama 3.1 並向開源 AI 社群開放！

更多部落格文章