AWS Trainium & Inferentia 文件

為訓練貢獻自定義模型

AWS Trainium & Inferentia

加入 Hugging Face 社群

並獲得增強的文件體驗

在模型、資料集和 Spaces 上進行協作

透過加速推理獲得更快的示例

切換文件主題

開始使用

為訓練貢獻自定義模型

本指南介紹瞭如何向 `optimum/neuron/models/training/` 目錄中新增自定義模型實現。需要自定義模型來支援 AWS Trainium 裝置上的分散式訓練功能，例如張量並行、流水線並行和序列並行。

架構元件

1. NeuronModelMixin

`NeuronModelMixin` 類提供了核心功能：

`from_pretrained()`：將常規的 Transformers 權重載入到自定義實現中
`save_pretrained()`：儲存帶有合併元資料的分片檢查點
透過 `PIPELINE_*` 屬性支援流水線並行

2. 權重轉換規範

轉換規範處理權重在以下格式之間的轉換：

原始 Transformers 格式 → 自定義並行格式（載入期間）
自定義並行格式 → 原始 Transformers 格式（檢查點合併期間）

關鍵的轉換規範型別：

`FusedLinearsSpec`：處理融合的線性層（例如 `gate_up_proj`）
`GQAQKVColumnParallelLinearSpec`：處理張量並行大小大於鍵值頭數量時的分組查詢注意力投影

有關所有轉換規範和實用函式的完整 API 文件，請參閱模型權重轉換規範 API 參考。

3. 並行層

使用 `neuronx_distributed` 中的這些並行層：

`ColumnParallelLinear`：沿輸出維度拆分權重矩陣
`RowParallelLinear`：沿輸入維度拆分權重矩陣
`ParallelEmbedding`：在不同 rank 之間拆分嵌入表
`GQAQKVColumnParallelLinear`：專門用於張量並行大小大於鍵值頭數量時的分組查詢注意力投影

實現步驟

步驟 1：建立模型結構

建立一個新目錄：`optimum/neuron/models/training/your_model/`

__init__.py

from .modeling_your_model import YourModelForCausalLM, YourModel

__all__ = ["YourModelForCausalLM", "YourModel"]

步驟 2：實現模型構建塊

modeling_your_model.py

匯入和依賴

import torch
from torch import nn
from neuronx_distributed.parallel_layers.layers import (
    ColumnParallelLinear,
    RowParallelLinear,
    ParallelEmbedding,
)
from neuronx_distributed.modules.qkv_linear import GQAQKVColumnParallelLinear
from transformers import PreTrainedModel
from transformers.models.your_model import YourModelConfig

from ..config import TrainingNeuronConfig
from ..modeling_utils import NeuronModelMixin
from ..transformations_utils import (
    CustomModule,
    FusedLinearsSpec,
    GQAQKVColumnParallelLinearSpec,
    ModelWeightTransformationSpecs,
)

嵌入層

class YourModelEmbeddings(nn.Module):
    def __init__(self, config, trn_config):
        super().__init__()
        self.embed_tokens = ParallelEmbedding(
            config.vocab_size,
            config.hidden_size,
            dtype=config.torch_dtype,
            sequence_parallel_enabled=trn_config.sequence_parallel_enabled,
        )

帶融合線性層的 MLP 層

重要提示：任何具有轉換規範的模組都必須繼承自 `CustomModule`，以確保正確處理權重轉換，並且轉換規範必須定義在 `self.specs` 屬性中。

class YourModelMLP(nn.Module, CustomModule):
    def __init__(self, config, trn_config):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size
        
        # Fused gate and up projections
        self.gate_up_proj = ColumnParallelLinear(
            self.hidden_size,
            2 * self.intermediate_size,
            stride=2,  # Important for proper sharding
            bias=False,
            gather_output=False,
            sequence_parallel_enabled=trn_config.sequence_parallel_enabled,
            dtype=config.torch_dtype,
        )
        
        self.down_proj = RowParallelLinear(
            self.intermediate_size,
            self.hidden_size,
            bias=False,
            input_is_parallel=True,
            sequence_parallel_enabled=trn_config.sequence_parallel_enabled,
            dtype=config.torch_dtype,
        )
        
        # Define transformation specs
        self.specs = ModelWeightTransformationSpecs()
        self.specs.add_spec(
            FusedLinearsSpec(
                fused_linear_name="gate_up_proj",
                linear_names=["gate_proj", "up_proj"],
                bias=False,
                fuse_axis="column",  # Fuse along output dimension
                original_dims=[self.intermediate_size, self.intermediate_size],
            )
        )

注意力層

注意力層的實現取決於模型的架構和張量並行配置。主要有三種變體：

1. 分離的 Q、K、V 投影（預設）

class YourModelAttention(nn.Module, CustomModule):
    def __init__(self, config, trn_config, layer_idx):
        super().__init__()
        self.config = config
        self.num_heads = config.num_attention_heads
        self.num_key_value_heads = config.num_key_value_heads
        self.head_dim = config.hidden_size // self.num_heads
        
        # Separate projections for Q, K, V
        self.q_proj = ColumnParallelLinear(
            config.hidden_size,
            self.num_heads * self.head_dim,
            bias=False,
            gather_output=False,
            sequence_parallel_enabled=trn_config.sequence_parallel_enabled,
            dtype=config.torch_dtype,
        )
        self.k_proj = ColumnParallelLinear(
            config.hidden_size,
            self.num_key_value_heads * self.head_dim,
            bias=False,
            gather_output=False,
            sequence_parallel_enabled=trn_config.sequence_parallel_enabled,
            dtype=config.torch_dtype,
        )
        self.v_proj = ColumnParallelLinear(
            config.hidden_size,
            self.num_key_value_heads * self.head_dim,
            bias=False,
            gather_output=False,
            sequence_parallel_enabled=trn_config.sequence_parallel_enabled,
            dtype=config.torch_dtype,
        )
        
        self.o_proj = RowParallelLinear(
            self.num_heads * self.head_dim,
            config.hidden_size,
            bias=False,
            input_is_parallel=True,
            sequence_parallel_enabled=trn_config.sequence_parallel_enabled,
            dtype=config.torch_dtype,
        )
        
        # No transformation specs needed - regular parallel layers
        self.specs = ModelWeightTransformationSpecs()

2. 融合的 QKV 投影（多頭注意力）

class YourModelAttention(nn.Module, CustomModule):
    def __init__(self, config, trn_config, layer_idx):
        super().__init__()
        # ... (same setup as above)
        
        tp_size = get_tensor_model_parallel_size()
        
        # Only use fused QKV when num_heads == num_key_value_heads (no GQA)
        if trn_config.fuse_qkv and self.num_heads == self.num_key_value_heads:
            self.qkv_proj = ColumnParallelLinear(
                config.hidden_size,
                3 * self.num_heads * self.head_dim,  # Q + K + V
                stride=3,  # Important for proper sharding
                bias=False,
                gather_output=False,
                sequence_parallel_enabled=trn_config.sequence_parallel_enabled,
                dtype=config.torch_dtype,
            )
            
            # Define transformation specs for fused QKV
            self.specs = ModelWeightTransformationSpecs()
            self.specs.add_spec(
                FusedLinearsSpec(
                    fused_linear_name="qkv_proj",
                    linear_names=["q_proj", "k_proj", "v_proj"],
                    bias=False,
                    fuse_axis="column",
                    original_dims=[self.num_heads * self.head_dim] * 3,
                )
            )
            self.split_size = self.num_heads * self.head_dim // tp_size

3. GQA QKV 投影（用於具有挑戰性的 TP 配置）

class YourModelAttention(nn.Module, CustomModule):
    def __init__(self, config, trn_config, layer_idx):
        super().__init__()
        # ... (same setup as above)
        
        tp_size = get_tensor_model_parallel_size()
        
        # Use GQA QKV when KV heads can't be evenly distributed across TP ranks
        # This happens when: num_key_value_heads < tp_size or num_key_value_heads % tp_size != 0
        self.qkv_linear = (self.num_key_value_heads < tp_size) or (self.num_key_value_heads % tp_size != 0)
        
        if self.qkv_linear:
            # Calculate KV size multiplier to ensure even distribution
            if trn_config.kv_size_multiplier is None:
                self.kv_size_multiplier = trn_config.auto_kv_size_multiplier(self.num_key_value_heads)
            else:
                self.kv_size_multiplier = trn_config.kv_size_multiplier
                
            self.qkv_proj = GQAQKVColumnParallelLinear(
                config.hidden_size,
                [self.num_heads * self.head_dim, self.num_key_value_heads * self.head_dim],
                bias=False,
                gather_output=False,
                sequence_parallel_enabled=trn_config.sequence_parallel_enabled,
                kv_size_multiplier=self.kv_size_multiplier,
                fuse_qkv=trn_config.fuse_qkv,
                dtype=config.torch_dtype,
            )
            
            # Define transformation specs for GQA QKV
            self.specs = ModelWeightTransformationSpecs()
            self.specs.add_spec(
                GQAQKVColumnParallelLinearSpec(
                    gqa_qkv_projection_name="qkv_proj",
                    query_projection_name="q_proj",
                    key_projection_name="k_proj", 
                    value_projection_name="v_proj",
                    output_projection_name="o_proj",
                    num_attention_heads=self.num_heads,
                    num_key_value_heads=self.num_key_value_heads,
                    kv_size_multiplier=self.kv_size_multiplier,
                    q_output_size_per_partition=self.qkv_proj.q_output_size_per_partition,
                    kv_output_size_per_partition=self.qkv_proj.kv_output_size_per_partition,
                    fuse_qkv=trn_config.fuse_qkv,
                )
            )

何時使用每種變體

分離的 Q、K、V：預設方法，適用於所有配置，但效率可能較低
融合的 QKV：當 `num_heads == num_key_value_heads`（沒有分組查詢注意力）且 `fuse_qkv=True` 時使用
GQA QKV：在使用分組查詢注意力且 KV 頭無法在 TP rank 之間均勻分佈的具有挑戰性的張量並行配置時需要

選擇通常取決於：

tp_size = get_tensor_model_parallel_size()
use_gqa_qkv = (num_key_value_heads < tp_size) or (num_key_value_heads % tp_size != 0)
use_fused_qkv = trn_config.fuse_qkv and (num_heads == num_key_value_heads) and not use_gqa_qkv

步驟 3：實現主模型類

基礎模型

class YourPreTrainedModel(PreTrainedModel, NeuronModelMixin):
    config_class = YourModelConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["YourModelDecoderLayer"]
    _skip_keys_device_placement = "past_key_values"
    _supports_flash_attn_2 = True
    _supports_cache_class = True
    _supports_quantized_cache = True
    _supports_static_cache = True


class YourModel(NeuronModelMixin, YourPreTrainedModel):
    def __init__(self, config: YourModelConfig, trn_config: TrainingNeuronConfig):
        YourPreTrainedModel.__init__(self, config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size

        self.trn_config = trn_config
        
        self.embed_tokens = ParallelEmbedding(...)
        self.layers = nn.ModuleList([
            YourModelDecoderLayer(config, trn_config, layer_idx)
            for layer_idx in range(config.num_hidden_layers)
        ])
        self.norm = YourModelRMSNorm(...)
        
        self.post_init()

CausalLM 模型

class YourModelForCausalLM(NeuronModelMixin, YourPreTrainedModel):
    _tied_weights_keys = ["lm_head.weight"]
    
    # Pipeline parallelism support
    SUPPORTS_PIPELINE_PARALLELISM = True
    PIPELINE_TRANSFORMER_LAYER_CLS = YourModelDecoderLayer
    PIPELINE_INPUT_NAMES = ["input_ids", "attention_mask"]
    
    def __init__(self, config, trn_config):
        super().__init__(config)
        self.trn_config = trn_config
        self.model = YourModel(config, trn_config)
        self.vocab_size = config.vocab_size
        
        self.lm_head = ColumnParallelLinear(
            config.hidden_size,
            config.vocab_size,
            bias=False,
            gather_output=False,
            dtype=config.torch_dtype,
        )
        
        self.post_init()

步驟 4：註冊模型

更新 `optimum/neuron/models/training/__init__.py`

from .your_model import YourModelForCausalLM, YourModel

__all__ = [..., "YourModelForCausalLM", "YourModel"]

更新 `optimum/neuron/models/training/auto_models.py`

from .your_model.modeling_your_model import YourModelForCausalLM, YourModel

# Register the base model (without head)
register_neuron_model_for_training("your_model", "model")(YourModel)

# Register the CausalLM model
register_neuron_model_for_training("your_model", "text-generation")(YourModelForCausalLM)

這裡的 `"your_model"` 對應於模型配置類中的 `model_type` 屬性。

最佳實踐

1. 並行層配置

對中間層使用 `gather_output=False`
對接收並行輸入的層設定 `input_is_parallel=True`
在所有層中一致地配置 `sequence_parallel_enabled`
使用適當的 `stride` 值以實現正確的權重分片

2. 權重轉換規範

始終為使用融合或並行層的模組定義規範
對於任何具有轉換規範的模組，使用 `CustomModule` mixin
確保規範引數名稱與實際模組結構匹配
測試常規權重和 LoRA 權重轉換

3. 流水線並行

對於支援的模型，設定 `SUPPORTS_PIPELINE_PARALLELISM = True`
將 `PIPELINE_TRANSFORMER_LAYER_CLS` 定義為你的解碼器層類
在 `PIPELINE_INPUT_NAMES` 中列出所有輸入名稱

4. Flash Attention 支援

如果你的模型支援，設定 `_supports_flash_attn_2 = True`
實現 eager 和 flash attention 兩種路徑
使用適當的注意力函式分派

測試你的實現

`tests/training/` 中的訓練測試提供了一個全面的測試框架，用於驗證數值正確性、分散式訓練場景和檢查點相容性。大多數測試並非為每個自定義模型實現而設計，而是為了驗證 Optimum Neuron 訓練基礎設施的核心功能。考慮到這一點，以下是你需要為你的自定義模型實現的內容：

1. 自定義模型驗證

`test_custom_modeling.py` 檔案驗證你的自定義實現是否與原始 Transformers 模型產生相同的輸出。

更新 `tests/training/test_custom_modeling.py`

CUSTOM_MODELINGS_TO_TEST = [
    # ... existing models ...
    ("YourModelForCausalLM", "your-org/your-model-name"),
]

重要提示：對於自定義模型驗證測試，請使用小型/微型模型以確保 CI 效率。測試模型應具有：

小詞彙量（例如，1000-8000 個詞元）
少數層（例如，2-4 層）
小隱藏維度（例如，128-512）
最少的注意力頭（例如，4-8 個頭）

適合自定義模型驗證的測試模型示例：

`"michaelbenayoun/llama-2-tiny-4kv-heads-4layers-random"` - 4 層，4 個 KV 頭
`"michaelbenayoun/granite-tiny-4kv-heads-4layers-random"` - 微型 Granite 模型
`"michaelbenayoun/qwen3-tiny-4kv-heads-4layers-random"` - 微型 Qwen3 模型

你的模型必須透過的關鍵測試：

def test_custom_modeling_matches_original()  # Output matching

數值正確性：確保自定義模型與 Transformers 的輸出完全匹配
並行化支援：測試各種 QKV 實現（常規、融合、GQA）

2. 端到端訓練驗證

`test_overfit.py` 檔案驗證訓練收斂性。要將你的模型包含在端到端訓練驗證中，你必須將其新增到引數化測試用例中。

更新 `tests/training/test_overfit.py`

@pytest.mark.parametrize(
    "model_class_name,model_name_or_path,learning_rate,warmup_ratio,training_kwargs,use_flash_attention_2,max_expected_loss,max_length,num_steps",
    [
        # ... existing models ...
        [
            "YourModelForCausalLM",
            "your-org/your-model-name",
            1e-4,
            0.03,
            {},
            True,
            0.5,
            2048,
            50,
        ],
    ],
    ids=[
        # ... existing model IDs ...
        "your-org/your-model-name",
    ],
)

此測試驗證：

收斂性驗證：確保模型可以在簡單資料集上過擬合

你的模型將在以下方面進行測試：

def test_overfit_custom_modeling_causal_lm()       # Basic training (your model included)

3. 自動模型載入

`test_modeling_auto.py` 檔案驗證你的模型可以使用 `NeuronModel` 和 `NeuronModelForCausalLM` 自動類載入。要將你的模型包含在這些測試中，你必須將其新增到測試用例中。

更新 `tests/training/test_modeling_auto.py`

@pytest.mark.parametrize("from_pretrained", [False, True], ids=["from_config", "from_pretrained"])
@distributed_test(world_size=1)
@is_trainium_test
def test_auto_model_with_supported_architecture(from_pretrained):
    trn_config = TrainingNeuronConfig()
    kwargs = {"torch_dtype": torch.bfloat16}
    for model_name_or_path in [
        "michaelbenayoun/llama-2-tiny-4kv-heads-4layers-random",
        "michaelbenayoun/granite-tiny-4kv-heads-4layers-random", 
        "michaelbenayoun/qwen3-tiny-4kv-heads-4layers-random",
        "your-org/your-model-name",  # Add your model here
    ]:
        # ... rest of test logic

@pytest.mark.parametrize("from_pretrained", [False, True], ids=["from_config", "from_pretrained"])
@distributed_test(world_size=1)
@is_trainium_test
def test_auto_model_for_causal_lm_with_supported_architecture(from_pretrained):
    trn_config = TrainingNeuronConfig()
    kwargs = {"torch_dtype": torch.bfloat16}
    for model_name_or_path in [
        "michaelbenayoun/llama-2-tiny-4kv-heads-4layers-random",
        "michaelbenayoun/granite-tiny-4kv-heads-4layers-random",
        "michaelbenayoun/qwen3-tiny-4kv-heads-4layers-random", 
        "your-org/your-model-name",  # Add your model here
    ]:
        # ... rest of test logic

此測試驗證：

自動模型載入：測試 `NeuronModel.from_pretrained()` 和 `NeuronModel.from_config()` 是否正常工作
自動 CausalLM 載入：測試 `NeuronModelForCausalLM.from_pretrained()` 和 `NeuronModelForCausalLM.from_config()` 是否正常工作

4. 執行測試

測試需要 AWS Trainium 例項。執行特定的測試類別：

# Run all custom modeling tests
pytest tests/training/test_custom_modeling.py -v

# Run specific model tests
pytest tests/training/test_custom_modeling.py -v -k "your_model"

# Run end-to-end training validation
pytest tests/training/test_overfit.py -v

5. 測試要求

你的實現必須：

透過數值正確性測試，與原始 Transformers 實現對比
支援並行化策略（至少支援 DP 和 TP；推薦支援 PP）
處理各種 QKV 實現（常規、融合、GQA）
支援分散式訓練的檢查點合併
支援 LoRA 訓練（如果適用）
透過過擬合測試證明收斂性

測試框架確保你的自定義模型與現有的 Optimum Neuron 訓練基礎設施保持相容，同時提供預期的效能和正確性保證。

常見問題

權重形狀不匹配：確保轉換規範正確處理張量形狀
流水線並行錯誤：檢查所有必需的屬性是否已設定
記憶體問題：考慮梯度檢查點和啟用重計算
注意力相容性：驗證注意力實現是否與你的模型架構相容

其他資源

本指南為實現自定義模型提供了基礎。有關完整示例和高階模式，請參考以下現有實現：

LLaMA: `optimum/neuron/models/training/llama/modeling_llama.py` - 包含（常規、融合和 GQA 注意力）、融合 MLP 的完整實現
Qwen3: `optimum/neuron/models/training/qwen3/modeling_qwen3.py` - 演示瞭如何調整 Llama 實現以適應 Qwen3 的 `q_norm` 和 `k_norm` 層

需要研究的關鍵檔案：

`optimum/neuron/models/training/modeling_utils.py` - `NeuronModelMixin` 基類
`optimum/neuron/models/training/transformations_utils.py` - 權重轉換規範
`optimum/neuron/models/training/config.py` - 用於並行設定的 `TrainingNeuronConfig`

←在 AWS Inferentia2 上的 Llama-3.3 70B 為新模型架構新增推理支援→