為新架構新增 BetterTransformer 支援

您想為 PyTorch Transformer API 的快速路徑 `Better Transformer` 新增新模型嗎？請檢視此指南！

應支援的模型

理論上，任何具有 Transformer 編碼器層（類似於“Attention Is All You Need”論文中描述的經典編碼器）的模型都應該得到支援。更具體地說，一個具有 MultiHead-Attention 模組（帶有前或後注意力層範數）的編碼器塊的模型應該能夠轉換為其 `BetterTransformer` 等效形式。條件可以總結如下：

使用經典的多頭注意力模組（例如，DeBERTa 不支援）
使用 `gelu` 或 `relu` 啟用函式
具有偶數個注意力頭
不使用任何注意力偏差（例如 `T5` 使用注意力偏差，因此不支援）
每層的第一個和第二個層範數之間的 `eps` 必須相等

如何將模型轉換為 BetterTransformer 格式？

步驟 1：識別要更改的源層

首先，轉到 `optimum/bettertransformer/__init__.py`，您將看到字典 `BetterTransformerManager.MODEL_MAPPING`。它應該包含模型型別與 `Tuple[str, BetterTransformerBaseLayer]` 之間的對映，該元組由可轉換為 `BetterTransformer` 等效形式的 `nn.Module` 的名稱以及實際的 `BetterTransformer` 層類組成。

讓我們一步步為 `Bert` 嘗試一下，首先我們需要識別需要替換的層

>>> from transformers import AutoModel

>>> model = AutoModel.from_pretrained("bert-base-uncased")
>>> print(model)
...
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
      (11): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
        (intermediate): BertIntermediate(
          (dense): Linear(in_features=768, out_features=3072, bias=True)
          (intermediate_act_fn): GELUActivation()
        )
        (output): BertOutput(
          (dense): Linear(in_features=3072, out_features=768, bias=True)
          (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
  )
  (pooler): BertPooler(
    (dense): Linear(in_features=768, out_features=768, bias=True)
    (activation): Tanh()
  )
)

您可以清楚地看到需要替換的層是 `BertLayer` 模組，因為它們包含整個編碼器層模組。

步驟 2：構建 xxxLayerBetterTransformer 模組

檢查識別出的模組是否未從其他模組複製（透過檢查 `transformers` 中的原始碼並檢查類定義是否不以 `# Copied from ...` 開頭）——如果不是，則在 `bettertransformer/models/encoder_model.py` 中建立一個類。從這些行開始：

import torch
import torch.nn as nn

from ..base import BetterTransformerBaseLayer


class BertLayerBetterTransformer(BetterTransformerBaseLayer):
    def __init__(self, bert_layer, config):
...

現在，確保填寫所有必要的屬性，屬性列表為：

in_proj_weight
in_proj_bias
out_proj_weight
out_proj_bias
linear1_weight
linear1_bias
linear2_weight
linear2_bias
norm1_eps
norm1_weight
norm1_bias
norm2_weight
norm2_bias
num_heads
embed_dim

請注意，這些屬性對應於執行 Transformer 編碼器模組所需的所有元件，請檢視 “Attention Is All You Need” 論文中的圖 1。

一旦您填寫了所有這些屬性（有時 `query`、`key` 和 `value` 層需要“連續化”，請檢視 `modeling_encoder.py` 檔案以瞭解更多）。

還要確保新增這些行：

self.is_last_layer = False
self.validate_bettertransformer()

步驟 3：構建前向傳播

首先，從 `super().forward_checker()` 這行開始，這是必需的，以便父類可以在之前執行所有安全檢查器。

在第一次前向傳播之後，需要使用注意力掩碼對隱藏狀態進行*巢狀*。一旦它們被巢狀，注意力掩碼就不再需要了，因此可以設定為 `None`。這就是 `Bert` 的前向傳播的構建方式，這些行在模型之間應該大致相似，但有時注意力掩碼的形狀在不同模型之間會有所不同。

super().forward_checker()

if hidden_states.is_nested:
    attention_mask = None

if attention_mask is not None:
    # attention mask comes in with values 0 and -inf. we convert to torch.nn.TransformerEncoder style bool mask
    # 0->false->keep this token -inf->true->mask this token
    attention_mask = attention_mask.bool()
    attention_mask = torch.reshape(attention_mask, (attention_mask.shape[0], attention_mask.shape[-1]))
    seqlen = attention_mask.shape[1]
    lengths = torch.sum(~attention_mask, 1)
    if not all([l == seqlen for l in lengths]):
        hidden_states = torch._nested_tensor_from_mask(hidden_states, ~attention_mask)
    attention_mask = None

隱藏狀態巢狀後，使用以下正確引數呼叫 `torch._transformer_encoder_layer_fwd`：

hidden_states = torch._transformer_encoder_layer_fwd(
    hidden_states,
    self.embed_dim,
    self.num_heads,
    self.in_proj_weight,
    self.in_proj_bias,
    self.out_proj_weight,
    self.out_proj_bias,
    self.use_gelu,
    self.norm_first,
    self.norm1_eps,
    self.norm1_weight,
    self.norm1_bias,
    self.norm2_weight,
    self.norm2_bias,
    self.linear1_weight,
    self.linear1_bias,
    self.linear2_weight,
    self.linear2_bias,
    attention_mask,
)

在最後一層，將隱藏狀態“取消巢狀”非常重要，以便後續模組可以對其進行處理，這透過這些行完成：

if hidden_states.is_nested and self.is_last_layer:
    hidden_states = hidden_states.to_padded_tensor(0.0)
return (hidden_states,)

另外，請確保返回一個 `tuple` 以遵循 `transformers` 的約定。

在您自己的模型上重現此實驗的最佳方法是嘗試從提供的建模指令碼中獲取靈感。當然，如果您在 `optimum` 上提出問題或拉取請求，我們將很樂意幫助您轉換您的模型！

步驟 4：健全性檢查！

最後一步，確保在 `optimum/bettertransformer/__init__.py` 中使用正確的名稱更新 `BetterTransformerManager.MODEL_MAPPING` 字典，然後您就可以轉換您的模型了。例如，對於 Bert，它將是：

MODEL_MAPPING = {
  ...
  "bert": ("BertLayer", BertLayerBetterTransformer),
  ...
}

嘗試使用教程部分中介紹的轉換方法！

< > 在 GitHub 上更新

Optimum