從零開始的稀疏專家混合語言模型：用專家容量擴充套件 makeMoE

社群文章釋出於 2024 年 3 月 18 日

我之前的部落格詳細介紹了稀疏專家混合語言模型“makeMoE”的端到端實現（靈感來自 Andrej Karpathy 的 makemore 和 nanoGPT），並獲得了社群的廣泛關注（https://huggingface.co/blog/AviSoori1x/makemoe-from-scratch）。最近，x.ai 開源了另一個稀疏 MoE LLM Grok-1，這進一步促使我透過加入一個我最初遺漏的功能——專家容量——來增強 makeMoE。

這裡提供的 Github 倉庫提供了端到端的實現（包含專家容量）：https://github.com/AviSoori1x/makeMoE

為什麼專家容量如此重要？

在預訓練稀疏專家混合語言模型或任何大型語言模型時，該過程通常涉及多個 GPU，甚至許多機器。訓練在這些硬體資源上的並行方式對於平衡計算負載至關重要。然而，如果某些專家或一組專家過於受偏愛——這反映了對利用而非探索的偏好——不僅可能導致模型效能問題，還會導致叢集計算負載的不平衡。

Switch Transformer 的實現透過專家容量來規避這個問題。專家容量決定了每個專家在訓練或推理過程中負責處理的 token 數量，並對每個專家處理的 token 數量設定了限制。它根據批次中的 token 數量和可用專家的數量來定義，通常透過容量因子進行調整。這個因子允許在分配上具有靈活性，提供了緩衝區以適應資料分佈的變化，並確保沒有單個專家由於過載而成為瓶頸。在訓練這些大型模型數週甚至數月時，硬體故障很常見，因此這非常重要。

專家容量通常按以下方式計算：

專家容量 = (每批次 Token 數 / 專家數量) × 容量因子，其中：

每批次 Token 數是指需要處理的批次中存在的 Token 總數。專家數量是指 MoE 層中可用於處理資料的專家總數。容量因子是一個乘數，用於調整基本容量（每批次 Token 數除以專家數量）。容量因子大於 1 允許每個專家處理超出均勻分配份額的緩衝區，以適應 Token 分配中的不平衡。這個值的通常範圍是 1-1.25。

以下程式碼塊對實現一個簡單版本的專家容量進行了輕微調整。

class SparseMoE(nn.Module):
    def __init__(self, n_embed, num_experts, top_k, capacity_factor=1.0):
        super(SparseMoE, self).__init__()
        self.router = NoisyTopkRouter(n_embed, num_experts, top_k)
        self.experts = nn.ModuleList([Expert(n_embed) for _ in range(num_experts)])
        self.top_k = top_k
        self.capacity_factor = capacity_factor
        self.num_experts = num_experts
    
    def forward(self, x):
    # Assuming x has shape [batch_size, seq_len, n_embd]
        batch_size, seq_len, _ = x.shape
        gating_output, indices = self.router(x)
        final_output = torch.zeros_like(x)

        # Flatten the batch and sequence dimensions to treat each token independently
        flat_x = x.view(-1, x.size(-1))  # Now shape [batch_size * seq_len, n_embd]
        flat_gating_output = gating_output.view(-1, gating_output.size(-1))

        tokens_per_batch = batch_size * seq_len * self.top_k
        expert_capacity = int((tokens_per_batch / self.num_experts) * self.capacity_factor)

        updates = torch.zeros_like(flat_x)

        for i, expert in enumerate(self.experts):
            expert_mask = (indices == i).any(dim=-1)
            flat_mask = expert_mask.view(-1)
            selected_indices = torch.nonzero(flat_mask).squeeze(-1)

            limited_indices = selected_indices[:expert_capacity] if selected_indices.numel() > expert_capacity else selected_indices
            if limited_indices.numel() > 0:
                expert_input = flat_x[limited_indices]
                expert_output = expert(expert_input)

                gating_scores = flat_gating_output[limited_indices, i].unsqueeze(1)
                weighted_output = expert_output * gating_scores

                updates.index_add_(0, limited_indices, weighted_output)

        # Reshape updates to match the original dimensions of x
        final_output += updates.view(batch_size, seq_len, -1)

        return final_output

為了確保形狀對齊（這在此類實現中很常見），需要進行大量的張量形狀操作，但實現中最重要的部分僅在幾行程式碼中。讓我們放大看看這些部分。

首先，讓我們看一下專家容量的計算。

expert_capacity = int((tokens_per_batch / self.num_experts) * self.capacity_factor)

這非常簡單。將其包含在前向傳播中，是為了應對使用動態批次大小的情況。

下一行重要的程式碼是：

limited_indices = selected_indices[:expert_capacity] if selected_indices.numel() > expert_capacity else selected_indices
if limited_indices.numel() > 0:
  #remaining logic to process and accumulate weighted expert outputs for selected tokens.

`selected_indices` 張量標識了指定由第 i 個專家處理的 token。如果分配給該專家的 token 總數超過其容量，則該張量將被截斷以匹配專家的最大處理容量。否則，它將按原樣用於進一步的計算。

這些計算包括透過專家確定每個 token 的輸出，然後應用相應的門控值以得出加權輸出。此加權輸出逐步與最終輸出張量結合，從而構成模型的整體輸出。

包含實現的 Jupyter Notebook 在這裡：https://github.com/AviSoori1x/makeMoE/blob/main/makeMoE_from_Scratch_with_Expert_Capacity.ipynb

這種管理專家容量的方法相對基礎。在文獻中探索了更高階的策略，例如 Google 論文中討論的 Switch Transformer 架構，可在此處獲取：https://arxiv.org/abs/2101.03961。儘管此處提出的方法簡化了容量處理，但它為這一概念提供了一個直觀的介紹，並使 makeMoE 更加完整！

社群

透過拖放到文字輸入框、貼上或點選此處上傳圖片、音訊和影片。

點選或貼上此處以上傳圖片

· 註冊或登入以評論

贊