多 GPU 除錯

分散式訓練可能很棘手，因為您必須確保在整個系統中使用正確的 CUDA 版本。您可能會遇到 GPU 之間的通訊問題，並且模型中可能存在下溢或溢位問題。

本指南介紹如何除錯這些問題，尤其是與 DeepSpeed 和 PyTorch 相關的問題。

DeepSpeed CUDA

DeepSpeed 編譯 CUDA C++，這可能是在構建需要 CUDA 的 PyTorch 擴充套件時潛在的錯誤來源。這些錯誤取決於 CUDA 在您的系統上的安裝方式。本節重點介紹使用 *CUDA 10.2* 構建的 PyTorch。

pip install deepspeed

對於任何其他安裝問題，請向 DeepSpeed 團隊提交問題。

非相同工具包

PyTorch 附帶自己的 CUDA 工具包，但要將 DeepSpeed 與 PyTorch 一起使用，您需要在系統範圍內安裝相同版本的 CUDA。例如，如果您在 Python 環境中使用 `cudatoolkit==10.2` 安裝了 PyTorch，那麼您還需要在所有地方安裝 CUDA 10.2。

確切的位置可能因系統而異，但 `usr/local/cuda-10.2` 是許多 Unix 系統上最常見的位置。當 CUDA 正確設定並新增到您的 `PATH` 環境變數時，您可以使用以下命令找到安裝位置。

which nvcc

多個工具包

您的系統上可能也安裝了不止一個 CUDA 工具包。

/usr/local/cuda-10.2
/usr/local/cuda-11.0

通常，軟體包安裝程式會將路徑設定為最新安裝的版本。如果軟體包構建失敗，因為它找不到正確的 CUDA 版本（儘管它已經安裝），那麼您需要配置 `PATH` 和 `LD_LIBRARY_PATH` 環境變數以指向正確的路徑。

首先檢視以下環境變數的內容。

echo $PATH
echo $LD_LIBRARY_PATH

`PATH` 列出了可執行檔案的位置，`LD_LIBRARY_PATH` 列出了要查詢共享庫的位置。較早的條目優先於較晚的條目，並且使用 `:` 分隔多個條目。要查詢特定的 CUDA 工具包，請將正確路徑插入到列表的首位。此命令是前置而不是覆蓋現有值。

# adjust the version and full path if needed
export PATH=/usr/local/cuda-10.2/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64:$LD_LIBRARY_PATH

此外，您還應該檢查分配的目錄是否實際存在。`lib64` 子目錄包含各種 CUDA `.so` 物件（如 `libcudart.so`），雖然您的系統不太可能以不同的名稱命名它們，但您應該檢查實際名稱並相應地更改它們。

舊版本

有時，舊版本的 CUDA 可能拒絕與新編譯器一起構建。例如，如果您有 `gcc-9` 而 CUDA 需要 `gcc-7`。通常，安裝最新的 CUDA 工具包可以支援較新的編譯器。

您還可以安裝舊版本的編譯器，除了您當前正在使用的版本（或者它可能已經安裝但預設情況下未使用，並且構建系統看不到它）。要解決此問題，請建立一個符號連結，以便構建系統可以看到舊的編譯器。

# adjust the path to your system
sudo ln -s /usr/bin/gcc-7  /usr/local/cuda-10.2/bin/gcc
sudo ln -s /usr/bin/g++-7  /usr/local/cuda-10.2/bin/g++

預構建

如果您在安裝 DeepSpeed 時仍然遇到問題，或者您正在執行時構建 DeepSpeed，請嘗試在安裝之前預構建 DeepSpeed 模組。執行以下命令以在本地構建 DeepSpeed。

git clone https://github.com/deepspeedai/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log

將 `DS_BUILD_AIO=1` 引數新增到構建命令以使用 NVMe 解除安裝。確保您在系統上安裝了 libaio-dev 軟體包。

接下來，透過編輯 `TORCH_CUDA_ARCH_LIST` 變數來指定您的 GPU 架構（在此頁面上查詢 NVIDIA GPU 及其相應架構的完整列表）。要檢查與您的架構對應的 PyTorch 版本，請執行以下命令。

python -c "import torch; print(torch.cuda.get_arch_list())"

使用以下命令查詢 GPU 的架構。

相同 GPU

特定 GPU

如果得到 `8, 6`，則可以將 `TORCH_CUDA_ARCH_LIST="8.6"`。對於具有不同架構的多個 GPU，將其列出為 `TORCH_CUDA_ARCH_LIST="6.1;8.6"`。

也可以不指定 `TORCH_CUDA_ARCH_LIST`，構建程式會自動查詢構建的 GPU 架構。但是，它可能與目標機器上的實際 GPU 不匹配，這就是為什麼最好明確指定正確的架構。

對於在多個具有相同設定的機器上進行訓練，您需要按照如下所示製作一個二進位制輪子。

git clone https://github.com/deepspeedai/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
python setup.py build_ext -j8 bdist_wheel

此命令生成一個二進位制輪子，它看起來像 `dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl`。在本地或其他機器上安裝此輪子。

pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl

通訊

分散式訓練涉及程序和/或節點之間的通訊，這可能是潛在的錯誤來源。

下載下面的指令碼以診斷網路問題，然後執行它來測試 GPU 通訊。下面的示例命令測試兩個 GPU 如何通訊。調整 `--nproc_per_node` 和 `--nnodes` 引數以適應您的系統。

wget https://raw.githubusercontent.com/huggingface/transformers/main/scripts/distributed/torch-distributed-gpu-test.py
python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

如果兩個 GPU 都能通訊並分配記憶體，指令碼會列印一個 `OK` 狀態。有關更多詳細資訊以及在 SLURM 環境中執行它的方法，請仔細檢視診斷指令碼。

新增 `NCCL_DEBUG=INFO` 環境變數以報告更多 NCCL 相關的除錯資訊。

NCCL_DEBUG=INFO python -m torch.distributed.run --nproc_per_node 2 --nnodes 1 torch-distributed-gpu-test.py

下溢和溢位檢測

當啟用或權重為 `inf`、`nan` 且 `loss=NaN` 時，可能會發生下溢和溢位。這可能表示下溢或溢位問題。要檢測這些問題，請在 `TrainingArguments.debug()` 中啟用 `DebugUnderflowOverflow` 模組，或匯入並將模組新增到您自己的訓練迴圈或其他訓練器類中。

訓練器

PyTorch 訓練迴圈

DebugUnderflowOverflow 模組在模型中插入鉤子，以在每次前向呼叫後測試輸入和輸出變數以及相應的模型權重。如果在啟用或權重的至少一個元素中檢測到 `inf` 或 `nan`，模組會列印如下所示的報告。

以下示例是使用 google/mt5-small 進行 fp16 混合精度訓練。

Detected inf/nan during batch_number=0
Last 21 forward frames:
abs min  abs max  metadata
                  encoder.block.1.layer.1.DenseReluDense.dropout Dropout
0.00e+00 2.57e+02 input[0]
0.00e+00 2.85e+02 output
[...]
                  encoder.block.2.layer.0 T5LayerSelfAttention
6.78e-04 3.15e+03 input[0]
2.65e-04 3.42e+03 output[0]
             None output[1]
2.25e-01 1.00e+04 output[2]
                  encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output
                  encoder.block.2.layer.1.DenseReluDense.wi_0 Linear
2.17e-07 4.50e+00 weight
1.79e-06 4.65e+00 input[0]
2.68e-06 3.70e+01 output
                  encoder.block.2.layer.1.DenseReluDense.wi_1 Linear
8.08e-07 2.66e+01 weight
1.79e-06 4.65e+00 input[0]
1.27e-04 2.37e+02 output
                  encoder.block.2.layer.1.DenseReluDense.dropout Dropout
0.00e+00 8.76e+03 input[0]
0.00e+00 9.74e+03 output
                  encoder.block.2.layer.1.DenseReluDense.wo Linear
1.01e-06 6.44e+00 weight
0.00e+00 9.74e+03 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00      inf output

在報告的開頭，您可以看到錯誤發生的批次號。在這種情況下，它發生在第一個批次。

每個幀都描述了它正在報告的模組。例如，下面的幀檢查了 `encoder.block.2.layer.1.layer_norm`。這表示編碼器第二個塊的第一層中的層歸一化。前向呼叫是針對 `T5LayerNorm` 的。

                  encoder.block.2.layer.1.layer_norm T5LayerNorm
8.69e-02 4.18e-01 weight
2.65e-04 3.42e+03 input[0]
1.79e-06 4.65e+00 output

最後一幀報告 `Dropout.forward` 函式。它從 `DenseReluDense` 類內部呼叫了 `dropout` 屬性。您可以觀察到溢位 (`inf`) 發生在編碼器第二個塊的第一層，第一個批次中。絕對最大的輸入元素是 6.27e+04。

                  encoder.block.2.layer.1.DenseReluDense T5DenseGatedGeluDense
1.79e-06 4.65e+00 input[0]
3.18e-04 6.27e+04 output
                  encoder.block.2.layer.1.dropout Dropout
3.18e-04 6.27e+04 input[0]
0.00e+00      inf output

`T5DenseGatedGeluDense.forward` 函式的輸出啟用的絕對最大值為 6.27e+04，接近 fp16 的最大限制 6.4e+04。在下一步中，`Dropout` 在將一些元素歸零後重新歸一化權重，這將使絕對最大值大於 6.4e+04，從而導致溢位。

現在您知道錯誤發生在哪裡，您可以檢視 modeling_t5.py 中的建模程式碼。

class T5DenseGatedGeluDense(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
        self.dropout = nn.Dropout(config.dropout_rate)
        self.gelu_act = ACT2FN["gelu_new"]

    def forward(self, hidden_states):
        hidden_gelu = self.gelu_act(self.wi_0(hidden_states))
        hidden_linear = self.wi_1(hidden_states)
        hidden_states = hidden_gelu * hidden_linear
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.wo(hidden_states)
        return hidden_states

一個解決方案是回到數值開始變得過大之前的幾個步驟，並切換到 fp32，這樣在乘法或求和時數值就不會溢位。另一個潛在的解決方案是暫時停用混合精度訓練（`amp`）。

import torch

def forward(self, hidden_states):
    if torch.is_autocast_enabled():
        with torch.cuda.amp.autocast(enabled=False):
            return self._forward(hidden_states)
    else:
        return self._forward(hidden_states)

該報告只返回完整幀的輸入和輸出，因此您可能還希望分析任何 `forward` 函式的中間值。在 forward 呼叫之後新增 `detect_overflow` 函式，以跟蹤中間 `forwarded_states` 中的 `inf` 或 `nan` 值。

from debug_utils import detect_overflow

class T5LayerFF(nn.Module):
    [...]

    def forward(self, hidden_states):
        forwarded_states = self.layer_norm(hidden_states)
        detect_overflow(forwarded_states, "after layer_norm")
        forwarded_states = self.DenseReluDense(forwarded_states)
        detect_overflow(forwarded_states, "after DenseReluDense")
        return hidden_states + self.dropout(forwarded_states)

最後，您可以配置 DebugUnderflowOverflow 列印的幀數。

from transformers.debug_utils import DebugUnderflowOverflow

debug_overflow = DebugUnderflowOverflow(model, max_frames_to_save=100)

批處理跟蹤

DebugUnderflowOverflow 能夠在停用下溢和溢位功能的情況下，跟蹤每個批次的絕對最小值和最大值。這對於識別模型中錯誤發生的位置很有用。

以下示例展示瞭如何在批次 1 和 3 中跟蹤最小值和最大值（批次是零索引的）。

debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3])

                  *** Starting batch number=1 ***
abs min  abs max  metadata
                  shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.47e+04 input[0]
5.36e-05 7.92e+02 output
[...]
                  decoder.dropout Dropout
1.60e-07 2.27e+01 input[0]
0.00e+00 2.52e+01 output
                  decoder T5Stack
     not a tensor output
                  lm_head Linear
1.01e-06 7.92e+02 weight
0.00e+00 1.11e+00 input[0]
6.06e-02 8.39e+01 output
                   T5ForConditionalGeneration
     not a tensor output

                  *** Starting batch number=3 ***
abs min  abs max  metadata
                  shared Embedding
1.01e-06 7.92e+02 weight
0.00e+00 2.78e+04 input[0]
5.36e-05 7.92e+02 output
[...]

DebugUnderflowOverflow 報告了大量幀，這更易於除錯。一旦您知道問題發生在何處，例如批次 150，那麼您可以將跟蹤重點放在批次 149 和 150 上，並比較數字發散的位置。

也可以在某個批次號之後中止跟蹤，例如批次 3。

debug_overflow = DebugUnderflowOverflow(model, trace_batch_nums=[1, 3], abort_after_batch_num=3)

< > 在 GitHub 上更新

Transformers