在 Intel CPU 上訓練

CPU 訓練最佳化的工作原理

Accelerate 完全支援 Intel CPU，你只需透過配置啟用它即可。

場景 1：加速非分散式 CPU 訓練

在你的機器上執行 accelerate config

$ accelerate config
-----------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
No distributed training
Do you want to run your training on CPU only (even if a GPU / Apple Silicon device is available)? [yes/NO]:yes
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16

這將生成一個配置檔案，在執行時會自動使用該檔案來正確設定預設選項

accelerate launch my_script.py --args_to_my_script

例如，下面是如何使用由 accelerate config 生成的 default_config.yaml 檔案來執行 NLP 示例 examples/nlp_example.py（從倉庫根目錄執行）

compute_environment: LOCAL_MACHINE
distributed_type: 'NO'
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

accelerate launch examples/nlp_example.py

[!CAUTION] accelerator.prepare 目前只能處理同時準備多個模型（且無最佳化器）或單個模型-最佳化器對進行訓練。其他嘗試（例如，兩個模型-最佳化器對）將引發詳細錯誤。要解決此限制，請考慮為每個模型-最佳化器對分別使用 accelerator.prepare。

場景 2：加速分散式 CPU 訓練，我們使用 Intel oneCCL 進行通訊，並結合 Intel® MPI 庫在 Intel® 架構上實現靈活、高效、可擴充套件的叢集訊息傳遞。你可以參考這裡的安裝指南

在你的機器（node0）上執行 accelerate config

$ accelerate config
-----------------------------------------------------------------------------------------------------------------------------------------------------------
In which compute environment are you running?
This machine
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Which type of machine are you using?
multi-CPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 4
-----------------------------------------------------------------------------------------------------------------------------------------------------------
What is the rank of this machine?
0
What is the IP address of the machine that will host the main process? 36.112.23.24
What is the port you will use to communicate with the main process? 29500
Are all the machines on the same local network? Answer `no` if nodes are on the cloud and/or on different network hosts [YES/no]: yes
Do you want accelerate to launch mpirun? [yes/NO]: yes
Please enter the path to the hostfile to use with mpirun [~/hostfile]: ~/hostfile
Enter the number of oneCCL worker threads [1]: 1
Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
How many processes should be used for distributed training? [1]:16
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Do you wish to use FP16 or BF16 (mixed precision)?
bf16

例如，下面是如何執行 NLP 示例 examples/nlp_example.py（從倉庫根目錄執行），併為分散式 CPU 訓練啟用 IPEX。

default_config.yaml 由 accelerate config 生成

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_CPU
downcast_bf16: 'no'
machine_rank: 0
main_process_ip: 36.112.23.24
main_process_port: 29500
main_training_function: main
mixed_precision: bf16
mpirun_config:
  mpirun_ccl: '1'
  mpirun_hostfile: /home/user/hostfile
num_machines: 4
num_processes: 16
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: true

設定以下環境變數並使用 Intel MPI 啟動訓練

在 node0 中，你需要建立一個包含每個節點 IP 地址的配置檔案（例如 hostfile），並將該配置檔案的路徑作為引數傳遞。

如果你選擇讓 Accelerate 啟動 mpirun，請確保你的 hostfile 的位置與配置中的路徑匹配。

$ cat hostfile
xxx.xxx.xxx.xxx #node0 ip
xxx.xxx.xxx.xxx #node1 ip
xxx.xxx.xxx.xxx #node2 ip
xxx.xxx.xxx.xxx #node3 ip

在執行 accelerate launch 命令之前，你需要載入 oneCCL 繫結的 setvars.sh 以正確設定你的 Intel MPI 環境。請注意，python 指令碼和環境都需要在用於多 CPU 訓練的所有機器上可用。

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh

accelerate launch examples/nlp_example.py

你也可以直接使用 mpirun 命令啟動分散式訓練，你需要在 node0 中執行以下命令，這樣 16DDP 將在 node0、node1、node2、node3 上以 BF16 混合精度啟用。使用此方法時，python 指令碼、python 環境和 accelerate 配置檔案都需要在用於多 CPU 訓練的所有機器上可用。

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
export CCL_WORKER_COUNT=1
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
export CCL_ATL_TRANSPORT=ofi
mpirun -f hostfile -n 16 -ppn 4 accelerate launch examples/nlp_example.py

< > 在 GitHub 上更新