多机多卡 NCCL + IB 调试

通信速度对于多机多卡训练或推理至关重要。在进行训练或推理前,可以通过执行 NCCL Tests 确认多机多卡环境、通信是否存在问题。

安装 NCCL

NCCL 可以直接到 NVIDIA 官网下载(需要注册登录)。NCCL 版本需要和 CUDA 版本匹配,如果依赖的 CUDA 版本为 12.4,则对应下图第 3 行下载。

image-iSSu.png

建议使用“与操作系统无关”的离线压缩包进行安装,只需解压并配置 NCCL_HOME 以及 LD_LIBRARY_PATH 环境变量即可,比较灵活,并且在多人使用的场景下可以避免互相影响环境。

image-QaHL.png

# 下载后解压
tar -xf nccl_2.27.7-1+cuda12.4_x86_64.txz

# 配置环境变量 (PATH 环境变量可选,一般情况下用不到)
export NCCL_HOME="$PWD/nccl_2.27.7-1+cuda12.4_x86_64"
export PATH="$NCCL_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$LD_LIBRARY_PATH"

安装 Open MPI

考虑到大部分发行版本如 CentOS 等默认包管理带的版本较低,建议从官网下载最新的版本手动编译安装。笔者测试的版本为 4.1.8(测试时官网最新版本为 5.0.8,但是使用 5.0.8 进行多机 NCCL Tests 的时候会卡住,回退到 4.1.8 后正常)。

编译安装

# 下载解压
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.8.tar.gz
tar -zxf openmpi-4.1.8.tar.gz

# 编译安装
cd openmpi-4.1.8
./configure --prefix=/usr/local/openmpi-4.1.8
make && make install

# 配置环境变量
export MPI_HOME=/usr/local/openmpi-4.1.8
export PATH="$MPI_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$MPI_HOME/lib:$LD_LIBRARY_PATH"

测试

被测试机器需要互相配置 SSH 免密登录,这部分可以参考网上教程,不再赘述。运行测试命令,注意根据实际情况修改替换通信网卡(bond1),

# 假设有两台被测机器,ip 分别为 10.1.1.1 和 10.1.1.2
mpirun \
  -v \
  --allow-run-as-root \
  --prefix "$MPI_HOME" \
  --mca btl_tcp_if_include bond1 \
  --mca oob_tcp_if_include bond1 \
  -np 2 \
  --host 10.1.1.1:1,10.1.1.2:1 \
  bash -c 'echo "Hello from process $OMPI_COMM_WORLD_RANK of $OMPI_COMM_WORLD_SIZE on $(hostname)"'

测试成功结果如下:

Hello from process 0 of 2 on host1
Hello from process 1 of 2 on host2

如果测试过程中卡住没有输出,请检查 SSH 免密登录及通信网卡是否正确配置,可以使用 strace 辅助调试(将卡住的 syscall 上下文复制到 gpt 询问即可),

strace -f -e trace=network,process,execve -- mpirun ... # 原 mpirun 命令

安装 NCCL Tests 工具

NCCL Tests 是 NVIDIA 官方出品的 NCCL 测试工具,只需要简单一行命令就可以完成测试。

下载编译

# 下载 NCCL Tests
git clone https://github.com/NVIDIA/nccl-tests.git

# 编译
cd nccl-tests
make MPI=1 MPI_HOME="$MPI_HOME" NCCL_HOME="$NCCL_HOME"

多机测试

参考下面的测试命令(双机 16 卡),注意根据实际情况修改替换通信网卡(bond1)及 NCCL_XXX 变量。如果出现单机测试正常,多机测试卡住的情况,请优先尝试更换 Open MPI 版本。

# 假设有两台被测机器,ip 分别为 10.1.1.1 和 10.1.1.2,每台机器有 8 卡 GPU
# 每台启动 8 个测试程序(-np 16 --host x:8,y:8),每个测试程序 1 卡 GPU(-g 1)
mpirun \
  -v \
  --allow-run-as-root \
  --prefix "$MPI_HOME" \
  --mca btl_tcp_if_include bond1 \
  --mca oob_tcp_if_include bond1 \
  -x NCCL_SOCKET_IFNAME=bond1 \
  -x NCCL_IB_DISABLE=0 \
  -x NCCL_IB_GID_INDEX=3 \
  -x NCCL_IB_HCA=mlx5_ \
  -x NCCL_NET_GDR_LEVEL=2 \
  -x NCCL_DEBUG=INFO \
  -x LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
  -np 16 \
  --host 10.1.1.1:8,10.1.1.2:8 \
  ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -n 50

测试结果如下:

# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 50 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 466205 on  host1 device  0 [0000:03:00] NVIDIA H20
#  Rank  1 Group  0 Pid 466206 on  host1 device  1 [0000:16:00] NVIDIA H20
#  Rank  2 Group  0 Pid 466207 on  host1 device  2 [0000:1c:00] NVIDIA H20
#  Rank  3 Group  0 Pid 466208 on  host1 device  3 [0000:2e:00] NVIDIA H20
#  Rank  4 Group  0 Pid 466209 on  host1 device  4 [0000:84:00] NVIDIA H20
#  Rank  5 Group  0 Pid 466210 on  host1 device  5 [0000:9c:00] NVIDIA H20
#  Rank  6 Group  0 Pid 466211 on  host1 device  6 [0000:b6:00] NVIDIA H20
#  Rank  7 Group  0 Pid 466212 on  host1 device  7 [0000:bb:00] NVIDIA H20
#  Rank  8 Group  0 Pid 254776 on  host2 device  0 [0000:03:00] NVIDIA H20
#  Rank  9 Group  0 Pid 254777 on  host2 device  1 [0000:16:00] NVIDIA H20
#  Rank 10 Group  0 Pid 254778 on  host2 device  2 [0000:1c:00] NVIDIA H20
#  Rank 11 Group  0 Pid 254779 on  host2 device  3 [0000:2e:00] NVIDIA H20
#  Rank 12 Group  0 Pid 254780 on  host2 device  4 [0000:84:00] NVIDIA H20
#  Rank 13 Group  0 Pid 254781 on  host2 device  5 [0000:9c:00] NVIDIA H20
#  Rank 14 Group  0 Pid 254782 on  host2 device  6 [0000:b6:00] NVIDIA H20
#  Rank 15 Group  0 Pid 254783 on  host2 device  7 [0000:bb:00] NVIDIA H20
#
#                                                              out-of-place                       in-place      
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    43.58    0.00    0.00      0    31.56    0.00    0.00      0
          16             4     float     sum      -1    33.34    0.00    0.00      0    31.35    0.00    0.00      0
          32             8     float     sum      -1    31.78    0.00    0.00      0    31.48    0.00    0.00      0
          64            16     float     sum      -1    31.91    0.00    0.00      0    31.77    0.00    0.00      0
         128            32     float     sum      -1    32.31    0.00    0.01      0    32.20    0.00    0.01      0
         256            64     float     sum      -1    47.91    0.01    0.01      0    32.67    0.01    0.01      0
         512           128     float     sum      -1    34.27    0.01    0.03      0    33.07    0.02    0.03      0
        1024           256     float     sum      -1    33.99    0.03    0.06      0    33.69    0.03    0.06      0
        2048           512     float     sum      -1    35.83    0.06    0.11      0    35.76    0.06    0.11      0
        4096          1024     float     sum      -1    37.71    0.11    0.20      0    38.12    0.11    0.20      0
        8192          2048     float     sum      -1    41.52    0.20    0.37      0    41.51    0.20    0.37      0
       16384          4096     float     sum      -1    41.25    0.40    0.74      0    39.98    0.41    0.77      0
       32768          8192     float     sum      -1    42.00    0.78    1.46      0    42.04    0.78    1.46      0
       65536         16384     float     sum      -1    43.99    1.49    2.79      0    43.43    1.51    2.83      0
      131072         32768     float     sum      -1    72.05    1.82    3.41      0    66.09    1.98    3.72      0
      262144         65536     float     sum      -1    76.37    3.43    6.44      0    71.50    3.67    6.87      0
      524288        131072     float     sum      -1    92.71    5.66   10.60      0    77.82    6.74   12.63      0
     1048576        262144     float     sum      -1    82.32   12.74   23.88      0    80.67   13.00   24.37      0
     2097152        524288     float     sum      -1    94.86   22.11   41.45      0    89.47   23.44   43.95      0
     4194304       1048576     float     sum      -1    110.7   37.89   71.05      0    109.5   38.29   71.79      0
     8388608       2097152     float     sum      -1    157.2   53.38  100.09      0    155.0   54.13  101.50      0
    16777216       4194304     float     sum      -1    202.2   82.96  155.55      0    204.1   82.21  154.14      0
    33554432       8388608     float     sum      -1    271.8  123.45  231.48      0    276.5  121.37  227.56      0
    67108864      16777216     float     sum      -1    469.7  142.87  267.89      0    493.3  136.05  255.09      0
   134217728      33554432     float     sum      -1    752.5  178.35  334.42      0    754.2  177.95  333.66      0
   268435456      67108864     float     sum      -1   1307.0  205.39  385.11      0   1293.1  207.59  389.24      0
   536870912     134217728     float     sum      -1   2340.5  229.38  430.09      0   2358.9  227.60  426.74      0
  1073741824     268435456     float     sum      -1   4417.6  243.06  455.74      0   4428.6  242.46  454.60      0
  2147483648     536870912     float     sum      -1   8572.4  250.51  469.71      0   8560.3  250.86  470.37      0
  4294967296    1073741824     float     sum      -1    16906  254.05  476.35      0    16857  254.79  477.73      0
  8589934592    2147483648     float     sum      -1    33561  255.95  479.91      0    33569  255.89  479.79      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 127.235 
#
# Collective test concluded: all_reduce_perf

对比关闭 IB 通信后(-x NCCL_IB_DISABLE=1),测试结果如下:

# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 50 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 460146 on  host1 device  0 [0000:03:00] NVIDIA H20
#  Rank  1 Group  0 Pid 460147 on  host1 device  1 [0000:16:00] NVIDIA H20
#  Rank  2 Group  0 Pid 460148 on  host1 device  2 [0000:1c:00] NVIDIA H20
#  Rank  3 Group  0 Pid 460149 on  host1 device  3 [0000:2e:00] NVIDIA H20
#  Rank  4 Group  0 Pid 460150 on  host1 device  4 [0000:84:00] NVIDIA H20
#  Rank  5 Group  0 Pid 460151 on  host1 device  5 [0000:9c:00] NVIDIA H20
#  Rank  6 Group  0 Pid 460152 on  host1 device  6 [0000:b6:00] NVIDIA H20
#  Rank  7 Group  0 Pid 460153 on  host1 device  7 [0000:bb:00] NVIDIA H20
#  Rank  8 Group  0 Pid 250661 on  host2 device  0 [0000:03:00] NVIDIA H20
#  Rank  9 Group  0 Pid 250662 on  host2 device  1 [0000:16:00] NVIDIA H20
#  Rank 10 Group  0 Pid 250663 on  host2 device  2 [0000:1c:00] NVIDIA H20
#  Rank 11 Group  0 Pid 250664 on  host2 device  3 [0000:2e:00] NVIDIA H20
#  Rank 12 Group  0 Pid 250665 on  host2 device  4 [0000:84:00] NVIDIA H20
#  Rank 13 Group  0 Pid 250666 on  host2 device  5 [0000:9c:00] NVIDIA H20
#  Rank 14 Group  0 Pid 250667 on  host2 device  6 [0000:b6:00] NVIDIA H20
#  Rank 15 Group  0 Pid 250668 on  host2 device  7 [0000:bb:00] NVIDIA H20
#
#                                                              out-of-place                       in-place  
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    66.33    0.00    0.00      0    71.02    0.00    0.00      0
          16             4     float     sum      -1    72.15    0.00    0.00      0    65.33    0.00    0.00      0
          32             8     float     sum      -1    65.17    0.00    0.00      0    67.41    0.00    0.00      0
          64            16     float     sum      -1    67.93    0.00    0.00      0    69.15    0.00    0.00      0
         128            32     float     sum      -1    106.8    0.00    0.00      0    105.2    0.00    0.00      0
         256            64     float     sum      -1    122.9    0.00    0.00      0    110.0    0.00    0.00      0
         512           128     float     sum      -1    147.3    0.00    0.01      0    148.4    0.00    0.01      0
        1024           256     float     sum      -1    127.8    0.01    0.02      0    127.8    0.01    0.02      0
        2048           512     float     sum      -1    142.7    0.01    0.03      0    143.6    0.01    0.03      0
        4096          1024     float     sum      -1    150.0    0.03    0.05      0    184.8    0.02    0.04      0
        8192          2048     float     sum      -1    158.4    0.05    0.10      0    143.1    0.06    0.11      0
       16384          4096     float     sum      -1    181.4    0.09    0.17      0    209.2    0.08    0.15      0
       32768          8192     float     sum      -1    194.2    0.17    0.32      0    221.7    0.15    0.28      0
       65536         16384     float     sum      -1    236.4    0.28    0.52      0    237.4    0.28    0.52      0
      131072         32768     float     sum      -1    261.4    0.50    0.94      0    260.6    0.50    0.94      0
      262144         65536     float     sum      -1    664.0    0.39    0.74      0    661.0    0.40    0.74      0
      524288        131072     float     sum      -1   1004.3    0.52    0.98      0   1013.0    0.52    0.97      0
     1048576        262144     float     sum      -1    673.6    1.56    2.92      0    659.8    1.59    2.98      0
     2097152        524288     float     sum      -1    953.6    2.20    4.12      0    948.7    2.21    4.14      0
     4194304       1048576     float     sum      -1   1609.0    2.61    4.89      0   1639.6    2.56    4.80      0
     8388608       2097152     float     sum      -1   2919.6    2.87    5.39      0   2890.5    2.90    5.44      0
    16777216       4194304     float     sum      -1   5400.9    3.11    5.82      0   5419.8    3.10    5.80      0
    33554432       8388608     float     sum      -1    10051    3.34    6.26      0    10056    3.34    6.26      0
    67108864      16777216     float     sum      -1    19200    3.50    6.55      0    19300    3.48    6.52      0
   134217728      33554432     float     sum      -1    35827    3.75    7.02      0    35784    3.75    7.03      0
   268435456      67108864     float     sum      -1    63918    4.20    7.87      0    63684    4.22    7.90      0
   536870912     134217728     float     sum      -1   115283    4.66    8.73      0   115968    4.63    8.68      0
  1073741824     268435456     float     sum      -1   242936    4.42    8.29      0   240135    4.47    8.38      0
  2147483648     536870912     float     sum      -1   481477    4.46    8.36      0   482167    4.45    8.35      0
  4294967296    1073741824     float     sum      -1   936890    4.58    8.60      0   969305    4.43    8.31      0
  8589934592    2147483648     float     sum      -1  1790395    4.80    9.00      0  1782046    4.82    9.04      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.14748 
#
# Collective test concluded: all_reduce_perf
Comment