通信速度对于多机多卡训练或推理至关重要。在进行训练或推理前,可以通过执行 NCCL Tests 确认多机多卡环境、通信是否存在问题。
安装 NCCL
NCCL 可以直接到 NVIDIA 官网下载(需要注册登录)。NCCL 版本需要和 CUDA 版本匹配,如果依赖的 CUDA 版本为 12.4,则对应下图第 3 行下载。
建议使用“与操作系统无关”的离线压缩包进行安装,只需解压并配置 NCCL_HOME
以及 LD_LIBRARY_PATH
环境变量即可,比较灵活,并且在多人使用的场景下可以避免互相影响环境。
# 下载后解压
tar -xf nccl_2.27.7-1+cuda12.4_x86_64.txz
# 配置环境变量 (PATH 环境变量可选,一般情况下用不到)
export NCCL_HOME="$PWD/nccl_2.27.7-1+cuda12.4_x86_64"
export PATH="$NCCL_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$LD_LIBRARY_PATH"
安装 Open MPI
考虑到大部分发行版本如 CentOS 等默认包管理带的版本较低,建议从官网下载最新的版本手动编译安装。笔者测试的版本为 4.1.8(测试时官网最新版本为 5.0.8,但是使用 5.0.8 进行多机 NCCL Tests 的时候会卡住,回退到 4.1.8 后正常)。
编译安装
# 下载解压
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.8.tar.gz
tar -zxf openmpi-4.1.8.tar.gz
# 编译安装
cd openmpi-4.1.8
./configure --prefix=/usr/local/openmpi-4.1.8
make && make install
# 配置环境变量
export MPI_HOME=/usr/local/openmpi-4.1.8
export PATH="$MPI_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$MPI_HOME/lib:$LD_LIBRARY_PATH"
测试
被测试机器需要互相配置 SSH 免密登录,这部分可以参考网上教程,不再赘述。运行测试命令,注意根据实际情况修改替换通信网卡(bond1),
# 假设有两台被测机器,ip 分别为 10.1.1.1 和 10.1.1.2
mpirun \
-v \
--allow-run-as-root \
--prefix "$MPI_HOME" \
--mca btl_tcp_if_include bond1 \
--mca oob_tcp_if_include bond1 \
-np 2 \
--host 10.1.1.1:1,10.1.1.2:1 \
bash -c 'echo "Hello from process $OMPI_COMM_WORLD_RANK of $OMPI_COMM_WORLD_SIZE on $(hostname)"'
测试成功结果如下:
Hello from process 0 of 2 on host1
Hello from process 1 of 2 on host2
如果测试过程中卡住没有输出,请检查 SSH 免密登录及通信网卡是否正确配置,可以使用 strace
辅助调试(将卡住的 syscall 上下文复制到 gpt 询问即可),
strace -f -e trace=network,process,execve -- mpirun ... # 原 mpirun 命令
安装 NCCL Tests 工具
NCCL Tests 是 NVIDIA 官方出品的 NCCL 测试工具,只需要简单一行命令就可以完成测试。
下载编译
# 下载 NCCL Tests
git clone https://github.com/NVIDIA/nccl-tests.git
# 编译
cd nccl-tests
make MPI=1 MPI_HOME="$MPI_HOME" NCCL_HOME="$NCCL_HOME"
多机测试
参考下面的测试命令(双机 16 卡),注意根据实际情况修改替换通信网卡(bond1)及 NCCL_XXX
变量。如果出现单机测试正常,多机测试卡住的情况,请优先尝试更换 Open MPI 版本。
# 假设有两台被测机器,ip 分别为 10.1.1.1 和 10.1.1.2,每台机器有 8 卡 GPU
# 每台启动 8 个测试程序(-np 16 --host x:8,y:8),每个测试程序 1 卡 GPU(-g 1)
mpirun \
-v \
--allow-run-as-root \
--prefix "$MPI_HOME" \
--mca btl_tcp_if_include bond1 \
--mca oob_tcp_if_include bond1 \
-x NCCL_SOCKET_IFNAME=bond1 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_HCA=mlx5_ \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
-np 16 \
--host 10.1.1.1:8,10.1.1.2:8 \
./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -n 50
测试结果如下:
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 50 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 466205 on host1 device 0 [0000:03:00] NVIDIA H20
# Rank 1 Group 0 Pid 466206 on host1 device 1 [0000:16:00] NVIDIA H20
# Rank 2 Group 0 Pid 466207 on host1 device 2 [0000:1c:00] NVIDIA H20
# Rank 3 Group 0 Pid 466208 on host1 device 3 [0000:2e:00] NVIDIA H20
# Rank 4 Group 0 Pid 466209 on host1 device 4 [0000:84:00] NVIDIA H20
# Rank 5 Group 0 Pid 466210 on host1 device 5 [0000:9c:00] NVIDIA H20
# Rank 6 Group 0 Pid 466211 on host1 device 6 [0000:b6:00] NVIDIA H20
# Rank 7 Group 0 Pid 466212 on host1 device 7 [0000:bb:00] NVIDIA H20
# Rank 8 Group 0 Pid 254776 on host2 device 0 [0000:03:00] NVIDIA H20
# Rank 9 Group 0 Pid 254777 on host2 device 1 [0000:16:00] NVIDIA H20
# Rank 10 Group 0 Pid 254778 on host2 device 2 [0000:1c:00] NVIDIA H20
# Rank 11 Group 0 Pid 254779 on host2 device 3 [0000:2e:00] NVIDIA H20
# Rank 12 Group 0 Pid 254780 on host2 device 4 [0000:84:00] NVIDIA H20
# Rank 13 Group 0 Pid 254781 on host2 device 5 [0000:9c:00] NVIDIA H20
# Rank 14 Group 0 Pid 254782 on host2 device 6 [0000:b6:00] NVIDIA H20
# Rank 15 Group 0 Pid 254783 on host2 device 7 [0000:bb:00] NVIDIA H20
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 43.58 0.00 0.00 0 31.56 0.00 0.00 0
16 4 float sum -1 33.34 0.00 0.00 0 31.35 0.00 0.00 0
32 8 float sum -1 31.78 0.00 0.00 0 31.48 0.00 0.00 0
64 16 float sum -1 31.91 0.00 0.00 0 31.77 0.00 0.00 0
128 32 float sum -1 32.31 0.00 0.01 0 32.20 0.00 0.01 0
256 64 float sum -1 47.91 0.01 0.01 0 32.67 0.01 0.01 0
512 128 float sum -1 34.27 0.01 0.03 0 33.07 0.02 0.03 0
1024 256 float sum -1 33.99 0.03 0.06 0 33.69 0.03 0.06 0
2048 512 float sum -1 35.83 0.06 0.11 0 35.76 0.06 0.11 0
4096 1024 float sum -1 37.71 0.11 0.20 0 38.12 0.11 0.20 0
8192 2048 float sum -1 41.52 0.20 0.37 0 41.51 0.20 0.37 0
16384 4096 float sum -1 41.25 0.40 0.74 0 39.98 0.41 0.77 0
32768 8192 float sum -1 42.00 0.78 1.46 0 42.04 0.78 1.46 0
65536 16384 float sum -1 43.99 1.49 2.79 0 43.43 1.51 2.83 0
131072 32768 float sum -1 72.05 1.82 3.41 0 66.09 1.98 3.72 0
262144 65536 float sum -1 76.37 3.43 6.44 0 71.50 3.67 6.87 0
524288 131072 float sum -1 92.71 5.66 10.60 0 77.82 6.74 12.63 0
1048576 262144 float sum -1 82.32 12.74 23.88 0 80.67 13.00 24.37 0
2097152 524288 float sum -1 94.86 22.11 41.45 0 89.47 23.44 43.95 0
4194304 1048576 float sum -1 110.7 37.89 71.05 0 109.5 38.29 71.79 0
8388608 2097152 float sum -1 157.2 53.38 100.09 0 155.0 54.13 101.50 0
16777216 4194304 float sum -1 202.2 82.96 155.55 0 204.1 82.21 154.14 0
33554432 8388608 float sum -1 271.8 123.45 231.48 0 276.5 121.37 227.56 0
67108864 16777216 float sum -1 469.7 142.87 267.89 0 493.3 136.05 255.09 0
134217728 33554432 float sum -1 752.5 178.35 334.42 0 754.2 177.95 333.66 0
268435456 67108864 float sum -1 1307.0 205.39 385.11 0 1293.1 207.59 389.24 0
536870912 134217728 float sum -1 2340.5 229.38 430.09 0 2358.9 227.60 426.74 0
1073741824 268435456 float sum -1 4417.6 243.06 455.74 0 4428.6 242.46 454.60 0
2147483648 536870912 float sum -1 8572.4 250.51 469.71 0 8560.3 250.86 470.37 0
4294967296 1073741824 float sum -1 16906 254.05 476.35 0 16857 254.79 477.73 0
8589934592 2147483648 float sum -1 33561 255.95 479.91 0 33569 255.89 479.79 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 127.235
#
# Collective test concluded: all_reduce_perf
对比关闭 IB 通信后(-x NCCL_IB_DISABLE=1
),测试结果如下:
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 50 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 460146 on host1 device 0 [0000:03:00] NVIDIA H20
# Rank 1 Group 0 Pid 460147 on host1 device 1 [0000:16:00] NVIDIA H20
# Rank 2 Group 0 Pid 460148 on host1 device 2 [0000:1c:00] NVIDIA H20
# Rank 3 Group 0 Pid 460149 on host1 device 3 [0000:2e:00] NVIDIA H20
# Rank 4 Group 0 Pid 460150 on host1 device 4 [0000:84:00] NVIDIA H20
# Rank 5 Group 0 Pid 460151 on host1 device 5 [0000:9c:00] NVIDIA H20
# Rank 6 Group 0 Pid 460152 on host1 device 6 [0000:b6:00] NVIDIA H20
# Rank 7 Group 0 Pid 460153 on host1 device 7 [0000:bb:00] NVIDIA H20
# Rank 8 Group 0 Pid 250661 on host2 device 0 [0000:03:00] NVIDIA H20
# Rank 9 Group 0 Pid 250662 on host2 device 1 [0000:16:00] NVIDIA H20
# Rank 10 Group 0 Pid 250663 on host2 device 2 [0000:1c:00] NVIDIA H20
# Rank 11 Group 0 Pid 250664 on host2 device 3 [0000:2e:00] NVIDIA H20
# Rank 12 Group 0 Pid 250665 on host2 device 4 [0000:84:00] NVIDIA H20
# Rank 13 Group 0 Pid 250666 on host2 device 5 [0000:9c:00] NVIDIA H20
# Rank 14 Group 0 Pid 250667 on host2 device 6 [0000:b6:00] NVIDIA H20
# Rank 15 Group 0 Pid 250668 on host2 device 7 [0000:bb:00] NVIDIA H20
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 66.33 0.00 0.00 0 71.02 0.00 0.00 0
16 4 float sum -1 72.15 0.00 0.00 0 65.33 0.00 0.00 0
32 8 float sum -1 65.17 0.00 0.00 0 67.41 0.00 0.00 0
64 16 float sum -1 67.93 0.00 0.00 0 69.15 0.00 0.00 0
128 32 float sum -1 106.8 0.00 0.00 0 105.2 0.00 0.00 0
256 64 float sum -1 122.9 0.00 0.00 0 110.0 0.00 0.00 0
512 128 float sum -1 147.3 0.00 0.01 0 148.4 0.00 0.01 0
1024 256 float sum -1 127.8 0.01 0.02 0 127.8 0.01 0.02 0
2048 512 float sum -1 142.7 0.01 0.03 0 143.6 0.01 0.03 0
4096 1024 float sum -1 150.0 0.03 0.05 0 184.8 0.02 0.04 0
8192 2048 float sum -1 158.4 0.05 0.10 0 143.1 0.06 0.11 0
16384 4096 float sum -1 181.4 0.09 0.17 0 209.2 0.08 0.15 0
32768 8192 float sum -1 194.2 0.17 0.32 0 221.7 0.15 0.28 0
65536 16384 float sum -1 236.4 0.28 0.52 0 237.4 0.28 0.52 0
131072 32768 float sum -1 261.4 0.50 0.94 0 260.6 0.50 0.94 0
262144 65536 float sum -1 664.0 0.39 0.74 0 661.0 0.40 0.74 0
524288 131072 float sum -1 1004.3 0.52 0.98 0 1013.0 0.52 0.97 0
1048576 262144 float sum -1 673.6 1.56 2.92 0 659.8 1.59 2.98 0
2097152 524288 float sum -1 953.6 2.20 4.12 0 948.7 2.21 4.14 0
4194304 1048576 float sum -1 1609.0 2.61 4.89 0 1639.6 2.56 4.80 0
8388608 2097152 float sum -1 2919.6 2.87 5.39 0 2890.5 2.90 5.44 0
16777216 4194304 float sum -1 5400.9 3.11 5.82 0 5419.8 3.10 5.80 0
33554432 8388608 float sum -1 10051 3.34 6.26 0 10056 3.34 6.26 0
67108864 16777216 float sum -1 19200 3.50 6.55 0 19300 3.48 6.52 0
134217728 33554432 float sum -1 35827 3.75 7.02 0 35784 3.75 7.03 0
268435456 67108864 float sum -1 63918 4.20 7.87 0 63684 4.22 7.90 0
536870912 134217728 float sum -1 115283 4.66 8.73 0 115968 4.63 8.68 0
1073741824 268435456 float sum -1 242936 4.42 8.29 0 240135 4.47 8.38 0
2147483648 536870912 float sum -1 481477 4.46 8.36 0 482167 4.45 8.35 0
4294967296 1073741824 float sum -1 936890 4.58 8.60 0 969305 4.43 8.31 0
8589934592 2147483648 float sum -1 1790395 4.80 9.00 0 1782046 4.82 9.04 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 3.14748
#
# Collective test concluded: all_reduce_perf