通信速度对于多机多卡训练或推理至关重要。在进行训练或推理前,可以通过执行 NCCL Tests 确认多机多卡环境、通信是否存在问题。
安装 NCCL
NCCL 可以直接到 NVIDIA 官网下载(需要注册登录)。NCCL 版本需要和 CUDA 版本匹配,如果依赖的 CUDA 版本为 12.4,则对应下图第 3 行下载。

建议使用“与操作系统无关”的离线压缩包进行安装,只需解压并配置 NCCL_HOME 以及 LD_LIBRARY_PATH 环境变量即可,比较灵活,并且在多人使用的场景下可以避免互相影响环境。

# 下载后解压
tar -xf nccl_2.27.7-1+cuda12.4_x86_64.txz
# 配置环境变量 (PATH 环境变量可选,一般情况下用不到)
export NCCL_HOME="$PWD/nccl_2.27.7-1+cuda12.4_x86_64"
export PATH="$NCCL_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$LD_LIBRARY_PATH"
安装 Open MPI
考虑到大部分发行版本如 CentOS 等默认包管理带的版本较低,建议从官网下载最新的版本手动编译安装。笔者测试的版本为 4.1.8(测试时官网最新版本为 5.0.8,但是使用 5.0.8 进行多机 NCCL Tests 的时候会卡住,回退到 4.1.8 后正常)。
编译安装
# 下载解压
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.8.tar.gz
tar -zxf openmpi-4.1.8.tar.gz
# 编译安装
cd openmpi-4.1.8
./configure --prefix=/usr/local/openmpi-4.1.8 --enable-orterun-prefix-by-default
make && make install
# 配置环境变量
export MPI_HOME=/usr/local/openmpi-4.1.8
export PATH="$MPI_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$MPI_HOME/lib:$LD_LIBRARY_PATH"
测试
被测试机器需要互相配置 SSH 免密登录,这部分可以参考网上教程,不再赘述。运行测试命令,注意根据实际情况修改替换通信网卡(bond1),
# 假设有两台被测机器,ip 分别为 10.1.1.1 和 10.1.1.2
mpirun \
-v \
--allow-run-as-root \
--prefix "$MPI_HOME" \
--mca btl_tcp_if_include bond1 \
--mca oob_tcp_if_include bond1 \
-np 2 \
--host 10.1.1.1:1,10.1.1.2:1 \
bash -c 'echo "Hello from process $OMPI_COMM_WORLD_RANK of $OMPI_COMM_WORLD_SIZE on $(hostname)"'
测试成功结果如下:
Hello from process 0 of 2 on host1
Hello from process 1 of 2 on host2
如果测试过程中卡住没有输出,请检查 SSH 免密登录及通信网卡是否正确配置,可以使用 strace 辅助调试(将卡住的 syscall 上下文复制到 gpt 询问即可),
strace -f -e trace=network,process,execve -- mpirun ... # 原 mpirun 命令
安装 NCCL Tests 工具
NCCL Tests 是 NVIDIA 官方出品的 NCCL 测试工具,只需要简单一行命令就可以完成测试。
下载编译
# 下载 NCCL Tests
git clone https://github.com/NVIDIA/nccl-tests.git
# 编译
cd nccl-tests
make MPI=1 MPI_HOME="$MPI_HOME" NCCL_HOME="$NCCL_HOME"
多机测试
参考下面的测试命令(双机 16 卡),注意根据实际情况修改替换通信网卡(bond1)及 NCCL_XXX 变量。如果出现单机测试正常,多机测试卡住的情况,请优先尝试更换 Open MPI 版本。
# 假设有两台被测机器,ip 分别为 10.1.1.1 和 10.1.1.2,每台机器有 8 卡 GPU
# 每台启动 8 个测试程序(-np 16 --host x:8,y:8),每个测试程序 1 卡 GPU(-g 1)
mpirun \
-v \
--allow-run-as-root \
--prefix "$MPI_HOME" \
--mca pml ob1 \
--mca btl ^openib \
--mca btl_tcp_if_include bond1 \
--mca oob_tcp_if_include bond1 \
-x FI_PROVIDER="^psm3" \
-x NCCL_SOCKET_IFNAME=bond1 \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_HCA=mlx5_ \
-x NCCL_NET_GDR_LEVEL=2 \
-x NCCL_DEBUG=INFO \
-x LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
-np 16 \
--host 10.1.1.1:8,10.1.1.2:8 \
./build/all_reduce_perf -b 8 -e 32G -f 2 -g 1 -n 50
测试结果如下:
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 50 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 466205 on host1 device 0 [0000:03:00] NVIDIA H20
# Rank 1 Group 0 Pid 466206 on host1 device 1 [0000:16:00] NVIDIA H20
# Rank 2 Group 0 Pid 466207 on host1 device 2 [0000:1c:00] NVIDIA H20
# Rank 3 Group 0 Pid 466208 on host1 device 3 [0000:2e:00] NVIDIA H20
# Rank 4 Group 0 Pid 466209 on host1 device 4 [0000:84:00] NVIDIA H20
# Rank 5 Group 0 Pid 466210 on host1 device 5 [0000:9c:00] NVIDIA H20
# Rank 6 Group 0 Pid 466211 on host1 device 6 [0000:b6:00] NVIDIA H20
# Rank 7 Group 0 Pid 466212 on host1 device 7 [0000:bb:00] NVIDIA H20
# Rank 8 Group 0 Pid 254776 on host2 device 0 [0000:03:00] NVIDIA H20
# Rank 9 Group 0 Pid 254777 on host2 device 1 [0000:16:00] NVIDIA H20
# Rank 10 Group 0 Pid 254778 on host2 device 2 [0000:1c:00] NVIDIA H20
# Rank 11 Group 0 Pid 254779 on host2 device 3 [0000:2e:00] NVIDIA H20
# Rank 12 Group 0 Pid 254780 on host2 device 4 [0000:84:00] NVIDIA H20
# Rank 13 Group 0 Pid 254781 on host2 device 5 [0000:9c:00] NVIDIA H20
# Rank 14 Group 0 Pid 254782 on host2 device 6 [0000:b6:00] NVIDIA H20
# Rank 15 Group 0 Pid 254783 on host2 device 7 [0000:bb:00] NVIDIA H20
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 43.58 0.00 0.00 0 31.56 0.00 0.00 0
16 4 float sum -1 33.34 0.00 0.00 0 31.35 0.00 0.00 0
32 8 float sum -1 31.78 0.00 0.00 0 31.48 0.00 0.00 0
64 16 float sum -1 31.91 0.00 0.00 0 31.77 0.00 0.00 0
128 32 float sum -1 32.31 0.00 0.01 0 32.20 0.00 0.01 0
256 64 float sum -1 47.91 0.01 0.01 0 32.67 0.01 0.01 0
512 128 float sum -1 34.27 0.01 0.03 0 33.07 0.02 0.03 0
1024 256 float sum -1 33.99 0.03 0.06 0 33.69 0.03 0.06 0
2048 512 float sum -1 35.83 0.06 0.11 0 35.76 0.06 0.11 0
4096 1024 float sum -1 37.71 0.11 0.20 0 38.12 0.11 0.20 0
8192 2048 float sum -1 41.52 0.20 0.37 0 41.51 0.20 0.37 0
16384 4096 float sum -1 41.25 0.40 0.74 0 39.98 0.41 0.77 0
32768 8192 float sum -1 42.00 0.78 1.46 0 42.04 0.78 1.46 0
65536 16384 float sum -1 43.99 1.49 2.79 0 43.43 1.51 2.83 0
131072 32768 float sum -1 72.05 1.82 3.41 0 66.09 1.98 3.72 0
262144 65536 float sum -1 76.37 3.43 6.44 0 71.50 3.67 6.87 0
524288 131072 float sum -1 92.71 5.66 10.60 0 77.82 6.74 12.63 0
1048576 262144 float sum -1 82.32 12.74 23.88 0 80.67 13.00 24.37 0
2097152 524288 float sum -1 94.86 22.11 41.45 0 89.47 23.44 43.95 0
4194304 1048576 float sum -1 110.7 37.89 71.05 0 109.5 38.29 71.79 0
8388608 2097152 float sum -1 157.2 53.38 100.09 0 155.0 54.13 101.50 0
16777216 4194304 float sum -1 202.2 82.96 155.55 0 204.1 82.21 154.14 0
33554432 8388608 float sum -1 271.8 123.45 231.48 0 276.5 121.37 227.56 0
67108864 16777216 float sum -1 469.7 142.87 267.89 0 493.3 136.05 255.09 0
134217728 33554432 float sum -1 752.5 178.35 334.42 0 754.2 177.95 333.66 0
268435456 67108864 float sum -1 1307.0 205.39 385.11 0 1293.1 207.59 389.24 0
536870912 134217728 float sum -1 2340.5 229.38 430.09 0 2358.9 227.60 426.74 0
1073741824 268435456 float sum -1 4417.6 243.06 455.74 0 4428.6 242.46 454.60 0
2147483648 536870912 float sum -1 8572.4 250.51 469.71 0 8560.3 250.86 470.37 0
4294967296 1073741824 float sum -1 16906 254.05 476.35 0 16857 254.79 477.73 0
8589934592 2147483648 float sum -1 33561 255.95 479.91 0 33569 255.89 479.79 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 127.235
#
# Collective test concluded: all_reduce_perf
对比关闭 IB 通信后(-x NCCL_IB_DISABLE=1),测试结果如下:
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 50 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 460146 on host1 device 0 [0000:03:00] NVIDIA H20
# Rank 1 Group 0 Pid 460147 on host1 device 1 [0000:16:00] NVIDIA H20
# Rank 2 Group 0 Pid 460148 on host1 device 2 [0000:1c:00] NVIDIA H20
# Rank 3 Group 0 Pid 460149 on host1 device 3 [0000:2e:00] NVIDIA H20
# Rank 4 Group 0 Pid 460150 on host1 device 4 [0000:84:00] NVIDIA H20
# Rank 5 Group 0 Pid 460151 on host1 device 5 [0000:9c:00] NVIDIA H20
# Rank 6 Group 0 Pid 460152 on host1 device 6 [0000:b6:00] NVIDIA H20
# Rank 7 Group 0 Pid 460153 on host1 device 7 [0000:bb:00] NVIDIA H20
# Rank 8 Group 0 Pid 250661 on host2 device 0 [0000:03:00] NVIDIA H20
# Rank 9 Group 0 Pid 250662 on host2 device 1 [0000:16:00] NVIDIA H20
# Rank 10 Group 0 Pid 250663 on host2 device 2 [0000:1c:00] NVIDIA H20
# Rank 11 Group 0 Pid 250664 on host2 device 3 [0000:2e:00] NVIDIA H20
# Rank 12 Group 0 Pid 250665 on host2 device 4 [0000:84:00] NVIDIA H20
# Rank 13 Group 0 Pid 250666 on host2 device 5 [0000:9c:00] NVIDIA H20
# Rank 14 Group 0 Pid 250667 on host2 device 6 [0000:b6:00] NVIDIA H20
# Rank 15 Group 0 Pid 250668 on host2 device 7 [0000:bb:00] NVIDIA H20
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 66.33 0.00 0.00 0 71.02 0.00 0.00 0
16 4 float sum -1 72.15 0.00 0.00 0 65.33 0.00 0.00 0
32 8 float sum -1 65.17 0.00 0.00 0 67.41 0.00 0.00 0
64 16 float sum -1 67.93 0.00 0.00 0 69.15 0.00 0.00 0
128 32 float sum -1 106.8 0.00 0.00 0 105.2 0.00 0.00 0
256 64 float sum -1 122.9 0.00 0.00 0 110.0 0.00 0.00 0
512 128 float sum -1 147.3 0.00 0.01 0 148.4 0.00 0.01 0
1024 256 float sum -1 127.8 0.01 0.02 0 127.8 0.01 0.02 0
2048 512 float sum -1 142.7 0.01 0.03 0 143.6 0.01 0.03 0
4096 1024 float sum -1 150.0 0.03 0.05 0 184.8 0.02 0.04 0
8192 2048 float sum -1 158.4 0.05 0.10 0 143.1 0.06 0.11 0
16384 4096 float sum -1 181.4 0.09 0.17 0 209.2 0.08 0.15 0
32768 8192 float sum -1 194.2 0.17 0.32 0 221.7 0.15 0.28 0
65536 16384 float sum -1 236.4 0.28 0.52 0 237.4 0.28 0.52 0
131072 32768 float sum -1 261.4 0.50 0.94 0 260.6 0.50 0.94 0
262144 65536 float sum -1 664.0 0.39 0.74 0 661.0 0.40 0.74 0
524288 131072 float sum -1 1004.3 0.52 0.98 0 1013.0 0.52 0.97 0
1048576 262144 float sum -1 673.6 1.56 2.92 0 659.8 1.59 2.98 0
2097152 524288 float sum -1 953.6 2.20 4.12 0 948.7 2.21 4.14 0
4194304 1048576 float sum -1 1609.0 2.61 4.89 0 1639.6 2.56 4.80 0
8388608 2097152 float sum -1 2919.6 2.87 5.39 0 2890.5 2.90 5.44 0
16777216 4194304 float sum -1 5400.9 3.11 5.82 0 5419.8 3.10 5.80 0
33554432 8388608 float sum -1 10051 3.34 6.26 0 10056 3.34 6.26 0
67108864 16777216 float sum -1 19200 3.50 6.55 0 19300 3.48 6.52 0
134217728 33554432 float sum -1 35827 3.75 7.02 0 35784 3.75 7.03 0
268435456 67108864 float sum -1 63918 4.20 7.87 0 63684 4.22 7.90 0
536870912 134217728 float sum -1 115283 4.66 8.73 0 115968 4.63 8.68 0
1073741824 268435456 float sum -1 242936 4.42 8.29 0 240135 4.47 8.38 0
2147483648 536870912 float sum -1 481477 4.46 8.36 0 482167 4.45 8.35 0
4294967296 1073741824 float sum -1 936890 4.58 8.60 0 969305 4.43 8.31 0
8589934592 2147483648 float sum -1 1790395 4.80 9.00 0 1782046 4.82 9.04 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 3.14748
#
# Collective test concluded: all_reduce_perf
测试结果
测试命令:all_reduce_perf -b 8 -e 32G -f 2 -g 1 -n 50
8卡*2机=16卡
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 47.81 0.00 0.00 0 33.66 0.00 0.00 0
16 4 float sum -1 33.66 0.00 0.00 0 34.94 0.00 0.00 0
32 8 float sum -1 33.87 0.00 0.00 0 33.90 0.00 0.00 0
64 16 float sum -1 34.20 0.00 0.00 0 34.07 0.00 0.00 0
128 32 float sum -1 34.56 0.00 0.01 0 34.47 0.00 0.01 0
256 64 float sum -1 48.72 0.01 0.01 0 34.87 0.01 0.01 0
512 128 float sum -1 37.57 0.01 0.03 0 35.44 0.01 0.03 0
1024 256 float sum -1 37.54 0.03 0.05 0 36.09 0.03 0.05 0
2048 512 float sum -1 44.73 0.05 0.09 0 37.92 0.05 0.10 0
4096 1024 float sum -1 41.03 0.10 0.19 0 41.49 0.10 0.19 0
8192 2048 float sum -1 41.92 0.20 0.37 0 42.84 0.19 0.36 0
16384 4096 float sum -1 42.85 0.38 0.72 0 41.66 0.39 0.74 0
32768 8192 float sum -1 42.62 0.77 1.44 0 44.69 0.73 1.37 0
65536 16384 float sum -1 44.70 1.47 2.75 0 43.29 1.51 2.84 0
131072 32768 float sum -1 65.38 2.00 3.76 0 62.48 2.10 3.93 0
262144 65536 float sum -1 71.63 3.66 6.86 0 67.78 3.87 7.25 0
524288 131072 float sum -1 91.63 5.72 10.73 0 81.45 6.44 12.07 0
1048576 262144 float sum -1 89.24 11.75 22.03 0 87.16 12.03 22.56 0
2097152 524288 float sum -1 90.43 23.19 43.48 0 93.56 22.42 42.03 0
4194304 1048576 float sum -1 114.5 36.64 68.70 0 122.8 34.15 64.04 0
8388608 2097152 float sum -1 157.7 53.20 99.76 0 157.0 53.45 100.21 0
16777216 4194304 float sum -1 209.9 79.94 149.88 0 203.6 82.42 154.54 0
33554432 8388608 float sum -1 277.1 121.10 227.07 0 278.6 120.44 225.83 0
67108864 16777216 float sum -1 474.3 141.50 265.32 0 481.6 139.36 261.30 0
134217728 33554432 float sum -1 759.4 176.75 331.41 0 761.5 176.26 330.48 0
268435456 67108864 float sum -1 1305.4 205.64 385.57 0 1307.9 205.23 384.81 0
536870912 134217728 float sum -1 2345.8 228.87 429.13 0 2346.2 228.83 429.05 0
1073741824 268435456 float sum -1 4407.4 243.62 456.79 0 4413.8 243.27 456.13 0
2147483648 536870912 float sum -1 8519.2 252.08 472.64 0 8520.7 252.03 472.56 0
4294967296 1073741824 float sum -1 16750 256.41 480.78 0 16758 256.30 480.56 0
8589934592 2147483648 float sum -1 33238 258.44 484.57 0 33233 258.47 484.64 0
17179869184 4294967296 float sum -1 66258 259.29 486.16 0 66241 259.35 486.29 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 138.348
8卡*3机=24卡
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 58.18 0.00 0.00 0 50.66 0.00 0.00 0
16 4 float sum -1 50.31 0.00 0.00 0 50.36 0.00 0.00 0
32 8 float sum -1 50.78 0.00 0.00 0 50.74 0.00 0.00 0
64 16 float sum -1 51.31 0.00 0.00 0 51.28 0.00 0.00 0
128 32 float sum -1 53.22 0.00 0.00 0 52.38 0.00 0.00 0
256 64 float sum -1 66.36 0.00 0.01 0 53.08 0.00 0.01 0
512 128 float sum -1 58.45 0.01 0.02 0 53.74 0.01 0.02 0
1024 256 float sum -1 55.23 0.02 0.04 0 55.12 0.02 0.04 0
2048 512 float sum -1 57.81 0.04 0.07 0 57.63 0.04 0.07 0
4096 1024 float sum -1 60.83 0.07 0.13 0 60.60 0.07 0.13 0
8192 2048 float sum -1 66.97 0.12 0.23 0 66.25 0.12 0.24 0
16384 4096 float sum -1 67.94 0.24 0.46 0 66.97 0.24 0.47 0
32768 8192 float sum -1 68.14 0.48 0.92 0 67.56 0.49 0.93 0
65536 16384 float sum -1 68.15 0.96 1.84 0 67.55 0.97 1.86 0
131072 32768 float sum -1 84.55 1.55 2.97 0 84.64 1.55 2.97 0
262144 65536 float sum -1 225.0 1.17 2.23 0 135.0 1.94 3.72 0
524288 131072 float sum -1 139.1 3.77 7.22 0 140.4 3.74 7.16 0
1048576 262144 float sum -1 140.1 7.48 14.34 0 142.3 7.37 14.12 0
2097152 524288 float sum -1 150.1 13.97 26.77 0 147.1 14.26 27.33 0
4194304 1048576 float sum -1 169.5 24.74 47.42 0 172.9 24.26 46.50 0
8388608 2097152 float sum -1 227.8 36.82 70.58 0 225.4 37.21 71.33 0
16777216 4194304 float sum -1 281.2 59.67 114.36 0 281.9 59.52 114.08 0
33554432 8388608 float sum -1 391.1 85.80 164.45 0 388.9 86.28 165.37 0
67108864 16777216 float sum -1 659.9 101.69 194.91 0 660.1 101.67 194.87 0
134217728 33554432 float sum -1 1294.0 103.72 198.80 0 1283.5 104.57 200.43 0
268435456 67108864 float sum -1 2636.9 101.80 195.12 0 2628.6 102.12 195.73 0
536870912 134217728 float sum -1 5264.3 101.98 195.47 0 5169.7 103.85 199.04 0
1073741824 268435456 float sum -1 10279 104.46 200.22 0 10283 104.42 200.14 0
2147483648 536870912 float sum -1 19358 110.94 212.63 0 19444 110.45 211.69 0
4294967296 1073741824 float sum -1 38604 111.26 213.24 0 38587 111.31 213.34 0
8589934592 2147483648 float sum -1 76478 112.32 215.28 0 76501 112.29 215.21 0
17179869184 4294967296 float sum -1 153032 112.26 215.17 0 152877 112.38 215.39 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 71.8299
8卡*4机=32卡
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 82.36 0.00 0.00 0 90.30 0.00 0.00 0
16 4 float sum -1 51.65 0.00 0.00 0 50.85 0.00 0.00 0
32 8 float sum -1 52.48 0.00 0.00 0 50.93 0.00 0.00 0
64 16 float sum -1 51.57 0.00 0.00 0 51.28 0.00 0.00 0
128 32 float sum -1 51.87 0.00 0.00 0 51.92 0.00 0.00 0
256 64 float sum -1 66.55 0.00 0.01 0 52.95 0.00 0.01 0
512 128 float sum -1 57.73 0.01 0.02 0 53.17 0.01 0.02 0
1024 256 float sum -1 54.62 0.02 0.04 0 54.34 0.02 0.04 0
2048 512 float sum -1 57.70 0.04 0.07 0 57.63 0.04 0.07 0
4096 1024 float sum -1 61.01 0.07 0.13 0 61.61 0.07 0.13 0
8192 2048 float sum -1 69.46 0.12 0.23 0 68.15 0.12 0.23 0
16384 4096 float sum -1 70.04 0.23 0.45 0 67.51 0.24 0.47 0
32768 8192 float sum -1 70.88 0.46 0.90 0 71.45 0.46 0.89 0
65536 16384 float sum -1 69.86 0.94 1.82 0 67.91 0.96 1.87 0
131072 32768 float sum -1 109.4 1.20 2.32 0 88.77 1.48 2.86 0
262144 65536 float sum -1 111.0 2.36 4.58 0 105.7 2.48 4.81 0
524288 131072 float sum -1 107.7 4.87 9.43 0 105.1 4.99 9.66 0
1048576 262144 float sum -1 258.7 4.05 7.85 0 163.3 6.42 12.44 0
2097152 524288 float sum -1 165.5 12.68 24.56 0 162.0 12.95 25.09 0
4194304 1048576 float sum -1 190.2 22.05 42.72 0 194.9 21.52 41.70 0
8388608 2097152 float sum -1 256.5 32.71 63.38 0 262.8 31.92 61.84 0
16777216 4194304 float sum -1 310.2 54.09 104.80 0 336.7 49.82 96.53 0
33554432 8388608 float sum -1 778.9 43.08 83.47 0 421.3 79.64 154.31 0
67108864 16777216 float sum -1 661.3 101.48 196.63 0 660.2 101.65 196.95 0
134217728 33554432 float sum -1 1299.4 103.29 200.13 0 1311.4 102.35 198.30 0
268435456 67108864 float sum -1 2675.8 100.32 194.37 0 2702.6 99.33 192.44 0
536870912 134217728 float sum -1 5070.6 105.88 205.14 0 5134.5 104.56 202.59 0
1073741824 268435456 float sum -1 9667.0 111.07 215.20 0 9674.4 110.99 215.04 0
2147483648 536870912 float sum -1 19314 111.19 215.43 0 19306 111.24 215.52 0
4294967296 1073741824 float sum -1 38352 111.99 216.97 0 38462 111.67 216.35 0
8589934592 2147483648 float sum -1 77082 111.44 215.91 0 77157 111.33 215.70 0
17179869184 4294967296 float sum -1 153558 111.88 216.77 0 154232 111.39 215.82 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 70.3908
8卡*6机=48卡
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 89.42 0.00 0.00 0 95.72 0.00 0.00 0
16 4 float sum -1 68.24 0.00 0.00 0 69.17 0.00 0.00 0
32 8 float sum -1 69.06 0.00 0.00 0 68.48 0.00 0.00 0
64 16 float sum -1 69.11 0.00 0.00 0 68.82 0.00 0.00 0
128 32 float sum -1 69.23 0.00 0.00 0 69.87 0.00 0.00 0
256 64 float sum -1 85.12 0.00 0.01 0 70.90 0.00 0.01 0
512 128 float sum -1 76.13 0.01 0.01 0 71.36 0.01 0.01 0
1024 256 float sum -1 72.96 0.01 0.03 0 73.21 0.01 0.03 0
2048 512 float sum -1 77.26 0.03 0.05 0 77.37 0.03 0.05 0
4096 1024 float sum -1 82.41 0.05 0.10 0 82.67 0.05 0.10 0
8192 2048 float sum -1 91.21 0.09 0.18 0 91.36 0.09 0.18 0
16384 4096 float sum -1 91.03 0.18 0.35 0 89.97 0.18 0.36 0
32768 8192 float sum -1 92.10 0.36 0.70 0 91.97 0.36 0.70 0
65536 16384 float sum -1 92.95 0.71 1.38 0 90.82 0.72 1.41 0
131072 32768 float sum -1 123.4 1.06 2.08 0 114.1 1.15 2.25 0
262144 65536 float sum -1 142.6 1.84 3.60 0 129.0 2.03 3.98 0
524288 131072 float sum -1 131.8 3.98 7.79 0 130.6 4.01 7.86 0
1048576 262144 float sum -1 333.7 3.14 6.15 0 205.7 5.10 9.98 0
2097152 524288 float sum -1 215.6 9.72 19.04 0 215.7 9.72 19.04 0
4194304 1048576 float sum -1 248.1 16.90 33.11 0 247.0 16.98 33.25 0
8388608 2097152 float sum -1 323.1 25.96 50.84 0 327.8 25.59 50.11 0
16777216 4194304 float sum -1 391.9 42.81 83.84 0 395.1 42.47 83.17 0
33554432 8388608 float sum -1 500.3 67.07 131.35 0 509.0 65.93 129.10 0
67108864 16777216 float sum -1 1110.0 60.46 118.40 0 758.7 88.46 173.23 0
134217728 33554432 float sum -1 1472.9 91.12 178.45 0 1459.9 91.94 180.05 0
268435456 67108864 float sum -1 2773.6 96.78 189.53 0 2777.4 96.65 189.27 0
536870912 134217728 float sum -1 6015.0 89.26 174.79 0 6069.4 88.46 173.23 0
1073741824 268435456 float sum -1 11307 94.97 185.98 0 11355 94.56 185.18 0
2147483648 536870912 float sum -1 22736 94.45 184.97 0 22737 94.45 184.96 0
4294967296 1073741824 float sum -1 44570 96.36 188.71 0 44592 96.32 188.62 0
8589934592 2147483648 float sum -1 88928 96.59 189.16 0 88866 96.66 189.30 0
17179869184 4294967296 float sum -1 176888 97.12 190.20 0 176773 97.19 190.32 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 61.5088
8卡*8机=64卡
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 112.8 0.00 0.00 0 71.52 0.00 0.00 0
16 4 float sum -1 85.27 0.00 0.00 0 105.0 0.00 0.00 0
32 8 float sum -1 72.34 0.00 0.00 0 70.80 0.00 0.00 0
64 16 float sum -1 72.54 0.00 0.00 0 74.74 0.00 0.00 0
128 32 float sum -1 73.15 0.00 0.00 0 72.09 0.00 0.00 0
256 64 float sum -1 88.33 0.00 0.01 0 75.13 0.00 0.01 0
512 128 float sum -1 78.82 0.01 0.01 0 77.59 0.01 0.01 0
1024 256 float sum -1 76.75 0.01 0.03 0 76.48 0.01 0.03 0
2048 512 float sum -1 81.94 0.02 0.05 0 82.43 0.02 0.05 0
4096 1024 float sum -1 86.95 0.05 0.09 0 84.84 0.05 0.10 0
8192 2048 float sum -1 91.47 0.09 0.18 0 91.45 0.09 0.18 0
16384 4096 float sum -1 91.38 0.18 0.35 0 90.78 0.18 0.36 0
32768 8192 float sum -1 92.38 0.35 0.70 0 90.76 0.36 0.71 0
65536 16384 float sum -1 94.18 0.70 1.37 0 90.96 0.72 1.42 0
131072 32768 float sum -1 104.5 1.25 2.47 0 105.9 1.24 2.44 0
262144 65536 float sum -1 121.0 2.17 4.26 0 120.8 2.17 4.27 0
524288 131072 float sum -1 126.3 4.15 8.17 0 127.4 4.11 8.10 0
1048576 262144 float sum -1 136.8 7.66 15.09 0 137.4 7.63 15.02 0
2097152 524288 float sum -1 331.5 6.33 12.46 0 220.2 9.52 18.75 0
4194304 1048576 float sum -1 248.7 16.86 33.20 0 251.0 16.71 32.90 0
8388608 2097152 float sum -1 322.0 26.05 51.28 0 336.2 24.95 49.12 0
16777216 4194304 float sum -1 382.8 43.82 86.28 0 395.1 42.47 83.61 0
33554432 8388608 float sum -1 519.7 64.56 127.10 0 535.6 62.64 123.33 0
67108864 16777216 float sum -1 817.0 82.14 161.71 0 855.0 78.49 154.53 0
134217728 33554432 float sum -1 1766.9 75.96 149.55 0 1549.4 86.62 170.54 0
268435456 67108864 float sum -1 2856.1 93.99 185.04 0 2846.1 94.32 185.69 0
536870912 134217728 float sum -1 6037.2 88.93 175.08 0 5964.4 90.01 177.21 0
1073741824 268435456 float sum -1 11352 94.58 186.21 0 11316 94.89 186.81 0
2147483648 536870912 float sum -1 22493 95.47 187.96 0 22326 96.19 189.37 0
4294967296 1073741824 float sum -1 44340 96.86 190.70 0 44363 96.81 190.60 0
8589934592 2147483648 float sum -1 88655 96.89 190.75 0 88763 96.77 190.52 0
17179869184 4294967296 float sum -1 177385 96.85 190.67 0 177137 96.99 190.94 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 61.5215
8卡*16机=128卡
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 96.01 0.00 0.00 0 83.47 0.00 0.00 0
16 4 float sum -1 83.85 0.00 0.00 0 85.04 0.00 0.00 0
32 8 float sum -1 86.48 0.00 0.00 0 85.99 0.00 0.00 0
64 16 float sum -1 85.68 0.00 0.00 0 86.09 0.00 0.00 0
128 32 float sum -1 94.56 0.00 0.00 0 85.30 0.00 0.00 0
256 64 float sum -1 99.09 0.00 0.01 0 88.01 0.00 0.01 0
512 128 float sum -1 92.27 0.01 0.01 0 87.80 0.01 0.01 0
1024 256 float sum -1 89.44 0.01 0.02 0 89.88 0.01 0.02 0
2048 512 float sum -1 96.80 0.02 0.04 0 94.19 0.02 0.04 0
4096 1024 float sum -1 100.3 0.04 0.08 0 99.64 0.04 0.08 0
8192 2048 float sum -1 107.3 0.08 0.15 0 105.7 0.08 0.15 0
16384 4096 float sum -1 107.7 0.15 0.30 0 109.9 0.15 0.30 0
32768 8192 float sum -1 109.3 0.30 0.59 0 107.0 0.31 0.61 0
65536 16384 float sum -1 113.0 0.58 1.15 0 117.6 0.56 1.11 0
131072 32768 float sum -1 145.2 0.90 1.79 0 132.8 0.99 1.96 0
262144 65536 float sum -1 154.0 1.70 3.38 0 148.4 1.77 3.51 0
524288 131072 float sum -1 154.1 3.40 6.75 0 153.9 3.41 6.76 0
1048576 262144 float sum -1 165.0 6.36 12.61 0 167.6 6.26 12.41 0
2097152 524288 float sum -1 372.5 5.63 11.17 0 272.0 7.71 15.30 0
4194304 1048576 float sum -1 310.1 13.53 26.84 0 312.7 13.41 26.61 0
8388608 2097152 float sum -1 398.2 21.07 41.81 0 402.4 20.85 41.37 0
16777216 4194304 float sum -1 488.8 34.32 68.11 0 471.2 35.61 70.65 0
33554432 8388608 float sum -1 604.3 55.53 110.19 0 632.0 53.09 105.35 0
67108864 16777216 float sum -1 943.3 71.14 141.18 0 982.9 68.28 135.49 0
134217728 33554432 float sum -1 1587.5 84.54 167.77 0 1601.4 83.81 166.32 0
268435456 67108864 float sum -1 3130.8 85.74 170.14 0 2956.4 90.80 180.18 0
536870912 134217728 float sum -1 5802.4 92.53 183.61 0 5643.7 95.13 188.77 0
1073741824 268435456 float sum -1 11184 96.00 190.51 0 11268 95.29 189.10 0
2147483648 536870912 float sum -1 23058 93.13 184.81 0 23122 92.88 184.30 0
4294967296 1073741824 float sum -1 44618 96.26 191.02 0 44649 96.19 190.88 0
8589934592 2147483648 float sum -1 89769 95.69 189.88 0 89610 95.86 190.22 0
17179869184 4294967296 float sum -1 179747 95.58 189.66 0 179740 95.58 189.67 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 59.2936