多机多卡 NCCL + IB 调试

通信速度对于多机多卡训练或推理至关重要。在进行训练或推理前,可以通过执行 NCCL Tests 确认多机多卡环境、通信是否存在问题。

安装 NCCL

NCCL 可以直接到 NVIDIA 官网下载(需要注册登录)。NCCL 版本需要和 CUDA 版本匹配,如果依赖的 CUDA 版本为 12.4,则对应下图第 3 行下载。

image-iSSu.png

建议使用“与操作系统无关”的离线压缩包进行安装,只需解压并配置 NCCL_HOME 以及 LD_LIBRARY_PATH 环境变量即可,比较灵活,并且在多人使用的场景下可以避免互相影响环境。

image-QaHL.png

# 下载后解压
tar -xf nccl_2.27.7-1+cuda12.4_x86_64.txz

# 配置环境变量 (PATH 环境变量可选,一般情况下用不到)
export NCCL_HOME="$PWD/nccl_2.27.7-1+cuda12.4_x86_64"
export PATH="$NCCL_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$NCCL_HOME/lib:$LD_LIBRARY_PATH"

安装 Open MPI

考虑到大部分发行版本如 CentOS 等默认包管理带的版本较低,建议从官网下载最新的版本手动编译安装。笔者测试的版本为 4.1.8(测试时官网最新版本为 5.0.8,但是使用 5.0.8 进行多机 NCCL Tests 的时候会卡住,回退到 4.1.8 后正常)。

编译安装

# 下载解压
wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-4.1.8.tar.gz
tar -zxf openmpi-4.1.8.tar.gz

# 编译安装
cd openmpi-4.1.8
./configure --prefix=/usr/local/openmpi-4.1.8 --enable-orterun-prefix-by-default
make && make install

# 配置环境变量
export MPI_HOME=/usr/local/openmpi-4.1.8
export PATH="$MPI_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$MPI_HOME/lib:$LD_LIBRARY_PATH"

测试

被测试机器需要互相配置 SSH 免密登录,这部分可以参考网上教程,不再赘述。运行测试命令,注意根据实际情况修改替换通信网卡(bond1),

# 假设有两台被测机器,ip 分别为 10.1.1.1 和 10.1.1.2
mpirun \
  -v \
  --allow-run-as-root \
  --prefix "$MPI_HOME" \
  --mca btl_tcp_if_include bond1 \
  --mca oob_tcp_if_include bond1 \
  -np 2 \
  --host 10.1.1.1:1,10.1.1.2:1 \
  bash -c 'echo "Hello from process $OMPI_COMM_WORLD_RANK of $OMPI_COMM_WORLD_SIZE on $(hostname)"'

测试成功结果如下:

Hello from process 0 of 2 on host1
Hello from process 1 of 2 on host2

如果测试过程中卡住没有输出,请检查 SSH 免密登录及通信网卡是否正确配置,可以使用 strace 辅助调试(将卡住的 syscall 上下文复制到 gpt 询问即可),

strace -f -e trace=network,process,execve -- mpirun ... # 原 mpirun 命令

安装 NCCL Tests 工具

NCCL Tests 是 NVIDIA 官方出品的 NCCL 测试工具,只需要简单一行命令就可以完成测试。

下载编译

# 下载 NCCL Tests
git clone https://github.com/NVIDIA/nccl-tests.git

# 编译
cd nccl-tests
make MPI=1 MPI_HOME="$MPI_HOME" NCCL_HOME="$NCCL_HOME"

多机测试

参考下面的测试命令(双机 16 卡),注意根据实际情况修改替换通信网卡(bond1)及 NCCL_XXX 变量。如果出现单机测试正常,多机测试卡住的情况,请优先尝试更换 Open MPI 版本。

# 假设有两台被测机器,ip 分别为 10.1.1.1 和 10.1.1.2,每台机器有 8 卡 GPU
# 每台启动 8 个测试程序(-np 16 --host x:8,y:8),每个测试程序 1 卡 GPU(-g 1)
mpirun \
  -v \
  --allow-run-as-root \
  --prefix "$MPI_HOME" \
  --mca pml ob1 \
  --mca btl ^openib \
  --mca btl_tcp_if_include bond1 \
  --mca oob_tcp_if_include bond1 \
  -x FI_PROVIDER="^psm3" \
  -x NCCL_SOCKET_IFNAME=bond1 \
  -x NCCL_IB_DISABLE=0 \
  -x NCCL_IB_GID_INDEX=3 \
  -x NCCL_IB_HCA=mlx5_ \
  -x NCCL_NET_GDR_LEVEL=2 \
  -x NCCL_DEBUG=INFO \
  -x LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
  -np 16 \
  --host 10.1.1.1:8,10.1.1.2:8 \
  ./build/all_reduce_perf -b 8 -e 32G -f 2 -g 1 -n 50

测试结果如下:

# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 50 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 466205 on  host1 device  0 [0000:03:00] NVIDIA H20
#  Rank  1 Group  0 Pid 466206 on  host1 device  1 [0000:16:00] NVIDIA H20
#  Rank  2 Group  0 Pid 466207 on  host1 device  2 [0000:1c:00] NVIDIA H20
#  Rank  3 Group  0 Pid 466208 on  host1 device  3 [0000:2e:00] NVIDIA H20
#  Rank  4 Group  0 Pid 466209 on  host1 device  4 [0000:84:00] NVIDIA H20
#  Rank  5 Group  0 Pid 466210 on  host1 device  5 [0000:9c:00] NVIDIA H20
#  Rank  6 Group  0 Pid 466211 on  host1 device  6 [0000:b6:00] NVIDIA H20
#  Rank  7 Group  0 Pid 466212 on  host1 device  7 [0000:bb:00] NVIDIA H20
#  Rank  8 Group  0 Pid 254776 on  host2 device  0 [0000:03:00] NVIDIA H20
#  Rank  9 Group  0 Pid 254777 on  host2 device  1 [0000:16:00] NVIDIA H20
#  Rank 10 Group  0 Pid 254778 on  host2 device  2 [0000:1c:00] NVIDIA H20
#  Rank 11 Group  0 Pid 254779 on  host2 device  3 [0000:2e:00] NVIDIA H20
#  Rank 12 Group  0 Pid 254780 on  host2 device  4 [0000:84:00] NVIDIA H20
#  Rank 13 Group  0 Pid 254781 on  host2 device  5 [0000:9c:00] NVIDIA H20
#  Rank 14 Group  0 Pid 254782 on  host2 device  6 [0000:b6:00] NVIDIA H20
#  Rank 15 Group  0 Pid 254783 on  host2 device  7 [0000:bb:00] NVIDIA H20
#
#                                                              out-of-place                       in-place  
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    43.58    0.00    0.00      0    31.56    0.00    0.00      0
          16             4     float     sum      -1    33.34    0.00    0.00      0    31.35    0.00    0.00      0
          32             8     float     sum      -1    31.78    0.00    0.00      0    31.48    0.00    0.00      0
          64            16     float     sum      -1    31.91    0.00    0.00      0    31.77    0.00    0.00      0
         128            32     float     sum      -1    32.31    0.00    0.01      0    32.20    0.00    0.01      0
         256            64     float     sum      -1    47.91    0.01    0.01      0    32.67    0.01    0.01      0
         512           128     float     sum      -1    34.27    0.01    0.03      0    33.07    0.02    0.03      0
        1024           256     float     sum      -1    33.99    0.03    0.06      0    33.69    0.03    0.06      0
        2048           512     float     sum      -1    35.83    0.06    0.11      0    35.76    0.06    0.11      0
        4096          1024     float     sum      -1    37.71    0.11    0.20      0    38.12    0.11    0.20      0
        8192          2048     float     sum      -1    41.52    0.20    0.37      0    41.51    0.20    0.37      0
       16384          4096     float     sum      -1    41.25    0.40    0.74      0    39.98    0.41    0.77      0
       32768          8192     float     sum      -1    42.00    0.78    1.46      0    42.04    0.78    1.46      0
       65536         16384     float     sum      -1    43.99    1.49    2.79      0    43.43    1.51    2.83      0
      131072         32768     float     sum      -1    72.05    1.82    3.41      0    66.09    1.98    3.72      0
      262144         65536     float     sum      -1    76.37    3.43    6.44      0    71.50    3.67    6.87      0
      524288        131072     float     sum      -1    92.71    5.66   10.60      0    77.82    6.74   12.63      0
     1048576        262144     float     sum      -1    82.32   12.74   23.88      0    80.67   13.00   24.37      0
     2097152        524288     float     sum      -1    94.86   22.11   41.45      0    89.47   23.44   43.95      0
     4194304       1048576     float     sum      -1    110.7   37.89   71.05      0    109.5   38.29   71.79      0
     8388608       2097152     float     sum      -1    157.2   53.38  100.09      0    155.0   54.13  101.50      0
    16777216       4194304     float     sum      -1    202.2   82.96  155.55      0    204.1   82.21  154.14      0
    33554432       8388608     float     sum      -1    271.8  123.45  231.48      0    276.5  121.37  227.56      0
    67108864      16777216     float     sum      -1    469.7  142.87  267.89      0    493.3  136.05  255.09      0
   134217728      33554432     float     sum      -1    752.5  178.35  334.42      0    754.2  177.95  333.66      0
   268435456      67108864     float     sum      -1   1307.0  205.39  385.11      0   1293.1  207.59  389.24      0
   536870912     134217728     float     sum      -1   2340.5  229.38  430.09      0   2358.9  227.60  426.74      0
  1073741824     268435456     float     sum      -1   4417.6  243.06  455.74      0   4428.6  242.46  454.60      0
  2147483648     536870912     float     sum      -1   8572.4  250.51  469.71      0   8560.3  250.86  470.37      0
  4294967296    1073741824     float     sum      -1    16906  254.05  476.35      0    16857  254.79  477.73      0
  8589934592    2147483648     float     sum      -1    33561  255.95  479.91      0    33569  255.89  479.79      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 127.235 
#
# Collective test concluded: all_reduce_perf

对比关闭 IB 通信后(-x NCCL_IB_DISABLE=1),测试结果如下:

# Collective test starting: all_reduce_perf
# nThread 1 nGpus 1 minBytes 8 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 50 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 460146 on  host1 device  0 [0000:03:00] NVIDIA H20
#  Rank  1 Group  0 Pid 460147 on  host1 device  1 [0000:16:00] NVIDIA H20
#  Rank  2 Group  0 Pid 460148 on  host1 device  2 [0000:1c:00] NVIDIA H20
#  Rank  3 Group  0 Pid 460149 on  host1 device  3 [0000:2e:00] NVIDIA H20
#  Rank  4 Group  0 Pid 460150 on  host1 device  4 [0000:84:00] NVIDIA H20
#  Rank  5 Group  0 Pid 460151 on  host1 device  5 [0000:9c:00] NVIDIA H20
#  Rank  6 Group  0 Pid 460152 on  host1 device  6 [0000:b6:00] NVIDIA H20
#  Rank  7 Group  0 Pid 460153 on  host1 device  7 [0000:bb:00] NVIDIA H20
#  Rank  8 Group  0 Pid 250661 on  host2 device  0 [0000:03:00] NVIDIA H20
#  Rank  9 Group  0 Pid 250662 on  host2 device  1 [0000:16:00] NVIDIA H20
#  Rank 10 Group  0 Pid 250663 on  host2 device  2 [0000:1c:00] NVIDIA H20
#  Rank 11 Group  0 Pid 250664 on  host2 device  3 [0000:2e:00] NVIDIA H20
#  Rank 12 Group  0 Pid 250665 on  host2 device  4 [0000:84:00] NVIDIA H20
#  Rank 13 Group  0 Pid 250666 on  host2 device  5 [0000:9c:00] NVIDIA H20
#  Rank 14 Group  0 Pid 250667 on  host2 device  6 [0000:b6:00] NVIDIA H20
#  Rank 15 Group  0 Pid 250668 on  host2 device  7 [0000:bb:00] NVIDIA H20
#
#                                                              out-of-place                       in-place  
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    66.33    0.00    0.00      0    71.02    0.00    0.00      0
          16             4     float     sum      -1    72.15    0.00    0.00      0    65.33    0.00    0.00      0
          32             8     float     sum      -1    65.17    0.00    0.00      0    67.41    0.00    0.00      0
          64            16     float     sum      -1    67.93    0.00    0.00      0    69.15    0.00    0.00      0
         128            32     float     sum      -1    106.8    0.00    0.00      0    105.2    0.00    0.00      0
         256            64     float     sum      -1    122.9    0.00    0.00      0    110.0    0.00    0.00      0
         512           128     float     sum      -1    147.3    0.00    0.01      0    148.4    0.00    0.01      0
        1024           256     float     sum      -1    127.8    0.01    0.02      0    127.8    0.01    0.02      0
        2048           512     float     sum      -1    142.7    0.01    0.03      0    143.6    0.01    0.03      0
        4096          1024     float     sum      -1    150.0    0.03    0.05      0    184.8    0.02    0.04      0
        8192          2048     float     sum      -1    158.4    0.05    0.10      0    143.1    0.06    0.11      0
       16384          4096     float     sum      -1    181.4    0.09    0.17      0    209.2    0.08    0.15      0
       32768          8192     float     sum      -1    194.2    0.17    0.32      0    221.7    0.15    0.28      0
       65536         16384     float     sum      -1    236.4    0.28    0.52      0    237.4    0.28    0.52      0
      131072         32768     float     sum      -1    261.4    0.50    0.94      0    260.6    0.50    0.94      0
      262144         65536     float     sum      -1    664.0    0.39    0.74      0    661.0    0.40    0.74      0
      524288        131072     float     sum      -1   1004.3    0.52    0.98      0   1013.0    0.52    0.97      0
     1048576        262144     float     sum      -1    673.6    1.56    2.92      0    659.8    1.59    2.98      0
     2097152        524288     float     sum      -1    953.6    2.20    4.12      0    948.7    2.21    4.14      0
     4194304       1048576     float     sum      -1   1609.0    2.61    4.89      0   1639.6    2.56    4.80      0
     8388608       2097152     float     sum      -1   2919.6    2.87    5.39      0   2890.5    2.90    5.44      0
    16777216       4194304     float     sum      -1   5400.9    3.11    5.82      0   5419.8    3.10    5.80      0
    33554432       8388608     float     sum      -1    10051    3.34    6.26      0    10056    3.34    6.26      0
    67108864      16777216     float     sum      -1    19200    3.50    6.55      0    19300    3.48    6.52      0
   134217728      33554432     float     sum      -1    35827    3.75    7.02      0    35784    3.75    7.03      0
   268435456      67108864     float     sum      -1    63918    4.20    7.87      0    63684    4.22    7.90      0
   536870912     134217728     float     sum      -1   115283    4.66    8.73      0   115968    4.63    8.68      0
  1073741824     268435456     float     sum      -1   242936    4.42    8.29      0   240135    4.47    8.38      0
  2147483648     536870912     float     sum      -1   481477    4.46    8.36      0   482167    4.45    8.35      0
  4294967296    1073741824     float     sum      -1   936890    4.58    8.60      0   969305    4.43    8.31      0
  8589934592    2147483648     float     sum      -1  1790395    4.80    9.00      0  1782046    4.82    9.04      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.14748 
#
# Collective test concluded: all_reduce_perf

测试结果

测试命令:all_reduce_perf -b 8 -e 32G -f 2 -g 1 -n 50

8卡*2机=16卡

#                                                              out-of-place                       in-place  
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    47.81    0.00    0.00      0    33.66    0.00    0.00      0
          16             4     float     sum      -1    33.66    0.00    0.00      0    34.94    0.00    0.00      0
          32             8     float     sum      -1    33.87    0.00    0.00      0    33.90    0.00    0.00      0
          64            16     float     sum      -1    34.20    0.00    0.00      0    34.07    0.00    0.00      0
         128            32     float     sum      -1    34.56    0.00    0.01      0    34.47    0.00    0.01      0
         256            64     float     sum      -1    48.72    0.01    0.01      0    34.87    0.01    0.01      0
         512           128     float     sum      -1    37.57    0.01    0.03      0    35.44    0.01    0.03      0
        1024           256     float     sum      -1    37.54    0.03    0.05      0    36.09    0.03    0.05      0
        2048           512     float     sum      -1    44.73    0.05    0.09      0    37.92    0.05    0.10      0
        4096          1024     float     sum      -1    41.03    0.10    0.19      0    41.49    0.10    0.19      0
        8192          2048     float     sum      -1    41.92    0.20    0.37      0    42.84    0.19    0.36      0
       16384          4096     float     sum      -1    42.85    0.38    0.72      0    41.66    0.39    0.74      0
       32768          8192     float     sum      -1    42.62    0.77    1.44      0    44.69    0.73    1.37      0
       65536         16384     float     sum      -1    44.70    1.47    2.75      0    43.29    1.51    2.84      0
      131072         32768     float     sum      -1    65.38    2.00    3.76      0    62.48    2.10    3.93      0
      262144         65536     float     sum      -1    71.63    3.66    6.86      0    67.78    3.87    7.25      0
      524288        131072     float     sum      -1    91.63    5.72   10.73      0    81.45    6.44   12.07      0
     1048576        262144     float     sum      -1    89.24   11.75   22.03      0    87.16   12.03   22.56      0
     2097152        524288     float     sum      -1    90.43   23.19   43.48      0    93.56   22.42   42.03      0
     4194304       1048576     float     sum      -1    114.5   36.64   68.70      0    122.8   34.15   64.04      0
     8388608       2097152     float     sum      -1    157.7   53.20   99.76      0    157.0   53.45  100.21      0
    16777216       4194304     float     sum      -1    209.9   79.94  149.88      0    203.6   82.42  154.54      0
    33554432       8388608     float     sum      -1    277.1  121.10  227.07      0    278.6  120.44  225.83      0
    67108864      16777216     float     sum      -1    474.3  141.50  265.32      0    481.6  139.36  261.30      0
   134217728      33554432     float     sum      -1    759.4  176.75  331.41      0    761.5  176.26  330.48      0
   268435456      67108864     float     sum      -1   1305.4  205.64  385.57      0   1307.9  205.23  384.81      0
   536870912     134217728     float     sum      -1   2345.8  228.87  429.13      0   2346.2  228.83  429.05      0
  1073741824     268435456     float     sum      -1   4407.4  243.62  456.79      0   4413.8  243.27  456.13      0
  2147483648     536870912     float     sum      -1   8519.2  252.08  472.64      0   8520.7  252.03  472.56      0
  4294967296    1073741824     float     sum      -1    16750  256.41  480.78      0    16758  256.30  480.56      0
  8589934592    2147483648     float     sum      -1    33238  258.44  484.57      0    33233  258.47  484.64      0
 17179869184    4294967296     float     sum      -1    66258  259.29  486.16      0    66241  259.35  486.29      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 138.348

8卡*3机=24卡

#                                                              out-of-place                       in-place      
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    58.18    0.00    0.00      0    50.66    0.00    0.00      0
          16             4     float     sum      -1    50.31    0.00    0.00      0    50.36    0.00    0.00      0
          32             8     float     sum      -1    50.78    0.00    0.00      0    50.74    0.00    0.00      0
          64            16     float     sum      -1    51.31    0.00    0.00      0    51.28    0.00    0.00      0
         128            32     float     sum      -1    53.22    0.00    0.00      0    52.38    0.00    0.00      0
         256            64     float     sum      -1    66.36    0.00    0.01      0    53.08    0.00    0.01      0
         512           128     float     sum      -1    58.45    0.01    0.02      0    53.74    0.01    0.02      0
        1024           256     float     sum      -1    55.23    0.02    0.04      0    55.12    0.02    0.04      0
        2048           512     float     sum      -1    57.81    0.04    0.07      0    57.63    0.04    0.07      0
        4096          1024     float     sum      -1    60.83    0.07    0.13      0    60.60    0.07    0.13      0
        8192          2048     float     sum      -1    66.97    0.12    0.23      0    66.25    0.12    0.24      0
       16384          4096     float     sum      -1    67.94    0.24    0.46      0    66.97    0.24    0.47      0
       32768          8192     float     sum      -1    68.14    0.48    0.92      0    67.56    0.49    0.93      0
       65536         16384     float     sum      -1    68.15    0.96    1.84      0    67.55    0.97    1.86      0
      131072         32768     float     sum      -1    84.55    1.55    2.97      0    84.64    1.55    2.97      0
      262144         65536     float     sum      -1    225.0    1.17    2.23      0    135.0    1.94    3.72      0
      524288        131072     float     sum      -1    139.1    3.77    7.22      0    140.4    3.74    7.16      0
     1048576        262144     float     sum      -1    140.1    7.48   14.34      0    142.3    7.37   14.12      0
     2097152        524288     float     sum      -1    150.1   13.97   26.77      0    147.1   14.26   27.33      0
     4194304       1048576     float     sum      -1    169.5   24.74   47.42      0    172.9   24.26   46.50      0
     8388608       2097152     float     sum      -1    227.8   36.82   70.58      0    225.4   37.21   71.33      0
    16777216       4194304     float     sum      -1    281.2   59.67  114.36      0    281.9   59.52  114.08      0
    33554432       8388608     float     sum      -1    391.1   85.80  164.45      0    388.9   86.28  165.37      0
    67108864      16777216     float     sum      -1    659.9  101.69  194.91      0    660.1  101.67  194.87      0
   134217728      33554432     float     sum      -1   1294.0  103.72  198.80      0   1283.5  104.57  200.43      0
   268435456      67108864     float     sum      -1   2636.9  101.80  195.12      0   2628.6  102.12  195.73      0
   536870912     134217728     float     sum      -1   5264.3  101.98  195.47      0   5169.7  103.85  199.04      0
  1073741824     268435456     float     sum      -1    10279  104.46  200.22      0    10283  104.42  200.14      0
  2147483648     536870912     float     sum      -1    19358  110.94  212.63      0    19444  110.45  211.69      0
  4294967296    1073741824     float     sum      -1    38604  111.26  213.24      0    38587  111.31  213.34      0
  8589934592    2147483648     float     sum      -1    76478  112.32  215.28      0    76501  112.29  215.21      0
 17179869184    4294967296     float     sum      -1   153032  112.26  215.17      0   152877  112.38  215.39      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 71.8299

8卡*4机=32卡

#                                                              out-of-place                       in-place      
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    82.36    0.00    0.00      0    90.30    0.00    0.00      0
          16             4     float     sum      -1    51.65    0.00    0.00      0    50.85    0.00    0.00      0
          32             8     float     sum      -1    52.48    0.00    0.00      0    50.93    0.00    0.00      0
          64            16     float     sum      -1    51.57    0.00    0.00      0    51.28    0.00    0.00      0
         128            32     float     sum      -1    51.87    0.00    0.00      0    51.92    0.00    0.00      0
         256            64     float     sum      -1    66.55    0.00    0.01      0    52.95    0.00    0.01      0
         512           128     float     sum      -1    57.73    0.01    0.02      0    53.17    0.01    0.02      0
        1024           256     float     sum      -1    54.62    0.02    0.04      0    54.34    0.02    0.04      0
        2048           512     float     sum      -1    57.70    0.04    0.07      0    57.63    0.04    0.07      0
        4096          1024     float     sum      -1    61.01    0.07    0.13      0    61.61    0.07    0.13      0
        8192          2048     float     sum      -1    69.46    0.12    0.23      0    68.15    0.12    0.23      0
       16384          4096     float     sum      -1    70.04    0.23    0.45      0    67.51    0.24    0.47      0
       32768          8192     float     sum      -1    70.88    0.46    0.90      0    71.45    0.46    0.89      0
       65536         16384     float     sum      -1    69.86    0.94    1.82      0    67.91    0.96    1.87      0
      131072         32768     float     sum      -1    109.4    1.20    2.32      0    88.77    1.48    2.86      0
      262144         65536     float     sum      -1    111.0    2.36    4.58      0    105.7    2.48    4.81      0
      524288        131072     float     sum      -1    107.7    4.87    9.43      0    105.1    4.99    9.66      0
     1048576        262144     float     sum      -1    258.7    4.05    7.85      0    163.3    6.42   12.44      0
     2097152        524288     float     sum      -1    165.5   12.68   24.56      0    162.0   12.95   25.09      0
     4194304       1048576     float     sum      -1    190.2   22.05   42.72      0    194.9   21.52   41.70      0
     8388608       2097152     float     sum      -1    256.5   32.71   63.38      0    262.8   31.92   61.84      0
    16777216       4194304     float     sum      -1    310.2   54.09  104.80      0    336.7   49.82   96.53      0
    33554432       8388608     float     sum      -1    778.9   43.08   83.47      0    421.3   79.64  154.31      0
    67108864      16777216     float     sum      -1    661.3  101.48  196.63      0    660.2  101.65  196.95      0
   134217728      33554432     float     sum      -1   1299.4  103.29  200.13      0   1311.4  102.35  198.30      0
   268435456      67108864     float     sum      -1   2675.8  100.32  194.37      0   2702.6   99.33  192.44      0
   536870912     134217728     float     sum      -1   5070.6  105.88  205.14      0   5134.5  104.56  202.59      0
  1073741824     268435456     float     sum      -1   9667.0  111.07  215.20      0   9674.4  110.99  215.04      0
  2147483648     536870912     float     sum      -1    19314  111.19  215.43      0    19306  111.24  215.52      0
  4294967296    1073741824     float     sum      -1    38352  111.99  216.97      0    38462  111.67  216.35      0
  8589934592    2147483648     float     sum      -1    77082  111.44  215.91      0    77157  111.33  215.70      0
 17179869184    4294967296     float     sum      -1   153558  111.88  216.77      0   154232  111.39  215.82      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 70.3908

8卡*6机=48卡

#                                                              out-of-place                       in-place      
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    89.42    0.00    0.00      0    95.72    0.00    0.00      0
          16             4     float     sum      -1    68.24    0.00    0.00      0    69.17    0.00    0.00      0
          32             8     float     sum      -1    69.06    0.00    0.00      0    68.48    0.00    0.00      0
          64            16     float     sum      -1    69.11    0.00    0.00      0    68.82    0.00    0.00      0
         128            32     float     sum      -1    69.23    0.00    0.00      0    69.87    0.00    0.00      0
         256            64     float     sum      -1    85.12    0.00    0.01      0    70.90    0.00    0.01      0
         512           128     float     sum      -1    76.13    0.01    0.01      0    71.36    0.01    0.01      0
        1024           256     float     sum      -1    72.96    0.01    0.03      0    73.21    0.01    0.03      0
        2048           512     float     sum      -1    77.26    0.03    0.05      0    77.37    0.03    0.05      0
        4096          1024     float     sum      -1    82.41    0.05    0.10      0    82.67    0.05    0.10      0
        8192          2048     float     sum      -1    91.21    0.09    0.18      0    91.36    0.09    0.18      0
       16384          4096     float     sum      -1    91.03    0.18    0.35      0    89.97    0.18    0.36      0
       32768          8192     float     sum      -1    92.10    0.36    0.70      0    91.97    0.36    0.70      0
       65536         16384     float     sum      -1    92.95    0.71    1.38      0    90.82    0.72    1.41      0
      131072         32768     float     sum      -1    123.4    1.06    2.08      0    114.1    1.15    2.25      0
      262144         65536     float     sum      -1    142.6    1.84    3.60      0    129.0    2.03    3.98      0
      524288        131072     float     sum      -1    131.8    3.98    7.79      0    130.6    4.01    7.86      0
     1048576        262144     float     sum      -1    333.7    3.14    6.15      0    205.7    5.10    9.98      0
     2097152        524288     float     sum      -1    215.6    9.72   19.04      0    215.7    9.72   19.04      0
     4194304       1048576     float     sum      -1    248.1   16.90   33.11      0    247.0   16.98   33.25      0
     8388608       2097152     float     sum      -1    323.1   25.96   50.84      0    327.8   25.59   50.11      0
    16777216       4194304     float     sum      -1    391.9   42.81   83.84      0    395.1   42.47   83.17      0
    33554432       8388608     float     sum      -1    500.3   67.07  131.35      0    509.0   65.93  129.10      0
    67108864      16777216     float     sum      -1   1110.0   60.46  118.40      0    758.7   88.46  173.23      0
   134217728      33554432     float     sum      -1   1472.9   91.12  178.45      0   1459.9   91.94  180.05      0
   268435456      67108864     float     sum      -1   2773.6   96.78  189.53      0   2777.4   96.65  189.27      0
   536870912     134217728     float     sum      -1   6015.0   89.26  174.79      0   6069.4   88.46  173.23      0
  1073741824     268435456     float     sum      -1    11307   94.97  185.98      0    11355   94.56  185.18      0
  2147483648     536870912     float     sum      -1    22736   94.45  184.97      0    22737   94.45  184.96      0
  4294967296    1073741824     float     sum      -1    44570   96.36  188.71      0    44592   96.32  188.62      0
  8589934592    2147483648     float     sum      -1    88928   96.59  189.16      0    88866   96.66  189.30      0
 17179869184    4294967296     float     sum      -1   176888   97.12  190.20      0   176773   97.19  190.32      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 61.5088

8卡*8机=64卡

#                                                              out-of-place                       in-place      
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    112.8    0.00    0.00      0    71.52    0.00    0.00      0
          16             4     float     sum      -1    85.27    0.00    0.00      0    105.0    0.00    0.00      0
          32             8     float     sum      -1    72.34    0.00    0.00      0    70.80    0.00    0.00      0
          64            16     float     sum      -1    72.54    0.00    0.00      0    74.74    0.00    0.00      0
         128            32     float     sum      -1    73.15    0.00    0.00      0    72.09    0.00    0.00      0
         256            64     float     sum      -1    88.33    0.00    0.01      0    75.13    0.00    0.01      0
         512           128     float     sum      -1    78.82    0.01    0.01      0    77.59    0.01    0.01      0
        1024           256     float     sum      -1    76.75    0.01    0.03      0    76.48    0.01    0.03      0
        2048           512     float     sum      -1    81.94    0.02    0.05      0    82.43    0.02    0.05      0
        4096          1024     float     sum      -1    86.95    0.05    0.09      0    84.84    0.05    0.10      0
        8192          2048     float     sum      -1    91.47    0.09    0.18      0    91.45    0.09    0.18      0
       16384          4096     float     sum      -1    91.38    0.18    0.35      0    90.78    0.18    0.36      0
       32768          8192     float     sum      -1    92.38    0.35    0.70      0    90.76    0.36    0.71      0
       65536         16384     float     sum      -1    94.18    0.70    1.37      0    90.96    0.72    1.42      0
      131072         32768     float     sum      -1    104.5    1.25    2.47      0    105.9    1.24    2.44      0
      262144         65536     float     sum      -1    121.0    2.17    4.26      0    120.8    2.17    4.27      0
      524288        131072     float     sum      -1    126.3    4.15    8.17      0    127.4    4.11    8.10      0
     1048576        262144     float     sum      -1    136.8    7.66   15.09      0    137.4    7.63   15.02      0
     2097152        524288     float     sum      -1    331.5    6.33   12.46      0    220.2    9.52   18.75      0
     4194304       1048576     float     sum      -1    248.7   16.86   33.20      0    251.0   16.71   32.90      0
     8388608       2097152     float     sum      -1    322.0   26.05   51.28      0    336.2   24.95   49.12      0
    16777216       4194304     float     sum      -1    382.8   43.82   86.28      0    395.1   42.47   83.61      0
    33554432       8388608     float     sum      -1    519.7   64.56  127.10      0    535.6   62.64  123.33      0
    67108864      16777216     float     sum      -1    817.0   82.14  161.71      0    855.0   78.49  154.53      0
   134217728      33554432     float     sum      -1   1766.9   75.96  149.55      0   1549.4   86.62  170.54      0
   268435456      67108864     float     sum      -1   2856.1   93.99  185.04      0   2846.1   94.32  185.69      0
   536870912     134217728     float     sum      -1   6037.2   88.93  175.08      0   5964.4   90.01  177.21      0
  1073741824     268435456     float     sum      -1    11352   94.58  186.21      0    11316   94.89  186.81      0
  2147483648     536870912     float     sum      -1    22493   95.47  187.96      0    22326   96.19  189.37      0
  4294967296    1073741824     float     sum      -1    44340   96.86  190.70      0    44363   96.81  190.60      0
  8589934592    2147483648     float     sum      -1    88655   96.89  190.75      0    88763   96.77  190.52      0
 17179869184    4294967296     float     sum      -1   177385   96.85  190.67      0   177137   96.99  190.94      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 61.5215

8卡*16机=128卡

#                                                              out-of-place                       in-place      
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)   
           8             2     float     sum      -1    96.01    0.00    0.00      0    83.47    0.00    0.00      0
          16             4     float     sum      -1    83.85    0.00    0.00      0    85.04    0.00    0.00      0
          32             8     float     sum      -1    86.48    0.00    0.00      0    85.99    0.00    0.00      0
          64            16     float     sum      -1    85.68    0.00    0.00      0    86.09    0.00    0.00      0
         128            32     float     sum      -1    94.56    0.00    0.00      0    85.30    0.00    0.00      0
         256            64     float     sum      -1    99.09    0.00    0.01      0    88.01    0.00    0.01      0
         512           128     float     sum      -1    92.27    0.01    0.01      0    87.80    0.01    0.01      0
        1024           256     float     sum      -1    89.44    0.01    0.02      0    89.88    0.01    0.02      0
        2048           512     float     sum      -1    96.80    0.02    0.04      0    94.19    0.02    0.04      0
        4096          1024     float     sum      -1    100.3    0.04    0.08      0    99.64    0.04    0.08      0
        8192          2048     float     sum      -1    107.3    0.08    0.15      0    105.7    0.08    0.15      0
       16384          4096     float     sum      -1    107.7    0.15    0.30      0    109.9    0.15    0.30      0
       32768          8192     float     sum      -1    109.3    0.30    0.59      0    107.0    0.31    0.61      0
       65536         16384     float     sum      -1    113.0    0.58    1.15      0    117.6    0.56    1.11      0
      131072         32768     float     sum      -1    145.2    0.90    1.79      0    132.8    0.99    1.96      0
      262144         65536     float     sum      -1    154.0    1.70    3.38      0    148.4    1.77    3.51      0
      524288        131072     float     sum      -1    154.1    3.40    6.75      0    153.9    3.41    6.76      0
     1048576        262144     float     sum      -1    165.0    6.36   12.61      0    167.6    6.26   12.41      0
     2097152        524288     float     sum      -1    372.5    5.63   11.17      0    272.0    7.71   15.30      0
     4194304       1048576     float     sum      -1    310.1   13.53   26.84      0    312.7   13.41   26.61      0
     8388608       2097152     float     sum      -1    398.2   21.07   41.81      0    402.4   20.85   41.37      0
    16777216       4194304     float     sum      -1    488.8   34.32   68.11      0    471.2   35.61   70.65      0
    33554432       8388608     float     sum      -1    604.3   55.53  110.19      0    632.0   53.09  105.35      0
    67108864      16777216     float     sum      -1    943.3   71.14  141.18      0    982.9   68.28  135.49      0
   134217728      33554432     float     sum      -1   1587.5   84.54  167.77      0   1601.4   83.81  166.32      0
   268435456      67108864     float     sum      -1   3130.8   85.74  170.14      0   2956.4   90.80  180.18      0
   536870912     134217728     float     sum      -1   5802.4   92.53  183.61      0   5643.7   95.13  188.77      0
  1073741824     268435456     float     sum      -1    11184   96.00  190.51      0    11268   95.29  189.10      0
  2147483648     536870912     float     sum      -1    23058   93.13  184.81      0    23122   92.88  184.30      0
  4294967296    1073741824     float     sum      -1    44618   96.26  191.02      0    44649   96.19  190.88      0
  8589934592    2147483648     float     sum      -1    89769   95.69  189.88      0    89610   95.86  190.22      0
 17179869184    4294967296     float     sum      -1   179747   95.58  189.66      0   179740   95.58  189.67      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 59.2936
Comment