NCCL单机vs多机性能测试
测试环境,A100(8卡)x 2台
一、节点内拓扑
查看节点内的8个GPU之间的连接关系:
/home/tsj # nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X
可以看到,任意2个GPU卡之间都是 NVLink12。应该是因为A100引入了 NVSwitch的关系。
以前V100的时候,还没有NVSwitch,2个GPU间有几根NVLink,还得看GPU的号。
如上图,2-3之间有2根,0-1之间则只有1根。
而A100引入NVSwitch,则任意2个GPU间,都有12根。
感觉这篇文章里面的图,比较贴近服务器内部连线,这里贴出来。
因为我们不是DGX服务器,肯定有些细节不一样。
二、节点内GPU卡间带宽:
测试工具是cuda带的sample工具,需要自己编译后使用。
cd /home/tsj/cuda/cuda-11.1/samples/1_Utilities/p2pBandwidthLatencyTest
CUDA_PATH=/home/tsj/cuda/cuda-11.1 make all
执行:
./p2pBandwidthLatencyTest
可以看到,单向GPU间带宽:
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1568.78 273.84 274.48 234.52 275.28 273.07 274.89 273.65
1 274.44 1584.69 273.64 233.82 275.44 275.29 274.64 273.26
2 272.95 276.63 1583.08 233.70 275.08 274.69 275.17 274.63
3 234.42 274.31 270.86 1553.18 274.29 273.95 275.01 270.95
4 234.41 274.32 271.64 235.24 1583.08 274.48 275.00 274.48
5 233.04 274.80 272.98 235.55 273.40 1589.52 274.63 273.07
6 236.24 275.47 273.63 235.42 274.15 274.62 1586.29 275.57
7 235.66 235.33 274.10 233.30 275.30 275.41 274.13 1597.65
双向GPU间带宽:
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 1558.60 401.42 399.88 399.67 400.62 401.62 401.49 400.79
1 399.88 1564.06 401.32 399.68 401.92 402.34 402.03 402.34
2 398.76 399.98 1564.06 401.73 402.03 402.44 402.75 403.25
3 402.38 398.56 398.66 1556.27 401.30 401.51 401.61 401.10
4 403.00 404.22 404.10 403.31 1608.34 492.03 490.80 491.84
5 403.58 404.04 404.66 403.13 495.82 1606.68 494.51 490.18
6 404.39 403.65 404.85 402.76 496.98 496.75 1609.17 496.74
7 404.27 404.58 404.09 403.79 496.57 496.80 495.44 1606.68
基本上,2个GPU间单向带宽是270GB左右。(理论值貌似:25GB * 12 = 300GB)。打了9折。
双向带宽400GB,(理论 300 x 2 = 600),打了个 66折。
三、节点内NCCL带宽
按照 《如何理解Nvidia英伟达的Multi-GPU多卡通信框架NCCL》 说法,节点内的NCCL,都走ring的,不走tree的。反正都跑下看看:(单机,所以我就没用 mpirun了。)
测试工具 nccl-test:
root@tsjsdbd:~# /root/all_reduce_perf -b 1M -e 2048M -f 2 -g 8
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
134217728 33554432 float sum -1 1335.1 100.53 175.93 0 1331.4 100.81 176.42 0
268435456 67108864 float sum -1 2376.9 112.94 197.64 0 2381.0 112.74 197.30 0
536870912 134217728 float sum -1 4684.3 114.61 200.57 0 4535.7 118.36 207.14 0
1073741824 268435456 float sum -1 8152.2 131.71 230.50 0 8148.3 131.78 230.61 0
2147483648 536870912 float sum -1 16182 132.71 232.24 0 16184 132.69 232.21 0
可以看到,才132GB。不知道为什么会下降这么多~
如果指定算法用Tree的话,带宽更低:
root@tsjsdbd:~# NCCL_ALGO=Tree /root/all_reduce_perf -b 1M -e 2048M -f 2 -g 8
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
134217728 33554432 float sum -1 1962.4 68.39 119.69 0 1961.0 68.44 119.78 0
268435456 67108864 float sum -1 3003.9 89.36 156.38 0 2964.9 90.54 158.44 0
536870912 134217728 float sum -1 5631.7 95.33 166.83 0 5134.6 104.56 182.98 0
1073741824 268435456 float sum -1 10381 103.43 181.00 0 10375 103.49 181.11 0
2147483648 536870912 float sum -1 20055 107.08 187.39 0 20051 107.10 187.42 0
只有107GB了。
四、节点间拓扑
每台服务器,上面有8个RoCE网卡:
/home/tsj # ibdev2netdev
mlx5_0 port 1 ==> enp80s0f0 (Up)
mlx5_1 port 1 ==> enp80s0f1 (Up)
mlx5_2 port 1 ==> enp106s0f0 (Up)
mlx5_3 port 1 ==> enp106s0f1 (Up)
mlx5_4 port 1 ==> enp137s0f0 (Up)
mlx5_5 port 1 ==> enp137s0f1 (Up)
mlx5_6 port 1 ==> enp234s0f0 (Up)
mlx5_7 port 1 ==> enp234s0f1 (Up)
连接到交换机网络中,示意如下:
五、节点间RDMA带宽
节点2启动Server监听:
ib_write_bw -a -d mlx5_0 --report_gbits
节点1启动客户端:
ib_write_bw -a -F -d mlx5_0 29.28.195.228 --report_gbits
得到:
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x00a6 PSN 0xe1223a RKey 0x183cdc VAddr 0x007f4a2cb39000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:29:28:201:21
remote address: LID 0000 QPN 0x00a6 PSN 0xe98532 RKey 0x183cdc VAddr 0x007f1c6fbde000
GID: 00:00:00:00:00:00:00:00:00:00:255:255:29:28:195:228
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
262144 5000 97.88 97.85 0.046659
524288 5000 97.84 97.80 0.023318
1048576 5000 97.77 97.77 0.011655
2097152 5000 97.84 97.80 0.005829
4194304 5000 97.87 97.84 0.002916
8388608 5000 97.84 97.84 0.001458
---------------------------------------------------------------------------------------
一个网卡带宽,才 100Gb,注意这里是小b。
8个网卡总带宽 100Gb x 8 = 100GB(大B)。
六、节点间NCCL带宽
节点1上面执行:(-p38888 是因为我ssh免密登录端口,设置为38888端口)
mpirun --allow-run-as-root --mca pml ob1 --mca btl tcp,self -mca btl_tcp_if_include enp218s0 -mca plm_rsh_args "-p 38888" --host 192.168.0.37,192.168.0.130 -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_TC=128 -x NCCL_ALGO=RING -x NCCL_IB_HCA=mlx5 -x NCCL_IB_TIMEOUT=18 -x NCCL_SOCKET_IFNAME=enp218s0 -x LD_LIBRARY_PATH /root/all_reduce_perf -b 8 -e 1024M -f 2 -g 8
结果为:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
134217728 33554432 float sum -1 3383.9 39.66 74.37 0 3642.6 36.85 69.09 0
268435456 67108864 float sum -1 6516.7 41.19 77.23 0 6813.2 39.40 73.87 0
536870912 134217728 float sum -1 12394 43.32 81.22 0 13157 40.81 76.51 0
1073741824 268435456 float sum -1 25629 41.90 78.55 0 24413 43.98 82.47 0
可以看到,带宽只有41GB。
换成NCCL_ALGO=Tree后,
mpirun --allow-run-as-root --mca pml ob1 --mca btl tcp,self -mca btl_tcp_if_include enp218s0 -mca plm_rsh_args "-p 38888" --host 192.168.0.37,192.168.0.130 -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_TC=128 -x NCCL_ALGO=Tree -x NCCL_IB_HCA=mlx5 -x NCCL_IB_TIMEOUT=18 -x NCCL_SOCKET_IFNAME=enp218s0 -x LD_LIBRARY_PATH /root/all_reduce_perf -b 8 -e 1024M -f 2 -g 8
带宽有所提升:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
134217728 33554432 float sum -1 2841.8 47.23 88.55 0 2860.9 46.91 87.96 0
268435456 67108864 float sum -1 4334.4 61.93 116.12 0 4241.4 63.29 118.67 0
536870912 134217728 float sum -1 8277.2 64.86 121.62 0 8473.8 63.36 118.79 0
1073741824 268435456 float sum -1 15289 70.23 131.68 0 15974 67.22 126.03 0
可以达到70GB的带宽。与8根网线理论100GB相比,相当于NCCL打了个7折~
- 点赞
- 收藏
- 关注作者
评论(0)