Appearance
3. Performance Testing
3.1 nccl-test (single node)
- All nodes perform standalone nccl-test. The logs will be returned to the log interface of each node in the monitoring dashboard.
- Replace your $manager_ip with the actual IP address.
shell
pdsh -l root -R ssh -w ^hosts.txt "/podsys/scripts/run_nccl_test_single.sh $manager_ip"3.2 nvbandwidth (single node)
shell
pdsh -l root -R ssh -w ^hosts.txt "/podsys/scripts/run_nvbandwidth.sh $manager_ip"3.3 nccl-test (Rack-level)
After completing the rack-level installation, configure passwordless SSH login between the root users of all nodes. Refer to 4.1.3 Configure passwordless SSH between root users on all nodes
- From the management node, SSH into one of the compute nodes, for example,
node01. - Prepare the
hosts.txtfile. You may use thehosts.txtgenerated on the management node. - Configure imex using the
hosts.txt.
shell
pdsh -l root -R ssh -w ^hosts.txt "systemctl restart nvidia-imex.service"- SSH into node01.
- Run the following command on node01:
shell
sudo su- Export Path
shell
export LD_LIBRARY_PATH=/podsys/build/ompi418/lib:/usr/local/cuda/lib64:/podsys/build/nccl_2.28.9-1+cuda13.0_aarch64/lib- Run nccl-test (all_reduce 72 GPUs)
shell
/podsys/build/ompi418/bin/mpirun --allow-run-as-root -np 72 -N 4 \
-hostfile /podsys/hosts.txt \
-x NCCL_DEBUG=WARN \
-x NCCL_NVLS_ENABLE=1 \
-x NCCL_SHM_DISABLE=1 \
-x UCX_NET_DEVICES=enP5p9s0 \
-x LD_LIBRARY_PATH /podsys/build/nccl-tests/build/all_reduce_perf \
-b 8 -e 32G -f 2 -g 1- Run nccl-test (alltoall 72 GPUs)
shell
/podsys/build/ompi418/bin/mpirun --allow-run-as-root -np 72 -N 4 \
-hostfile /podsys/hosts.txt \
-x NCCL_DEBUG=WARN \
-x NCCL_NVLS_ENABLE=1 \
-x NCCL_SHM_DISABLE=1 \
-x UCX_NET_DEVICES=enP5p9s0 \
-x LD_LIBRARY_PATH /podsys/build/nccl-tests/build/alltoall_perf \
-b 8 -e 32G -f 2 -g 1