If your training cluster is running AllReduce above 2ms on InfiniBand NDR, you are losing time on every single gradient synchronisation step. Not occasionally. Every step. Across every GPU. For the entire duration of your training run.
This is not a framework problem. It is not a model architecture problem. It is not solved by tuning batch size, learning rate, or gradient accumulation steps. It lives in the communication layer — and most teams never look there.
Do you have this problem?
Run this diagnostic before reading further. If you answer yes to two or more of these, you have an AllReduce configuration problem, not a hardware problem:
- GPU utilization sits between 60–75% during distributed training — not above 80%
- Step time variance exceeds 10% between runs with identical batch sizes
- nccl-tests shows AllReduce above 2ms on a 16-node InfiniBand cluster
- You've increased batch size but throughput (samples/sec) didn't improve proportionally
- Nsight Systems shows idle GPU time between forward pass and backward pass
- You are running NCCL without explicitly setting NCCL_IB_GID_INDEX or NCCL_SOCKET_IFNAME
If you have not run nccl-tests on your cluster in the last 30 days, you do not have a baseline. You cannot diagnose what you have not measured.
Root cause: it is almost never the hardware
The AllReduce operation synchronises gradients across all GPUs after each backward pass. On a 16-node A100 cluster with InfiniBand NDR 400Gbps, a correctly configured AllReduce on a 7B parameter model should complete in under 1ms. When I see 4ms, the cluster has the hardware to do better — it's just not using it correctly.
There are three places the problem almost always lives:
1. NCCL is not using InfiniBand correctly
By default, NCCL will attempt to auto-detect your network interface. On clusters with multiple NICs — which is most production HPC clusters — this auto-detection picks the wrong interface. It falls back to TCP over Ethernet instead of using GPUDirect RDMA over InfiniBand. The result: your 400Gbps InfiniBand fabric is idle while NCCL sends gradients through your 10Gbps management network.
NCCL_DEBUG=INFO torchrun --nproc_per_node=8 your_script.py 2>&1 | grep -E "NET|IB|SOCKET" # You should see: # NCCL INFO NET/IB : Using [0]mlx5_0:1/IB # NOT: # NCCL INFO NET/Socket : Using [0]eth0
export NCCL_IB_DISABLE=0 export NCCL_IB_GID_INDEX=3 export NCCL_SOCKET_IFNAME=ib0 export NCCL_IB_HCA=mlx5_0,mlx5_1 # match your HCA names export NCCL_NET_GDR_LEVEL=5 # enable GPUDirect RDMA
2. GPUDirect RDMA is disabled or misconfigured
GPUDirect RDMA allows the GPU to communicate directly with the InfiniBand HCA without going through CPU or system memory. When it works, it removes 40–60% of the latency from each AllReduce operation. When it's misconfigured, NCCL silently falls back to a CPU-mediated path with no warning.
The most common misconfiguration: nvidia_peermem is not loaded on the host. This module enables GPU-to-NIC direct memory access. Without it, GPUDirect RDMA cannot function regardless of your NCCL environment variables.
# Check the module is loaded lsmod | grep nvidia_peermem # Verify in NCCL logs — you should see: # NCCL INFO GDR enabled for peer access # NOT: # NCCL INFO GDR disabled (no peer access)
3. SHARP in-network aggregation is not enabled
Modern InfiniBand switches support SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) — in-network computing that offloads AllReduce computation to the switch fabric itself. On a 16-node cluster, SHARP can reduce AllReduce latency by 30–50% by performing gradient aggregation in the network rather than at the endpoints.
Most teams do not enable it. The switch supports it. The HCA supports it. NCCL supports it. Nobody turned it on.
If this looks like your cluster, you don't need more GPUs. You need to see what your system is actually doing — and that starts with a single nccl-tests run. Let's look at it together →
export NCCL_ALGO=TREE # use tree algorithm (required for SHARP) export NCCL_SHARP_ENABLE=1 export SHARP_COLL_LOG_LEVEL=3 # verify SHARP is actually active
How to measure the impact
Before applying any fix, establish a baseline. Run nccl-tests with your actual message sizes — not the default. A 7B parameter model with BF16 gradients produces an AllReduce message of approximately 14GB. Test at that size.
# Install nccl-tests git clone https://github.com/NVIDIA/nccl-tests cd nccl-tests && make MPI=1 CUDA_HOME=/usr/local/cuda # Run AllReduce test at production message size mpirun -np 128 --hostfile hostfile \ ./build/all_reduce_perf \ -b 14G -e 14G -f 2 -g 1 \ -w 5 -n 20 # Key metrics to record: # - algbw (algorithmic bandwidth) — target: >300 GB/s # - busbw (bus bandwidth) — target: >200 GB/s # - time (latency) — target: <2ms on 16-node NDR
Run this before and after applying the NCCL environment variables above. On a properly configured cluster, I typically see algorithmic bandwidth increase from 40–80 GB/s to 280–350 GB/s. That translates directly to reduced step time.
What to do this week
- Run nccl-tests at your production message size and record algbw and busbw
- Check NCCL_DEBUG=INFO output — confirm InfiniBand is being used, not Ethernet
- Verify nvidia_peermem is loaded on all host nodes
- Apply the environment variable fixes above and re-run nccl-tests
- If algbw is still below 200 GB/s after these fixes, the problem is deeper — likely in the ring topology or HCA binding. That requires profiling with Nsight Systems.
The fixes above take under two hours to apply and verify. If your cluster is running AllReduce above 2ms today, there is no reason it should still be running above 2ms tomorrow.
If you run nccl-tests and algbw is already above 300 GB/s but your training utilization is still below 80%, the bottleneck has moved. It is no longer in AllReduce — it's in compute-communication overlap, or in memory bandwidth, or in kernel launch overhead. Those require Nsight Systems to diagnose. That's the next article.
Every training run you execute with a misconfigured NCCL stack is wasting GPU time you are already paying for. On a 16-node cluster running 24/7, this is not a performance issue. It is a cost leak — and it compounds with every job you run.
This is fixable — and measurable.
If your cluster is showing these symptoms, the right move is to look at your Nsight trace together. I profile, find the exact bottleneck, and fix it — documented so your team can maintain it. Most clusters I work with recover $200K–$600K/year in efficiency within the first 30 days.
I work with 3 teams at a time. One slot is currently open.