Why Your AllReduce Is Costing You 8% Step Time

The cost you're not measuring

$340K

Approximate annual cost of 8% step time loss on a 16-node A100 cluster running at $1.20/GPU-hour, 24/7. Most teams have no idea this number exists — let alone where it's going.

If your training cluster is running AllReduce above 2ms on InfiniBand NDR, you are losing time on every single gradient synchronisation step. Not occasionally. Every step. Across every GPU. For the entire duration of your training run.

This is not a framework problem. It is not a model architecture problem. It is not solved by tuning batch size, learning rate, or gradient accumulation steps. It lives in the communication layer — and most teams never look there.

Do you have this problem?

Run this diagnostic before reading further. If you answer yes to two or more of these, you have an AllReduce configuration problem, not a hardware problem:

Symptoms — check your cluster now

GPU utilization sits between 60–75% during distributed training — not above 80%
Step time variance exceeds 10% between runs with identical batch sizes
nccl-tests shows AllReduce above 2ms on a 16-node InfiniBand cluster
You've increased batch size but throughput (samples/sec) didn't improve proportionally
Nsight Systems shows idle GPU time between forward pass and backward pass
You are running NCCL without explicitly setting NCCL_IB_GID_INDEX or NCCL_SOCKET_IFNAME

If you have not run nccl-tests on your cluster in the last 30 days, you do not have a baseline. You cannot diagnose what you have not measured.

Root cause: it is almost never the hardware

The AllReduce operation synchronises gradients across all GPUs after each backward pass. On a 16-node A100 cluster with InfiniBand NDR 400Gbps, a correctly configured AllReduce on a 7B parameter model should complete in under 1ms. When I see 4ms, the cluster has the hardware to do better — it's just not using it correctly.

There are three places the problem almost always lives:

1. NCCL is not using InfiniBand correctly

By default, NCCL will attempt to auto-detect your network interface. On clusters with multiple NICs — which is most production HPC clusters — this auto-detection picks the wrong interface. It falls back to TCP over Ethernet instead of using GPUDirect RDMA over InfiniBand. The result: your 400Gbps InfiniBand fabric is idle while NCCL sends gradients through your 10Gbps management network.

Diagnosis — check what NCCL is actually using

NCCL_DEBUG=INFO torchrun --nproc_per_node=8 your_script.py 2>&1 | grep -E "NET|IB|SOCKET"

# You should see:
# NCCL INFO NET/IB : Using [0]mlx5_0:1/IB
# NOT:
# NCCL INFO NET/Socket : Using [0]eth0

Fix — force NCCL to use InfiniBand

export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=ib0
export NCCL_IB_HCA=mlx5_0,mlx5_1  # match your HCA names
export NCCL_NET_GDR_LEVEL=5        # enable GPUDirect RDMA

2. GPUDirect RDMA is disabled or misconfigured

GPUDirect RDMA allows the GPU to communicate directly with the InfiniBand HCA without going through CPU or system memory. When it works, it removes 40–60% of the latency from each AllReduce operation. When it's misconfigured, NCCL silently falls back to a CPU-mediated path with no warning.

The most common misconfiguration: nvidia_peermem is not loaded on the host. This module enables GPU-to-NIC direct memory access. Without it, GPUDirect RDMA cannot function regardless of your NCCL environment variables.

Verify GPUDirect RDMA is operational

# Check the module is loaded
lsmod | grep nvidia_peermem

# Verify in NCCL logs — you should see:
# NCCL INFO GDR enabled for peer access
# NOT:
# NCCL INFO GDR disabled (no peer access)

3. SHARP in-network aggregation is not enabled

Modern InfiniBand switches support SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) — in-network computing that offloads AllReduce computation to the switch fabric itself. On a 16-node cluster, SHARP can reduce AllReduce latency by 30–50% by performing gradient aggregation in the network rather than at the endpoints.

Most teams do not enable it. The switch supports it. The HCA supports it. NCCL supports it. Nobody turned it on.

If this looks like your cluster, you don't need more GPUs. You need to see what your system is actually doing — and that starts with a single nccl-tests run. Let's look at it together →

Enable SHARP for AllReduce

export NCCL_ALGO=TREE         # use tree algorithm (required for SHARP)
export NCCL_SHARP_ENABLE=1
export SHARP_COLL_LOG_LEVEL=3  # verify SHARP is actually active

How to measure the impact

Before applying any fix, establish a baseline. Run nccl-tests with your actual message sizes — not the default. A 7B parameter model with BF16 gradients produces an AllReduce message of approximately 14GB. Test at that size.

nccl-tests baseline — run this first

# Install nccl-tests
git clone https://github.com/NVIDIA/nccl-tests
cd nccl-tests && make MPI=1 CUDA_HOME=/usr/local/cuda

# Run AllReduce test at production message size
mpirun -np 128 --hostfile hostfile \
  ./build/all_reduce_perf \
  -b 14G -e 14G -f 2 -g 1 \
  -w 5 -n 20

# Key metrics to record:
# - algbw (algorithmic bandwidth) — target: >300 GB/s
# - busbw (bus bandwidth) — target: >200 GB/s  
# - time (latency) — target: <2ms on 16-node NDR

Run this before and after applying the NCCL environment variables above. On a properly configured cluster, I typically see algorithmic bandwidth increase from 40–80 GB/s to 280–350 GB/s. That translates directly to reduced step time.

Production result — Capgemini 16-node A100 cluster · 13B-parameter LLM fine-tuning · 2025

4.2→0.8msAllReduce latency

61→87%GPU utilization

7d→18hTraining time

$480KAnnual savings

Every number above came from applying the three fixes described in this article — plus one additional change to the ring topology that I cover in the next article. The cluster hardware did not change. The model did not change. The batch size did not change. Only the communication configuration changed.

What to do this week

Action sequence

Run nccl-tests at your production message size and record algbw and busbw
Check NCCL_DEBUG=INFO output — confirm InfiniBand is being used, not Ethernet
Verify nvidia_peermem is loaded on all host nodes
Apply the environment variable fixes above and re-run nccl-tests
If algbw is still below 200 GB/s after these fixes, the problem is deeper — likely in the ring topology or HCA binding. That requires profiling with Nsight Systems.

The fixes above take under two hours to apply and verify. If your cluster is running AllReduce above 2ms today, there is no reason it should still be running above 2ms tomorrow.

If you run nccl-tests and algbw is already above 300 GB/s but your training utilization is still below 80%, the bottleneck has moved. It is no longer in AllReduce — it's in compute-communication overlap, or in memory bandwidth, or in kernel launch overhead. Those require Nsight Systems to diagnose. That's the next article.

Every training run you execute with a misconfigured NCCL stack is wasting GPU time you are already paying for. On a 16-node cluster running 24/7, this is not a performance issue. It is a cost leak — and it compounds with every job you run.

This is fixable — and measurable.

If your cluster is showing these symptoms, the right move is to look at your Nsight trace together. I profile, find the exact bottleneck, and fix it — documented so your team can maintain it. Most clusters I work with recover $200K–$600K/year in efficiency within the first 30 days.

I work with 3 teams at a time. One slot is currently open.

Book 20-min diagnostic call See how I work →