A 16-node A100 cluster — 128 GPUs — was training a 1.3B parameter GPT-style model. On paper, this was serious compute. In practice, MFU sat at 7%. That means 93% of available compute was wasted on every training step.
The GPU utilization% metric showed 61% — which looked acceptable. The team assumed slowness was a hardware limitation. The cluster had been running this way for months.
It wasn't a hardware problem. It was a systems problem. Here is exactly what we found and how we fixed it.
The diagnosis: GPUs were waiting, not computing
The first step was breaking down where time actually goes in a training step. Using Nsight Systems, we profiled a representative training run and decomposed each step into compute time, communication time, and idle time.
For every second of useful computation, the cluster was spending approximately 10 seconds in collective communication and idle waiting. The cluster was not compute-bound. It was communication-bound.
GPU utilization% was actively misleading. It counts a GPU as "utilized" even when it is blocked on a collective communication operation. MFU told the real story: 7% of theoretical FLOPs were being used productively.
Root causes: three fixable configuration errors
Three specific issues were responsible for the entire performance gap. None required hardware changes. All were configuration-level interventions. Of these, NCCL misconfiguration and AllReduce overhead accounted for the majority of lost time. Batch size amplified both.
The cluster was running with default NCCL settings. For a 16-node topology, the default ring-based AllReduce was suboptimal. NCCL socket buffer sizes were not matched to the network fabric, causing fragmented communication patterns and increased per-operation latency. Every gradient synchronisation step was 3–4× slower than it needed to be.
The micro-batch size left GPUs underloaded between communication rounds. Each GPU was processing too few tokens per step to keep its compute units saturated. GPUs were finishing compute work quickly — then sitting idle waiting for the next AllReduce to complete. Short compute window, long wait window.
The combination of suboptimal NCCL settings and small batch size meant that the ratio of communication time to compute time was severely skewed. Communication and compute were not overlapping — the backward pass completed before gradients began synchronising, creating a sequential bottleneck that compounded across all 128 GPUs.
The intervention: three changes, measured individually
Three targeted changes. No hardware upgrades. No architectural changes to the model. These were not applied simultaneously — each was tested and measured independently to isolate the impact of each intervention precisely.
Fix 01 — NCCL tuning
Reconfigured NCCL for the specific 16-node InfiniBand topology. Set socket thread count and buffer sizes to match the network fabric. Enabled GPU Direct RDMA across nodes to reduce CPU involvement in inter-node gradient transfers.
export NCCL_SOCKET_NTHREADS=4 export NCCL_NSOCKS_PERTHREAD=4 export NCCL_BUFFSIZE=4194304 export NCCL_NET_GDR_LEVEL=5 export NCCL_IB_DISABLE=0 export NCCL_IB_GID_INDEX=3
Fix 02 — Batch size increase
Increased micro-batch size using gradient accumulation to maintain the effective batch size while keeping GPUs compute-saturated between communication rounds. Tokens processed per GPU per step increased from ~2K to ~8K — giving AllReduce operations more useful work to amortise over.
Fix 03 — Communication / compute overlap
Enabled asynchronous gradient communication to overlap AllReduce operations with the backward pass. This hid a significant portion of communication latency behind compute that was already happening — eliminating the sequential bottleneck entirely.
One change. Measured. Then the next. This is how you know what actually moved the number — and how you replicate the result on a different cluster.
The result
| Metric | Before | After | Change |
|---|---|---|---|
| MFU | 7% | 40% | +33pp |
| Training throughput | Baseline | 4× baseline | +300% |
| Training time (full run) | 7 days | 18 hours | −89% |
| Hardware | 128× A100 | 128× A100 | Unchanged |
| Annual savings | — | $480K–$600K | Per cluster |
The cluster did not change. The model did not change. The dataset did not change. Three configuration interventions moved MFU from 7% to 40% and compressed a 7-day training run into 18 hours. Effectively, the cluster was delivering roughly one-fifth of its actual capacity before optimisation.
What this means for your cluster
If your cluster is running below 40% MFU, you likely have at least one of these three issues. The signs are recognisable. If you're running a multi-node GPU cluster, this is not an edge case. This is the default state.
- GPU utilization% looks acceptable but training is slower than expected
- Scaling to more GPUs does not improve throughput proportionally
- Step times are inconsistent or dominated by communication phases
- Your team attributes slowness to model size or hardware limitations
- Is communication >50% of your step time in Nsight Systems?
- Does scaling beyond 8 GPUs degrade per-GPU efficiency?
- Are GPUs idle between AllReduce operations?
- Did your team tune NCCL settings for your specific network topology?
Most AI teams have a systems problem, not a hardware problem. Every 10% of MFU you recover is six figures in annual savings on a production cluster. Let's look at your cluster together →
This is fixable — and measurable.
If your cluster is showing these symptoms, the right move is to profile it together. I identify the exact bottleneck — NCCL, batch configuration, communication overhead — and fix it with documented results your team can maintain. Most clusters I work with recover $200K–$600K/year in efficiency within the first 30 days.
GPU Cluster Health Check: $750 introductory rate · 3–5 day remote audit · written report with specific fixes.