The cost you're not measuring
$1,200–$1,500/hr
What a 128-GPU A100 cluster burning at 7% MFU wastes every hour at on-demand pricing. This cluster ran that way for months before anyone looked at the communication layer.

A 16-node A100 cluster — 128 GPUs — was training a 1.3B parameter GPT-style model. On paper, this was serious compute. In practice, MFU sat at 7%. That means 93% of available compute was wasted on every training step.

The GPU utilization% metric showed 61% — which looked acceptable. The team assumed slowness was a hardware limitation. The cluster had been running this way for months.

It wasn't a hardware problem. It was a systems problem. Here is exactly what we found and how we fixed it.

The diagnosis: GPUs were waiting, not computing

The first step was breaking down where time actually goes in a training step. Using Nsight Systems, we profiled a representative training run and decomposed each step into compute time, communication time, and idle time.

Communication
72%
Idle / waiting
21%
Actual compute
7%

For every second of useful computation, the cluster was spending approximately 10 seconds in collective communication and idle waiting. The cluster was not compute-bound. It was communication-bound.

GPU utilization% was actively misleading. It counts a GPU as "utilized" even when it is blocked on a collective communication operation. MFU told the real story: 7% of theoretical FLOPs were being used productively.

Root causes: three fixable configuration errors

Three specific issues were responsible for the entire performance gap. None required hardware changes. All were configuration-level interventions. Of these, NCCL misconfiguration and AllReduce overhead accounted for the majority of lost time. Batch size amplified both.

// Root cause 01
NCCL misconfiguration

The cluster was running with default NCCL settings. For a 16-node topology, the default ring-based AllReduce was suboptimal. NCCL socket buffer sizes were not matched to the network fabric, causing fragmented communication patterns and increased per-operation latency. Every gradient synchronisation step was 3–4× slower than it needed to be.

// Root cause 02
Small batch size — GPU starvation

The micro-batch size left GPUs underloaded between communication rounds. Each GPU was processing too few tokens per step to keep its compute units saturated. GPUs were finishing compute work quickly — then sitting idle waiting for the next AllReduce to complete. Short compute window, long wait window.

// Root cause 03
AllReduce communication overhead

The combination of suboptimal NCCL settings and small batch size meant that the ratio of communication time to compute time was severely skewed. Communication and compute were not overlapping — the backward pass completed before gradients began synchronising, creating a sequential bottleneck that compounded across all 128 GPUs.

The intervention: three changes, measured individually

Three targeted changes. No hardware upgrades. No architectural changes to the model. These were not applied simultaneously — each was tested and measured independently to isolate the impact of each intervention precisely.

Fix 01 — NCCL tuning

Reconfigured NCCL for the specific 16-node InfiniBand topology. Set socket thread count and buffer sizes to match the network fabric. Enabled GPU Direct RDMA across nodes to reduce CPU involvement in inter-node gradient transfers.

NCCL environment variables — applied per training node
export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_BUFFSIZE=4194304
export NCCL_NET_GDR_LEVEL=5
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3

Fix 02 — Batch size increase

Increased micro-batch size using gradient accumulation to maintain the effective batch size while keeping GPUs compute-saturated between communication rounds. Tokens processed per GPU per step increased from ~2K to ~8K — giving AllReduce operations more useful work to amortise over.

Fix 03 — Communication / compute overlap

Enabled asynchronous gradient communication to overlap AllReduce operations with the backward pass. This hid a significant portion of communication latency behind compute that was already happening — eliminating the sequential bottleneck entirely.

Rule applied throughout

One change. Measured. Then the next. This is how you know what actually moved the number — and how you replicate the result on a different cluster.

The result

MetricBeforeAfterChange
MFU7%40%+33pp
Training throughputBaseline4× baseline+300%
Training time (full run)7 days18 hours−89%
Hardware128× A100128× A100Unchanged
Annual savings$480K–$600KPer cluster

The cluster did not change. The model did not change. The dataset did not change. Three configuration interventions moved MFU from 7% to 40% and compressed a 7-day training run into 18 hours. Effectively, the cluster was delivering roughly one-fifth of its actual capacity before optimisation.

// Production result — 16-node A100 cluster · 1.3B GPT-style model · NCCL + batch optimisation · 2025
7% → 40%MFU
Throughput
7d → 18hTraining time
$480K+Annual savings
At $3/hr per A100 on-demand, recovering 33 percentage points of MFU across 128 GPUs running continuously translates to $480K–$600K in annual savings — from a single cluster, with zero additional hardware spend. The cluster hardware did not change. Only the communication configuration and batch sizing changed.
Communication
22%
Idle / waiting
8%
Actual compute
70%

What this means for your cluster

If your cluster is running below 40% MFU, you likely have at least one of these three issues. The signs are recognisable. If you're running a multi-node GPU cluster, this is not an edge case. This is the default state.

Symptoms — check your cluster now
  • GPU utilization% looks acceptable but training is slower than expected
  • Scaling to more GPUs does not improve throughput proportionally
  • Step times are inconsistent or dominated by communication phases
  • Your team attributes slowness to model size or hardware limitations
// Quick self-check
  • Is communication >50% of your step time in Nsight Systems?
  • Does scaling beyond 8 GPUs degrade per-GPU efficiency?
  • Are GPUs idle between AllReduce operations?
  • Did your team tune NCCL settings for your specific network topology?
If yes to any of these — you are leaving significant compute on the table. The fixes take days, not months.

Most AI teams have a systems problem, not a hardware problem. Every 10% of MFU you recover is six figures in annual savings on a production cluster. Let's look at your cluster together →

This is fixable — and measurable.

If your cluster is showing these symptoms, the right move is to profile it together. I identify the exact bottleneck — NCCL, batch configuration, communication overhead — and fix it with documented results your team can maintain. Most clusters I work with recover $200K–$600K/year in efficiency within the first 30 days.

GPU Cluster Health Check: $750 introductory rate · 3–5 day remote audit · written report with specific fixes.