The Setup
On April 21, 2026, I ran a full suite of multi-node MPI bandwidth and latency benchmarks on Google Cloud Platform. Two compute nodes, us-central1-a, OSU Micro-Benchmarks v7.3 — the same benchmark suite used to validate production HPC cluster interconnects.
The goal: characterize bandwidth and latency across message sizes — the same sizes NCCL uses internally for AllReduce gradient synchronization during LLM training.
Key Findings
Full Benchmark Data
| Message Size | Latency (μs) | Bandwidth (MB/s) | Notes |
|---|---|---|---|
| 1B | 20.73 | 0.38 | Baseline floor |
| 1KB | 21.86 | 460.94 | Flat latency zone |
| 4KB | 26.60 | 1,304.12 | Peak efficiency zone |
| 8KB | 31.12 | 965.40 ↓26% | Buffer fragmentation dip |
| 64KB | 158.26 7× JUMP | 1,058.45 | MTU boundary cliff |
| 1MB | 466.37 | 1,105.01 | AllReduce-relevant range |
| 4MB | 1821.39 | 1,530.02 | Peak bandwidth |
Anomaly 1: The 64KB Latency Cliff
At 4KB, latency sits at 26.60μs. Cross the 64KB boundary: 158.26μs — a 7× jump at a single message size boundary.
This is a MTU boundary effect. When a message exceeds the Maximum Transmission Unit of the network fabric, the packet is fragmented. Each fragment is a separate network transaction. Latency compounds multiplicatively.
NCCL's AllReduce splits gradient tensors into chunks and transmits them across the ring or tree topology. If NCCL buffer size isn't explicitly aligned to the fabric MTU — InfiniBand, RoCE, or otherwise — every AllReduce chunk crosses this cliff. The overhead is silent. No errors. Just slow training.
Anomaly 2: The 8KB Bandwidth Dip
At 4KB, bandwidth peaks at 1,304 MB/s. At 8KB it drops to 965 MB/s — a 26% reduction despite the larger message size.
At 8KB the kernel can't assemble DMA transfer buffers cleanly — a well-documented pattern in high-speed network drivers. In production NCCL environments this manifests as step-time variance that looks like hardware instability but is entirely configuration-driven.
Why These Patterns Mirror Production Clusters
These anomalies — the MTU cliff and buffer fragmentation dip — are not GCP-specific quirks. They appear in production GPU clusters with InfiniBand, RoCE, and 400G/800G Ethernet fabrics. Message sizes shift slightly, but the mechanism is identical.
The reason most teams never find them: training logs show step time. Step time is a composite of compute, communication, and I/O. A 30% communication overhead looks like 10% slower training — below the threshold that triggers investigation.
Your cluster isn't slow. Your NCCL configuration is. Buffer sizes, MTU alignment, SHARP enablement, ring vs. tree topology — configuration choices that compound silently across every training step, every day, every month.
What This Costs at Scale
The Fix
Three configuration changes. None require new hardware.
Exact values depend on your fabric. InfiniBand HDR has different MTU characteristics than RoCE v2 or 400G Ethernet. The diagnostic approach is always the same: benchmark first, identify the cliffs, align the buffers.
Methodology & Reproducibility
All benchmark code, raw data, and cluster configuration is published on GitHub. The validation workload uses a 1.3B GPT-style model in JAX with bfloat16 across 8 accelerators, with explicit separation of compute time, AllReduce/communication time, input pipeline time, and XLA compile time per training step.
MFU-equivalent is defined as achieved FLOPs / theoretical FLOPs after normalizing for architecture-specific compilation and interconnect behavior. No production cluster runs begin until instrumentation is validated with all seven metrics logging cleanly.
Full benchmark data + methodology
Raw OSU output · GCP cluster config · Reproducible scripts · Open source
Frequently Asked Questions
What is the MTU cliff in GPU clusters?
The MTU cliff is a latency spike when network messages exceed the fabric's Maximum Transmission Unit. In GPU clusters, this causes packet fragmentation on every AllReduce, silently increasing communication latency up to 7× at the boundary message size — without triggering any error or alert.
How much does NCCL misconfiguration cost in production?
On a 128-GPU A100 cluster at $25K/month, NCCL misconfiguration from MTU misalignment typically wastes $180,000+ annually in compute — with zero new hardware required to fix it. The fix is a configuration change that takes one afternoon.
How do you fix NCCL MTU misalignment?
Set NCCL_BUFFSIZE to 4MB to align with peak bandwidth message size, tune NCCL_MIN_NCHANNELS to avoid the 8KB fragmentation boundary, and enable SHARP for in-network AllReduce if your fabric supports it. Full configuration in the benchmark repo at github.com/sankarbaseone/nydux-gpu-benchmarks.
What benchmark tools were used for these GPU cluster tests?
OSU Micro-Benchmarks v7.3 (osu_bw and osu_latency) on Google Cloud Platform us-central1-a, two compute nodes. The same benchmark suite is used to validate production HPC interconnects. Raw data and reproduction scripts are available at github.com/sankarbaseone/nydux-gpu-benchmarks.
Does Your Cluster Have These Patterns?
Most production GPU clusters do. The MTU cliff and buffer fragmentation patterns are the default state — not exceptions. A full diagnostic identifies exactly where your compute is going and what it costs.
GPU Cluster Diagnostic · $5,000 · Full findings in 5 days · Identifies $300K+ of wasted GPU capacity