Performance Pathologies
The performance conversation gets much better once you stop saying “the GPUs are underutilized” and start naming the stage that is stalling.
Step Time Decomposition
Section titled “Step Time Decomposition”flowchart LR A[Batch wait] --> B[H2D copy] B --> C[Forward] C --> D[Backward] D --> E[Gradient sync] E --> F[Optimizer step] F --> G[Checkpoint / logging side work]
First Questions To Ask
Section titled “First Questions To Ask”| Question | Why |
|---|---|
| Is step time stable or bursty? | Bursty often means I/O or background contention. |
| Is the slowdown rank-local or global? | Local issues suggest hardware, data, or placement skew. |
| Does the gap appear before backward or during gradient sync? | Separates compute inefficiency from communication bottlenecks. |
| Did the gap appear after a memory-saving change? | Activation checkpointing and sharding can trade memory for latency. |
Instrumented Training Step
Section titled “Instrumented Training Step”def timed_train_step(model, optimizer, batch, timer, scaler=None): with timer("h2d"): batch = move_to_device(batch)
with timer("forward"): outputs = model(batch["inputs"]) loss = compute_loss(outputs, batch["targets"])
with timer("backward"): if scaler: scaler.scale(loss).backward() else: loss.backward()
with timer("optimizer"): if scaler: scaler.step(optimizer) scaler.update() else: optimizer.step() optimizer.zero_grad(set_to_none=True)
return lossCurrent PyTorch Notes
Section titled “Current PyTorch Notes”The current official AMP docs and activation-checkpoint docs add useful precision:
- with autocast, you should not manually call
half()orbfloat16()on your model or inputs just to “do AMP right” - autocast should wrap the forward pass and loss computation; backward under autocast is not the recommended pattern
- activation checkpointing still fundamentally trades compute for memory
- preserving RNG state across activation-checkpoint recomputation improves determinism but can cost performance
That is useful interview material because it turns a vague “use mixed precision and checkpointing” answer into an operational one.
Common Bottleneck Patterns
Section titled “Common Bottleneck Patterns”
flowchart TD
A[Throughput drop] --> B{Main symptom}
B --> C[GPU idle before kernels]
B --> D[Long backward tail]
B --> E[Spiky step latency]
B --> F[OOM after scaling]
C --> G[Input pipeline or H2D issue]
D --> H[All-reduce / bucketization / topology issue]
E --> I[Checkpointing, storage, or noisy neighbor]
F --> J[Activation, optimizer, or fragmentation issue]
Tuning Levers With Honest Tradeoffs
Section titled “Tuning Levers With Honest Tradeoffs”| Lever | Upside | Risk |
|---|---|---|
| larger batch size | better device occupancy | optimization behavior changes, memory pressure increases |
| mixed precision | more throughput, lower memory | numerical edge cases, scaler handling |
| more loader workers | better CPU parallelism | oversubscription and context-switch overhead |
| gradient accumulation | emulate larger global batch | longer optimizer feedback loop |
| activation checkpointing | memory relief | extra recompute increases latency |
| NCCL/env tuning | better comm efficiency | cluster-specific and hard to generalize live |
The Staff Angle
Section titled “The Staff Angle”A staff-level answer connects performance to platform economics:
- cost per successful training hour
- storage bandwidth consumed by checkpoints
- cluster fragmentation caused by rigid topology requirements
- debugging burden introduced by aggressive optimization