Skip to content

Staff-Level Hard Questions

Use this page for pressure testing. Read the question, answer out loud, then compare your answer to the expected shape.

mindmap
root((Hard Questions))
  Correctness
    replay
    sharding
    optimizer state
  Performance
    all-reduce
    dataloader
    memory
  Reliability
    restart scope
    checkpoint atomicity
    deadlocks
  Staff judgment
    cost
    simplification
    rollout risk
The hard questions usually probe one of these four categories.

1. “How do you know resume is correct?”

Section titled “1. “How do you know resume is correct?””

Strong answer shape:

  • define what “correct” means first
  • list the state that must survive
  • explain how you validate continuity after restore
  • mention that exact replay may differ for some stochastic pipelines unless RNG and sample order are preserved

Strong answer shape:

  • DDP is the simplest correct baseline
  • if the model fits comfortably, FSDP adds complexity without immediate value
  • if memory is the dominant constraint, I would migrate to FSDP and plan for sharded checkpoint semantics

3. “What breaks when world size changes on resume?”

Section titled “3. “What breaks when world size changes on resume?””

Good answer:

  • effective global batch changes
  • optimizer dynamics may change
  • sampler partitioning changes
  • checkpoint state may no longer map cleanly in sharded formats
  • monitoring baselines need reinterpretation

This is a strong place to say:

“I would prefer to resume with the same topology unless the system is explicitly designed for elasticity.”

4. “How would you debug a training job that is slow only every 20 minutes?”

Section titled “4. “How would you debug a training job that is slow only every 20 minutes?””

Reasonable path:

  1. correlate spikes with checkpoint saves
  2. check storage latency and background artifact uploads
  3. inspect node-level contention and noisy neighbors
  4. compare rank-level step breakdowns to see if the stall is systemic or localized

5. “What is the most dangerous silent failure here?”

Section titled “5. “What is the most dangerous silent failure here?””

Best answer:

“Silent data duplication or omission across ranks, because the job can appear healthy, the loss can still decrease, and the resulting model quality degradation is hard to attribute after the fact.”

6. “Why not just make the training loop asynchronous?”

Section titled “6. “Why not just make the training loop asynchronous?””

Because once gradients and optimizer progress are no longer synchronized, the semantics change significantly. You are no longer talking about the same training algorithm or failure model.

Page-worthy signalWhy
no forward progress for a bounded windowindicates hang or severe stall
checkpoint age exceeds thresholdrecovery point is drifting
repeated restart looplikely structural fault rather than transient blip
sample throughput collapseusually user-visible cost inefficiency

8. “How do you decide checkpoint frequency?”

Section titled “8. “How do you decide checkpoint frequency?””

Answer using economics:

  • more frequent saves lower redo work
  • they increase I/O overhead and storage spend
  • the right point depends on failure rate, checkpoint cost, and tolerated recovery point objective

9. “When would you redesign the pipeline instead of tuning it?“

Section titled “9. “When would you redesign the pipeline instead of tuning it?“”
flowchart TD
  A[Persistent pain] --> B{Is bottleneck fundamental?}
  B -->|No| C[Tune parameters]
  B -->|Yes| D[Redesign boundary]
  D --> E[data format]
  D --> F[checkpoint contract]
  D --> G[parallelism strategy]
  D --> H[cluster placement model]
Staff engineers know when the current architecture is the bottleneck, not just the current config.

10. “What would you deliberately not build in the interview?”

Section titled “10. “What would you deliberately not build in the interview?””

A clean answer:

  • full scheduler integration
  • cloud-specific auth and secrets flow
  • production dashboard wiring
  • every exotic parallelism mode

Then add:

“I would preserve interfaces for those pieces so the notebook still demonstrates system boundaries.”