Staff-Level Hard Questions

Use this page for pressure testing. Read the question, answer out loud, then compare your answer to the expected shape.

Question Taxonomy

mindmap
root((Hard Questions))
  Correctness
    replay
    sharding
    optimizer state
  Performance
    all-reduce
    dataloader
    memory
  Reliability
    restart scope
    checkpoint atomicity
    deadlocks
  Staff judgment
    cost
    simplification
    rollout risk

The hard questions usually probe one of these four categories.

1. “How do you know resume is correct?”

Strong answer shape:

define what “correct” means first
list the state that must survive
explain how you validate continuity after restore
mention that exact replay may differ for some stochastic pipelines unless RNG and sample order are preserved

2. “Why DDP instead of FSDP?”

Strong answer shape:

DDP is the simplest correct baseline
if the model fits comfortably, FSDP adds complexity without immediate value
if memory is the dominant constraint, I would migrate to FSDP and plan for sharded checkpoint semantics

3. “What breaks when world size changes on resume?”

Good answer:

effective global batch changes
optimizer dynamics may change
sampler partitioning changes
checkpoint state may no longer map cleanly in sharded formats
monitoring baselines need reinterpretation

This is a strong place to say:

“I would prefer to resume with the same topology unless the system is explicitly designed for elasticity.”

4. “How would you debug a training job that is slow only every 20 minutes?”

Reasonable path:

correlate spikes with checkpoint saves
check storage latency and background artifact uploads
inspect node-level contention and noisy neighbors
compare rank-level step breakdowns to see if the stall is systemic or localized

5. “What is the most dangerous silent failure here?”

Best answer:

“Silent data duplication or omission across ranks, because the job can appear healthy, the loss can still decrease, and the resulting model quality degradation is hard to attribute after the fact.”

6. “Why not just make the training loop asynchronous?”

Because once gradients and optimizer progress are no longer synchronized, the semantics change significantly. You are no longer talking about the same training algorithm or failure model.

7. “What metrics would you page on?”

Page-worthy signal	Why
no forward progress for a bounded window	indicates hang or severe stall
checkpoint age exceeds threshold	recovery point is drifting
repeated restart loop	likely structural fault rather than transient blip
sample throughput collapse	usually user-visible cost inefficiency

8. “How do you decide checkpoint frequency?”

Answer using economics:

more frequent saves lower redo work
they increase I/O overhead and storage spend
the right point depends on failure rate, checkpoint cost, and tolerated recovery point objective

9. “When would you redesign the pipeline instead of tuning it?“

flowchart TD
  A[Persistent pain] --> B{Is bottleneck fundamental?}
  B -->|No| C[Tune parameters]
  B -->|Yes| D[Redesign boundary]
  D --> E[data format]
  D --> F[checkpoint contract]
  D --> G[parallelism strategy]
  D --> H[cluster placement model]

Staff engineers know when the current architecture is the bottleneck, not just the current config.

10. “What would you deliberately not build in the interview?”

A clean answer:

full scheduler integration
cloud-specific auth and secrets flow
production dashboard wiring
every exotic parallelism mode

Then add:

“I would preserve interfaces for those pieces so the notebook still demonstrates system boundaries.”