Staff-Level Hard Questions
Use this page for pressure testing. Read the question, answer out loud, then compare your answer to the expected shape.
Question Taxonomy
Section titled “Question Taxonomy”
mindmap
root((Hard Questions))
Correctness
replay
sharding
optimizer state
Performance
all-reduce
dataloader
memory
Reliability
restart scope
checkpoint atomicity
deadlocks
Staff judgment
cost
simplification
rollout risk
1. “How do you know resume is correct?”
Section titled “1. “How do you know resume is correct?””Strong answer shape:
- define what “correct” means first
- list the state that must survive
- explain how you validate continuity after restore
- mention that exact replay may differ for some stochastic pipelines unless RNG and sample order are preserved
2. “Why DDP instead of FSDP?”
Section titled “2. “Why DDP instead of FSDP?””Strong answer shape:
- DDP is the simplest correct baseline
- if the model fits comfortably, FSDP adds complexity without immediate value
- if memory is the dominant constraint, I would migrate to FSDP and plan for sharded checkpoint semantics
3. “What breaks when world size changes on resume?”
Section titled “3. “What breaks when world size changes on resume?””Good answer:
- effective global batch changes
- optimizer dynamics may change
- sampler partitioning changes
- checkpoint state may no longer map cleanly in sharded formats
- monitoring baselines need reinterpretation
This is a strong place to say:
“I would prefer to resume with the same topology unless the system is explicitly designed for elasticity.”
4. “How would you debug a training job that is slow only every 20 minutes?”
Section titled “4. “How would you debug a training job that is slow only every 20 minutes?””Reasonable path:
- correlate spikes with checkpoint saves
- check storage latency and background artifact uploads
- inspect node-level contention and noisy neighbors
- compare rank-level step breakdowns to see if the stall is systemic or localized
5. “What is the most dangerous silent failure here?”
Section titled “5. “What is the most dangerous silent failure here?””Best answer:
“Silent data duplication or omission across ranks, because the job can appear healthy, the loss can still decrease, and the resulting model quality degradation is hard to attribute after the fact.”
6. “Why not just make the training loop asynchronous?”
Section titled “6. “Why not just make the training loop asynchronous?””Because once gradients and optimizer progress are no longer synchronized, the semantics change significantly. You are no longer talking about the same training algorithm or failure model.
7. “What metrics would you page on?”
Section titled “7. “What metrics would you page on?””| Page-worthy signal | Why |
|---|---|
| no forward progress for a bounded window | indicates hang or severe stall |
| checkpoint age exceeds threshold | recovery point is drifting |
| repeated restart loop | likely structural fault rather than transient blip |
| sample throughput collapse | usually user-visible cost inefficiency |
8. “How do you decide checkpoint frequency?”
Section titled “8. “How do you decide checkpoint frequency?””Answer using economics:
- more frequent saves lower redo work
- they increase I/O overhead and storage spend
- the right point depends on failure rate, checkpoint cost, and tolerated recovery point objective
9. “When would you redesign the pipeline instead of tuning it?“
Section titled “9. “When would you redesign the pipeline instead of tuning it?“”
flowchart TD
A[Persistent pain] --> B{Is bottleneck fundamental?}
B -->|No| C[Tune parameters]
B -->|Yes| D[Redesign boundary]
D --> E[data format]
D --> F[checkpoint contract]
D --> G[parallelism strategy]
D --> H[cluster placement model]
10. “What would you deliberately not build in the interview?”
Section titled “10. “What would you deliberately not build in the interview?””A clean answer:
- full scheduler integration
- cloud-specific auth and secrets flow
- production dashboard wiring
- every exotic parallelism mode
Then add:
“I would preserve interfaces for those pieces so the notebook still demonstrates system boundaries.”