Interview Cheatsheet
Default Baseline
Section titled “Default Baseline”- one process per GPU
- DDP before FSDP unless memory forces otherwise
- deterministic distributed sampler
- rank-0 concise logs, all-rank structured errors
- full-job restart from latest good checkpoint
Must-Save Checkpoint State
Section titled “Must-Save Checkpoint State”- model
- optimizer
- scheduler
- AMP scaler
- sampler progress
- RNG state
- step / epoch / config metadata
Metrics To Mention Fast
Section titled “Metrics To Mention Fast”- step time by phase
- samples/sec
- data-loader wait
- all-reduce time
- checkpoint latency
- restart count
Config And Reproducibility Defaults
Section titled “Config And Reproducibility Defaults”- version config, code, and dataset manifest together
- save featurization or transform version, not just model hyperparameters
- state the effective global batch explicitly
- checkpoint sampler or stream cursor state, not only weights
- promise statistical reproducibility by default; reserve exact replay for controlled cases
Evaluation Defaults
Section titled “Evaluation Defaults”- keep train and validation manifests independently versioned
- say which split policy is honest for the domain: scaffold for molecules, clean holdout for images
- separate system-health metrics from model-quality metrics
- keep one cheap smoke-eval and one more representative holdout
- never answer “we just watch loss”
flowchart LR A[Execution model] --> B[Checkpoint contract] A --> C[Data partitioning] B --> D[Reproducibility story] C --> D D --> E[Metrics layers] E --> F[Evaluation policy] F --> G[Promotion decision]
High-Signal Sentences
Section titled “High-Signal Sentences”- “I’m optimizing for the smallest correct distributed baseline first.”
- “Data partitioning and resume semantics are correctness problems, not just performance details.”
- “I would only introduce a more complex parallelism axis when the current bottleneck is explicit.”
- “Checkpoint frequency is an RPO and cost decision.”
- “In a notebook I’ll preserve production boundaries even if I mock the backing systems.”
Biotech Addendum
Section titled “Biotech Addendum”Additional defaults for roles involving protein, molecular, or genomic data.
Biotech data defaults
- scaffold split over random split; time split for prospective validation
- missing bioassay labels are not negatives—mask them out of the loss
- per-task AUROC, BEDROC, and EF@1% over accuracy and raw loss
- sparse matrix row-slicing in
__getitem__for genomic datasets
Biotech model defaults
- bfloat16 over float16 for long-sequence stability
use_reentrant=Falsein activation checkpointing for transformer blocks- layer normalization only in equivariant networks; batch norm breaks equivariance
- sequence length curriculum from short to long to avoid early OOM events
Biotech high-signal sentences
- “Scaffold split is the correctness bar; random split is the optimism bar.”
- “Missing labels in a bioassay database are not negatives—treating them as negatives is a silent training error.”
- “Equivariant network training imposes preprocessing contracts: frame canonicalization must precede data sharding.”
- “Wet lab hit rate is the only metric the business actually cares about; computational metrics exist to let us move faster while we wait for it.”
- “Active learning turns the dataset contract from a precondition into an ongoing invariant managed by the orchestrator.”