Skip to content

Colab Exercises

These drills are meant to be typed, explained, and defended. Treat them like mock interview reps.

Drill 1: Build The Happy Path In 15 Minutes

Section titled “Drill 1: Build The Happy Path In 15 Minutes”

Goal:

  • config dataclass
  • toy dataset
  • model
  • single-process training loop
  • metrics printout

What to prove:

  • you can create structure before distribution
  • you keep code readable under time pressure
flowchart LR
  A[Single-process trainer] --> B[Initialize process group]
  B --> C[Set device from local rank]
  C --> D[Use distributed sampler]
  D --> E[Wrap with DDP]
  E --> F[Rank-aware metrics and checkpointing]
The jump from single-process to DDP should feel incremental, not like a rewrite.

Success criteria:

  • rank, local_rank, world_size are explicit
  • sampler is deterministic
  • non-rank-0 logging is controlled

Implement or pseudocode:

def restore_if_present(model, optimizer, sampler, cfg):
path = latest_checkpoint_path(cfg.checkpoint_dir)
if path is None:
return SimpleNamespace(epoch=0, step=0)
state = torch.load(path, map_location="cpu")
model_state = state["model"]
unwrap(model).load_state_dict(model_state)
optimizer.load_state_dict(state["optimizer"])
sampler.load_state_dict(state["sampler"])
return SimpleNamespace(epoch=state["epoch"], step=state["step"])

Then explain:

  • what happens if topology changed
  • how to verify checkpoint completeness
  • why sampler state matters

Track:

  • loader wait
  • forward
  • backward
  • optimizer
  • checkpoint save

Then answer:

  1. Which phase is most likely to scale poorly first?
  2. Which phase is most likely to have long-tail spikes?
  3. Which phase is easiest to make observable in a notebook?
flowchart TD
  A[Notebook prototype] --> B[Containerized trainer]
  B --> C[Job launcher]
  C --> D[Artifact + config service]
  D --> E[Metrics / logging stack]
  E --> F[Policy layer for retries and cost]
A strong closing move is to show how the notebook becomes a service without rewriting core training logic.
PromptWhat a good answer should emphasize
”The loss drops, but throughput is terrible.”Step-time decomposition and loader vs sync diagnosis
”The job resumes, but metrics look strange.”Batch semantics, LR continuity, sampler replay
”One GPU has lower utilization than the others.”Rank-local skew, data imbalance, hardware or placement issue
”Checkpointing makes the job stall.”RPO vs I/O overhead, async artifact handling, manifest publication