Colab Exercises

These drills are meant to be typed, explained, and defended. Treat them like mock interview reps.

Drill 1: Build The Happy Path In 15 Minutes

Goal:

config dataclass
toy dataset
model
single-process training loop
metrics printout

What to prove:

you can create structure before distribution
you keep code readable under time pressure

Drill 2: Lift To Distributed Training

flowchart LR
  A[Single-process trainer] --> B[Initialize process group]
  B --> C[Set device from local rank]
  C --> D[Use distributed sampler]
  D --> E[Wrap with DDP]
  E --> F[Rank-aware metrics and checkpointing]

The jump from single-process to DDP should feel incremental, not like a rewrite.

Success criteria:

rank, local_rank, world_size are explicit
sampler is deterministic
non-rank-0 logging is controlled

Drill 3: Resume After Failure

Implement or pseudocode:

def restore_if_present(model, optimizer, sampler, cfg):
    path = latest_checkpoint_path(cfg.checkpoint_dir)
    if path is None:
        return SimpleNamespace(epoch=0, step=0)

    state = torch.load(path, map_location="cpu")
    model_state = state["model"]
    unwrap(model).load_state_dict(model_state)
    optimizer.load_state_dict(state["optimizer"])
    sampler.load_state_dict(state["sampler"])
    return SimpleNamespace(epoch=state["epoch"], step=state["step"])

Then explain:

what happens if topology changed
how to verify checkpoint completeness
why sampler state matters

Drill 4: Instrument Performance

Track:

loader wait
forward
backward
optimizer
checkpoint save

Then answer:

Which phase is most likely to scale poorly first?
Which phase is most likely to have long-tail spikes?
Which phase is easiest to make observable in a notebook?

Drill 5: Whiteboard The Production Lift

flowchart TD
  A[Notebook prototype] --> B[Containerized trainer]
  B --> C[Job launcher]
  C --> D[Artifact + config service]
  D --> E[Metrics / logging stack]
  E --> F[Policy layer for retries and cost]

A strong closing move is to show how the notebook becomes a service without rewriting core training logic.

Fast Mock Prompts

Prompt	What a good answer should emphasize
”The loss drops, but throughput is terrible.”	Step-time decomposition and loader vs sync diagnosis
”The job resumes, but metrics look strange.”	Batch semantics, LR continuity, sampler replay
”One GPU has lower utilization than the others.”	Rank-local skew, data imbalance, hardware or placement issue
”Checkpointing makes the job stall.”	RPO vs I/O overhead, async artifact handling, manifest publication