Skip to content

Notebook Walkthrough

This is the page to rehearse directly before the interview.

Build a compact notebook with these cells:

  1. config and assumptions
  2. dataset + sampler abstraction
  3. model + optimizer factory
  4. distributed init / rank wiring
  5. training step
  6. checkpoint adapter
  7. metrics emission
  8. launch / resume wrapper
flowchart TD
  A[Config cell] --> B[Dataset + sampler cell]
  B --> C[Model / optimizer cell]
  C --> D[Distributed init cell]
  D --> E[Training loop cell]
  E --> F[Checkpoint + metrics cell]
  F --> G[Launch / main cell]
The notebook should feel layered, not like one giant procedural cell.
@dataclass
class TrainConfig:
backend: str = "nccl"
world_size: int = 2
rank: int = 0
local_rank: int = 0
micro_batch_size: int = 8
grad_accum_steps: int = 2
max_epochs: int = 3
checkpoint_dir: str = "/tmp/checkpoints"
seed: int = 17
@property
def global_batch_size(self) -> int:
return self.micro_batch_size * self.grad_accum_steps * self.world_size

Say this sentence when you introduce it:

“I like to put effective batch semantics directly on the config because it prevents hidden training behavior as the topology changes.”

def run(cfg: TrainConfig):
setup_seed(cfg.seed)
maybe_init_dist(cfg)
dataset = ToyDataset(size=10_000, width=256)
sampler = ResumeAwareDistributedSampler(
dataset,
num_replicas=cfg.world_size,
rank=cfg.rank,
seed=cfg.seed,
)
loader = DataLoader(
dataset,
batch_size=cfg.micro_batch_size,
sampler=sampler,
num_workers=2,
pin_memory=torch.cuda.is_available(),
drop_last=True,
)
model = TinyNet(width=256).to(device_for(cfg))
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
if dist.is_initialized():
model = DDP(model, device_ids=[cfg.local_rank], output_device=cfg.local_rank)
state = restore_if_present(model, optimizer, sampler, cfg)
for epoch in range(state.epoch, cfg.max_epochs):
sampler.set_epoch(epoch)
train_epoch(model, optimizer, loader, sampler, cfg, start_step=state.step)
save_checkpoint(model, optimizer, sampler, epoch, cfg)

The current torchrun docs still make the launch model very explicit:

  • torchrun spawns one or more processes per node
  • for GPU training, each distributed process operates on a single GPU
  • modern PyTorch passes --local-rank=<rank> to your script

That gives you a clean explanation for why your notebook code separates rank from local_rank.

WhenWhat to say
after sampler creation”This is where distributed correctness lives; if this is wrong, the rest of the trainer can look healthy while learning on bad data.”
after DDP wrap”I’m using DDP as the baseline because it gives me synchronized gradient semantics with the least interview-time complexity.”
before checkpoint code”I want recovery state to include optimizer and sampler progress, not just weights.”
before metrics code”I’m exposing enough observability to tell whether the trainer is compute-bound, input-bound, or synchronization-bound.”
  • multi-node environment bootstrapping
  • object-store client details
  • scheduler-specific job spec
  • vendor-specific metrics exporters

If you choose to pseudocode them, preserve their interfaces.

flowchart LR
  A[Baseline DDP] --> B[Add mixed precision]
  B --> C[Add timed phases]
  C --> D[Discuss FSDP migration path]
If the core notebook is stable, use the remaining time to show how you would extend it safely.