Parallelism Playbook
Candidates often list parallelism strategies. Senior candidates explain when each strategy becomes the least bad option.
Start With DDP
Section titled “Start With DDP”flowchart LR A[Full model on rank 0] --> B[Forward] C[Full model on rank 1] --> D[Forward] B --> E[Backward] D --> F[Backward] E --> G[All-reduce gradients] F --> G G --> H[Optimizer step on every rank]
DDP is usually the right interview default because:
- it matches common production practice
- it keeps failure discussion legible
- it isolates the first-order network cost to gradient synchronization
- it gives you a clean path to sampler design and global batch math
When DDP Stops Being Enough
Section titled “When DDP Stops Being Enough”| Symptom | Interpretation | Next move |
|---|---|---|
| model does not fit in device memory | parameter + optimizer state footprint dominates | consider FSDP or ZeRO-style sharding |
| all-reduce dominates step time | communication is the bottleneck | tune bucket sizing, overlap, topology awareness, or reduce model/data split pressure |
| activation memory spikes | forward graph is too large for local device budget | activation checkpointing, sequence parallelism, or pipeline partitioning |
| one stage idles while another computes | work is unevenly partitioned | rebalance stages or simplify topology |
DDP vs FSDP
Section titled “DDP vs FSDP”def wrap_model(model: nn.Module, cfg: TrainConfig) -> nn.Module: if cfg.parallelism == "ddp": return torch.nn.parallel.DistributedDataParallel( model, device_ids=[cfg.local_rank], output_device=cfg.local_rank, gradient_as_bucket_view=True, ) if cfg.parallelism == "fsdp": return FSDP( model, auto_wrap_policy=size_based_auto_wrap_policy, mixed_precision=cfg.mixed_precision_policy, sharding_strategy=ShardingStrategy.FULL_SHARD, ) raise ValueError(f"Unsupported mode: {cfg.parallelism}")What to say out loud
Section titled “What to say out loud”- DDP replicates parameters on every rank and is easier to debug.
- FSDP shards parameters across data-parallel workers; the current docs describe it exactly that way.
- FSDP reduces memory pressure but shifts complexity into wrap policy, state-dict handling, checkpoint formats, and performance tuning.
- In a live interview, choosing DDP first is usually more correct than prematurely optimizing into a harder failure model.
Hybrid Parallelism
Section titled “Hybrid Parallelism”flowchart TD A[Node 0] --> B[Tensor Parallel Group 0] A --> C[Tensor Parallel Group 1] D[Node 1] --> E[Tensor Parallel Group 0] D --> F[Tensor Parallel Group 1] B --> G[Pipeline Stage 0] C --> G E --> H[Pipeline Stage 1] F --> H G --> I[Data Parallel Replica 0] H --> J[Data Parallel Replica 1]
This is where staff-level language matters:
- Tensor parallelism trades communication for larger layer capacity.
- Pipeline parallelism trades bubble overhead and scheduling complexity for model fit.
- Data parallelism trades replicated state for implementation simplicity.
The wrong answer is “use all of them for large models.” The right answer is “introduce only the extra axis needed to eliminate the current bottleneck.”
Current PyTorch Notes
Section titled “Current PyTorch Notes”The current FSDP docs still position FSDP as a sharding wrapper for data-parallel workers, while current DDP docs still emphasize that DDP itself does not partition input data. Together, that leads to a clean interview distinction:
- DDP: replication + gradient sync
- FSDP: sharding + more state-management complexity
- sampler / loader: still your responsibility either way
Choosing A Strategy
Section titled “Choosing A Strategy”| Scenario | Best first choice | Why |
|---|---|---|
| medium model, commodity cluster | DDP | minimal operational complexity |
| model barely exceeds device memory | DDP + activation checkpointing | cheapest complexity increase |
| model substantially exceeds device memory | FSDP | memory savings without full hybrid topology |
| enormous model, dedicated infra | FSDP + tensor/pipeline parallelism | necessary but operationally heavier |
Trick Question: “Why Not Just Increase Batch Size?”
Section titled “Trick Question: “Why Not Just Increase Batch Size?””Because increasing batch size is not a generic scaling fix.
- It may change optimization behavior.
- It may increase activation memory.
- It may mask data-loader starvation without fixing it.
- It may raise communication payloads if gradient accumulation is not used carefully.