Skip to content

Observability and Debugging

The interview signal here is not whether you know a specific observability vendor. It is whether you know what must be observable in a distributed trainer.

flowchart LR
  A[Trainer ranks] --> B[Structured logs]
  A --> C[Metrics]
  A --> D[Spans / traces]
  B --> E[Central log store]
  C --> F[Time-series backend]
  D --> G[Trace backend]
  E --> H[Alerts + dashboards]
  F --> H
  G --> H
A production-ready answer distinguishes raw telemetry from the systems that aggregate and alert on it.
  • global step
  • loss
  • learning rate
  • gradient norm
  • skipped-step count for AMP / overflow cases
  • step time by phase
  • samples/sec
  • all-reduce time
  • data-loader wait time
  • checkpoint save latency
  • GPU memory allocated and reserved
  • restart count
  • checkpoint age
  • failed collective count
  • per-rank heartbeat freshness
Log typeWhere it should come from
concise progress logsrank 0
structured error logsevery rank
environment summaryrank 0 and launch layer
communicator diagnosticsall affected ranks, throttled
flowchart TD
  A[Job appears stuck] --> B{Are ranks alive?}
  B -->|No| C[Process crash / OOM / node issue]
  B -->|Yes| D{Progress metric moving?}
  D -->|No| E[Deadlock or blocked I/O]
  D -->|Yes| F[Slow path, not hang]
  E --> G[Check last collective, loader wait, checkpoint write]
Hangs are easier to debug when you separate liveness from progress.
@contextmanager
def phase(timer_store: dict[str, list[float]], name: str):
started = time.perf_counter()
try:
yield
finally:
timer_store.setdefault(name, []).append(time.perf_counter() - started)
def log_rank_event(rank: int, event: str, **fields) -> None:
payload = {"rank": rank, "event": event, **fields}
print(json.dumps(payload, sort_keys=True))

This is the class of failure where the job still runs but semantics drift.

Examples:

  • rank 3 is skipping batches after a loader exception
  • one rank restored stale optimizer state
  • world size changed but effective batch math was not updated
  • sampler shuffle seeds differ across ranks

Your defense is invariant checking:

  • assert batch counts match expected step counts
  • log checkpoint metadata on resume
  • compare sampler state across ranks when debugging
  • track global batch and accumulation config in emitted metadata