Observability and Debugging

The interview signal here is not whether you know a specific observability vendor. It is whether you know what must be observable in a distributed trainer.

The Telemetry Graph

flowchart LR
  A[Trainer ranks] --> B[Structured logs]
  A --> C[Metrics]
  A --> D[Spans / traces]
  B --> E[Central log store]
  C --> F[Time-series backend]
  D --> G[Trace backend]
  E --> H[Alerts + dashboards]
  F --> H
  G --> H

A production-ready answer distinguishes raw telemetry from the systems that aggregate and alert on it.

Metrics Worth Emitting

Training semantics

global step
loss
learning rate
gradient norm
skipped-step count for AMP / overflow cases

Systems behavior

step time by phase
samples/sec
all-reduce time
data-loader wait time
checkpoint save latency
GPU memory allocated and reserved

Reliability

restart count
checkpoint age
failed collective count
per-rank heartbeat freshness

Logging Policy

Log type	Where it should come from
concise progress logs	rank 0
structured error logs	every rank
environment summary	rank 0 and launch layer
communicator diagnostics	all affected ranks, throttled

Debugging A Hang

flowchart TD
  A[Job appears stuck] --> B{Are ranks alive?}
  B -->|No| C[Process crash / OOM / node issue]
  B -->|Yes| D{Progress metric moving?}
  D -->|No| E[Deadlock or blocked I/O]
  D -->|Yes| F[Slow path, not hang]
  E --> G[Check last collective, loader wait, checkpoint write]

Hangs are easier to debug when you separate liveness from progress.

Debug Hooks You Can Show In A Notebook

@contextmanager
def phase(timer_store: dict[str, list[float]], name: str):
    started = time.perf_counter()
    try:
        yield
    finally:
        timer_store.setdefault(name, []).append(time.perf_counter() - started)


def log_rank_event(rank: int, event: str, **fields) -> None:
    payload = {"rank": rank, "event": event, **fields}
    print(json.dumps(payload, sort_keys=True))

Diagnosing Silent Desynchronization

This is the class of failure where the job still runs but semantics drift.

Examples:

rank 3 is skipping batches after a loader exception
one rank restored stale optimizer state
world size changed but effective batch math was not updated
sampler shuffle seeds differ across ranks

Your defense is invariant checking:

assert batch counts match expected step counts
log checkpoint metadata on resume
compare sampler state across ranks when debugging
track global batch and accumulation config in emitted metadata