Molecular and Genomic Data Pipelines

Generic image or tabular datasets rarely expose the data pipeline complexity that biological datasets introduce. Molecules are graphs with variable topology. Protein sequences span two orders of magnitude in length. Genomic matrices are large and sparse. Each changes the pipeline design in ways that surface as correctness failures, not just performance problems.

The Biological Data Landscape

Data type	Representation	Training challenge
Small molecules	SMILES, InChI, molecular graphs	Variable atom count, chirality, canonicalization
Protein sequences	Amino acid strings	Length variation from 10 to 35,000+ residues
Multiple sequence alignments	N sequences × L positions	Variable N and L; large tensors with gap tokens
3D protein structures	Coordinates, torsion angles	SE(3) invariance; coordinate frame sensitivity
scRNA-seq	Sparse cell × gene count matrices	90–99% zeros; high gene dimensionality
Bioassay labels	IC50, % inhibition, binary hit	Severe class imbalance; censored measurements

Data Path for Molecular Training

flowchart LR
  A[SMILES / FASTA store] --> B[Canonicalization + validity filter]
  B --> C[Featurization: graph or token]
  C --> D[Scaffold-aware split]
  D --> E[Rank-aware sampler]
  E --> F[Variable-length collation]
  F --> G[Trainer step]

Canonicalization and validity filtering happen before any split or sampler, because invalid inputs silently corrupt graph-based forward passes.

Scaffold Splits

Random train/test splits in drug discovery produce optimistic evaluation. Molecules that share a chemical scaffold appear in both splits, leaking structural information from test into train.

A Murcko scaffold split groups molecules by their ring framework and assigns entire scaffold groups to a single partition:

import random
from collections import defaultdict

from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold


def scaffold_split(
    smiles_list: list[str],
    train_frac: float = 0.8,
    seed: int = 0,
) -> tuple[list[int], list[int]]:
    scaffold_to_indices: dict[str, list[int]] = defaultdict(list)
    for idx, smi in enumerate(smiles_list):
        mol = Chem.MolFromSmiles(smi)
        if mol is None:
            continue
        scaffold = MurckoScaffold.MurckoScaffoldSmiles(mol=mol, includeChirality=False)
        scaffold_to_indices[scaffold].append(idx)

    rng = random.Random(seed)
    scaffold_sets = list(scaffold_to_indices.values())
    rng.shuffle(scaffold_sets)

    train_indices: list[int] = []
    test_indices: list[int] = []
    cutoff = int(len(smiles_list) * train_frac)
    for group in scaffold_sets:
        if len(train_indices) < cutoff:
            train_indices.extend(group)
        else:
            test_indices.extend(group)
    return train_indices, test_indices

The same logic applies when sharding across distributed ranks: the scaffold group, not the individual molecule, is the unit of partition assignment. Splitting a scaffold group across train and evaluation ranks defeats the purpose of the split.

A random split versus scaffold split on ChEMBL-scale datasets typically inflates AUROC by 5–15 points. That gap measures scaffold memorization, not activity generalization.

Variable-Length Collation

Standard DataLoader collation fails on sequences or graphs of unequal length in the same batch.

Strategy	When to use	Tradeoff
Padding + attention mask	Fixed-depth transformers, protein encoders	Wasted compute on pad tokens; occupancy drops for long-tail batches
Dynamic length bucketing	Wide length distributions	Reduces padding waste; complicates sampler and resume tracking
Graph-level batching	Molecular GNNs	Requires specialized collation; `batch` vector tracks graph membership

from torch.nn.utils.rnn import pad_sequence


def collate_sequences(
    batch: list[dict[str, torch.Tensor]],
) -> dict[str, torch.Tensor]:
    input_ids = pad_sequence(
        [item["input_ids"] for item in batch],
        batch_first=True,
        padding_value=0,
    )
    attention_mask = pad_sequence(
        [torch.ones(len(item["input_ids"]), dtype=torch.bool) for item in batch],
        batch_first=True,
        padding_value=False,
    )
    labels = torch.stack([item["label"] for item in batch])
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

The attention mask is not cosmetic. Padded positions contribute to loss unless explicitly masked. In a distributed trainer, every rank must apply identical masking logic or gradient semantics diverge silently.

Molecular Graph Batching

For GNNs on molecules, torch_geometric represents a batch as a single large disconnected graph with a batch vector that maps each node back to its source molecule:

from rdkit import Chem
import torch
from torch_geometric.data import Data


def smiles_to_graph(smiles: str, label: float) -> Data | None:
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None

    node_features = torch.tensor(
        [
            [atom.GetAtomicNum(), atom.GetDegree(), int(atom.GetIsAromatic())]
            for atom in mol.GetAtoms()
        ],
        dtype=torch.float,
    )
    edges = [(b.GetBeginAtomIdx(), b.GetEndAtomIdx()) for b in mol.GetBonds()]
    if not edges:
        edge_index = torch.zeros((2, 0), dtype=torch.long)
    else:
        src, dst = zip(*edges)
        edge_index = torch.tensor([src + dst, dst + src], dtype=torch.long)

    return Data(x=node_features, edge_index=edge_index, y=torch.tensor([label]))

The batch vector produced by Batch.from_data_list() enables graph-level readout (global mean pool, global add pool) to produce one embedding per molecule rather than one per atom. Without it, pooling averages across all atoms in the concatenated batch, producing meaningless embeddings.

Sparse Genomic Data

Single-cell RNA-seq count matrices are 90–99% zeros. Loading them as dense tensors materializes gigabytes of zeros before any computation.

flowchart TD
  A[Raw count matrix: cells × genes] --> B{Storage format}
  B --> C[AnnData h5ad: CSR on disk]
  B --> D[Dense in-memory: prohibitive at atlas scale]
  C --> E[Row-by-row slice in Dataset.__getitem__]
  E --> F[to_dense on batch only]
  F --> G[Normalize and log-transform]
  G --> H[Trainer step]

Row-slicing a sparse matrix one observation at a time keeps loader worker memory bounded even for million-cell atlases.

import scipy.sparse
import torch
from torch.utils.data import Dataset


class SingleCellDataset(Dataset):
    def __init__(
        self,
        counts: scipy.sparse.csr_matrix,
        labels: torch.Tensor,
        target_sum: float = 1e4,
    ) -> None:
        self.counts = counts
        self.labels = labels
        self.target_sum = target_sum

    def __len__(self) -> int:
        return self.counts.shape[0]

    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
        row = torch.tensor(self.counts[idx].toarray().ravel(), dtype=torch.float32)
        row = torch.log1p(row / (row.sum() + 1e-6) * self.target_sum)
        return {"expression": row, "label": self.labels[idx]}

Class Imbalance in Bioassay Data

High-throughput screening data routinely produces 100:1 to 1000:1 negative-to-active ratios.

Approach	Mechanism	Resume complication
Loss reweighting (`pos_weight`)	Penalize false negatives more	None; stateless
`WeightedRandomSampler`	Oversample actives per epoch	Must save sampler RNG state on top of consumed count
Focal loss	Down-weight easy negatives	Extra hyperparameter; may degrade calibration

Weighted sampling changes the epoch definition, which makes resume correctness harder. If the checkpoint captures a consumed count from a DistributedSampler but not the internal state of a WeightedRandomSampler, resume silently restarts sampling from a different distribution.

The strong interview sentence:

“I prefer loss reweighting over oversampling in distributed training because it does not change the sampler state contract. With oversampling I need to checkpoint the sampler’s internal RNG on top of the consumed-batch count, which adds a correctness surface that does not exist with stateless loss weighting.”

Metrics That Reflect Biological Reality

Raw loss and accuracy are almost uninformative for drug discovery models.

Metric	Definition	Why it matters
AUROC	Area under ROC curve	Threshold-free; standard for binary classification on imbalanced data
BEDROC	Boltzmann-enhanced discrimination ROC	Emphasizes early enrichment; reflects virtual screening economics
EF@1%	(actives in top 1%) / (expected by chance)	Standard KPI for hit identification campaigns
Scaffold generalization gap	Train-scaffold AUROC minus test-scaffold AUROC	Quantifies leakage; >0.1 suggests scaffold overfitting
Precision@K	Fraction of true actives in top K predictions	Operationally relevant when wet lab capacity is fixed at K compounds

Staff-Level Tradeoffs

Decision	Why you choose it	What it costs
Scaffold split over random	Realistic generalization estimate	Fewer training molecules; noisier split variance
Dynamic batching over fixed padding	Better GPU occupancy	Complicates sampler resume tracking
Sparse loading for genomics	Memory-safe at atlas scale	More complex collation; harder to pin memory
Loss reweighting over oversampling	Simpler resume semantics	May produce poorly calibrated probabilities
AUROC over accuracy	Appropriate for imbalanced labels	Less interpretable to non-ML stakeholders