Skip to content

Molecular and Genomic Data Pipelines

Generic image or tabular datasets rarely expose the data pipeline complexity that biological datasets introduce. Molecules are graphs with variable topology. Protein sequences span two orders of magnitude in length. Genomic matrices are large and sparse. Each changes the pipeline design in ways that surface as correctness failures, not just performance problems.

Data typeRepresentationTraining challenge
Small moleculesSMILES, InChI, molecular graphsVariable atom count, chirality, canonicalization
Protein sequencesAmino acid stringsLength variation from 10 to 35,000+ residues
Multiple sequence alignmentsN sequences × L positionsVariable N and L; large tensors with gap tokens
3D protein structuresCoordinates, torsion anglesSE(3) invariance; coordinate frame sensitivity
scRNA-seqSparse cell × gene count matrices90–99% zeros; high gene dimensionality
Bioassay labelsIC50, % inhibition, binary hitSevere class imbalance; censored measurements
flowchart LR
  A[SMILES / FASTA store] --> B[Canonicalization + validity filter]
  B --> C[Featurization: graph or token]
  C --> D[Scaffold-aware split]
  D --> E[Rank-aware sampler]
  E --> F[Variable-length collation]
  F --> G[Trainer step]
Canonicalization and validity filtering happen before any split or sampler, because invalid inputs silently corrupt graph-based forward passes.

Random train/test splits in drug discovery produce optimistic evaluation. Molecules that share a chemical scaffold appear in both splits, leaking structural information from test into train.

A Murcko scaffold split groups molecules by their ring framework and assigns entire scaffold groups to a single partition:

import random
from collections import defaultdict
from rdkit import Chem
from rdkit.Chem.Scaffolds import MurckoScaffold
def scaffold_split(
smiles_list: list[str],
train_frac: float = 0.8,
seed: int = 0,
) -> tuple[list[int], list[int]]:
scaffold_to_indices: dict[str, list[int]] = defaultdict(list)
for idx, smi in enumerate(smiles_list):
mol = Chem.MolFromSmiles(smi)
if mol is None:
continue
scaffold = MurckoScaffold.MurckoScaffoldSmiles(mol=mol, includeChirality=False)
scaffold_to_indices[scaffold].append(idx)
rng = random.Random(seed)
scaffold_sets = list(scaffold_to_indices.values())
rng.shuffle(scaffold_sets)
train_indices: list[int] = []
test_indices: list[int] = []
cutoff = int(len(smiles_list) * train_frac)
for group in scaffold_sets:
if len(train_indices) < cutoff:
train_indices.extend(group)
else:
test_indices.extend(group)
return train_indices, test_indices

The same logic applies when sharding across distributed ranks: the scaffold group, not the individual molecule, is the unit of partition assignment. Splitting a scaffold group across train and evaluation ranks defeats the purpose of the split.

A random split versus scaffold split on ChEMBL-scale datasets typically inflates AUROC by 5–15 points. That gap measures scaffold memorization, not activity generalization.

Standard DataLoader collation fails on sequences or graphs of unequal length in the same batch.

StrategyWhen to useTradeoff
Padding + attention maskFixed-depth transformers, protein encodersWasted compute on pad tokens; occupancy drops for long-tail batches
Dynamic length bucketingWide length distributionsReduces padding waste; complicates sampler and resume tracking
Graph-level batchingMolecular GNNsRequires specialized collation; batch vector tracks graph membership
from torch.nn.utils.rnn import pad_sequence
def collate_sequences(
batch: list[dict[str, torch.Tensor]],
) -> dict[str, torch.Tensor]:
input_ids = pad_sequence(
[item["input_ids"] for item in batch],
batch_first=True,
padding_value=0,
)
attention_mask = pad_sequence(
[torch.ones(len(item["input_ids"]), dtype=torch.bool) for item in batch],
batch_first=True,
padding_value=False,
)
labels = torch.stack([item["label"] for item in batch])
return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

The attention mask is not cosmetic. Padded positions contribute to loss unless explicitly masked. In a distributed trainer, every rank must apply identical masking logic or gradient semantics diverge silently.

For GNNs on molecules, torch_geometric represents a batch as a single large disconnected graph with a batch vector that maps each node back to its source molecule:

from rdkit import Chem
import torch
from torch_geometric.data import Data
def smiles_to_graph(smiles: str, label: float) -> Data | None:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return None
node_features = torch.tensor(
[
[atom.GetAtomicNum(), atom.GetDegree(), int(atom.GetIsAromatic())]
for atom in mol.GetAtoms()
],
dtype=torch.float,
)
edges = [(b.GetBeginAtomIdx(), b.GetEndAtomIdx()) for b in mol.GetBonds()]
if not edges:
edge_index = torch.zeros((2, 0), dtype=torch.long)
else:
src, dst = zip(*edges)
edge_index = torch.tensor([src + dst, dst + src], dtype=torch.long)
return Data(x=node_features, edge_index=edge_index, y=torch.tensor([label]))

The batch vector produced by Batch.from_data_list() enables graph-level readout (global mean pool, global add pool) to produce one embedding per molecule rather than one per atom. Without it, pooling averages across all atoms in the concatenated batch, producing meaningless embeddings.

Single-cell RNA-seq count matrices are 90–99% zeros. Loading them as dense tensors materializes gigabytes of zeros before any computation.

flowchart TD
  A[Raw count matrix: cells × genes] --> B{Storage format}
  B --> C[AnnData h5ad: CSR on disk]
  B --> D[Dense in-memory: prohibitive at atlas scale]
  C --> E[Row-by-row slice in Dataset.__getitem__]
  E --> F[to_dense on batch only]
  F --> G[Normalize and log-transform]
  G --> H[Trainer step]
Row-slicing a sparse matrix one observation at a time keeps loader worker memory bounded even for million-cell atlases.
import scipy.sparse
import torch
from torch.utils.data import Dataset
class SingleCellDataset(Dataset):
def __init__(
self,
counts: scipy.sparse.csr_matrix,
labels: torch.Tensor,
target_sum: float = 1e4,
) -> None:
self.counts = counts
self.labels = labels
self.target_sum = target_sum
def __len__(self) -> int:
return self.counts.shape[0]
def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
row = torch.tensor(self.counts[idx].toarray().ravel(), dtype=torch.float32)
row = torch.log1p(row / (row.sum() + 1e-6) * self.target_sum)
return {"expression": row, "label": self.labels[idx]}

High-throughput screening data routinely produces 100:1 to 1000:1 negative-to-active ratios.

ApproachMechanismResume complication
Loss reweighting (pos_weight)Penalize false negatives moreNone; stateless
WeightedRandomSamplerOversample actives per epochMust save sampler RNG state on top of consumed count
Focal lossDown-weight easy negativesExtra hyperparameter; may degrade calibration

Weighted sampling changes the epoch definition, which makes resume correctness harder. If the checkpoint captures a consumed count from a DistributedSampler but not the internal state of a WeightedRandomSampler, resume silently restarts sampling from a different distribution.

The strong interview sentence:

“I prefer loss reweighting over oversampling in distributed training because it does not change the sampler state contract. With oversampling I need to checkpoint the sampler’s internal RNG on top of the consumed-batch count, which adds a correctness surface that does not exist with stateless loss weighting.”

Raw loss and accuracy are almost uninformative for drug discovery models.

MetricDefinitionWhy it matters
AUROCArea under ROC curveThreshold-free; standard for binary classification on imbalanced data
BEDROCBoltzmann-enhanced discrimination ROCEmphasizes early enrichment; reflects virtual screening economics
EF@1%(actives in top 1%) / (expected by chance)Standard KPI for hit identification campaigns
Scaffold generalization gapTrain-scaffold AUROC minus test-scaffold AUROCQuantifies leakage; >0.1 suggests scaffold overfitting
Precision@KFraction of true actives in top K predictionsOperationally relevant when wet lab capacity is fixed at K compounds
DecisionWhy you choose itWhat it costs
Scaffold split over randomRealistic generalization estimateFewer training molecules; noisier split variance
Dynamic batching over fixed paddingBetter GPU occupancyComplicates sampler resume tracking
Sparse loading for genomicsMemory-safe at atlas scaleMore complex collation; harder to pin memory
Loss reweighting over oversamplingSimpler resume semanticsMay produce poorly calibrated probabilities
AUROC over accuracyAppropriate for imbalanced labelsLess interpretable to non-ML stakeholders