Preprocessing¶
The preprocessing pipeline converts simulated tree sequences into training data: SFS feature tensors, log-TMRCA target vectors, and metadata.
Pipeline overview¶
.trees files
│
├──→ genotype matrix extraction
├──→ biallelic filtering
├──→ (optional) accessibility mask application
├──→ SFS computation (xor/xnor channels)
├──→ windowed TMRCA computation (exact span-weighted averages)
├──→ (optional) tree topology feature extraction
│
└──→ output per simulation:
X.npy (P, 2, W, N) float16
y.npy (P, W) float16 log-TMRCA
pairs.npy (P, 2) int32
meta.json { mutation_rate, num_samples, ... }
CLI usage¶
# Basic preprocessing
fastcxt-preprocess --base-dir ./sims/anogam --out-subdir processed
# With accessibility mask (for real data with missing regions)
fastcxt-preprocess --base-dir ./sims/anogam \
--accessibility-mask masks/ag1000g_accessible.npz \
--out-subdir processed
# With tree topology features
fastcxt-preprocess --base-dir ./sims/anogam \
--extract-trees \
--out-subdir processed
# Variable sample sizes (recommended): point --base-dir at the parent
# directory containing per-size subdirectories. --max-samples pads tree
# features to a consistent dimension so all sizes can be batched.
fastcxt-preprocess --base-dir ./sims \
--extract-trees --max-samples 200 \
--out-subdir processed
# Customize pair sampling
fastcxt-preprocess --base-dir ./sims/anogam \
--num-pairs 500 \
--global-seed 42 \
--out-subdir processed
Accessibility masks¶
For species with high missing-data rates (e.g. Anopheles gambiae from Ag1000G), accessibility masks ensure the SFS is computed only over callable regions:
from fastcxt.preprocess import apply_accessibility_mask
import numpy as np
mask = np.load("ag1000g_accessible_2L.npz")["is_accessible"]
gm_filtered, pos_filtered = apply_accessibility_mask(gm, positions, mask, seq_len)
Output layout¶
processed/
├── train/
│ ├── n10/ # scenario = subdirectory name
│ │ ├── ts_00000000_i0/
│ │ │ ├── X.npy # (P, 2, W, n_samples) float16
│ │ │ ├── y.npy # (P, W) float16
│ │ │ ├── pairs.npy # (P, 2) int32
│ │ │ └── meta.json
│ │ └── ts_00000001_i1/
│ │ └── ...
│ ├── n50/
│ │ └── ...
│ └── n200/
│ └── ...
└── test/
└── ...
When preprocessing tree sequences with different sample sizes, point
--base-dir at the parent directory. Each subdirectory name becomes the
scenario label in the output. Use --max-samples to ensure tree topology
features have a consistent dimension across all sample sizes (the SFS
dimension is handled automatically by zero-padding in the model).