SNP2P Dataset
Overview
This page summarizes the dataset and data-collation utilities used by SNP2P
training. The datasets build genotype indices that align with the
SNPTreeParser catalogs and emit dictionaries that the model consumes via
the collators. Use these classes to load genotype sources (TSV or PLINK),
attach covariates/phenotypes, and optionally enable block/chunk processing.
Usage and examples
Example: load a PLINK dataset
from src.utils.tree import SNPTreeParser
from src.utils.data.dataset import PLINKDataset
tree_parser = SNPTreeParser(ontology="ontology.tsv", snp2gene="snp2gene.tsv")
dataset = PLINKDataset(
tree_parser=tree_parser,
bfile="data/geno/plink_prefix",
cov="data/covariates.tsv",
pheno="data/phenotypes.tsv",
cov_ids=("AGE", "SEX"),
pheno_ids=("BMI",),
)
Example: create a collated batch
from torch.utils.data import DataLoader
from src.utils.data.dataset import SNP2PCollator
collator = SNP2PCollator(tree_parser=tree_parser, input_format="indices")
loader = DataLoader(dataset, batch_size=8, shuffle=True, collate_fn=collator)
batch = next(iter(loader))
# batch["genotype"]["snp"], batch["covariates"], batch["phenotype"]
API documentation
- class GenotypeDataset
Base dataset for SNP2P training that prepares covariates and phenotype targets.
- Parameters:
tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
cov (str) – Path to the covariates TSV.
pheno (str, optional) – Optional phenotype TSV.
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.
dynamic_phenotype_sampling (bool, optional) – Whether phenotype sampling changes per batch.
- class TSVDataset
Loads genotype data from a TSV and returns SNP, gene, and system indices.
- Parameters:
tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
genotype_path (str) – Path to genotype TSV.
cov (str) – Path to covariates TSV.
pheno (str, optional) – Optional phenotype TSV.
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
flip (bool, optional) – Whether to flip reference/alternate allele encodings.
input_format (str, optional) – Input format (
indicesby default).cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.
- class PLINKDataset
Loads genotype data from PLINK binaries and aligns covariates/phenotypes.
- Parameters:
tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
bfile (str) – PLINK file prefix.
cov (str, optional) – Path to covariates TSV (optional).
pheno (str, optional) – Path to phenotype TSV (optional).
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
flip (bool, optional) – Whether to flip reference/alternate allele encodings.
block (bool, optional) – Whether to include block indices in outputs.
input_format (str, optional) – Input format (
indicesby default).cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.
- summary()
Print a short dataset summary.
- sample_population(n=100)
Subsample individuals for quick experiments.
- Parameters:
n (int, optional) – Number of individuals to keep.
- sample_phenotypes(n, seed=None)
Sample a subset of phenotypes and update the dataset ranges.
- class EmbeddingDataset
PLINK-backed dataset that augments samples with pretrained SNP embeddings.
- Parameters:
tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
bfile (str) – PLINK file prefix.
embedding (str) – Directory with per-sample embedding tensors.
iid2ind (dict) – Mapping from IID to embedding index.
cov (str, optional) – Path to covariates TSV (optional).
pheno (str, optional) – Path to phenotype TSV (optional).
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.
- class SNPTokenizer
Simple SNP tokenizer for masked language modeling over SNP blocks.
- class BlockDataset
Dataset that returns SNP indices for block-level pretraining.
- Parameters:
- class BlockQueryDataset
Dataset that assembles block-level genotypes from multiple block sources.
- Parameters:
tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
bfile (str) – PLINK file prefix.
blocks (dict) – Mapping of block identifiers to
BlockDatasetinstances.cov (str, optional) – Path to covariates TSV (optional).
pheno (str, optional) – Path to phenotype TSV (optional).
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.
flip (bool, optional) – Whether to flip reference/alternate allele encodings.
- class SNP2PCollator
Collator that assembles batched SNP2P inputs and labels.
- Parameters:
tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
input_format (str, optional) – Input format (
indices,embedding, orblock).pheno_ids (tuple, optional) – Phenotype IDs used for ordering labels.
mlm (bool, optional) – Whether to apply SNP masked language modeling.
mlm_collator_dict (dict, optional) – Per-block MLM collators for block input.
- class ChunkSNP2PCollator
Collator that breaks SNP2P inputs into chunks for memory efficiency.
- Parameters:
tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
chunker (object) – Chunker with
create_chunksoutput.input_format (str, optional) – Input format (
indices,embedding, orblock).pheno_ids (tuple, optional) – Phenotype IDs used for ordering labels.
mlm (bool, optional) – Whether to apply SNP masked language modeling.
mlm_collator_dict (dict, optional) – Per-block MLM collators for block input.
- class DynamicPhenotypeBatchSampler
Batch sampler that randomly samples phenotypes per batch.
- class CohortSampler
Weighted sampler for continuous phenotypes using skew-normal weights.
- Parameters:
dataset (Dataset) – Dataset containing
cov_df.n_samples (int, optional) – Optional number of samples to draw.
phenotype_col (str, optional) – Column name for the phenotype.
z_weight (float, optional) – Weight multiplier for skew-normal density.
sex_col (int or str, optional) – Column name or index for the sex covariate.
- class BinaryCohortSampler
Weighted sampler for binary phenotypes.
- Parameters:
dataset (Dataset) – Dataset containing
cov_df.phenotype_col (str, optional) – Column name for the phenotype.
- class DistributedCohortSampler
Distributed version of
CohortSampler.- Parameters:
dataset (Dataset) – Dataset containing
cov_df.num_replicas (int, optional) – Number of distributed replicas.
rank (int, optional) – Rank of the current replica.
shuffle (bool, optional) – Whether to shuffle indices.
seed (int, optional) – Random seed.
phenotype_col (str, optional) – Column name for the phenotype.
z_weight (float, optional) – Weight multiplier for skew-normal density.
sex_col (int or str, optional) – Column name or index for the sex covariate.
- class DistributedBinaryCohortSampler
Distributed version of
BinaryCohortSampler.- Parameters:
dataset (Dataset) – Dataset containing
cov_df.num_replicas (int, optional) – Number of distributed replicas.
rank (int, optional) – Rank of the current replica.
shuffle (bool, optional) – Whether to shuffle indices.
seed (int, optional) – Random seed.
phenotype_col (str, optional) – Column name for the phenotype.