SNP2P Dataset

Overview

This page summarizes the dataset and data-collation utilities used by SNP2P training. The datasets build genotype indices that align with the SNPTreeParser catalogs and emit dictionaries that the model consumes via the collators. Use these classes to load genotype sources (TSV or PLINK), attach covariates/phenotypes, and optionally enable block/chunk processing.

Usage and examples

Example: load a PLINK dataset

from src.utils.tree import SNPTreeParser
from src.utils.data.dataset import PLINKDataset

tree_parser = SNPTreeParser(ontology="ontology.tsv", snp2gene="snp2gene.tsv")
dataset = PLINKDataset(
    tree_parser=tree_parser,
    bfile="data/geno/plink_prefix",
    cov="data/covariates.tsv",
    pheno="data/phenotypes.tsv",
    cov_ids=("AGE", "SEX"),
    pheno_ids=("BMI",),
)

Example: create a collated batch

from torch.utils.data import DataLoader
from src.utils.data.dataset import SNP2PCollator

collator = SNP2PCollator(tree_parser=tree_parser, input_format="indices")
loader = DataLoader(dataset, batch_size=8, shuffle=True, collate_fn=collator)
batch = next(iter(loader))
# batch["genotype"]["snp"], batch["covariates"], batch["phenotype"]

API documentation

class GenotypeDataset

Base dataset for SNP2P training that prepares covariates and phenotype targets.

Parameters:

tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
cov (str) – Path to the covariates TSV.
pheno (str, optional) – Optional phenotype TSV.
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.
dynamic_phenotype_sampling (bool, optional) – Whether phenotype sampling changes per batch.

__getitem__(index)

Returns a dictionary with covariate tensors and phenotype targets for the sample.

Parameters:: index (int) – Sample index.
Returns:: Sample payload with covariates/phenotype tensors.
Return type:: dict

class TSVDataset

Loads genotype data from a TSV and returns SNP, gene, and system indices.

Parameters:

tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
genotype_path (str) – Path to genotype TSV.
cov (str) – Path to covariates TSV.
pheno (str, optional) – Optional phenotype TSV.
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
flip (bool, optional) – Whether to flip reference/alternate allele encodings.
input_format (str, optional) – Input format (indices by default).
cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.

class PLINKDataset

Loads genotype data from PLINK binaries and aligns covariates/phenotypes.

Parameters:

tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
bfile (str) – PLINK file prefix.
cov (str, optional) – Path to covariates TSV (optional).
pheno (str, optional) – Path to phenotype TSV (optional).
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
flip (bool, optional) – Whether to flip reference/alternate allele encodings.
block (bool, optional) – Whether to include block indices in outputs.
input_format (str, optional) – Input format (indices by default).
cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.

summary(): Print a short dataset summary.

sample_population(n=100)

Subsample individuals for quick experiments.

Parameters:: n (int, optional) – Number of individuals to keep.

sample_phenotypes(n, seed=None)

Sample a subset of phenotypes and update the dataset ranges.

Parameters:

n (int) – Number of phenotypes to sample.
seed (int, optional) – Optional random seed.

select_phenotypes(phenotypes)

Restrict the dataset to specific phenotype names.

Parameters:: phenotypes (list) – Phenotype IDs to keep.

class EmbeddingDataset

PLINK-backed dataset that augments samples with pretrained SNP embeddings.

Parameters:

tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
bfile (str) – PLINK file prefix.
embedding (str) – Directory with per-sample embedding tensors.
iid2ind (dict) – Mapping from IID to embedding index.
cov (str, optional) – Path to covariates TSV (optional).
pheno (str, optional) – Path to phenotype TSV (optional).
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.

class SNPTokenizer

Simple SNP tokenizer for masked language modeling over SNP blocks.

Parameters:

vocab (dict) – Mapping from token string to integer ID.
max_len (int, optional) – Optional maximum sequence length.

class BlockDataset

Dataset that returns SNP indices for block-level pretraining.

Parameters:

bfile (str) – PLINK file prefix.
flip (bool, optional) – Whether to flip reference/alternate allele encodings.

get_individual_block_genotype(iid)

Return SNP indices for a given individual.

Parameters:: iid (str) – Individual ID.
Returns:: SNP indices for the individual.
Return type:: torch.Tensor

class BlockQueryDataset

Dataset that assembles block-level genotypes from multiple block sources.

Parameters:

tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
bfile (str) – PLINK file prefix.
blocks (dict) – Mapping of block identifiers to BlockDataset instances.
cov (str, optional) – Path to covariates TSV (optional).
pheno (str, optional) – Path to phenotype TSV (optional).
cov_mean_dict (dict, optional) – Optional covariate mean overrides.
cov_std_dict (dict, optional) – Optional covariate standard deviation overrides.
cov_ids (tuple, optional) – Subset of covariate column names to load.
pheno_ids (tuple, optional) – Subset of phenotype column names to load.
bt (tuple, optional) – Binary phenotype IDs.
qt (tuple, optional) – Quantitative phenotype IDs.
flip (bool, optional) – Whether to flip reference/alternate allele encodings.

class SNP2PCollator

Collator that assembles batched SNP2P inputs and labels.

Parameters:

tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
input_format (str, optional) – Input format (indices, embedding, or block).
pheno_ids (tuple, optional) – Phenotype IDs used for ordering labels.
mlm (bool, optional) – Whether to apply SNP masked language modeling.
mlm_collator_dict (dict, optional) – Per-block MLM collators for block input.

class ChunkSNP2PCollator

Collator that breaks SNP2P inputs into chunks for memory efficiency.

Parameters:

tree_parser (SNPTreeParser) – Parsed SNP ontology and masks.
chunker (object) – Chunker with create_chunks output.
input_format (str, optional) – Input format (indices, embedding, or block).
pheno_ids (tuple, optional) – Phenotype IDs used for ordering labels.
mlm (bool, optional) – Whether to apply SNP masked language modeling.
mlm_collator_dict (dict, optional) – Per-block MLM collators for block input.

class DynamicPhenotypeBatchSampler

Batch sampler that randomly samples phenotypes per batch.

Parameters:

dataset (Dataset) – Dataset supporting sample_phenotypes and phenotype ranges.
batch_size (int) – Batch size.
drop_last (bool, optional) – Whether to drop the last incomplete batch.

class CohortSampler

Weighted sampler for continuous phenotypes using skew-normal weights.

Parameters:

dataset (Dataset) – Dataset containing cov_df.
n_samples (int, optional) – Optional number of samples to draw.
phenotype_col (str, optional) – Column name for the phenotype.
z_weight (float, optional) – Weight multiplier for skew-normal density.
sex_col (int or str, optional) – Column name or index for the sex covariate.

class BinaryCohortSampler

Weighted sampler for binary phenotypes.

Parameters:

dataset (Dataset) – Dataset containing cov_df.
phenotype_col (str, optional) – Column name for the phenotype.

class DistributedCohortSampler

Distributed version of CohortSampler.

Parameters:

dataset (Dataset) – Dataset containing cov_df.
num_replicas (int, optional) – Number of distributed replicas.
rank (int, optional) – Rank of the current replica.
shuffle (bool, optional) – Whether to shuffle indices.
seed (int, optional) – Random seed.
phenotype_col (str, optional) – Column name for the phenotype.
z_weight (float, optional) – Weight multiplier for skew-normal density.
sex_col (int or str, optional) – Column name or index for the sex covariate.

class DistributedBinaryCohortSampler

Distributed version of BinaryCohortSampler.

Parameters:

dataset (Dataset) – Dataset containing cov_df.
num_replicas (int, optional) – Number of distributed replicas.
rank (int, optional) – Rank of the current replica.
shuffle (bool, optional) – Whether to shuffle indices.
seed (int, optional) – Random seed.
phenotype_col (str, optional) – Column name for the phenotype.