SNP2P Dataset ============= Overview -------- This page summarizes the dataset and data-collation utilities used by SNP2P training. The datasets build genotype indices that align with the ``SNPTreeParser`` catalogs and emit dictionaries that the model consumes via the collators. Use these classes to load genotype sources (TSV or PLINK), attach covariates/phenotypes, and optionally enable block/chunk processing. Usage and examples ------------------ Example: load a PLINK dataset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from src.utils.tree import SNPTreeParser from src.utils.data.dataset import PLINKDataset tree_parser = SNPTreeParser(ontology="ontology.tsv", snp2gene="snp2gene.tsv") dataset = PLINKDataset( tree_parser=tree_parser, bfile="data/geno/plink_prefix", cov="data/covariates.tsv", pheno="data/phenotypes.tsv", cov_ids=("AGE", "SEX"), pheno_ids=("BMI",), ) Example: create a collated batch ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from torch.utils.data import DataLoader from src.utils.data.dataset import SNP2PCollator collator = SNP2PCollator(tree_parser=tree_parser, input_format="indices") loader = DataLoader(dataset, batch_size=8, shuffle=True, collate_fn=collator) batch = next(iter(loader)) # batch["genotype"]["snp"], batch["covariates"], batch["phenotype"] API documentation ----------------- .. class:: GenotypeDataset Base dataset for SNP2P training that prepares covariates and phenotype targets. :param tree_parser: Parsed SNP ontology and masks. :type tree_parser: SNPTreeParser :param cov: Path to the covariates TSV. :type cov: str :param pheno: Optional phenotype TSV. :type pheno: str, optional :param cov_mean_dict: Optional covariate mean overrides. :type cov_mean_dict: dict, optional :param cov_std_dict: Optional covariate standard deviation overrides. :type cov_std_dict: dict, optional :param cov_ids: Subset of covariate column names to load. :type cov_ids: tuple, optional :param pheno_ids: Subset of phenotype column names to load. :type pheno_ids: tuple, optional :param bt: Binary phenotype IDs. :type bt: tuple, optional :param qt: Quantitative phenotype IDs. :type qt: tuple, optional :param dynamic_phenotype_sampling: Whether phenotype sampling changes per batch. :type dynamic_phenotype_sampling: bool, optional .. method:: __getitem__(index) Returns a dictionary with covariate tensors and phenotype targets for the sample. :param index: Sample index. :type index: int :return: Sample payload with covariates/phenotype tensors. :rtype: dict .. class:: TSVDataset Loads genotype data from a TSV and returns SNP, gene, and system indices. :param tree_parser: Parsed SNP ontology and masks. :type tree_parser: SNPTreeParser :param genotype_path: Path to genotype TSV. :type genotype_path: str :param cov: Path to covariates TSV. :type cov: str :param pheno: Optional phenotype TSV. :type pheno: str, optional :param cov_mean_dict: Optional covariate mean overrides. :type cov_mean_dict: dict, optional :param cov_std_dict: Optional covariate standard deviation overrides. :type cov_std_dict: dict, optional :param flip: Whether to flip reference/alternate allele encodings. :type flip: bool, optional :param input_format: Input format (``indices`` by default). :type input_format: str, optional :param cov_ids: Subset of covariate column names to load. :type cov_ids: tuple, optional :param pheno_ids: Subset of phenotype column names to load. :type pheno_ids: tuple, optional :param bt: Binary phenotype IDs. :type bt: tuple, optional :param qt: Quantitative phenotype IDs. :type qt: tuple, optional .. class:: PLINKDataset Loads genotype data from PLINK binaries and aligns covariates/phenotypes. :param tree_parser: Parsed SNP ontology and masks. :type tree_parser: SNPTreeParser :param bfile: PLINK file prefix. :type bfile: str :param cov: Path to covariates TSV (optional). :type cov: str, optional :param pheno: Path to phenotype TSV (optional). :type pheno: str, optional :param cov_mean_dict: Optional covariate mean overrides. :type cov_mean_dict: dict, optional :param cov_std_dict: Optional covariate standard deviation overrides. :type cov_std_dict: dict, optional :param flip: Whether to flip reference/alternate allele encodings. :type flip: bool, optional :param block: Whether to include block indices in outputs. :type block: bool, optional :param input_format: Input format (``indices`` by default). :type input_format: str, optional :param cov_ids: Subset of covariate column names to load. :type cov_ids: tuple, optional :param pheno_ids: Subset of phenotype column names to load. :type pheno_ids: tuple, optional :param bt: Binary phenotype IDs. :type bt: tuple, optional :param qt: Quantitative phenotype IDs. :type qt: tuple, optional .. method:: summary() Print a short dataset summary. .. method:: sample_population(n=100) Subsample individuals for quick experiments. :param n: Number of individuals to keep. :type n: int, optional .. method:: sample_phenotypes(n, seed=None) Sample a subset of phenotypes and update the dataset ranges. :param n: Number of phenotypes to sample. :type n: int :param seed: Optional random seed. :type seed: int, optional .. method:: select_phenotypes(phenotypes) Restrict the dataset to specific phenotype names. :param phenotypes: Phenotype IDs to keep. :type phenotypes: list .. class:: EmbeddingDataset PLINK-backed dataset that augments samples with pretrained SNP embeddings. :param tree_parser: Parsed SNP ontology and masks. :type tree_parser: SNPTreeParser :param bfile: PLINK file prefix. :type bfile: str :param embedding: Directory with per-sample embedding tensors. :type embedding: str :param iid2ind: Mapping from IID to embedding index. :type iid2ind: dict :param cov: Path to covariates TSV (optional). :type cov: str, optional :param pheno: Path to phenotype TSV (optional). :type pheno: str, optional :param cov_mean_dict: Optional covariate mean overrides. :type cov_mean_dict: dict, optional :param cov_std_dict: Optional covariate standard deviation overrides. :type cov_std_dict: dict, optional :param cov_ids: Subset of covariate column names to load. :type cov_ids: tuple, optional :param pheno_ids: Subset of phenotype column names to load. :type pheno_ids: tuple, optional :param bt: Binary phenotype IDs. :type bt: tuple, optional :param qt: Quantitative phenotype IDs. :type qt: tuple, optional .. class:: SNPTokenizer Simple SNP tokenizer for masked language modeling over SNP blocks. :param vocab: Mapping from token string to integer ID. :type vocab: dict :param max_len: Optional maximum sequence length. :type max_len: int, optional .. class:: BlockDataset Dataset that returns SNP indices for block-level pretraining. :param bfile: PLINK file prefix. :type bfile: str :param flip: Whether to flip reference/alternate allele encodings. :type flip: bool, optional .. method:: get_individual_block_genotype(iid) Return SNP indices for a given individual. :param iid: Individual ID. :type iid: str :return: SNP indices for the individual. :rtype: torch.Tensor .. class:: BlockQueryDataset Dataset that assembles block-level genotypes from multiple block sources. :param tree_parser: Parsed SNP ontology and masks. :type tree_parser: SNPTreeParser :param bfile: PLINK file prefix. :type bfile: str :param blocks: Mapping of block identifiers to :class:`BlockDataset` instances. :type blocks: dict :param cov: Path to covariates TSV (optional). :type cov: str, optional :param pheno: Path to phenotype TSV (optional). :type pheno: str, optional :param cov_mean_dict: Optional covariate mean overrides. :type cov_mean_dict: dict, optional :param cov_std_dict: Optional covariate standard deviation overrides. :type cov_std_dict: dict, optional :param cov_ids: Subset of covariate column names to load. :type cov_ids: tuple, optional :param pheno_ids: Subset of phenotype column names to load. :type pheno_ids: tuple, optional :param bt: Binary phenotype IDs. :type bt: tuple, optional :param qt: Quantitative phenotype IDs. :type qt: tuple, optional :param flip: Whether to flip reference/alternate allele encodings. :type flip: bool, optional .. class:: SNP2PCollator Collator that assembles batched SNP2P inputs and labels. :param tree_parser: Parsed SNP ontology and masks. :type tree_parser: SNPTreeParser :param input_format: Input format (``indices``, ``embedding``, or ``block``). :type input_format: str, optional :param pheno_ids: Phenotype IDs used for ordering labels. :type pheno_ids: tuple, optional :param mlm: Whether to apply SNP masked language modeling. :type mlm: bool, optional :param mlm_collator_dict: Per-block MLM collators for block input. :type mlm_collator_dict: dict, optional .. class:: ChunkSNP2PCollator Collator that breaks SNP2P inputs into chunks for memory efficiency. :param tree_parser: Parsed SNP ontology and masks. :type tree_parser: SNPTreeParser :param chunker: Chunker with ``create_chunks`` output. :type chunker: object :param input_format: Input format (``indices``, ``embedding``, or ``block``). :type input_format: str, optional :param pheno_ids: Phenotype IDs used for ordering labels. :type pheno_ids: tuple, optional :param mlm: Whether to apply SNP masked language modeling. :type mlm: bool, optional :param mlm_collator_dict: Per-block MLM collators for block input. :type mlm_collator_dict: dict, optional .. class:: DynamicPhenotypeBatchSampler Batch sampler that randomly samples phenotypes per batch. :param dataset: Dataset supporting ``sample_phenotypes`` and phenotype ranges. :type dataset: Dataset :param batch_size: Batch size. :type batch_size: int :param drop_last: Whether to drop the last incomplete batch. :type drop_last: bool, optional .. class:: CohortSampler Weighted sampler for continuous phenotypes using skew-normal weights. :param dataset: Dataset containing ``cov_df``. :type dataset: Dataset :param n_samples: Optional number of samples to draw. :type n_samples: int, optional :param phenotype_col: Column name for the phenotype. :type phenotype_col: str, optional :param z_weight: Weight multiplier for skew-normal density. :type z_weight: float, optional :param sex_col: Column name or index for the sex covariate. :type sex_col: int or str, optional .. class:: BinaryCohortSampler Weighted sampler for binary phenotypes. :param dataset: Dataset containing ``cov_df``. :type dataset: Dataset :param phenotype_col: Column name for the phenotype. :type phenotype_col: str, optional .. class:: DistributedCohortSampler Distributed version of :class:`CohortSampler`. :param dataset: Dataset containing ``cov_df``. :type dataset: Dataset :param num_replicas: Number of distributed replicas. :type num_replicas: int, optional :param rank: Rank of the current replica. :type rank: int, optional :param shuffle: Whether to shuffle indices. :type shuffle: bool, optional :param seed: Random seed. :type seed: int, optional :param phenotype_col: Column name for the phenotype. :type phenotype_col: str, optional :param z_weight: Weight multiplier for skew-normal density. :type z_weight: float, optional :param sex_col: Column name or index for the sex covariate. :type sex_col: int or str, optional .. class:: DistributedBinaryCohortSampler Distributed version of :class:`BinaryCohortSampler`. :param dataset: Dataset containing ``cov_df``. :type dataset: Dataset :param num_replicas: Number of distributed replicas. :type num_replicas: int, optional :param rank: Rank of the current replica. :type rank: int, optional :param shuffle: Whether to shuffle indices. :type shuffle: bool, optional :param seed: Random seed. :type seed: int, optional :param phenotype_col: Column name for the phenotype. :type phenotype_col: str, optional