Epistasis Discovery
Overview
The epistasis discovery utilities combine model attention scores with genotype data to find significant SNP-SNP interactions within biological systems. The workflow filters candidate SNPs, tests pairwise interactions, and validates effects with regression models.
Usage and examples
Example: run discovery for a system
import pandas as pd
from g2pt.tree import SNPTreeParser
from src.utils.analysis.epistasis import EpistasisFinder
tree_parser = SNPTreeParser(
ontology="ontology.tsv",
snp2gene="snp2gene.tsv",
)
attention_results = pd.read_csv("outputs/attention_scores.csv")
finder = EpistasisFinder(
tree_parser=tree_parser,
attention_results=attention_results,
tsv="data/genotypes.tsv",
cov="data/covariates.tsv",
pheno="data/phenotypes.tsv",
)
pairs, _ = finder.search_epistasis_on_system("immune_system")
API documentation
- class EpistasisFinder
Finds and analyzes epistatic interactions between SNPs within biological systems.
This class integrates attention scores from a trained model with genotype data to perform a multi-stage statistical analysis. It first identifies candidate SNPs based on their relevance to a system (using attention scores), then filters SNP pairs by physical distance, tests for pairwise interaction using Fisher’s Exact Test, and finally validates significant pairs using a regression model to confirm the statistical interaction effect.
- Parameters:
tree_parser (SNPTreeParser) – An instance of the parser containing SNP, gene, and system relationships.
genotype (pd.DataFrame) – A DataFrame of genotype data, with samples as rows and SNPs as columns.
cov_df (pd.DataFrame) – A DataFrame containing covariate and phenotype data.
attention_results (pd.DataFrame) – A DataFrame of attention scores for each sample and system.
- __init__(tree_parser, attention_results, tsv_path, cov, pheno=None, flip=False)
Initializes the EpistasisFinder and loads all necessary data from TSV files.
- Parameters:
tree_parser (SNPTreeParser) – An initialized SNPTreeParser object.
attention_results (str or pd.DataFrame) – Path to a CSV file or a DataFrame of attention scores.
tsv_path (str) – Path to the directory containing genotypes.tsv and snp2gene.tsv.
cov (str) – Path to a tab-separated covariate file.
pheno (str, optional) – Path to a tab-separated phenotype file.
flip (bool, optional) – If True, swaps reference and alternate alleles.
- search_epistasis_on_system(system, sex=0, quantile=0.9, fisher=True, return_significant_only=True, check_inheritance=True, verbose=0, snp_inheritance_dict={}, binary=False, target='PHENOTYPE')
Searches for epistatic interactions for a given biological system.
This method executes a multi-step pipeline: 1. Determines the optimal inheritance model for each SNP (optional). 2. Filters SNPs using a Chi-Square test based on attention scores to
identify those prevalent in a high-risk cohort.
Generates all pairs of the filtered SNPs.
Filters out SNP pairs that are physically close on the chromosome.
Performs Fisher’s Exact Test on the distant pairs to find statistically significant co-occurrences (optional).
Uses a regression model to test for a statistical interaction effect for the remaining pairs, correcting for multiple testing.
- Parameters:
system (str) – The name of the system (e.g., GO term) to analyze.
sex (int, optional) – The sex to include in the analysis (0, 1, or 2 for all).
quantile (float, optional) – The attention score quantile to define the high-risk cohort.
fisher (bool, optional) – Whether to perform the Fisher’s Exact Test step.
return_significant_only (bool, optional) – If True, returns only the pairs that are statistically significant after all tests. If False, returns results for all tested pairs.
check_inheritance (bool, optional) – If True, determines the best-fit inheritance model for each SNP before testing.
verbose (int, optional) – Verbosity level (0 or 1).
snp_inheritance_dict (dict, optional) – A pre-computed dictionary of SNP inheritance models.
binary (bool, optional) – Whether the target phenotype is binary (for logistic regression) or continuous (for linear regression).
target (str, optional) – The name of the phenotype column in the covariate data.
- Returns:
A tuple containing a list of significant epistatic pairs and an updated dictionary of determined SNP inheritance models.
- Return type: