Epistasis Simulation

Overview

The epistasis simulation utilities generate synthetic genotype/phenotype datasets with both additive and pairwise interaction effects. The simulation is designed for benchmarking discovery workflows and includes options for LD structure, heritability budgets, and biasing epistatic pairs toward non-significant SNPs. A helper function can also build a hierarchical SNP→Gene→System ontology that is aligned with the simulated causal structure.

For a full walkthrough, see the notebook: Epistasis_simulation.ipynb.

Key features

  • Additive + epistatic signal control: Allocate separate heritability budgets for additive and pairwise interaction effects.

  • LD-aware genotypes: Optionally simulate LD blocks and correlated SNPs.

  • Ontology-aligned mappings: Generate SNP→Gene and Gene→System mappings that align with the simulated causal pairs, with tunable system coherence.

  • Export-ready artifacts: Write out genotypes.tsv, simulation.pheno, simulation.cov, ontology.tsv, and snp2gene.tsv for downstream pipelines.

Workflow overview

  1. Configure simulation hyperparameters.

  2. Simulate genotypes (G) and phenotype (y).

  3. Build an ontology and mapping files aligned to the simulation.

  4. Export data, ontology, and causal metadata to disk.

  5. Train a model, predict attention, and evaluate epistasis retrieval.

Usage

Configure the simulation

output_dir = "./epistasis_simulation_samples/"
seed = 42
n_samples = 10000
n_pairs = 50
h2_epistatic = 0.1
ontology_coherence = 0.5
n_additive = 100

Simulate genotypes and phenotypes

The main simulation returns a dictionary with genotype matrix G, phenotype y, and causal SNP indices and pairs.

from src.utils.analysis.epistasis_simulation import simulate_epistasis

sim = simulate_epistasis(
    n_samples=2000,
    n_snps=50000,
    n_additive=200,
    n_pairs=75,
    h2_additive=0.4,
    h2_epistatic=0.2,
    n_ld_blocks=200,
    ld_rho=0.8,
)

print(sim["G"].shape)      # (samples, snps)
print(len(sim["pair_idx"]))

Build a hierarchical ontology aligned to the simulation

build_hierarchical_ontology maps simulated SNPs to genes and systems and builds a multi-level system hierarchy. The ontology_coherence parameter controls whether epistatic pairs are placed in the same, related, or distant systems.

from src.utils.analysis.epistasis_simulation import build_hierarchical_ontology

snp_df, gene_df, system_df = build_hierarchical_ontology(
    sim,
    n_genes=800,
    n_systems=60,
    ontology_coherence=0.5,
    overlap_prob=0.25,
    n_causal_systems=20,
)

snp_df.to_csv("snp2gene.tsv", sep="\t")
gene_df.to_csv("gene2system.tsv", sep="\t", index=False)
system_df.to_csv("system_hierarchy.tsv", sep="\t", index=False)

Export genotypes, phenotypes, and covariates

The model training and evaluation utilities expect TSV files for genotypes, phenotypes, and covariates. The notebook uses IID/FID pairs to keep identifiers consistent across files.

import pandas as pd
import numpy as np

iids = [f"sample_{i}" for i in range(sim["y"].shape[0])]

genotypes_df = pd.DataFrame(sim["G"], columns=snp_df.index.values)
genotypes_df.index = iids
genotypes_df.index.name = "IID"
genotypes_df.to_csv(f"{output_dir}/genotypes.tsv", sep="\t")

pheno_df = pd.DataFrame({"FID": iids, "IID": iids, "phenotype": sim["y"]})
pheno_df.to_csv(f"{output_dir}/simulation.pheno", index=False, sep="\t")

cov_df = pd.DataFrame(
    {
        "FID": iids,
        "IID": iids,
        "SEX": np.random.randint(2, size=sim["y"].shape[0]),
        "AGE": np.random.randint(40, 70, size=sim["y"].shape[0]),
    }
)
cov_df.to_csv(f"{output_dir}/simulation.cov", index=False, sep="\t")

Export ontology and causal metadata

The ontology file combines gene-to-system and system-to-supersystem edges into a single parent, child, interaction format expected by the training and analysis pipelines. The causal metadata file is used later for evaluation.

import json

gene_df = gene_df.rename(columns={"gene_id": "child", "system_id": "parent"})
gene_df["interaction"] = "gene"
system_df = system_df.rename(columns={"system_id": "child", "supersystem_id": "parent"})
system_df["interaction"] = "default"

ontology_df = pd.concat(
    [
        gene_df[["parent", "child", "interaction"]],
        system_df[["parent", "child", "interaction"]],
    ]
)
ontology_df.to_csv(f"{output_dir}/ontology.tsv", index=False, sep="\t", header=False)
snp_df.to_csv(f"{output_dir}/snp2gene.tsv", index=True, sep="\t")

causal_info = {
    "epistatic_pairs": [list(map(int, p)) for p in sim["pair_idx"]],
    "additive_snps": list(map(int, sim["additive_idx"])),
}
with open(f"{output_dir}/causal_info.json", "w") as f:
    json.dump(causal_info, f, indent=2)

Train, predict attention, and evaluate retrieval

Once the artifacts are saved, you can train a model, compute attention scores, and run epistasis retrieval evaluation as in the notebook.

python train_snp2p_model.py \
  --train-tsv "epistasis_simulation_samples/genotypes.tsv" \
  --train-pheno "epistasis_simulation_samples/simulation.pheno" \
  --train-cov "epistasis_simulation_samples/simulation.cov" \
  --onto "epistasis_simulation_samples/ontology.tsv" \
  --snp2gene "epistasis_simulation_samples/snp2gene.tsv" \
  --out "epistasis_simulation_samples/output_model.txt" \
  --epochs 51 \
  --batch-size 64 \
  --lr 1e-4 \
  --qt "phenotype" \
  --jobs 4 \
  --cuda 0 \
  --sys2env --env2sys --sys2gene \
  --sys2pheno --gene2pheno \
  --val-step 50 \
  --use_hierarchical_transformer

python predict_attention.py \
  --model "epistasis_simulation_samples/output_model.txt.50" \
  --tsv "epistasis_simulation_samples/genotypes.tsv" \
  --pheno "epistasis_simulation_samples/simulation.pheno" \
  --cov "epistasis_simulation_samples/simulation.cov" \
  --onto "epistasis_simulation_samples/ontology.tsv" \
  --snp2gene "epistasis_simulation_samples/snp2gene.tsv" \
  --out "epistasis_simulation_samples/output_model.txt.50" \
  --batch-size 256 \
  --cuda 0
from src.utils.analysis.epistasis_retrieval_evaluation import (
    EvaluationConfig,
    EpistasisRetrievalEvaluator,
)

config = EvaluationConfig(
    causal_info="epistasis_simulation_samples/causal_info.json",
    attention_results="epistasis_simulation_samples/output_model.txt.50.phenotype.head_sum.csv",
    system_importance="epistasis_simulation_samples/output_model.txt.50.phenotype.head_sum.sys_importance.csv",
    tsv="epistasis_simulation_samples/genotypes.tsv",
    pheno="epistasis_simulation_samples/simulation.pheno",
    cov="epistasis_simulation_samples/simulation.cov",
    onto="epistasis_simulation_samples/ontology.tsv",
    snp2gene="epistasis_simulation_samples/snp2gene.tsv",
    top_n_systems=5,
    snp_threshold=50,
    num_workers=1,
    executor_type="threads",
    quantiles=[0.9],
    output_prefix="epistasis_simulation_samples/simulation_output.50",
)

evaluator = EpistasisRetrievalEvaluator(config)
evaluator.evaluate()

Optional: statistical sanity checks

The notebook includes quick checks to verify that additive and interaction effects are detectable in the synthetic dataset. For example, you can fit a linear model per SNP for additive signals and evaluate interaction terms for known epistatic pairs.

import statsmodels.api as sm

results = []
for snp in genotypes.columns:
    df_snp = df_full[["phenotype", "SEX", "AGE", snp]].dropna()
    X = sm.add_constant(df_snp[["SEX", "AGE", snp]])
    model = sm.OLS(df_snp["phenotype"], X).fit()
    results.append((snp, model.pvalues[snp]))

Outputs

  • simulate_epistasis returns a dictionary with the genotype matrix, phenotype, causal SNP indices, epistatic pairs, effect sizes, and LD block assignments.

  • build_hierarchical_ontology returns three DataFrames suitable for use with tree utilities and downstream analyses: SNP-to-gene, gene-to-system, and the system hierarchy.

  • Exported artifacts on disk typically include:

    • genotypes.tsv: Genotype dosages with IID index.

    • simulation.pheno: Phenotype values with FID/IID.

    • simulation.cov: Covariates (e.g., SEX, AGE) with FID/IID.

    • ontology.tsv: Combined system and gene edges for the parser.

    • snp2gene.tsv: SNP-to-gene mapping.

    • causal_info.json: Ground-truth additive SNPs and epistatic pairs.