# Epistasis Simulation

## Overview

The epistasis simulation utilities generate synthetic genotype/phenotype datasets
with both additive and pairwise interaction effects. The simulation is designed
for benchmarking discovery workflows and includes options for LD structure,
heritability budgets, and biasing epistatic pairs toward non-significant SNPs.
A helper function can also build a hierarchical SNP→Gene→System ontology that
is aligned with the simulated causal structure.

For a full walkthrough, see the notebook: [`Epistasis_simulation.ipynb`](../../Epistasis_simulation.ipynb).

## Key features

- **Additive + epistatic signal control**: Allocate separate heritability budgets
  for additive and pairwise interaction effects.
- **LD-aware genotypes**: Optionally simulate LD blocks and correlated SNPs.
- **Ontology-aligned mappings**: Generate SNP→Gene and Gene→System mappings that
  align with the simulated causal pairs, with tunable system coherence.
- **Export-ready artifacts**: Write out `genotypes.tsv`, `simulation.pheno`,
  `simulation.cov`, `ontology.tsv`, and `snp2gene.tsv` for downstream pipelines.

## Workflow overview

1. Configure simulation hyperparameters.
2. Simulate genotypes (`G`) and phenotype (`y`).
3. Build an ontology and mapping files aligned to the simulation.
4. Export data, ontology, and causal metadata to disk.
5. Train a model, predict attention, and evaluate epistasis retrieval.

## Usage

### Configure the simulation

```python
output_dir = "./epistasis_simulation_samples/"
seed = 42
n_samples = 10000
n_pairs = 50
h2_epistatic = 0.1
ontology_coherence = 0.5
n_additive = 100
```

### Simulate genotypes and phenotypes

The main simulation returns a dictionary with genotype matrix `G`, phenotype `y`,
and causal SNP indices and pairs.

```python
from src.utils.analysis.epistasis_simulation import simulate_epistasis

sim = simulate_epistasis(
    n_samples=2000,
    n_snps=50000,
    n_additive=200,
    n_pairs=75,
    h2_additive=0.4,
    h2_epistatic=0.2,
    n_ld_blocks=200,
    ld_rho=0.8,
)

print(sim["G"].shape)      # (samples, snps)
print(len(sim["pair_idx"]))
```

### Build a hierarchical ontology aligned to the simulation

`build_hierarchical_ontology` maps simulated SNPs to genes and systems and
builds a multi-level system hierarchy. The `ontology_coherence` parameter
controls whether epistatic pairs are placed in the same, related, or distant
systems.

```python
from src.utils.analysis.epistasis_simulation import build_hierarchical_ontology

snp_df, gene_df, system_df = build_hierarchical_ontology(
    sim,
    n_genes=800,
    n_systems=60,
    ontology_coherence=0.5,
    overlap_prob=0.25,
    n_causal_systems=20,
)

snp_df.to_csv("snp2gene.tsv", sep="\t")
gene_df.to_csv("gene2system.tsv", sep="\t", index=False)
system_df.to_csv("system_hierarchy.tsv", sep="\t", index=False)
```

### Export genotypes, phenotypes, and covariates

The model training and evaluation utilities expect TSV files for genotypes,
phenotypes, and covariates. The notebook uses IID/FID pairs to keep identifiers
consistent across files.

```python
import pandas as pd
import numpy as np

iids = [f"sample_{i}" for i in range(sim["y"].shape[0])]

genotypes_df = pd.DataFrame(sim["G"], columns=snp_df.index.values)
genotypes_df.index = iids
genotypes_df.index.name = "IID"
genotypes_df.to_csv(f"{output_dir}/genotypes.tsv", sep="\t")

pheno_df = pd.DataFrame({"FID": iids, "IID": iids, "phenotype": sim["y"]})
pheno_df.to_csv(f"{output_dir}/simulation.pheno", index=False, sep="\t")

cov_df = pd.DataFrame(
    {
        "FID": iids,
        "IID": iids,
        "SEX": np.random.randint(2, size=sim["y"].shape[0]),
        "AGE": np.random.randint(40, 70, size=sim["y"].shape[0]),
    }
)
cov_df.to_csv(f"{output_dir}/simulation.cov", index=False, sep="\t")
```

### Export ontology and causal metadata

The ontology file combines gene-to-system and system-to-supersystem edges into
a single `parent`, `child`, `interaction` format expected by the training and
analysis pipelines. The causal metadata file is used later for evaluation.

```python
import json

gene_df = gene_df.rename(columns={"gene_id": "child", "system_id": "parent"})
gene_df["interaction"] = "gene"
system_df = system_df.rename(columns={"system_id": "child", "supersystem_id": "parent"})
system_df["interaction"] = "default"

ontology_df = pd.concat(
    [
        gene_df[["parent", "child", "interaction"]],
        system_df[["parent", "child", "interaction"]],
    ]
)
ontology_df.to_csv(f"{output_dir}/ontology.tsv", index=False, sep="\t", header=False)
snp_df.to_csv(f"{output_dir}/snp2gene.tsv", index=True, sep="\t")

causal_info = {
    "epistatic_pairs": [list(map(int, p)) for p in sim["pair_idx"]],
    "additive_snps": list(map(int, sim["additive_idx"])),
}
with open(f"{output_dir}/causal_info.json", "w") as f:
    json.dump(causal_info, f, indent=2)
```

### Train, predict attention, and evaluate retrieval

Once the artifacts are saved, you can train a model, compute attention scores,
and run epistasis retrieval evaluation as in the notebook.

```bash
python train_snp2p_model.py \
  --train-tsv "epistasis_simulation_samples/genotypes.tsv" \
  --train-pheno "epistasis_simulation_samples/simulation.pheno" \
  --train-cov "epistasis_simulation_samples/simulation.cov" \
  --onto "epistasis_simulation_samples/ontology.tsv" \
  --snp2gene "epistasis_simulation_samples/snp2gene.tsv" \
  --out "epistasis_simulation_samples/output_model.txt" \
  --epochs 51 \
  --batch-size 64 \
  --lr 1e-4 \
  --qt "phenotype" \
  --jobs 4 \
  --cuda 0 \
  --sys2env --env2sys --sys2gene \
  --sys2pheno --gene2pheno \
  --val-step 50 \
  --use_hierarchical_transformer

python predict_attention.py \
  --model "epistasis_simulation_samples/output_model.txt.50" \
  --tsv "epistasis_simulation_samples/genotypes.tsv" \
  --pheno "epistasis_simulation_samples/simulation.pheno" \
  --cov "epistasis_simulation_samples/simulation.cov" \
  --onto "epistasis_simulation_samples/ontology.tsv" \
  --snp2gene "epistasis_simulation_samples/snp2gene.tsv" \
  --out "epistasis_simulation_samples/output_model.txt.50" \
  --batch-size 256 \
  --cuda 0
```

```python
from src.utils.analysis.epistasis_retrieval_evaluation import (
    EvaluationConfig,
    EpistasisRetrievalEvaluator,
)

config = EvaluationConfig(
    causal_info="epistasis_simulation_samples/causal_info.json",
    attention_results="epistasis_simulation_samples/output_model.txt.50.phenotype.head_sum.csv",
    system_importance="epistasis_simulation_samples/output_model.txt.50.phenotype.head_sum.sys_importance.csv",
    tsv="epistasis_simulation_samples/genotypes.tsv",
    pheno="epistasis_simulation_samples/simulation.pheno",
    cov="epistasis_simulation_samples/simulation.cov",
    onto="epistasis_simulation_samples/ontology.tsv",
    snp2gene="epistasis_simulation_samples/snp2gene.tsv",
    top_n_systems=5,
    snp_threshold=50,
    num_workers=1,
    executor_type="threads",
    quantiles=[0.9],
    output_prefix="epistasis_simulation_samples/simulation_output.50",
)

evaluator = EpistasisRetrievalEvaluator(config)
evaluator.evaluate()
```

### Optional: statistical sanity checks

The notebook includes quick checks to verify that additive and interaction
effects are detectable in the synthetic dataset. For example, you can fit a
linear model per SNP for additive signals and evaluate interaction terms for
known epistatic pairs.

```python
import statsmodels.api as sm

results = []
for snp in genotypes.columns:
    df_snp = df_full[["phenotype", "SEX", "AGE", snp]].dropna()
    X = sm.add_constant(df_snp[["SEX", "AGE", snp]])
    model = sm.OLS(df_snp["phenotype"], X).fit()
    results.append((snp, model.pvalues[snp]))
```

## Outputs

- `simulate_epistasis` returns a dictionary with the genotype matrix, phenotype,
  causal SNP indices, epistatic pairs, effect sizes, and LD block assignments.
- `build_hierarchical_ontology` returns three DataFrames suitable for use with
  tree utilities and downstream analyses: SNP-to-gene, gene-to-system, and the
  system hierarchy.
- Exported artifacts on disk typically include:
  - `genotypes.tsv`: Genotype dosages with IID index.
  - `simulation.pheno`: Phenotype values with FID/IID.
  - `simulation.cov`: Covariates (e.g., SEX, AGE) with FID/IID.
  - `ontology.tsv`: Combined system and gene edges for the parser.
  - `snp2gene.tsv`: SNP-to-gene mapping.
  - `causal_info.json`: Ground-truth additive SNPs and epistatic pairs.