Tree

Overview

The Tree utilities parse hierarchical ontologies of systems (terms) and genes, and optionally link SNPs to genes. They are used throughout the project to build attention masks and define structured inputs for epistasis analysis and SNP2P datasets.

Usage and examples

TreeParser

The ontology input should provide parent/child relationships between systems (terms) and genes. It can be supplied as a pandas DataFrame or as a path to a tabular file with parent, child, and interaction columns (for example is_a or gene). Once loaded, you can inspect the structure or collapse small terms.

import pandas as pd
from g2pt.tree import TreeParser

# Minimal parent/child ontology with interaction types.
ontology_df = pd.DataFrame(
    {
        "parent": ["immune_system", "immune_system", "adaptive_immunity"],
        "child": ["innate_immunity", "adaptive_immunity", "IL7R"],
        "interaction": ["is_a", "is_a", "gene"],
    }
)

tree = TreeParser(ontology_df)
tree.summary()
collapsed_tree = tree.collapse(min_term_size=2)

SNPTreeParser

The SNP mapping file is expected to include at least snp and gene columns (optionally chr if you plan to use by_chr=True). You can pass either file paths or pandas DataFrames.

from g2pt.tree import SNPTreeParser

tree_parser = SNPTreeParser(
    ontology="ontology.tsv",
    snp2gene="snp2gene.tsv",
    by_chr=True,
)
tree_parser.summary()

API documentation

class TreeParser

Parses and represents a hierarchical ontology of systems and genes.

This class loads an ontology from a file or DataFrame, builds a graph representation, and provides methods for manipulating and analyzing the ontology.

Example

The ontology input should provide parent/child relationships between systems (terms) and genes. It can be supplied as a pandas DataFrame or as a path to a tabular file with parent, child, and interaction columns (for example is_a or gene). Once loaded, you can inspect the structure or collapse small terms.

import pandas as pd
from g2pt.tree import TreeParser

# Minimal parent/child ontology with interaction types.
ontology_df = pd.DataFrame(
    {
        "parent": ["immune_system", "immune_system", "adaptive_immunity"],
        "child": ["innate_immunity", "adaptive_immunity", "IL7R"],
        "interaction": ["is_a", "is_a", "gene"],
    }
)

tree = TreeParser(ontology_df)
tree.summary()
collapsed_tree = tree.collapse(min_term_size=2)
__init__(ontology, dense_attention=False, sys_annot_file=None)

Initializes the TreeParser.

Parameters:
  • ontology (pandas.DataFrame or str) – A pandas DataFrame or path to a file containing the ontology.

  • dense_attention (bool, optional) – Whether to use dense attention.

  • sys_annot_file (str, optional) – Path to a file containing system annotations.

from_obo(obo_path, dense_attention=False)

Create a TreeParser instance from an OBO file.

Parameters:
  • obo_path (str) – Path to the OBO file.

  • dense_attention (bool, optional) – Whether to use dense attention.

init_ontology(ontology_df, inplace=True, verbose=True)

Initializes the ontology from a DataFrame.

Parameters:
  • ontology_df (pandas.DataFrame) – A pandas DataFrame containing the ontology.

  • inplace (bool, optional) – Whether to modify the object in place.

  • verbose (bool, optional) – Whether to print progress messages.

build_mask(ordered_query, ordered_key, query2key_dict, interaction_value=0, mask_value=-10**4)

Builds a mask for attention.

Parameters:
  • ordered_query (list) – A list of query items.

  • ordered_key (list) – A list of key items.

  • query2key_dict (dict) – A dictionary mapping query items to key items.

  • interaction_value (int, optional) – The value to use for interactions.

  • mask_value (int, optional) – The value to use for non-interactions.

Returns:

A tuple containing the query-to-index mapping, the index-to-query mapping, the key-to-index mapping, the index-to-key mapping, and the mask.

Return type:

tuple

summary(system=True, gene=True)

Print a summary of the systems and genes in the ontology.

Parameters:
  • system (bool, optional) – Whether to include system summary.

  • gene (bool, optional) – Whether to include gene summary.

collapse(to_keep=None, min_term_size=2, verbose=True, inplace=False)

Collapses the ontology by removing small terms.

Parameters:
  • to_keep (list, optional) – A list of terms to keep, even if they are small.

  • min_term_size (int, optional) – The minimum number of genes a term must have to be kept.

  • verbose (bool, optional) – Whether to print progress messages.

  • inplace (bool, optional) – Whether to modify the object in place.

class SNPTreeParser

Parses SNP→gene mappings alongside the system ontology.

This class extends TreeParser by wiring SNPs into the ontology so downstream datasets can emit SNP, gene, and system indices. Provide the same parent/child ontology used for TreeParser plus a SNP→gene mapping table.

Example

The SNP mapping file is expected to include at least snp and gene columns (optionally chr if you plan to use by_chr=True). You can pass either file paths or pandas DataFrames.

from g2pt.tree import SNPTreeParser

tree_parser = SNPTreeParser(
    ontology="ontology.tsv",
    snp2gene="snp2gene.tsv",
    by_chr=True,
)
tree_parser.summary()
__init__(ontology, snp2gene, dense_attention=False, sys_annot_file=None, by_chr=False, multiple_phenotypes=False, block_bias=False)
Parameters:
  • ontology (str or pandas.DataFrame) – path or DataFrame for parent–child ontology

  • snp2gene (str or pandas.DataFrame) – path or DataFrame for SNP→gene mapping

  • dense_attention (bool, optional) – Whether to use dense attention.

  • sys_annot_file (str, optional) – Path to a file containing system annotations.

  • by_chr (bool, optional) – Whether to process by chromosome.

  • multiple_phenotypes (bool, optional) – Whether to handle multiple phenotypes.

  • block_bias (bool, optional) – Whether to use block bias.

init_ontology_with_snp(ontology_df, snp2gene, inplace=True, multiple_phenotypes=False, verbose=True)

Extend TreeParser.init_ontology by also loading and wiring the SNP→gene table (snp2gene).

Parameters:
  • ontology_df (pandas.DataFrame) – A pandas DataFrame containing the ontology.

  • snp2gene (str or pandas.DataFrame) – path or DataFrame for SNP→gene mapping

  • inplace (bool, optional) – Whether to modify the object in place.

  • multiple_phenotypes (bool, optional) – Whether to handle multiple phenotypes.

  • verbose (bool, optional) – Whether to print progress messages.