TreeParser Documentation

The TreeParser class, located in src/utils/tree/tree.py, is a comprehensive tool for parsing, manipulating, and analyzing hierarchical biological ontologies, such as the Gene Ontology (GO). It is designed to handle complex relationships between systems (e.g., GO terms) and genes, providing a suite of methods for structural analysis, data transformation, and ontology simplification.

The class uses pandas for data manipulation and networkx to represent the ontology as a directed acyclic graph (DAG), enabling powerful graph-based operations.

Initialization

The TreeParser is initialized with a path to an ontology file and an optional system annotation file.

from src.utils.tree import TreeParser

# Path to the ontology file (tab-separated: parent, child, interaction_type)
ontology_file = 'path/to/ontology.tsv'
# Optional: Path to system annotations (tab-separated: system_id, description)
annotation_file = 'path/to/annotations.tsv'

# Initialize the parser
tree = TreeParser(ontology_file, sys_annot_file=annotation_file)

Ontology file format

The ontology file is a tab-separated table with three columns:

  1. parent: parent system/term ID.

  2. child: child system/term ID or gene ID.

  3. interaction_type: relationship type (e.g., is_a, part_of, gene).

Use interaction_type="gene" for system → gene edges; all other interaction types are treated as system → system relationships. This distinction drives mask construction and hierarchy traversal.

During initialization (init_ontology), the class performs several key setup steps:

  1. Loads Data: Reads the ontology and annotations into pandas DataFrames.

  2. Builds Graph: Constructs a networkx.DiGraph representing the system hierarchy.

  3. Creates Mappings: Builds dictionaries to map systems and genes to unique integer indices (e.g., sys2ind, gene2ind) and vice-versa.

  4. Propagates Genes: Calculates sys2gene_full, a dictionary where each system is mapped to a list of all genes contained within it and all its descendant systems. This provides a complete view of gene membership up the hierarchy.

  5. Calculates Heights: Computes the height of each node in the system graph, which is used in structural similarity calculations. The height is the length of the longest path from a node to a leaf.


Ontology Simplification (Collapsing)

A key feature of TreeParser is its ability to simplify the ontology by collapsing, or removing, terms that are redundant or uninformative. This is crucial for reducing complexity and focusing on the most relevant parts of the ontology. When a term is collapsed, its children are re-parented to its parents, preserving the overall hierarchy.

There are three methods for collapsing terms, each with a different strategy.

1. collapse()

This is a general-purpose collapsing method that removes terms based on two main criteria:

  • Redundancy: Systems that contain the exact same set of genes (after gene propagation) are considered redundant. The method keeps one representative term (the first one alphabetically) and collapses the others.

  • Size: Systems with fewer genes than a specified min_term_size (default is 2) are considered too small to be informative and are collapsed.

Algorithm:

  1. The method first identifies groups of redundant terms by hashing their full gene sets.

  2. It then identifies small terms by checking the size of their full gene sets.

  3. A final list of terms to collapse is compiled from both criteria.

  4. The graph is rewired by connecting the parents and children of each collapsed node.

  5. Genes from collapsed nodes are re-assigned to their parent nodes.

  6. The ontology is rebuilt from the modified graph.

2. collapse_by_gene_similarity()

This method collapses terms based on the similarity of their gene sets, measured by the Jaccard index. It is useful for grouping together systems that have a significant overlap in their member genes, even if they are not identical.

Algorithm:

  1. Pairwise Similarity: The method calculates the Jaccard similarity for every pair of systems based on their sys2gene_full gene sets. The Jaccard index is computed as:

    J(A, B) = |A ∩ B| / |A ∪ B|
    
  2. Clustering: The resulting similarity matrix is converted into a distance matrix (1 - similarity). This matrix is then used with sklearn.cluster.AgglomerativeClustering to group systems. The clustering algorithm merges systems into clusters until no two systems in a cluster have a similarity below the given similarity_threshold.

  3. Representative Selection: For each cluster with more than one term, a single representative is chosen. The representative is the term with the most genes. Ties are broken alphabetically.

  4. Collapsing: All non-representative terms in each cluster are collapsed. The graph rewiring and ontology reconstruction process is the same as in the collapse method.

3. collapse_by_structural_similarity()

This method uses the structure of the ontology graph itself to determine similarity, rather than gene sets. It is based on the concept of the Most Informative Common Ancestor (MICA) and uses a variation of Lin’s similarity metric.

Algorithm:

  1. Pairwise Similarity: The structural similarity between two terms is calculated based on the height of their MICA. The height of a node is the length of the longest path to a leaf node. The similarity formula is:

    Sim(A, B) = (2 * Height(MICA(A, B))) / (Height(A) + Height(B))
    

    This value will be high if two terms share a “close” common ancestor relative to their own distance from the leaves.

  2. Clustering: Unlike the gene similarity method, this approach uses a simpler, iterative clustering logic. It finds all terms with a similarity greater than the similarity_threshold to form clusters.

  3. Representative Selection: For each cluster, the term with the greatest height is chosen as the representative. This prioritizes terms that are higher up in the ontology. Ties are broken alphabetically.

  4. Collapsing: All other terms in the cluster are collapsed, and the ontology is rebuilt.


Other Notable Methods

  • _get_annotated_name(term): A helper to return a formatted string of a term’s name and its annotation if available, used for verbose logging.

  • rename_systems_with_llm(...): An advanced feature that uses a Large Language Model (OpenAI’s GPT or Google’s Gemini) to suggest new, more informative names for systems based on their gene content and a target phenotype. This can greatly improve the interpretability of a simplified ontology.

Usage Example

# Initialize the parser
tree = TreeParser('data/ontology.tsv', sys_annot_file='data/annotations.tsv')

print(f"Original number of systems: {tree.n_systems}")

# Collapse using gene similarity
tree.collapse_by_gene_similarity(similarity_threshold=0.8, verbose=True, inplace=True)

print(f"Number of systems after collapsing: {tree.n_systems}")

# Save the simplified ontology
tree.save_ontology('output/collapsed_ontology.tsv')

SNPTreeParser

The SNPTreeParser class, located in src/utils/tree/snp_tree.py, extends TreeParser to integrate Single Nucleotide Polymorphism (SNP) data. It is designed to build a unified hierarchy that connects SNPs to genes and genes to systems, making it a powerful tool for genomic analyses that require an ontology-based approach (e.g., gene set enrichment analysis for GWAS results).

Initialization

In addition to the ontology and sys_annot_file arguments from TreeParser, SNPTreeParser requires a snp2gene file.

from src.utils.tree import SNPTreeParser

# Path to the ontology file
ontology_file = 'path/to/ontology.tsv'
# Path to the SNP-to-gene mapping file (tab-separated: snp, gene, chr, pos)
snp2gene_file = 'path/to/snp2gene.tsv'

SNP-to-gene mapping format

The SNP mapping file should be a tab-separated table with at least:

  1. snp: SNP identifier (string or integer).

  2. gene: gene identifier.

Optional columns:

  • chr: chromosome identifier (required for chromosome filtering and block construction).

  • pos: base-pair position used for distance-based filtering in epistasis analysis.

These mappings populate snp2ind, snp2gene, and snp2chr/snp2pos lookups, and enable system-level SNP aggregation (sys2snp).

Usage example

from src.utils.tree import SNPTreeParser

tree_parser = SNPTreeParser(
    ontology=ontology_file,
    snp2gene=snp2gene_file,
    by_chr=True,
)

print(tree_parser.n_systems, tree_parser.n_genes, tree_parser.n_snps)
tree_parser.summary()

Downstream usage

The SNP-aware parser is used by datasets and analysis utilities. For example, PLINKDataset expects a SNPTreeParser so it can align SNP indices with hierarchical masks, and epistasis discovery uses sys2snp to restrict testing to biologically relevant subsets.

The init_ontology_with_snp method orchestrates the initialization:

  1. Initializes Parent: It first calls the parent TreeParser.init_ontology method to build the gene-system hierarchy.

  2. Loads SNP Data: It reads the snp2gene mapping file.

  3. Creates SNP Mappings: It builds dictionaries to map SNPs to unique integer indices (snp2ind) and vice-versa.

  4. Builds Masks: It creates a snp2gene_mask, a matrix that represents the connections between all SNPs and all genes in the ontology.

Usage Example

# Initialize the SNP-aware parser
snp_tree = SNPTreeParser(
    'data/ontology.tsv',
    'data/snp2gene.tsv',
    sys_annot_file='data/annotations.tsv'
)

print(f"Initialized with {snp_tree.n_snps} SNPs.")

# Get all SNPs associated with a specific system
target_system = 'GO:0008150' # biological_process
associated_snps = snp_tree.sys2snp.get(target_system, [])

print(f"Found {len(associated_snps)} SNPs linked to '{target_system}'.")

# Retain only a specific list of SNPs and rebuild the ontology
snps_to_keep = ['rs123', 'rs456']
snp_tree.retain_snps(snps_to_keep, inplace=True)

print(f"Ontology now contains {snp_tree.n_snps} SNPs and {snp_tree.n_genes} genes.")