Pathway enrichment analysis (PEA), also known as functional enrichment analysis or overexpression analysis, is a powerful computational biology methodology. PEA is used to identify biological pathways that are significantly overexpressed in a given set of genes or biomolecular entities (e.g., microRNAs or metabolites). It involves comparing genes of interest with reference databases of known pathways to determine whether the abundance of a particular pathway in the list of input genes is more than would be expected by chance. The goal is to gain a deeper understanding of the biological processes associated with a particular biological condition or experimental context.
The PEA process involves comparing the frequencies of genes observed in a given pathway with the expected frequencies based on the background set. There are three main categories of methods used for pathway enrichment analysis:
Overrepresentation-based Methods
At the heart of the over-representation-based method lies the scrutiny of a gene list of interest. This technique endeavors to ascertain whether any pathways are significantly overrepresented within this gene list, surpassing what would be expected by random chance. This requires juxtaposing the gene list against a predefined background gene set, setting the stage for unveiling the biological pathways that hold extraordinary relevance within the context of the specific genes under investigation.
Rank-based Methods
Harnessing the wealth of functional information derived from diverse omics datasets, ranking-based methods unleash the power of gene expression levels and other omics signals. These methods initiate the process by ranking the entire gene set based on the signals detected in omics studies, such as the abundance of transcripts. Subsequently, they explore whether genes annotated to the same pathway exhibit a tendency to cluster together at either the top or bottom of the ranked list. By embracing the hierarchical order of gene expression and other relevant signals, ranking-based methods provide a sophisticated means to decipher the collective behavior of genes within intricate pathways.
Topology-based Methods
In recognition of the multifaceted nature of pathway regulation, topology-based methods venture beyond simple gene lists and hierarchical rankings. These approaches aim to account for additional information that profoundly influences pathway activity. By integrating scores that measure gene positions within a pathway and gene-gene interactions, topology-based methods foster a more comprehensive understanding of the underlying biological mechanisms. These intricate enrichment tests allow researchers to unravel the intricate relationships between genes and their dynamic roles within pathways.
Fig. 1. Overview of three types of methods for pathway enrichment analysis. (Zhao K, et al., 2023)
Deciphering Disease Mechanisms and Biomarkers
PEA emerges as a pivotal tool in illuminating the intricate molecular underpinnings of a diverse array of diseases, including cancer, neurodegenerative disorders, and metabolic syndrome. By scrutinizing the gene expression profiles derived from patient samples, PEA empowers researchers to identify perturbed pathways intimately linked to specific diseases, opening new avenues for targeted therapies. Furthermore, the specificity of PEA permits the discovery of potential biomarkers that enable early diagnosis and prognostic assessments.
Drug Target Identification and Drug Repurposing
In the arena of drug discovery, PEA emerges as a compass that guides researchers toward potential drug targets. By scrutinizing the impact of drug candidates or active compounds on the genome, scientists can discern the precise pathways influenced by these interventions. The intricate complexity of PEA allows researchers to evaluate the degree of pathway enrichment, thus pinpointing promising molecular targets for therapeutic intervention.
Uncovering Biological Signatures in Single-Cell Analysis
With the advent of single-cell RNA sequencing (scRNA-seq), researchers can now explore cellular heterogeneity with unprecedented resolution. PEA helps characterize the functional features of different cell types within complex tissues. By analyzing scRNA-seq data and applying PEA to cell-specific gene sets, researchers can discover the unique pathways that drive cellular function and behavior. This knowledge is invaluable for understanding developmental processes, tissue regeneration, and immune responses at single-cell resolution.
Integrated Analysis of Multi-omics Datasets
The integration of multi-omics datasets such as genomics, transcriptomics, epigenomics and metabolomics provides a holistic view of cellular processes. PEA can be extended to integrate these disparate datasets to identify pathways that span multiple layers of biological regulation. For example, by integrating genomic variation data from genome-wide association studies (GWAS) with transcriptome profiles, researchers can reveal functional pathways that link genetic variation to complex traits and diseases.
Fig. 2. General workflow for interpreting various types of omics data with pathway enrichment analysis and documenting results. (Zhao K, et al., 2023)
(1) Data preprocessing and gene set selection
The initial and pivotal phase in PEA commences with meticulous data collection and preprocessing. Researchers encounter a vast array of potential input data, ranging from gene expression profiles to genomic variants and epigenetic modifications, each demanding tailored handling to ensure accuracy and meaningful analysis. Astute preprocessing techniques are paramount to normalize the data, eliminate artifacts, and diminish the impact of technical confounders.
(2) Background selection and normalization
The foundation of a robust PEA hinges upon the careful choice of a representative gene background set, epitomizing the entire gene pool under investigation. Equipped with this crucial backdrop, researchers ascertain the relative enrichment or depletion of pathways with enhanced precision. Nevertheless, to ensure comparability across diverse datasets and platforms, data standardization emerges as a pivotal endeavor, harmonizing and reconciling disparities while preserving biological variance.
(3) Overrepresentation or ranking-based analysis
To cater to multifarious research goals, PEA offers a dichotomy of methodological approaches: the overrepresentation-based analysis and the ranking-based analysis. Researchers engaged in the former embark on a rigorous statistical journey, deciphering whether particular pathways reveal a statistically significant overrepresentation within gene lists, surpassing the expected frequencies within the designated background. In contrast, the latter focuses on evaluating the clustering tendencies of genes within the same pathway, discerning whether they gravitate toward proximity in ranked gene lists. Such intricately orchestrated analyses enable researchers to navigate the wealth of genetic data, unearthing biologically relevant pathways that underlie phenotypic manifestations.
(4) Topology-based analysis (optional)
For those seeking unparalleled depth and comprehensive insights, a captivating avenue opens through the adoption of topology-based methodologies. This additional layer of complexity amalgamates gene-gene interactions and network topology, uncovering pathways of paramount importance that not only manifest abundance but also manifest coordinated alterations in gene expression and connectivity within the intricate genetic networks.
(5) Multiple tests for correction
To safeguard the fidelity of PEA, researchers must deftly employ multiple test corrections, fortifying their analyses against erroneous conclusions. The venerable Bonferroni correction or the more adaptive false discovery rate (FDR) correction stands as stalwart sentinels, vigilant in their endeavor to retain only the statistically significant enrichments, dismissing the temptations of spurious associations.
(6) Interpretation and visualization
Researchers harness a diverse array of visualization tools, akin to master artisans crafting a tapestry of biological significance. Pathway maps, radiant with richly hued nodes of interconnected genes, divulge the orchestrations of cellular pathways. Heat maps, evocative of cosmic constellations, unveil the shifting patterns of gene expression across divergent conditions. Enrichment maps, akin to cartographers charting uncharted territories, lay bare the unexplored realms of biological insight, weaving a tale of pathway enrichments and their ramifications.
References