Gene Set Analysis (GSA), also known as pathway or functional enrichment analysis, is a computational method used to analyze high-throughput genomic data in the context of predefined gene sets or functional categories. It aims to determine whether specific biological pathways, gene sets, or functional categories are significantly associated with the observed experimental or phenotypic changes.
GSA is based on the principle that genes rarely act alone but rather function in coordinated networks and pathways to carry out biological processes. By analyzing gene expression data or other genomic measurements in the context of gene sets, GSA provides a higher-level interpretation of the underlying biological mechanisms or functions affected by experimental conditions, genetic variations, or disease states.
The primary goal of GSA is to identify gene sets that show statistically significant enrichment or depletion of genes of interest within a given dataset. The analysis helps researchers gain insights into the biological processes, molecular pathways, or functional categories that are differentially regulated or associated with a particular condition or phenotype.
Gene set analysis approaches can be broadly categorized into two main types: overrepresentation analysis (ORA) and functional enrichment analysis (FEA). Here's an outline of these two approaches and their subcategories:
Overrepresentation Analysis (ORA)
ORA determines whether a particular gene set or functional category is overrepresented or enriched in a given gene list compared to what would be expected by chance.
Learn more Tips about Over-representation Analysis.
The most common subcategories of ORA include:
a. Hypergeometric test: This statistical test calculates the probability of observing the overlap between the input gene list and a gene set of interest, considering the total number of genes in the genome or a reference background.
b. Fisher's exact test: Similar to the hypergeometric test, it assesses the significance of gene set enrichment based on the overlap between the input gene list and a gene set, but it can handle larger gene lists and datasets.
c. Binomial test: It evaluates the probability of observing a certain number of genes from the input gene list within a specific gene set, assuming a binomial distribution.
Functional Enrichment Analysis (FEA)
FEA evaluates the overall functional enrichment of a gene set by considering the collective behavior of the genes within the set.
The main subcategories of FEA include:
a. Gene Set Enrichment Analysis (GSEA): As described earlier, GSEA ranks genes based on their differential expression or activity and assesses the enrichment of predefined gene sets by analyzing the distribution of ranked genes.
b. Functional Class Scoring (FCS): FCS methods assign a score to each gene based on its association with a specific functional category or gene set. These scores are then used to assess the overall enrichment of the gene set in the input gene list.
c. Global Test: The global test evaluates the collective behavior of a gene set by modeling the relationship between gene expression levels and phenotypic traits. It assesses whether the gene set as a whole is associated with the phenotype of interest.
d. Self-contained methods: These methods evaluate the statistical significance of differential expression or activity within a gene set without requiring comparison to a reference background. Examples include Wilcoxon rank-sum test, t-test, and limma.
Classification of gene set analysis approaches and tools available for RNA-seq data analysis. (Das et al., 2020)
Data preprocessing
Input data: Gene expression data (e.g., microarray or RNA sequencing data) or summary statistics (e.g., effect sizes, p-values) from differential expression analysis.
Choice: Normalize the gene expression data to correct for technical variations, such as batch effects or platform-specific biases. Consider using methods like quantile normalization, log transformation, or variance stabilizing transformation. If using summary statistics, ensure they are properly formatted and filtered as per the requirements of downstream analysis.
Gene set selection
Input: Collection of predefined gene sets or functional categories from biological databases (e.g., Gene Ontology, KEGG pathways).
Choice: Select the appropriate gene set database or create custom gene sets based on specific biological knowledge or hypotheses. Consider the biological relevance and coverage of the gene sets to the research question at hand.
Gene ranking
Input: Preprocessed gene expression data or summary statistics.
Choice: Rank genes based on differential expression, fold change, t-statistics, effect sizes, or other relevant metrics. Choose an appropriate metric that reflects the biological differences between conditions and is relevant to the analysis goals.
Methods: For gene expression data, consider methods such as t-test, limma, DESeq2, edgeR. For summary statistics, calculate effect sizes (e.g., log-fold change) based on the available data.
Gene set enrichment analysis
Input: Ranked gene list and gene sets of interest.
Choice: Select the appropriate enrichment analysis method.
Methods:
Statistical assessment
Input: Enrichment scores or gene set statistics.
Choice: Determine the statistical significance of gene set enrichment.
Methods:
Tools and databases offer different features, functionalities, and gene set collections. The choice of tool or database depends on the specific research question, data type, and user preferences. It is often beneficial to explore multiple resources to ensure comprehensive and reliable gene set analysis.
Gene Set Enrichment Analysis (GSEA)
GSEA is a popular software package developed by the Broad Institute. It provides a comprehensive platform for GSA and offers various statistical methods, visualization tools, and gene set databases. GSEA can be downloaded and installed on local machines for customized analysis.
Enrichr
Enrichr is a web-based tool that integrates numerous gene set databases, including Gene Ontology, KEGG pathways, Reactome, and many others. It allows users to perform GSA using multiple statistical methods and provides interactive visualization of enriched gene sets. Enrichr also offers additional features for gene list enrichment analysis and visualization.
DAVID (Database for Annotation, Visualization, and Integrated Discovery)
DAVID is a web-based tool that provides functional annotation and enrichment analysis for gene sets. It integrates multiple gene set databases and offers statistical methods for gene set enrichment analysis. DAVID also enables users to perform functional annotation of gene lists, perform clustering analysis, and visualize results.
MSigDB (Molecular Signatures Database)
MSigDB is a curated collection of gene sets and signatures derived from various sources, such as pathway databases, scientific literature, and high-throughput datasets. It is maintained by the Broad Institute and provides a comprehensive resource for GSA. MSigDB can be used in conjunction with GSEA or other tools supporting MSigDB format.
Panther
Panther (Protein ANalysis THrough Evolutionary Relationships) is a comprehensive resource for functional annotation and GSA. It integrates gene set databases, pathway information, and protein family classifications. Panther offers various statistical methods for GSA and provides interactive visualization tools.
Reactome
Reactome is a knowledgebase of biological pathways and processes. It provides curated gene sets and pathway information, which can be used for GSA. Reactome offers web-based tools for GSA, including pathway enrichment analysis and pathway visualization.
Reference: