Gene Set Analysis

Article Overviews

01 What is Gene Set Analysis (GSA)? 02 Classification of gene set analysis approaches 03 Step-By-Step Guide: Gene Set Analysis 04 Tools and databases for gene set analysis

What is Gene Set Analysis (GSA)?

Gene Set Analysis (GSA), also known as pathway or functional enrichment analysis, is a computational method used to analyze high-throughput genomic data in the context of predefined gene sets or functional categories. It aims to determine whether specific biological pathways, gene sets, or functional categories are significantly associated with the observed experimental or phenotypic changes.

GSA is based on the principle that genes rarely act alone but rather function in coordinated networks and pathways to carry out biological processes. By analyzing gene expression data or other genomic measurements in the context of gene sets, GSA provides a higher-level interpretation of the underlying biological mechanisms or functions affected by experimental conditions, genetic variations, or disease states.

The primary goal of GSA is to identify gene sets that show statistically significant enrichment or depletion of genes of interest within a given dataset. The analysis helps researchers gain insights into the biological processes, molecular pathways, or functional categories that are differentially regulated or associated with a particular condition or phenotype.

Classification of gene set analysis approaches

Gene set analysis approaches can be broadly categorized into two main types: overrepresentation analysis (ORA) and functional enrichment analysis (FEA). Here's an outline of these two approaches and their subcategories:

Overrepresentation Analysis (ORA)

ORA determines whether a particular gene set or functional category is overrepresented or enriched in a given gene list compared to what would be expected by chance.

Learn more Tips about Over-representation Analysis.

The most common subcategories of ORA include:
a. Hypergeometric test: This statistical test calculates the probability of observing the overlap between the input gene list and a gene set of interest, considering the total number of genes in the genome or a reference background.
b. Fisher's exact test: Similar to the hypergeometric test, it assesses the significance of gene set enrichment based on the overlap between the input gene list and a gene set, but it can handle larger gene lists and datasets.
c. Binomial test: It evaluates the probability of observing a certain number of genes from the input gene list within a specific gene set, assuming a binomial distribution.

Functional Enrichment Analysis (FEA)

FEA evaluates the overall functional enrichment of a gene set by considering the collective behavior of the genes within the set.

The main subcategories of FEA include:
a. Gene Set Enrichment Analysis (GSEA): As described earlier, GSEA ranks genes based on their differential expression or activity and assesses the enrichment of predefined gene sets by analyzing the distribution of ranked genes.
b. Functional Class Scoring (FCS): FCS methods assign a score to each gene based on its association with a specific functional category or gene set. These scores are then used to assess the overall enrichment of the gene set in the input gene list.
c. Global Test: The global test evaluates the collective behavior of a gene set by modeling the relationship between gene expression levels and phenotypic traits. It assesses whether the gene set as a whole is associated with the phenotype of interest.
d. Self-contained methods: These methods evaluate the statistical significance of differential expression or activity within a gene set without requiring comparison to a reference background. Examples include Wilcoxon rank-sum test, t-test, and limma.

Classification of gene set analysis approaches and tools available for RNA-seq data analysis. (Das et al., 2020)

Step-By-Step Guide: Gene Set Analysis

Data preprocessing

Input data: Gene expression data (e.g., microarray or RNA sequencing data) or summary statistics (e.g., effect sizes, p-values) from differential expression analysis.

Choice: Normalize the gene expression data to correct for technical variations, such as batch effects or platform-specific biases. Consider using methods like quantile normalization, log transformation, or variance stabilizing transformation. If using summary statistics, ensure they are properly formatted and filtered as per the requirements of downstream analysis.

Gene set selection

Input: Collection of predefined gene sets or functional categories from biological databases (e.g., Gene Ontology, KEGG pathways).

Choice: Select the appropriate gene set database or create custom gene sets based on specific biological knowledge or hypotheses. Consider the biological relevance and coverage of the gene sets to the research question at hand.

Gene ranking

Input: Preprocessed gene expression data or summary statistics.

Choice: Rank genes based on differential expression, fold change, t-statistics, effect sizes, or other relevant metrics. Choose an appropriate metric that reflects the biological differences between conditions and is relevant to the analysis goals.

Methods: For gene expression data, consider methods such as t-test, limma, DESeq2, edgeR. For summary statistics, calculate effect sizes (e.g., log-fold change) based on the available data.

Gene set enrichment analysis

Input: Ranked gene list and gene sets of interest.

Choice: Select the appropriate enrichment analysis method.

Methods:

GSEA (Gene Set Enrichment Analysis): Utilize GSEA software or packages to perform GSEA. This method evaluates the enrichment of gene sets by analyzing the distribution of ranked genes along with predefined gene sets. It takes into account the collective behavior of genes within a set.
ORA (Overrepresentation Analysis): Apply ORA methods such as hypergeometric test, Fisher's exact test, or binomial test. These methods determine whether a particular gene set is overrepresented or enriched in the gene list compared to what would be expected by chance. ORA methods are based on the overlap between the input gene list and the gene set of interest.

Statistical assessment

Input: Enrichment scores or gene set statistics.

Choice: Determine the statistical significance of gene set enrichment.

Methods:

For GSEA: Perform permutation testing by randomly permuting sample labels to generate null distributions and assess the significance of enrichment scores. Correct for multiple testing using methods like false discovery rate (FDR) or family-wise error rate (FWER) corrections.
For ORA: Apply statistical tests appropriate for the selected method (e.g., hypergeometric test, Fisher's exact test, binomial test). Adjust p-values for multiple testing using methods like FDR correction.

Tools and databases for gene set analysis

Tools and databases offer different features, functionalities, and gene set collections. The choice of tool or database depends on the specific research question, data type, and user preferences. It is often beneficial to explore multiple resources to ensure comprehensive and reliable gene set analysis.

Gene Set Enrichment Analysis (GSEA)

GSEA is a popular software package developed by the Broad Institute. It provides a comprehensive platform for GSA and offers various statistical methods, visualization tools, and gene set databases. GSEA can be downloaded and installed on local machines for customized analysis.

Enrichr

Enrichr is a web-based tool that integrates numerous gene set databases, including Gene Ontology, KEGG pathways, Reactome, and many others. It allows users to perform GSA using multiple statistical methods and provides interactive visualization of enriched gene sets. Enrichr also offers additional features for gene list enrichment analysis and visualization.

DAVID (Database for Annotation, Visualization, and Integrated Discovery)

DAVID is a web-based tool that provides functional annotation and enrichment analysis for gene sets. It integrates multiple gene set databases and offers statistical methods for gene set enrichment analysis. DAVID also enables users to perform functional annotation of gene lists, perform clustering analysis, and visualize results.

MSigDB (Molecular Signatures Database)

MSigDB is a curated collection of gene sets and signatures derived from various sources, such as pathway databases, scientific literature, and high-throughput datasets. It is maintained by the Broad Institute and provides a comprehensive resource for GSA. MSigDB can be used in conjunction with GSEA or other tools supporting MSigDB format.

Panther

Panther (Protein ANalysis THrough Evolutionary Relationships) is a comprehensive resource for functional annotation and GSA. It integrates gene set databases, pathway information, and protein family classifications. Panther offers various statistical methods for GSA and provides interactive visualization tools.

Reactome

Reactome is a knowledgebase of biological pathways and processes. It provides curated gene sets and pathway information, which can be used for GSA. Reactome offers web-based tools for GSA, including pathway enrichment analysis and pathway visualization.

Reference:

Das, Samarendra, Craig J. McClain, and Shesh N. Rai. "Fifteen years of gene set analysis for high-throughput genomic data: a review of statistical approaches and future challenges." Entropy 22.4 (2020): 427.

* For Research Use Only. Not for use in diagnostic procedures.