Overrepresentation analysis is an integral component of enrichment analysis, serving as a fundamental approach to ascertain the degree of overrepresentation of a given gene set within a dataset in comparison to what would be anticipated by chance. This analytical method relies on statistical tests to assess the statistical significance of enrichment. The two primary statistical tests utilized in overrepresentation analysis are the hypergeometric distribution and Fisher's exact test. These tests ascertain the likelihood of observing the observed overlap between the gene set and the dataset under investigation, taking into account factors such as the size of the gene set, the total number of genes in the dataset, and the background distribution.
If you want to know more information, please read our article Enrichment Analysis for basic information.
Over-representation analysis (Zhao et al., 2023)
The principles and methodology underlying overrepresentation analysis stem from the notion that functionally related gene sets are more inclined to exhibit coordinated alterations in expression or other relevant characteristics if they play significant roles in the biological processes being examined. By evaluating the statistical significance of enrichment, overrepresentation analysis aids in the identification of gene sets or pathways that demonstrate substantial associations with the experimental conditions.
The methodology employed in overrepresentation analysis encompasses several crucial steps. Initially, a predefined gene set or pathway database is selected, typically derived from resources like Gene Ontology (GO) or Kyoto Encyclopedia of Genes and Genomes (KEGG). Subsequently, the gene set of interest is juxtaposed with the genes present in the dataset under scrutiny. Statistical tests are then applied to ascertain the significance of the observed overlap between the gene set and the dataset. Finally, the evaluation of enrichment significance is conducted using appropriate statistical thresholds, and the obtained results are interpreted within the context of the specific biological inquiry at hand.
In overrepresentation analysis, various statistical tests are employed to assess the significance of enrichment. Two commonly used tests are the hypergeometric distribution and Fisher's exact test.
The hypergeometric distribution is a probability distribution that calculates the probability of obtaining a specific number of successes (i.e., the overlap between the gene set and the dataset) in a fixed number of draws (i.e., the size of the gene set) from a finite population (i.e., the total number of genes in the dataset). In overrepresentation analysis, the hypergeometric distribution is used to determine the probability of observing the given overlap by chance. A p-value is calculated based on this probability, and if it falls below a predefined significance threshold (e.g., p < 0.05), the enrichment is considered statistically significant.
Fisher's exact test
Fisher's exact test is another statistical test used in overrepresentation analysis, particularly when dealing with small sample sizes or when the assumptions of the hypergeometric distribution are not met. This test calculates the probability of obtaining a specific distribution of gene set members within the dataset by considering all possible arrangements. It compares the observed distribution to the expected distribution under the null hypothesis of no enrichment. A p-value is calculated, and if it is below the significance threshold, the enrichment is considered significant.
Determination of significant enrichment
To determine the significance of enrichment, a statistical threshold or adjusted p-value is applied. Multiple testing correction methods, such as Bonferroni or false discovery rate (FDR) correction, are often employed to control for the inflation of false positives that can occur when testing multiple gene sets simultaneously. Adjusted p-values provide a more conservative estimate of significance, taking into account the number of gene sets being tested.
Numerous software tools and packages are available for conducting overrepresentation analysis. These tools provide a user-friendly interface for inputting data, selecting gene sets, performing statistical tests, and interpreting the results. Examples of popular tools include DAVID (Database for Annotation, Visualization, and Integrated Discovery), Enrichr, g:Profiler, and clusterProfiler. These tools often incorporate various databases, including GO, KEGG, and other resources, enabling researchers to explore and analyze enrichment patterns in their datasets.
|Tools and Software
|A web-based tool that integrates diverse biological resources, such as GO and KEGG, for overrepresentation analysis. It supports input of gene lists, offers statistical tests like the hypergeometric distribution, and provides visualization options like enrichment charts.
|An online tool that allows the analysis of gene sets using various databases, including GO and KEGG. It supports multiple statistical tests, generates interactive enrichment plots, and provides links to external resources for further exploration.
|A versatile web-based tool that offers comprehensive functional profiling of gene sets. It integrates databases like GO and KEGG, performs statistical tests, and provides visualization options, including bar plots and enrichment maps.
|An R package for overrepresentation analysis and functional annotation of gene clusters. It supports various statistical tests, multiple gene ID conversion methods, and integrates databases like GO and KEGG. Visualization options include enrichment plots.
|An online enrichment analysis tool that facilitates functional interpretation of large-scale genomic data. It includes a collection of gene sets from multiple databases, performs statistical tests, and offers interactive visualization options like enriched term charts.
These tools and software platforms significantly simplify the process of overrepresentation analysis, allowing researchers to perform comprehensive enrichment analysis and gain valuable biological insights from their high-throughput datasets. Researchers can choose the tool that best suits their specific requirements in terms of data format, statistical tests, visualization options, and integration with relevant databases.