Differential Expression Analysis

Online Inquiry

Cellular function depends largely on which genes are active. Therefore, the discovery of differentially expressed genes between conditions is often a necessary part of understanding the molecular mechanisms underlying phenotypic changes between diseased and normal tissues, or between untreated and treated samples. RNA-seq provides a quantitative readout of the transcriptional status of cells. CD Genomics provides targeted RNA sequencing analysis service to help you understand specific genome-scale transcripts.

The next step in the RNA-seq workflow is differential expression analysis. Differential expression analysis identifies which genes are expressed at different levels under different conditions and plays a critical role in understanding the transcriptional variation that underpins unique biological states.

Differential Expression Analysis Workflow

Count Data and Normalization

Differential expression analysis typically begins with count data, which represents the number of sequence reads mapped to each gene. The higher the count, the higher the level of gene expression in the sample. However, prior to analysis, normalization is required to account for potential confounding factors, such as:

Sequencing depth: refers to the total number of reads obtained for each sample
Gene length: affects the number of reads mapped to a gene
RNA composition: can be affected by a number of factors, including contamination, highly differentially expressed genes, or different numbers of genes expressed between samples.

Different scenarios and data types use different normalization methods, some of which are:

CPM (counts per million presentations)
TPM (transcripts per kilobase million)
RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped)
Ratio median for DESeq2
Truncated Mean of M-values for EdgeR (TMM)

Note that the RPKM/FPKM method is not recommended for inter-sample comparisons because the total number of RPKM/FPKM normalized counts will vary from sample to sample, making it difficult to compare the normalized counts for each gene between samples.

Quality Control in Differential Expression Analysis

Quality control (QC) is an important step in the differential expression analysis workflow. It helps ensure that the data is reliable and can provide meaningful biological insights.

Sample-Level Quality Control

Sample-level QC involves using techniques such as principal component analysis (PCA) and hierarchical clustering to understand the overall similarity between samples. Gene-level quality control ensures that only genes that are likely to be detected as differentially expressed are included in the analysis, which helps to increase the robustness of the study.

Principal Component Analysis (PCA):
PCA is a dimensionality reduction technique that finds the largest amount of variation in a dataset and assigns it to principal components. The variation explained by the principal components can be visualized and used to see patterns in the data.

Fig. 1. PCA plots. (Jiménez-Jacinto V, et al. 2019)

Hierarchical Clustering Heatmap:
This method helps to identify strong patterns and potential outliers in a data set. The heatmap shows the correlation of gene expression for all samples in the dataset combined in pairs. This can indicate which samples are more similar to each other based on normalized gene expression values.

Gene-level QC

In addition to sample-level QC, gene-level QC is equally important. This involves ignoring genes with zero counts, extreme count outliers, or low average normalized counts in all samples. This filtering enhances the ability to detect truly differentially expressed genes.

Actual Differential Expression Analysis

This process can reveal which genes are expressed at different levels under different conditions, providing insight into the biological processes affected by those conditions. Different tools such as DESeq2 or EdgeR can be used for this purpose, each with its own strengths and approaches. It is worth noting that DESeq2 automatically filters out certain classes of genes, whereas tools such as EdgeR require an explicit pre-filtering step.

Fig. 2. Overview of the RNA-Seq differential expression analysis pipeline. (Costa-Silva J, et al. 2017)

Differential Expression Analysis Data Interpretation

Interpreting the results of differential expression analysis is a critical step that requires careful attention. Differential gene expression (DEG) analysis is usually expressed as a log2 multiplicative change and an adjusted p-value. The log2 multiplicative change indicates the degree of differential expression between conditions, while the adjusted p-value indicates the statistical significance of differential expression.

It is also important to keep in mind the potential impact of biological and technical variation on DEG interpretation. In addition, differentially expressed genes should be considered in the context of the biological system or disease state under study. Utilizing additional bioinformatic resources, such as gene ontology (GO) enrichment analysis or pathway analysis, can provide greater insight into the biological significance of identified DEGs.

References:

Jiménez-Jacinto V, Sanchez-Flores A, Vega-Alvarado L. Integrative differential expression analysis for multiple experiments (IDEAMEX): a web server tool for integrated RNA-Seq data analysis[J]. Frontiers in genetics, 2019, 10: 279.
Costa-Silva J, Domingues D, Lopes F M. RNA-Seq differential expression analysis: An extended review and a software tool[J]. PloS one, 2017, 12(12): e0190152.

* For Research Use Only. Not for use in diagnostic procedures.