Cellular function depends largely on which genes are active. Therefore, the discovery of differentially expressed genes between conditions is often a necessary part of understanding the molecular mechanisms underlying phenotypic changes between diseased and normal tissues, or between untreated and treated samples. RNA-seq provides a quantitative readout of the transcriptional status of cells. CD Genomics provides targeted RNA sequencing analysis service to help you understand specific genome-scale transcripts.
The next step in the RNA-seq workflow is differential expression analysis. Differential expression analysis identifies which genes are expressed at different levels under different conditions and plays a critical role in understanding the transcriptional variation that underpins unique biological states.
Count Data and Normalization
Differential expression analysis typically begins with count data, which represents the number of sequence reads mapped to each gene. The higher the count, the higher the level of gene expression in the sample. However, prior to analysis, normalization is required to account for potential confounding factors, such as:
Different scenarios and data types use different normalization methods, some of which are:
Note that the RPKM/FPKM method is not recommended for inter-sample comparisons because the total number of RPKM/FPKM normalized counts will vary from sample to sample, making it difficult to compare the normalized counts for each gene between samples.
Quality Control in Differential Expression Analysis
Quality control (QC) is an important step in the differential expression analysis workflow. It helps ensure that the data is reliable and can provide meaningful biological insights.
Sample-Level Quality Control
Sample-level QC involves using techniques such as principal component analysis (PCA) and hierarchical clustering to understand the overall similarity between samples. Gene-level quality control ensures that only genes that are likely to be detected as differentially expressed are included in the analysis, which helps to increase the robustness of the study.
Fig. 1. PCA plots. (Jiménez-Jacinto V, et al. 2019)
Gene-level QC
In addition to sample-level QC, gene-level QC is equally important. This involves ignoring genes with zero counts, extreme count outliers, or low average normalized counts in all samples. This filtering enhances the ability to detect truly differentially expressed genes.
Actual Differential Expression Analysis
This process can reveal which genes are expressed at different levels under different conditions, providing insight into the biological processes affected by those conditions. Different tools such as DESeq2 or EdgeR can be used for this purpose, each with its own strengths and approaches. It is worth noting that DESeq2 automatically filters out certain classes of genes, whereas tools such as EdgeR require an explicit pre-filtering step.
Fig. 2. Overview of the RNA-Seq differential expression analysis pipeline. (Costa-Silva J, et al. 2017)
Interpreting the results of differential expression analysis is a critical step that requires careful attention. Differential gene expression (DEG) analysis is usually expressed as a log2 multiplicative change and an adjusted p-value. The log2 multiplicative change indicates the degree of differential expression between conditions, while the adjusted p-value indicates the statistical significance of differential expression.
It is also important to keep in mind the potential impact of biological and technical variation on DEG interpretation. In addition, differentially expressed genes should be considered in the context of the biological system or disease state under study. Utilizing additional bioinformatic resources, such as gene ontology (GO) enrichment analysis or pathway analysis, can provide greater insight into the biological significance of identified DEGs.
References: