Normalization in gene expression refers to the process of adjusting gene expression data to account for technical variabilities introduced during sample preparation, such as differences in RNA quantity, quality, and amplification efficiency. Normalization is important to ensure that gene expression data accurately reflect biological differences between samples, rather than technical artifacts.
RNA-seq is a sequencing method that detects gene expression by NGS technology. It is not statistically reasonable to measure gene expression simply by the number of Reads (often referred to as Count) that are paired to a reference gene, so a method is needed for cross-sectional comparison. Understanding the concepts of Counts, RPKM, FPKM, TPM and RPM provides a better understanding of the normalization of data.
The most direct means of analyzing gene expression is to calculate how many reads are compared to each gene, in transcriptome sequencing this is called Counts. The raw counting value is not suitable for direct expression comparisons between samples due to various reasons such as different amounts of starting RNA in each sequenced sample, different library volumes, different amounts of sequencing data, etc. There are various algorithms for correcting counts, and several common ones are described below.
The Read Count as raw reads count matrix is an absolute value, and absolute values are characterized by different scales (gene length, sequencing depth) and are not comparable with each other. The aim of performing these gene normalization methods is to transform the Count matrix into a relative value, removing the effect of technical bias and making the subsequent analysis of variance statistically significant.
RPM (Reads per million mapped reads), 106 normalizes the effect of sequencing depth, but does not take into account the effect of transcript length. RPM is suitable for sequencing methods where the number of reads generated is not affected by gene length, such as miRNA sequencing, where miRNAs are typically between 20-24 bases in length.
FPKM is Fragments Per Kilobase of exon model per Million mapped fragments, a correction method that divides the gene counts by the gene length and then by the amount of sequencing data (actually the total number of successful reads is used) to obtain the relative expression of each gene The relative expression level of each gene is obtained by dividing the gene length by the amount of sequencing data (the actual number of successful reads used).
RPKM (Reads Per Kilobase Million), stands for the number of reads per kilobase length from a gene per million reads. RPKM is the number of reads from a map to a gene divided by the number of all reads (in millions) from the map to the genome with the length of the RNA (in KB), the RNA- seq used in RNA-seq to indicate the amount or abundance of gene expression.
3 steps to calculate the RPKM
The principles of FPKM and RPKM are similar, the difference being that FPKM corresponds to DNA fragments, whereas RPKM calculates data (reads). Fragment has a broader meaning than read, and therefore FPKM contains a broader meaning. For example, in an Illumina pair-end RNA-seq, a pair of reads corresponds to one DNA fragment.
With the concept of FPKM and RPKM, we can compare the relative expression of gene A and gene B in the same sample, or the relative expression of the same gene in different samples, and gene expression values are usually expressed as RPKM or FPKM.
TPM is becoming more and more popular. TPM is Transcripts Per Kilobase of exon model per Million mapped reads.
So, what are the differences between TPM and RPKM/FPKM? TPM is actually very similar to RPKM/FPKM, the only difference is that the order of calculation is different; TPM can be thought of as a percentage of the RPKM/FPKM value. When calculating the TPM, the gene length is normalized first, followed by the normalization of the sequencing depth. The different order of normalization can have a significant impact on the results. When using TPM, the sum of the TPMs is the same for each sample. TPM actually improves on the inaccuracy of the RPKM/FPKM method for quantification across samples, in contrast to FPKM and RPKM, where the cumulative sum of FPKM or RPKM can be different for each sample, resulting in no direct comparison of FPKM or RPKM values between samples.
The choice of the appropriate normalization method for RNA-seq data depends on the research question and experimental design. Common normalization methods include TPM, RPKM and FPKM, which we mentioned above.
TPM normalization is often used when comparing the expression of different genes in a sample. It takes into account differences in transcript length and sequencing depth, and ensures that TPM values are comparable across samples.
RPKM and FPKM normalization is often used when comparing gene expression levels between different samples. Both methods take into account differences in gene length and sequencing depth. However, RPKM normalizes reads by the total number of reads mapped, while FPKM normalizes reads by the total number of fragments mapped (i.e. paired end reads count as two fragments).
In general, TPM normalization is preferred when comparing gene expression levels within a sample, while RPKM/FPKM normalization is preferred when comparing gene expression levels between samples. However, there is no one-size-fits-all solution and researchers should carefully evaluate normalization methods based on their specific research questions and experimental design. It is also important to remember that as the field of RNA-seq analysis continues to evolve, new normalization methods may be developed in the future.