It is desirable to completely comprehend an organism's complexity and the diversity of cell types that can arise from a single genome, as well as to compare the gene complements different evolutionary groups. This necessitates the ability to observe and record changes in gene expression in a cell or tissue. The transcriptome refers to a cell's entire collection of transcripts (RNA molecules), comprising both protein-coding and non-coding RNAs. The transcriptome also includes all alternative splice variants, polyadenylated transcripts, and RNA-edited transcripts. These represent the genes that are actively expressed in a specific tissue when taken together. Microarrays, which use printed or manufactured probes corresponding to mRNAs, were the primary tool for assessing gene expression. While these technologies are more reliable and provide a more sophisticated framework for data analysis, they do necessitate the use of a fully annotated genome in order to create probes. Inaccurate hybridization of sequences to probes limits microarrays, which is difficult to model and account for. Microarrays are still extremely effective for measuring and comparing gene expression in model organisms. However, in the absence of high-quality annotation and adequate arrays, DNA sequencing is the most effective way for deciphering the transcriptome. The ability to catalog and evaluate gene expression from a larger range of species has become possible with the development of Next Generation Sequencing (NGS) technologies and improved extraction methods to reliably collect RNA from smaller volumes of tissue or even single cells.
Figure 1. An overview of the two transcriptome assembly pipelines. (Moreton, 2016)
Reference-Based Transcriptome Assembly Method
When a model organism with a sequenced genome for the target transcriptome is available, reference-based transcriptome assembly is commonly employed. As a result, the transcriptome is rebuilt by mapping to previously identified sequences. The overlapping portions of the short reads are assembled into transcripts once they are aligned to the reference genome. The reference-based strategy is highly sensitive when a good quality reference is available, and it has become the standard method for many RNA sequencing (RNA-seq) investigations. However, the quality of reference-based transcriptome assembly is dependent on accurate read alignment, which is made more difficult by difficulties like alternative splicing and sequencing mistakes.
When compared to de novo transcriptome assembly, reference-based transcriptome assembly requires substantially less computing power. Furthermore, the existence of artifacts or sequencing contamination is not a serious concern because these can frequently be rectified during read alignment to the genome. The quality of the results, on the other hand, is greatly determined by the genomic model used.
De novo Transcriptome Assembly Method
Constructing de Bruijn graphs is a frequent approach used by de novo transcriptome assemblers. All subsequences of length k are discovered in the reads using this method, and these are referred to as "k-mers." All unique k-mers are used as nodes in a de Bruijn graph, with connecting edges representing immediately overlapping k-mers. When a k-mer substring is shifted by one sequence base and overlaps (by k-1 bases) another k-mer, an edge is constructed between the nodes associated with those k-mers. Wherever possible, a linear chain of k-mer nodes is reduced into a single node (where the two nodes are joined by a single unique edge). The paths of the graph can then be traversed to build transcript variations.