Next-generation sequencing (NGS) is a high throughput method that generates huge amounts of data. It exceeds the traditional Sanger Sequencing by producing billions of sequence reads from millions of clusters simultaneously base by base which is achieved through a cycle of nucleotide attachment, fluorescence signal reading, and dye cleaving. The explosion of sequence numbers made biological data analysis more arduous; hence, most of the effort is dedicated to identifying the alignment and origin of sequence reads. Through this, accurate identification, characterization, quantification, and differentiation of DNA sequences could be achieved.
For example, in Illumina platform, the raw output of NGS machines is the .bcl format which contains both the base call and its quality in every cycle. Each base call is recorded in real-time as the NGS machine calls that base. However, the .bcl format, however, is not useful other than for the sequencing machine. Hence, the raw output data must be organized to separate reads assigned with different indices in a process called demultiplexing. In demultiplexing, the raw output in .bcl is converted to the universally used fastq format which contains an assigned name to each read, location on the flow cell, index, base calls, and quality scores.
After demultiplexing, downstream analyses such as alignment of fastq data are performed. The most common alignment is through a reference genome. To name a few, Bowtie2, BWA, Maq, NovoAlign, Stampy, are some of the most prominent mapping programs which can do extremely fast analysis from billions of reads mapped from large reference genomes. Computational strategies, such as indexing and the Burrows-Wheeler transform are implemented to better analyze and map the reads. In indexing, an index of a large DNA sequence permits one to quickly discover short sequences embedded in it. In this procedure, a read is partitioned into four portions of the same length, called the 'seeds'. A seed will align depending on its similarity with the reference genome. Few will align to the reference genome if there are SNPs; otherwise, all of them will align perfectly. Through this, candidate locations within the mapped reference genome may be narrowed down, allowing a maximum of two mismatches. The Burrows-Wheeler transform is almost the same as indexing but requires lesser memory of 2 gigabytes compared with indexing which needs 50 gigabytes. It is the most complicated but its algorithm is faster and efficient.
De novo assembly is performed when the sample being sequenced and aligned does not have a reference genome. In de novo manner of alignment, reads are compared against each other to look for overlapping sequences that are lumped to create contigs or larger contiguous sequences with a goal to cover the entire genome of the organism. A vital element in de novo experiments is paired-end reads which link two contigs that are separated by a homopolymer stretch or other regions that are difficult to sequence. Additionally, mate-pair reads link contigs from larger distances separated by larger microsatellites, for example, which cannot be aligned.
After the alignment, a SAM or BAM file will be generated which are universal files for mapped sequence reads. They contain the sequence, quality scores, genome loci mapped, the confidence level of the aligner, mate read information, and CIGAR string listing all SNPs. Afterward, variant calling follows which determines the presence of SNPs, INDELS, and de novo SNVs through examining mapped data and the reference.
The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.
References