Sequencing platforms such as next-generation sequencing have been a useful tool for elucidating the relationships between genetic variation and changes in function or condition within a population. Regions of interests, identified within the genome, epigenome, transcriptome, or the meta-genome, are amplified using PCR, thereby creating sequencing libraries that contain only a small subset of the target interest. This greatly increases throughput and decreases sequencing costs; however, specific conditions must be satisfied in order to produce good quality sequencing data.
Amplicon sequencing enables variant characterization and identification in specific genomic regions through deep sequencing of PCR products or amplicons with the use of oligonucleotide probes. It allows discoveries in complicated samples such as multi-species 16S/18S/ITS amplicon sequencing and identifying rare somatic mutations commonly observed in tumors. With this method, hard-to sequence regions such as those rich in GC nucleotides could be elucidated with high coverage and efficiency. This can be applied in extensive approaches such as whole-genome sequencing.
In applications such as whole-genome sequencing, many fragments of DNA are produced from sequencing libraries. This creates clusters with diverse sequences and an equal representation of DNA nucleotides after each sequencing cycle. Amplicon libraries, on the other hand, differ in terms of diversity as they are created from the same short segment of sequence, hence the associated term, low diversity library. The lack of diversity in amplicon sequencing could affect how data are collected and processed during analysis. Since clusters contain almost have the same base sequences, the formation of unintended combined clusters and poor-quality base calls could happen due to the inaccuracy of the raw signal during cluster identification.
At the beginning of the sequencing process during template generation, clusters are mapped on the flow cell. However, sometimes clusters having the same base call while being too close to each other may be mistaken as one big cluster thereby creating an erroneous cluster map. As a control measure, multi-cycle detection is used to detect overlapping individual clusters. When different clusters are inaccurately grouped into one, ambiguous base calls are generated which renders the cluster a "noisy signal" that is unusable after quality filtering. In addition to unintended clustering, spectral overlap or crosstalk could also occur. This happens when there is overlapping emission where a signal in one base channel affects the signal in another base channel. A matrix correction is enforced to manage crosstalk, thereby producing a more pure signal.
Unoptimized sequencing runs could result in poor quality data and low output. As a response, we can spike-in a balanced library, such as PhiX, or reduce cluster density. Implementation of a 5% to 15% spike-in can produce diversity which allows better cluster separation and purer base calls. A larger percentage of spike-in is required for 16S applications. Additionally, a 30% reduction of cluster density will allow more accurate template generation from low diversity libraries. As an analogy in density reduction, we can better distinguish the number and characteristics of uniformly colored yellow mangoes in a basket than in a truck. Implementing both spike-in and density reduction results in a more accurate base representation, more clusters passing filter, and better Q-scores which are represented in SAVs or sequencing analysis viewer.
The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.
References