Home
Resources
Support Documents
Target Genomic Sequencing: Introduction and Bioinformatics Analysis Pipeline

Target Genomic Sequencing: Introduction and Bioinformatics Analysis Pipeline

Introduction to Target Sequencing

Targeted sequencing, also known as resequencing, is a technique for sequencing only a portion of a genome or specific areas of concern rather than the whole genome of a specimen. Target Enrichment, a pre-sequencing DNA preparation phase in which target DNA sequences are either directly amplified (amplicon or multiplex PCR-based) or captured (hybrid capture-based) and then sequenced using DNA sequencers, is required to only concentrate on specific or clinically significant areas of a genome or DNA specimen.

Quality Control and Data Pre-processing

QC and Alignment

The first process in any NGS pipeline is to use FastQC to evaluate the importance of the sequenced reads. It sums up and analyzes the base quality score for each base pair sequenced, allowing researchers to get a quick overview of the read quality and determine whether trimming is necessary, particularly at the 3′ end, where the base quality is often smaller.

Following that, raw or trimmed reads are connected against the reference genome to produce Sequence Alignment Map (SAM) or Binary Alignment Map (BAM) data for each specimen. The Burrows-Wheeler Aligner (BWA) and Bowtie2 are two common aligners.

Assessment of Off-target Reads

To guarantee the highest quality of TS data, various QC steps should always be performed Consider the large proportion of reads produced by TS to come from targeted areas because the design panel concentrates on areas of concern; however, off-target reads are normal. Following alignment, applications such as bedtools and the GATK coverage module can be used to determine the percentage of reads that cover targeted areas. A high percentage of reads that are off-target could imply that the TS experiment failed or that the targeted areas have too many repeat sequences.

Marking and Removal of PCR duplicates

PCR duplicates are sequence reads that correspond to the same genomic coordinates and typically occur during the library preparation process' PCR phases. The duplication percentage for low-quality fragmented DNA, such as FFPE and ctDNA, is much higher, reaching 50–60% in some cases, whereas the rate for FF DNA is usually less than 20%. Before any downstream assessment, these PCR duplicates must be identified and removed, as involving them will lead to an overestimation of coverage in targeted areas, and, more importantly, incorrect mutation frequency approximation.

Realignment, Base Score Recalibration, and Estimation of Sequencing Coverage

Following that, filtered alignments are further analyzed to enhance alignment quality, such as local realignment around indels and GATK-based base quality score recalibration. To decrease false-positive calls, the local realignment phase improves the alignment quality for bases around recognized and suspected indel locations According to known polymorphisms, base score recalibration is used to recalibrate base quality scores for all sequenced reads. During variant calling, the base and mapping quality scores are utilized to scan reads, and the fine-tuning that happens during this step is critical to guarantee that only high-confidence variants are called.

Variant Calling

These high-quality alignment data are eligible for variant calling once all TS pre-processing steps have been completed. The method of identifying base pair variants by contrasting aligned reads to a reference genome or matched normal DNA sequences is known as variant calling.

Annotation and Further Filtration of Variants

After calling variants, the next phase is to encode them in terms of genes, codons, and amino acid positions, as well as classify them as nonsense, missense, exonic deletions, or synonymous variants. This enables for a better understanding of their functional effects on the genes to which they are linked. Only non-silent exonic or splicing mutations are chosen for further assessment in many TS research, concentrating solely on functional coding variants and mutations. However, depending on the area of concern or the aims of the research, such as variants in promoter area or UTRs of the genome, these criteria may differ.

Estimation of Background Error Rate

NGS has a sensitivity in the range of VAF 1 percent. However, in some studies, variants with much lower VAFs are required, for example, to identify very small subclonal and minimal residual disease (MRD) alterations. Higher sequencing depth is usually necessary to accomplish this, and a comprehensive technique is necessary to distinguish between genuine calls and background sequencing artifacts or the background noise level at VAFs < 1%.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.

References

Bewicke-Copley F, Kumar EA, Palladino G, et al. Applications and analysis of targeted genomic sequencing in cancer studies. Computational and structural biotechnology journal. 2019, 17.
Mamanova L, Coffey AJ, Scott CE, et al. Target-enrichment strategies for next-generation sequencing. Nature methods. 2010, 7(2).
Allentoft ME, Collins M, Harker D, et al. The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proceedings of the Royal Society B: Biological Sciences. 2012.

* For Research Use Only. Not for use in diagnostic procedures.