Whole Exome Data Analysis: Methods and Commonly Used Tools

Whole Exome Data Analysis: Methods and Commonly Used Tools

Online Inquiry

Introduction to Whole Exome Sequencing

Experiments using next-generation sequencing (NGS) continue to enable a significant shift in the way genetic variation is identified and used. While it is now possible to sequence the whole genome of patients or research samples, the price of doing so at the depth required to profile mutations adequately remains unaffordable. Rather, researchers and clinicians frequently try to restrict sequencing to a smaller genomic target, allowing for a more thorough examination of the selected regions. Using a series of oligonucleotide hybridization probes that aim known exon sequences, whole-exome sequencing (WES) enforces a capture-based enrichment of the protein-coding areas of genomic DNA Mutational assessment, or the identification of single-nucleotide variants (SNVs) or small insertions and deletions, is the most typical approach of WES (Indels). The profiling of copy-number variations (CNVs) and the detection of structural variants are two other technologies (SVs, often defined as genomic rearrangements larger than 50 bp).

Methods in Whole Exome Data Analysis

For identifying variations in NGS sequence information, a huge proportion of algorithms and application tools have been established. Because cancer specimens frequently usually show characteristics that may unsettle a standard germ-line variant caller, such as the appearance of normal or non-aberrant cells, copy number incidents, or tumor heterogeneity and sub-clonality, somatic variant identification is usually done using algorithms and applications tools skilled for the activity. Furthermore, the accessibility of a matching normal specimen allows for the assessment of an entirely new type of data: rather than simply comparing distinctions between a test specimen and the reference genome, variance can be compared to both the reference and the matching normal. Although more advanced algorithms featuring synchronized assessment of both matched specimens using joint probabilistic designs have been more commonly used in the past, strategies based on analyzing tumors and matching normal specimens independently followed by a subtraction-based approach have been more commonly used in the past. Such designs can categorize a variant in a cancer specimen as germline or somatic using the second measure of likelihood, in addition to identifying the existence of a variant at a particular locus (with a related likelihood or set of scores). Specific restrictions in germline detection techniques, like ploidy assumptions or the assumption that a heterozygous variant will be visible at a minimum allele fraction, can be relaxed using somatic-specific algorithms, resulting in increased sensitivity.

Software Used in Whole Exome Data Analysis

Scientists can use a variety of application systems to evaluate the importance of their sequencing data and identify differences of interest. Methods like FastQC make it simple to conduct quality control evaluations on the data in a FastQ file. Data of poor quality is frequently removed from the FastQ file. Trimmomatic, for example, is a method that can perform this function.

Bowtie, the Burrows-Wheeler Aligner (BWA), and Geneious are just a few of the application tools that allow scientists to integrate sequenced DNA to the parts of the reference genome from which it came. The Broad Institute's Integrated Genome Visualization (IGV) tool is a common preference for visualizing alignment data. Another common visualization application for sequence databases is the UCSC genome browser. The IGV and the UCSC genome browser are similar in functionality, so selecting between the two is often a matter of personal preference.

Open-source software packages like the Genome Analysis Toolkit (GATK) and the Burrows-Wheeler Aligner (BWA) enable researchers to evaluate metrics from their sequencing experiments and clean sequences as needed for quality assurance. Picard Alignment Summary metrics are a popular toolkit for evaluating data quality. Once more, tools like GATK and SAMtools will evaluate the integration and read depth and call single nucleotide polymorphisms or indels where they happen in the sequenced specimen to call genomic variants.

The scientist can use these application systems to ensure that sequencing quality fits the criteria for the kind of experiment being conducted and that capture performance is adequate. These benchmarks offer a strong foundation for determining whether variants are visible in sequenced data. However, errors in the NGS framework can result in poor read depth or ineffective capture performance. Context-specific biases and non-uniformity, for example, can have an influence on capture quality, particularly in highly repetitive, GC-rich, and non-unique genome regions. Alternatively, probes could be constructed ineffectively, resulting in a reduction in on-target rate, sequencing depth, and overall capture uniformity.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.


  1. Ulintz PJ, Wu W, Gates CM. Bioinformatics analysis of whole exome sequencing data. InChronic Lymphocytic Leukemia 2019 (pp. 277-318). Humana Press, New York, NY.
  2. Bao R, Huang L, Andrade J, et al. Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing. Cancer informatics. 2014.
  3. Magi A, Tattini L, Cifola I, et al. EXCAVATOR: detecting copy number variants from whole-exome sequencing data. Genome biology. 2013, 14(10).
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry