Variant calling is the computational process by which laboratories identify variants in sequencing data. As the first step in many NGS data analysis processes, accurate variant calling is often critical for downstream analysis and interpretation. The success of many clinical, association, or population genetics studies relies heavily on a properly executed variant calling step.
In turn, variant calling underpins many genomic applications, from helping to advance our understanding of genomics to facilitating the study of hereditary disease and cancer genomics, paving the way for the future of precision medicine.
Variant calling is the computational process of identifying gene sequence variants or changes when compared to a reference genome. Our genomes are full of genetic variants that support the richness of human diversity. Single nucleotide variants (SNVs) and insertion deletions (insertions and deletions) represent the most common forms of genetic variation. However, the human genome also contains more complex forms of genetic variation, such as copy number variation (CNV), structural variation (SV), and larger chromosomal aberrations. Explore our SNP Calling Service and Copy Number Variation (CNV) Analysis for more information.
In addition, there are other more exotic types of variation such as multinucleated variants (MNV), tandem repeats, mobile elements (ME), gene retrotransposon insertion polymorphisms (GRIP), and mitochondrial DNA nuclear insertions (NUMT). Understanding and accurately identifying these variant types can be challenging but is essential for a comprehensive understanding of the genome.
DNA Isolation and Fragmentation
Methods of DNA isolation can affect the expression of different genomic regions and thus affect variant detection. DNA fragmentation methods vary from mechanical to chemical and include Tn5 transposases that deliver PCR or sequencing junctions directly to DNA. The DNA source, isolation, and fragmentation method chosen will have a significant impact on the project and should be agreed upon before the collaboration begins.
The presence of PCR duplicates can lead to false positives in variant detection. There are several methods to deal with these repeats, such as Unique Molecular Identifiers (UMIs), amplification-free library construction, or computational labeling of potential repeats. The choice of method depends on the available input material and library construction scheme.
The chosen sequencing strategy will determine the depth of genome coverage. Short read-length whole genome sequencing typically provides 30× coverage, while long read-length sequencing typically covers 60×. Targeted resequencing like exome sequencing provides uneven coverage, so higher coverage (90×-100×) is used to compensate.
Platforms and Read Lengths
The cost and effectiveness of variant detection usually dictate a preference for short read-length sequencing. However, longer reads, while more costly and with higher error rates, can cover larger structural variants across a large number of replicates and regions with GC content bias. Advanced technologies such as linked read lengths, cyclic coherent read lengths, and combinations of short and long-read length sequencing data can provide high-quality data with good mapping capabilities.
Fig. 1. Overview of experimental factors that are important for planning and performing a genome sequencing study. (Zverinova S, et al. 2022)
The variant calling process identifies single nucleotide variants present in the whole genome and exome data. Explore our whole exome and whole genome sequencing services to learn more. Variants are identified by comparing individual datasets to a reference sequence. The variant calling pipeline consists of a series of interrelated sequential steps:
Quality control represents the first, and arguably one of the most critical, steps in the variant calling process. Sequencing platforms (e.g. Illumina) generate raw reads in FASTQ format. These raw reads typically contain poor-quality sequences and splice sequences that need to be removed for efficient downstream analysis. Tools such as Trimmomatic or cutadapt are very effective in this step to filter out these unwanted sequences and improve data quality.
In addition, short reads consisting of less than 20 bases are discarded at this stage. These short read segments usually map ambiguously to multiple positions on the reference genome, which may introduce bias during variant calling.
Filtered reads are aligned to the reference genome after stringent quality control. Alignment is performed by algorithms such as Burrows-Wheeler Aligner (BWA-mem or BWA-aln) or Bowtie-2. The choice of aligner depends strongly on the size and nature of the reads (single- or double-ended).
The comparison process generates Sequential Alignment Mapping (SAM) files, which are subsequently converted to Binary Alignment Format (BAM) files to save storage space.
Deduplication eliminates duplicates, multiple mappings, and supplemental reads from the analysis to minimize the possibility of false positive results. Tools such as Picard facilitate this process by allowing only uniquely aligned reads to be analyzed through downstream variant identification.
Local Realignment Around Insertion Deletions and Variant Calls
The alignment stage sometimes produces artifacts around insertions and deletions (insertions and deletions), so a local realignment step is needed to correct these inaccuracies. The Genome Analysis Toolkit (GATK) is widely used for this purpose. The algorithm HaplotypeCaller in GATK identifies all potential variants in the processed comparison reads and outputs these variants to a variant call file (VCF).
The variant annotation phase aims to determine the function and effect of all identified SNPs. Annotation tools such as SnPEff predict the effects of variants on genes and annotate variants based on genomic location and their potential coding effects. Databases such as dbSNP and ClinVar, which contain comprehensive data on nucleotide variants and the relationship between variants and phenotypes in humans, respectively, are often used at this stage.
Visualization and Validation
The Genome Browser provides a platform for researchers to visualize aligned reads and thus examine the variants identified therein. Thereafter, validation is essential to confirm the results of the variant calling process. Sanger sequencing and microarray genotyping are often used in this context, and computational algorithms such as MutationValidator categorize mutations as somatic, germline, or human-made mutations.
Fig. 2. Standard pipelines for NGS analysis. (Koboldt D C, 2020)