Home
Resources
Support Documents
Identification and Analysis of Genomic Variants: Introduction and Detailed Pipeline

Identification and Analysis of Genomic Variants: Introduction and Detailed Pipeline

Introduction to Genomic Variants

A variant is a change in the most prevalent sequence of DNA. They can be described in a number of ways, but the most common classification is based on the kind of DNA error. Some variants have no functional importance, such as a simple base substitution. In oncology, those that influence protein synthesis are functionally important. Variations or variants can be tumor-specific (somatic) or congenital (inherited) (germline). Biomarkers include both genetic and somatic variants. In place of the word mutation, the word variant is progressively being utilized.

Figure 1. The flowchart of combinations using different sequencers and variant calling pipelines for germline variants. (Chen, 2019)

Detection of Genomic Variation: Pipeline

Detecting genomic variants from raw read data is a multistep process that can be carried out with a variety of tools and resources. The steps in the procedure are as follows:

- Sequence the entire genome or exome to generate FASTQ files.

- Create BAM or CRAM files by aligning the sequences to a reference genome.

- Create a VCF file by identifying where the aligned reads vary from the reference genome.

Acquirement of raw read data: the FASTQ file configuration

The raw info from a sequencing machine is most commonly supplied as FASTQ files, which consist of sequence information and extra information, such as sequence quality information, similar to FASTA files.

Quality Control

In overall, raw sequence data obtained from a sequencing service provider is not instantly ready for variant discovery. Following data acquisition, the quality control (QC) step is the first and most essential phase in the WES/WGS assessment framework. QC is the method of enhancing raw data by erasing any errors that can be identified. By performing quality control (QC) at the start of the assessment, the possibilities of encountering contamination, bias, error, or missing data are reduced.

The QC method is a cyclical technique in which (i) the quality is assessed, (ii) QC is stopped if the quality is sufficient, (iii) a data modifying phase (e.g., trimming of low-quality reads, deletion of adapters, etc.) is conducted, and (iv) the QC method is repeated, starting from step one (i).

FastQC is the most widely used equipment for assessing and envisioning the quality of FASTQ data. It offers a wealth of data about data quality, consisting of per-base sequence quality scores, GC content data, sequence duplication rates, and overrepresented sequences, among other things. FastQC substitutes involve fastqp, NGS QC Toolkit, and PRINSEQ, and QC-Chain.

Sequence Alignment

Each read must be aligned to a reference genome in order to determine its exact location. Because aligning large numbers of reads can take days, and a low-accuracy alignment will result in insufficient analyses, reliability and accuracy are critical in this phase.

A Sequence Alignment Map (SAM) file is created after the alignment is completed. This file contains the reference-aligned reads. A Binary Alignment Map (BAM) file is the binary edition of a SAM file, and BAM files are used for random-access reasons. A header and an alignment segment make up the SAM/BAM file. Contigs of the aligned reference sequence, read groups (carrying platform, library, and sample data), and data processing instruments implemented to the reads are all found in the header segment. The read alignment segment contains data on read alignments.

Post-Alignment Processing

Post-alignment data processing to create analysis-ready BAM files is an important process in any reads-to-variants framework. This process involves data cleaning to eliminate technical biases, such as marking duplicates and recalibrating base quality scores.

Short Variant Discovery

The methods for detecting germline SNV and indels are described in this report. The methods for discovering somatic short variants and structural variations are described in the preceding portions.

The reads are prepared for downstream assessment after going through data processing steps, and the most common phase is variant calling. The form of classifying distinctions between sequencing reads generated by NGS experiments and a reference genome is known as variant calling. As alignment and sequencing artifacts complicate the method of variant calling, a plethora of variant callers have been established and are being built to help with this difficult task.

Filtration of Variants

Raw SNV and indels in the Variant Call Format (VCF) are acquired after the variant calling stage. After that, either apply hard filters to the data or use a more intricate method like GATK's Variant Quality Score Recalibration to filter them (VQSR).

Variant Annotation

Another crucial process in the WES/WGS assessment framework is variable annotation. All functional annotation instruments have the goal of annotating data about variant effects/consequences, such as (i) listing which gene(s)/transcript(s) are influenced, (ii) determining the impact on protein sequence, and (iii) correlating the variant with known genomic annotations, and (iv) finding known variants in variant databases and complementing them. Each variant's effect is demonstrated using Sequence Ontology (SO) terms. Qualifiers are frequently used to imply the severity and effect of these implications.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.

References

Neerman N, Faust G, Meeks N, et al. A clinically validated whole genome pipeline for structural variant detection and analysis. BMC genomics. 2019, 20(8).
Chen J, Li X, Zhong H, et al. Systematic comparison of germline variant calling pipelines cross multiple next-generation sequencers. Scientific reports. 2019, 9(1).
Liu F, Zhang Y, Zhang L, et al. Systematic comparative analysis of single-nucleotide variant detection methods from single-cell RNA sequencing data. Genome biology. 2019, 20(1).
Peng G, Fan Y, Palculict TB, et al. Rare variant detection using family-based sequencing analysis. Proceedings of the National Academy of Sciences. 2013, 110(10).

* For Research Use Only. Not for use in diagnostic procedures.