Whole-Genome Sequencing: Introduction and Data Analysis Protocols

Whole-Genome Sequencing: Introduction and Data Analysis Protocols

Online Inquiry

Introduction to Whole-Genome Sequencing

Whole-genome sequencing (WGS) is a method of determining an organism's complete chromosomal (nuclear) and mitochondrial DNA sequence by reading and stitching together short fragments of DNA. When a reference or template sequence is not accessible, de novo sequencing is used to sequence a new genome. Contigs are made up of sequencing reads (contiguous consensus sequences from collections of overlapping reads). A draft or common reference sequence is created after a de novo genome has been completely sequenced, constructed, and formatted. Single nucleotide polymorphisms (SNPs), copy number variations (CNVs), re-arrangements, and indels are regularly determined using focused sequencing approaches such as exome or targeted resequencing. While WGS can detect genomic variants on its own, the sequencing depth provided by focused or targeted resequencing is currently far cheaper.

Whole-Genome Sequencing Bioinformatics Workflow

The bioinformatics template for WGS includes the following stages: raw read quality control and data preprocessing; alignment; variant calling; genome assembly; genome annotation; and other enhanced assessments such as phylogenetic evaluation depending on your study concern.

Read Filtering and Cleaning

To enhance average quality and decrease the quantity of erroneous data, reads are 'cleaned.' Poor or biased sequences, ambiguous bases, read duplicates, homopolymers that develop at the flow cell's edge, and adapter dimers are all examples of this. Deduplication, imposing a minimum read length, and reducing low-quality sequences are examples of other filtering options (low Q score).


It is necessary to establish a reference genome. Mash allows us to decide genetic distance and connectedness by comparing the sequencing reads to the NCBI RefSeq genomes reference set. The quality-controlled reads must now be mapped to the reference genome. Two famous short-read alignment algorithms are Burrows-Wheeler Aligner (BWA) and Bowtie2. The conventional sequence alignment/map template known as SAM is produced by BWA and Bowtie2, which makes the following steps easier.

Variant Calling

Within whole genome and exome data, the variant calling pipeline defines single-nucleotide varieties. Varieties are discovered by contrasting an individual's datasets to a reference sequence.

Genome Assembly

The method of aligning overlapping reads to establish longer contigs (larger contiguous sequences) and ordering the contigs into scaffolds is known as de novo assembly (a framework of the sequenced genome). When a reference genome from a linked organism is available, it is a familiar practice to generate contigs from scratch before aligning them to the reference genome for scaffold assembly. The "Align-Layout-Consensus" algorithm is another option. This technique aligns reads against a closely related reference genome before creating contigs and scaffolds from scratch.

Genome Annotation

A specific gene and its product, such as RNA or protein, are described in the genome annotation. It contains the gene product's designated purpose as well as some supporting evidence. Because there are so many genes and products to examine, the best method usually entails both manual and automated annotation. The study of genetic elements such as open reading frames, gene composition, and regulatory motifs is known as structural annotation. The process of allocating biological function (regulation, interactions, and expression) to these components is known as functional annotation.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.


  1. Gautam SS, Rajendra KC, Leong KW, et al. A step-by-step beginner's protocol for whole genome sequencing of human bacterial pathogens. Journal of biological methods. 2019, 6(1).
  2. Maurier F, Beury D, Fléchon L, et al. A complete protocol for whole-genome sequencing of virus from clinical samples: Application to coronavirus OC43. Virology. 2019.
  3. Seth-Smith H, Bonfiglio F, Cuénod A, et al. Evaluation of rapid library preparation protocols for whole genome sequencing based outbreak investigation. Frontiers in public health. 2019, 7.
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry