Methods and Tools for Whole Genome Sequencing Data Analysis

Methods and Tools for Whole Genome Sequencing Data Analysis

Online Inquiry

Introduction to Whole Genome Sequencing

Using the most sophisticated genetic sequencing innovations, whole-genome sequencing (WGS) has the potential to immensely improve genomic understanding and unlock life's secrets. WGS can be used for a variety of purposes, including variant calling, genome annotation, phylogenetic assessment, and reference genome development. Data management is another issue for WGS. Computational assessment, rather than sequencing innovation, will be the rate-limiting variable as larger datasets become more available and cost-effective.

The following are the stages in the bioinformatics template for WGS: (1) quality control of raw reads; (2) data preprocessing; (3) alignment; (4) variant calling; (5) genome assembly; and (6) genome annotation. Depending on the software, different types of data assessment will be needed.

Raw Read QC and Preprocessing

Poor-quality reads/sequences, as well as technical sequences like adapter sequences, must be removed from the raw files (fastq). This procedure is critical for detecting variations with accuracy and reliability. FastQC is an effective raw read quality control tool that generates statistical data findings involving basic statistics, sequence quality, quality scores, sequence content, GC content, sequence length distribution, overrepresented sequences, sequence duplication level design, adapter composition, and k-mer composition. Instruments like Fastx trimmer and cutadapt can be used for read trimming.


It is necessary to establish a reference genome. Mash allows us to evaluate genetic distance and relatedness by comparing the sequencing reads produced against the reference set from NCBI RefSeq genomes. The quality-controlled reads must now be mapped to the reference genome. The conventional sequence alignment/map template known as SAM is produced by BWA and Bowtie2, which makes the following processes easier. BLAST, on the other hand, is commonly used for local alignment.

Variant calling

Variants can be assessed by comparing the specimen genome to the reference genome after reads have been aligned to the reference genome. Variants discovered may be linked to disease or simply non-functional genomic noise. SNPs (single nucleotide polymorphisms), indels, structural variants, and annotations are all stored in VCF, which is the conventional template for storing sequence variations. Due to the high percentage of false positive and false negative detection of SNVs and indels, variant calling can be difficult.

Genome assembly

The process of aligning overlapping reads to construct longer contigs (larger contiguous sequences) and ordering the contigs into scaffolds is known as de novo assembly (a template of the sequenced genome). When a reference genome from a related specimen is available, it is normal practice to produce contigs from scratch before aligning them to the reference genome for scaffold assembly. The "Align-Layout-Consensus" algorithm is another option. This technique aligns reads against a strongly linked reference genome before creating contigs and scaffolds from scratch.

The quality of the assembly can be measured using a variety of metrics. Effective genome annotation requires contiguous near-complete (approximately 90%) assembly disrupted by small gaps.
- Genome size: can be estimated using both C-value and k-mer frequency-based methods.
- Assembly contiguity: The N50 statistic, which defines a type of median of assembled sequence lengths, can be employed to assess assembly contiguity.
- Accuracy: Transcriptome data is a valuable resource for verifying sequence accuracy and fixing scaffolds. Mis-assemblies and chimeric contigs can also be detected using comparative genomic methods.

Genome annotation

To fully comprehend the genome sequence, biologically appropriate details such as gene ontology (GO) terms, KEGG pathways, and epigenetic modifications must be formatted. There are two stages to the annotation:

  1. Stages of computation. Repeat masking, coding sequence prediction (CDS), and gene model prediction are all part of the computational phase.
  2. Repeat the masking process. Because repeats are poorly preserved across organisms, it is advised that you use tools like RepeatModeler and RepeatExplorer to develop a species-specific repeat library. Gene models are predicted. Protein alignment, syntenic protein lift-overs from other species, EST, and RNA-seq data can all be helpful in predicting gene models.

  3. Annotation phase. All the evidence mentioned above (ab initio prediction, as well as protein-, EST-, and RNA-alignments) is then synthesized into a gene annotation. Additionally, automated annotation tools such as MAKER and PASA are available to integrate and weigh the evidence. WebApollo can be employed to change the annotation through the visual interface if there is an error with the gene annotations.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.


  1. Pirooznia M, Goes FS, Zandi PP. Whole-genome CNV analysis: advances in computational approaches. Frontiers in genetics. 2015, 6.
  2. Baes CF, Dolezal MA, Koltes JE, et al. Evaluation of variant identification methods for whole genome sequencing data in dairy cattle. BMC genomics. 2014, 15(1).
  3. Pabinger S, Dander A, Fischer M, et al. A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in bioinformatics. 2014, 15(2).
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry