Pan-Genomics Analysis: Introduction, Pipeline, and Applications

Pan-Genomics Analysis: Introduction, Pipeline, and Applications

Online Inquiry

Introduction to Pan Genome

A pan-genome (pangenome or supragenome) is the whole series of genes for all strains inside a clade in molecular biology and genetics. In a broader sense, it's the sum of a clade's genomes. The pangenome is made up of two parts: the core genome, which contains genes found in all strains in the clade, and the accessory genome, which contains 'dispensable' genes found in only a subset of the strains, and strain-specific genes. It's worth noting that, at least in plants, the word "dispensable" has been criticized, as accessory genes "show a crucial part in genome evolution and in the complex interaction between the genome and the environment." Pan-genomics is the study of the pangenome.

Technical Implementations in Pan-Genome Analysis

ORFs, genes, clumps of orthologous groups COGs, coding sequences (CDS), proteins, arbitrary sequence chunks, concatenated gene or protein entities, and other sequence units can be used in pangenome assessment.

The variables that establish in the search engine the orthologous sequences and thus directly influence the pool of core and dispensable sequence entities, the mathematical model the applied distribution of forecasting the evolution of the pangenome and core-genome size, and how fast a pangenome is anticipated to expand and attain a plateau (open or close pangenome) are all practical implications of concern that explicitly affect the credibility of the conclusions derived. As the number of genomes grows, another limiting aspect is the scalability of all probable genome addition permutations, since the total number of comparisons required is classified from the following function: C=N!/(n−1)!∙(N−n)! Wherein C is the total number of comparisons, and N is the total number of genomes.

A technique of subsampling the total number of comparisons necessary is a workaround to an exhaustive strategy; comparisons are chosen randomly to ensure that each genome receives the same number of comparisons; the tactic here is to establish the number of probable comparisons to a number that will efficiently sustain the current computational power and the target database size.

Pan Genome Analysis Pipeline: BGDMdocker

BGDMdocker analyzes and visualizes bacterial pangenome and biosynthetic gene clusters using Docker innovation. Prokka v1.11 for fast prokaryotic genome annotation, panX for pangenome assessment, and antiSMASH3.0 for fully automated genomic detection and evaluation of biosynthetic gene clusters comprise the pipeline. Alignment, phylogenetic trees, mutations configured on phylogenetic branches, and gene loss and gain mapping on the core-genome phylogeny are all supported by the visualization. A total of 44 Bacillus amyloliquefaciens strains were tested.

Software Tools Used for Pan Genome Analysis

As the field of pangenomics has grown in popularity, a number of applications equipment have been created to aid in the analysis of this data. In 2015, a team looked at the various types of analyses and tools that a researcher might have. To assess pangenomes, seven types of software have been constructed: build phylogenetic relationships of orthologous genes/families of strains/isolates; function-based searching; annotation and/or curation; cluster homologous genes; identify SNPs; plot pangenomic profiles; build phylogenetic relationships of orthologous genes/families of strains/isolates; and visualizations.

Panseq and the pan-genomes analysis pipeline were the two most quoted software equipment at the end of 2014. (PGAP). BPGA – A Pan-Genome Analysis Pipeline for Prokaryotic Genomes, GET HOMOLOGUES, Roary, and PanDelos are some of the other choices.

In 2015, an evaluation on plant pan-genomes was released. PanTools and GET HOMOLOGUES-EST were two of the first software applications for plant pangenomes.

A computational correlation of instruments for extracting gene-based pangenomic contents (such as GET HOMOLOGUES, PanDelos, Roary, and others) was recently carried out. Equipment was evaluated from a methodological standpoint, with the goal of determining what factors cause one methodology to outperform others. The study considered a variety of bacterial populations that were created synthetically by altering evolutionary parameters. The outcomes demonstrate that the efficiency of each piece of equipment differs depending on the input genome structure.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.


  1. Vernikos GS. A Review of Pangenome Tools and Recent Studies. The Pangenome. 2020:89-112.
  2. Chen X, Zhang Y, Zhang Z, et al. PGAweb: a web server for bacterial pan-genome analysis. Frontiers in microbiology. 2018, 9.
  3. Chaudhari NM, Gupta VK, Dutta C. BPGA-an ultra-fast pan-genome analysis pipeline. Scientific reports. 2016, 13;6(1).
* For Research Use Only. Not for use in diagnostic procedures.
Online Inquiry