Home
Resources
Support Documents
Genome Size Estimation: Origin, Definition, and Methods

Genome Size Estimation: Origin, Definition, and Methods

Definition and Origin of the term "Genome Size"

The quantity of haploid nuclear DNA in an organism's genome (Genome size) is usually measured in picograms or megabases (where 1 pg is equivalent to 978 Mb). The complexity of a species is not equivalent to its genome size; total DNA content varies greatly among biological taxa. For unknown purposes, some single-celled microbes have far more DNA than humans.

Even in conversations dealing primarily with definitions in this field of study, the word "genome size" is frequently attributed incorrectly to a 1976 paper by Ralph Hinegardner. Hinegardner only used the word once in the book: in the title.The term appears to have first appeared in 1968, when Hinegardner wondered whether "cellular DNA composition does, in fact, represent genome size" in the last paragraph of another article. The term "genome size" was employed in this sense to refer to the set of genes in a genotype.

These writers should probably be recognized for coining the word "genome size" in its modern sense in a publication presented only two months later. By the early 1970s, "genome size" had become widely accepted, owing to its incorporation in Susumu Ohno's influential book Evolution, written in 1970.

Genome Size Estimation

Genome size estimation is critical not only for our understanding of genome evolution, but also for a variety of practical factors of genome sequencing and assembly, such as estimating the quantity of sequencing information needed and assessing the completeness of formed genome sequences.

Experimental and computational techniques for genome size estimation can be classified into two parts. For many years, testing methods such as feulgen densitometry or the widely used flow cytometry have been utilized and implemented to tens of thousands of species, resulting in a variety of genome size datasets. It's worth noting that all analytical designs depend on genomes that serve as internal controls. Since the significance of guidelines has been acknowledged, even so, the aim of developing a collection of frequently used guidelines has remained unachieved. When the same genome is evaluated in various labs, this, along with other variables like sample collection, staining/dyeing method, and stochastic drift of instruments, can lead to major differences in genome size estimation.

Conversely, utilizing whole-genome sequencing data, genome size can be calculated computationally. Because most sequence configurations are incomplete, employing genome assembly length as genome size estimation is not very effective. Instead, genome size can be deduced effectively from sequencing reads by assessing the frequencies of k-mers, which can be done with a variety of effective k-mer counting instruments such as jellyfish or DSK. The amount and average abundance of k-mers can then be utilized to approximate genome size; however, estimating these variables accurately is not easy. Recently, mathematical models such as negative binomial or Poisson distributions have been used to match the histogram of different k-mer frequencies in order to interpret these factors and predict genome size. Though commonly used, there is barely any research that established their predictions.

Finding Genome Size Using K-mer Estimation

The overall number of k-mers feasible for a sequence of length L and a k-mer size of k is given by ( L – k ) + 1. The overall number of k-mers approximated for shorter fragment sizes is n = 7, which is not close to the actual fragment size of L, which is 14 bps. However, the overall number of k-(n) mer's provides a good estimation to the exact genome size for larger fragment sizes. The following table tries to illustrate the approximation:

Figure 1. Total number of k-mers as good approximation to the actual genome size. (Institute for Systems Genomics, Computational Biology Core, University of Connecticut)

So, for a 1 Mb genome, the difference between approximation and reality is only.0017 percent, which is a very close approximation of actual size.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.

References

Sun H, Ding J, Piednoël M, Schneeberger K. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics. 2018, 34(4).
Nayfach S, Pollard KS. Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome. Genome biology. 2015, 16(1).

* For Research Use Only. Not for use in diagnostic procedures.