There are two-leveled descent concepts of group evolution study: phylogeny and population genetics. Phylogeny focuses on the evolution of species including higher taxonomic orders. Population genetics, on the other hand, considers the evolution of groups below the species level. Elements of phylogeny were only seen in the works of Charles Darwin where varying homologous characters were compared. The most useful are selectively neutral, which displays that their variation closely relates with the passage of time, rather than evolutionary forces. In 1965, it was discovered that nucleotide and protein residues record the evolutionary history of an organism, which was continuously developed and became the basis of the phylogeny. Population genetics was more computational as it is used to simulate evolution, and also takes distances between the mutations into consideration.
Phylogeny can be elucidated through searching the best tree or computing pairwise distances and then constructing a tree from it. Scoring trees is difficult without an alignment except when using alignment-free methods where the sequence data is partitioned according to k-mers distribution without prior computation of a distance matrix. Clustering distances either uses algorithms for neighbor-joining or for minimizing the difference between tree and data. Moreover, multiple sequence alignments assembled into a tree can be constructed through pairwise distances. Gene-based and residue-based are the two methods of alignment-free distance computation which is done by finding homologous genes by alignment. Considering unannotated DNA sequences as input is a way to do in an alignment-free process which is either based on word counts or lengths of an exact match.
In using word counts, the difference of overlapping polymer frequencies is quantified as k-mers between sets of sequences. This option is the best for computing the number of mutations. However, counting words lacks power in resolving and comparing closely related sequences. This can be solved by using the feature frequency profile or that refined version of word counting which increases topography recovered. Additionally, Co-phylog can estimate branch lengths in addition to the topology.
For match lengths, closely related sequences with longer exact matches are computed in linear time algorithms that are based on a generalized index of input sequences called suffix trees. The distribution of mutations can be best resolved in this method. In this method, each suffix is the concatenated path label from the root to leaf. The $ at the end of the sequence means is a ‘sentinel character’ which differs from all other characters in the string. It ensures that a suffix also being the prefix of another suffix is still represented by a leaf in the suffix tree. An array of longest common prefix lengths can enhance the suffix array which simulates suffix tree operations.
Population genetics identify the evolutionary forces acting at the interbreeding groups of organisms. These forces are mutation, recombination, and selection. The spatial distribution of mutations is taken into consideration. Dynamic population structure and size influence these forces. For example, sudden environmental changes, bottleneck phenomena, and the founder effect drastically affect the population genome content. Although few, alignment-free methods are slowly gaining momentum in this field once de novo assembly of these data becomes feasible.
The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.
References