Although the details of genome annotation pipelines vary, they all share a common set of characteristics. Annotation of gene structures across the genome is usually divided into two phases. The first phase, the ‘computation' phase, involves aligning expressed sequence tags (ESTs), proteins, and other data to the genome and generating ab initio and/or evidence-driven gene predictions. These data are sequenced into gene annotations in the second phase, the 'annotation' phase. Annotation pipelines are the programs that organize compute data (evidence) and use it to generate genome annotations because this process is integrally complex and involves so many various tools. Although Ensembl has some functionalities for annotating non-coding RNAs, the current pipelines are centered on protein-coding genes (ncRNAs).
Figure 1. Basic approaches to genome annotation and some common variations. (Yandell, 2012)
Computation Phase
Repeat identification: The first step in the computation phase of genome annotation is usually repeated identification and masking. The term "repeat" is used to define two different types of sequences, which can be confusing. Low-complexity sequences, such as homopolymeric nucleotide runs, as well as transposable (mobile) elements, such as viruses, long interspersed nuclear elements (LINEs), and short interspersed nucleotide repeats (SINRs), are examples. Eukaryotic genomes can have a lot of repeats. Furthermore, the borders of these repeats are frequently ill-defined; repeats frequently insert within other repeats, and complete elements are only found on rare occasions. The annotation of the genome is complicated by repeats. They must be identified and annotated, but the tools used to identify repeats are not the same as those used to identify genes to identify the genes of the host genome.
Evidence alignment: Most pipelines align proteins, ESTs, and RNA-seq data to the genome assembly after repeat masking. These sequences involve transcripts and proteins from the organism whose genome is being annotated that have previously been defined. Most pipelines align proteins, ESTs, and RNA-seq data to the genome assembly after repeat masking. These sequences involve transcripts and proteins from the organism whose genome is being annotated that have previously been defined.
Ab initio gene prediction: Gene predictors revolutionized genome analysis when they first became accessible in the 1990s even though they supplied a quick and easy way to recognize genes in assembled DNA sequences. These tools are known as ab initio gene predictors because they identify genes and evaluate their intron-exon structures using mathematical models rather than external evidence (such as EST and protein alignments).
Evidence-driven gene prediction: Evidence-driven (as opposed to ab initio) gene prediction is the term used to describe this method. Evidence-driven gene prediction has a lot of potential for improving gene prediction quality in newly sequenced genomes, but it can be difficult to use in practice. ESTs and proteins must first be aligned to the genome, and RNA-seq data, if available, must also be aligned. Splice sites must then be discovered, and the evidence must be post-processed before a data summary can be sent to the gene finder.
Annotation Phase
The ultimate goal of annotation efforts is to create a final set of gene annotations by combining alignment-based evidence with ab initio gene predictions. This was previously done manually, with human genome annotators reviewing the evidence for each gene to determine intron-exon structures. Although this produces high-quality annotation, it is so time-consuming that smaller genome projects are increasingly being forced to rely on automated annotations due to budget constraints. There are almost as many methods for constructing automated annotations as there are annotation pipelines, but the common thread is that evidence is used to enhance the precision of gene models, usually through a combination of pre- and post-processing.
Eukaryotic genome annotation detects both unknown and rare transcripts, provides quantitative analysis of transcripts, reveals differences in gene expression levels of different samples, and performs structural analysis for discovering variable splicing sites, gene fusions, SNPs, and InDel sites. It is also important for revealing the regulation of gene expression in eukaryotic cells, exploring host-pathogen interaction, and monitor disease progression.
References