Home
Resources
Support Documents
Bioinformatics Basics: The Identification of Genome Repeats

Bioinformatics Basics: The Identification of Genome Repeats

Overview

Genome repeats are parts of the genome that appear in multiple copies, possibly at different places throughout the genome. Because reads from these distinct repeats are very similar, and assembly tools cannot differentiate between them, the quantity and distribution of repeats in a genome have a huge impact on genome assembly findings. This can result in mis-assemblies, where areas of the genome that are far apart are put together incorrectly, or an inaccurate approximate of the size or number of duplicates of the repeats themselves. A high repeat material often results in fragmented assembly because the assembly instruments are unable to identify the proper assembly of these areas and merely stop lengthening the contigs at the repeats' borders. Reads must be long enough to involve the distinctive sequences flanking the repeats to settle the assembly of repeats. If you know you'll be progressing with a genome with a lot of repeats, it might be a smart option to order data from a long-read innovation.

Genome Repeats Identification

Gene annotation can be muddled by repeats in the genome sequence, and genome assembly can be hampered by them. It is critical to properly annotate and categorize them in genome sequences. The percentage of repeats in a genome can range from a few percent (3 percent in the yeast Saccharomyces cerevisiae) to nearly the entire genome (>80% in maize). The human genome contains a lot of repeats, which account for about 45 percent of the genome.

Helitron Identification

Helitrons are rolling-circle transposons with few terminal repeats and target site duplications, making them difficult to detect. Helitronscanner was developed to define DNA motifs stored in helitrons and then use the motifs to detect specific areas in your genome because helitrons have inefficient sequence conservation.

Helitronscanner is basically a four-step procedure:
1. Classify 5' helitron sequences by comparing them to known sequences.
2. Classify 3' helitron sequences by comparing them to known sequences.
3. Glue the helitron ends together.
4. Make fasta sequences for each of the helitrons.

DNA Transposon Borders Identification

Inverted repeats finder is mainly used to classify inverted repeats in the genome, but it can also be used to define full-length DNA transposons when used in conjunction with other repeat prediction applications. We need to do bed tools that intersect with another repeat data set from the same genome (REPET/Repeatmodeler) to see if these inverted repeats are terminal inverted repeats of DNA transposons.

LTR Retrotransposons Identification

LTR-finder is a piece of application that allows you to characterize LTR retrotransposons in a genome in great detail. The algorithm was created with the goal of classifying the structural similarities among LTR retrotransposons. LTR finder is particularly effective at spotting pairs of LTRs in the genome, and it uses Prosite to annotate the intervening sequences for protein-coding domains. This application will not provide a comprehensive analysis of repeats in a genome (i.e. repeatmodeler, REPET, etc.), but it will provide a simplified look at LTR retroelements.

De novo Repeat Identification

Repeatmodeler is a repeat-detection program that can generate a list of repeat family sequences that can be used by RepeatMasker to mask repeats in a genome. One thing to keep in mind about this application is that it can take a long time to process large genomes. You must also set the correct repeatmodeler parameters in order to obtain repeats that are not only grouped by family but also annotated.

Tandem Repeats Detection

There are a variety of methods for detecting tandem duplications (TD) in a genome, each with its own set of capabilities. I'll show you how to use applications that can detect more ancient tandem duplications in a genome, as well as software that can only detect recent and highly similar TDs. Redtandem essentially links smaller configurations together in order to identify more deviated and ancient TD. Mummer is basically a genome self-alignment that quickly classifies recent TDs but misses ancient TD that have diverged in nucleotide similarity.

About CD Genomics Bioinformatics Analysis

The bioinformatics analysis department of CD Genomics provides novel solutions for data-driven innovation aimed at discovering the hidden potential in biological data, tapping new insights related to life science research, and predicting new prospects.

References

Del Angel VD, Hjerde E, Sterck L, et al. Ten steps to get started in Genome Assembly and Annotation. F1000Research. 2018, 7.
Price AL, Jones NC, Pevzner PA. De novo identification of repeat families in large genomes. Bioinformatics. 2005, 21(suppl_1).
Bao Z, Eddy SR. Automated de novo identification of repeat sequence families in sequenced genomes. Genome research. 2002, 12(8).

* For Research Use Only. Not for use in diagnostic procedures.