What is Sequence Alignment?
Sequence alignment is a central method used in bioinformatics to compare DNA, RNA, or protein sequences, and involves comparing a newly determined biological sequence with previously known sequences stored in a database. Such alignments are essential for identifying similarities and differences between these sequences, and help to understand the functional, structural, and evolutionary relationships between the sequences.
Types of Sequence Alignment
Based on Sequence Length
- Global Alignment: Global alignment is an algorithm that compares two sequences in their entirety from beginning to end. This method is most effective when comparing sequences of approximately the same length and assumed to have similar regions throughout their length. In global matching, the focus is on maximizing the overall similarity over the length of the sequences. This means that even though some regions may not be perfectly aligned, the algorithm tries to find the best possible alignment over the entire length of the sequence.
- Local Alignment: Unlike global alignment, local alignment does not attempt to compare the entire length of the sequence. Instead, it aims to identify the regions of the sequence that exhibit the highest similarity. This makes it particularly useful when dealing with sequences of different lengths or sequences that are similar only in certain regions.
Fig. 1. Global and Local Alignment of two sequences. (Mount, D. W., 2001)
Based on Sequence Number
- Pairwise Sequence Alignment: Pairwise Sequence Alignment (PSA) involves comparing only two sequences to identify regions of similarity or dissimilarity. This process helps to trace the evolutionary path of the sequences, which in turn helps to understand their function and structure.
- Multiple Sequence Alignment: In contrast, multiple sequence alignment (MSA) compares more than two sequences simultaneously. The main purpose of MSA is to discover gene families, protein families, and enzyme active sites, and to assess functional and evolutionary relevance.
Pairwise Sequence Alignment Methods
Dot Plotting
The dot matrix method or dot plot provides a visual representation of sequence comparison. Sequences are plotted along the two axes of a matrix, with dots marking the points where sequences match. This graphical technique reveals the structural motifs and repetitive elements in the sequences, especially when they exhibit a high degree of similarity.
Dynamic Programming
Dynamic programming is a rigorous method for discovering the best alignment between two sequences. It uses algorithms such as the Needleman-Wunsch algorithm for global alignment and the Smith-Waterman algorithm for local alignment. The technique uses a scoring system to assign values for matches, mismatches, and gaps and ultimately selects the path with the highest cumulative score as the best alignment.
Word or k-tuple methods
The word or k-tuple method is used in database search tools such as FASTA and BLAST and is a heuristic method for quickly comparing sequences. It recognizes short identical sequences or "words" and uses dynamic programming to align sequences based on these words.
Multiple Sequence Alignment Methods
Exhaustive Algorithm
The exhaustive algorithm checks all possible matches, making it a computationally expensive method. Although theoretically robust, it becomes impractical for large datasets due to the exponential growth of possible pairs.
Heuristic Algorithms
Heuristic algorithms provide an effective alternative to exhaustive methods. These include asymptotic, iterative, and block-based methods.
- Progressive alignment: Progressive alignment methods perform pairwise alignment in a stepwise manner to generate bootstrap trees from similarity scores. It starts by comparing the most closely related sequences and gradually adds more distant sequences until all sequences are compared.
Fig. 2. Consistency in progressive alignment. (Batzoglou S, 2005)
- Iterative Alignment: Iterative methods refine the initial suboptimal alignment by repeatedly modifying it until the best solution is found. PRRN is a tool that uses an iterative approach, starting with adult pairs and generating weighted trees for subsequent pairs.
- Block-based alignment: The block-based approach recognizes gap-free comparison blocks shared by all sequences. It is particularly useful for identifying conserved structural domains and motifs in highly divergent sequences of different lengths.
Applications of Sequence Alignment
Sequence comparison has a variety of uses in molecular biology, including but not limited to:
- Identifying unknown sequences: Sequence comparison can be used to discover unknown sequences by comparing them with known sequences in a database.
- Predicting new members of gene families: Sequence alignment can provide predictions about new members of gene families.
- Discovering evolutionary relationships: Phylogenetic trees constructed using sequence alignment can provide insight into the evolutionary relationships between sequences.
- Predicting Regulatory Regions: Sequence alignment can predict the location and function of protein coding and transcriptional regulatory regions in genomic DNA.
- Identify structurally or functionally similar regions: By highlighting similar regions, sequence alignment can help identify structurally or functionally similar regions in DNA, RNA, or protein sequences.
The Best Tool or Software for Sequence Alignment
There are a variety of tools and software available for sequence alignment, and the choice usually depends on the specific task at hand. Some of the most commonly used include BLAST, FASTA, and GeneWise. These software programs provide a comprehensive solution for sequence alignment and have been widely adopted by the bioinformatics community.
References:
- Mount, D. W. (2001) Bioinformatics: sequence and genome analysis. Cold Spring Harbor Laboratory Press.
- Batzoglou S. The many faces of sequence alignment. Brief Bioinform. 2005 Mar;6(1):6-22.