Hubbry Logo
logo
Sequence alignment
Community hub

Sequence alignment

logo
0 subscribers
Read side by side
from Wikipedia

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.[1] Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

A sequence alignment, produced by ClustalO, of mammalian histone proteins.
Sequences are the amino acids for residues 120-180 of the proteins. Residues that are conserved across all sequences are highlighted in grey. Below the protein sequences is a key denoting conserved sequence (*), conservative mutations (:), semi-conservative mutations (.), and non-conservative mutations ( ).[2]

Interpretation

[edit]

If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between amino acids occupying a particular position in the sequence can be interpreted as a rough measure of how conserved a particular region or sequence motif is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose side chains have similar biochemical properties) in a particular region of the sequence, suggest [3] that this region has structural or functional importance. Although DNA and RNA nucleotide bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.

Alignment methods

[edit]

Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Various algorithms were devised to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: global alignments and local alignments. Calculating a global alignment is a form of global optimization that "forces" the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity.[4] A variety of computational algorithms have been applied to the sequence alignment problem. These include slow but formally correct methods like dynamic programming. These also include efficient, heuristic algorithms or probabilistic methods designed for large-scale database search, that do not guarantee to find best matches.

Representations

[edit]

Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the conservation of a given amino acid substitution. For multiple sequences the last row in each column is often the consensus sequence determined by the alignment; the consensus sequence is also often represented in graphical format with a sequence logo in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.[5]

Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as FASTA format and GenBank format and the output is not easily editable. Several conversion programs that provide graphical and/or command line interfaces are available, such as READSEQ[6] and EMBOSS. There are also several programming packages which provide this conversion functionality, such as BioPython, BioRuby and BioPerl. The SAM/BAM files use the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string format to represent an alignment of a sequence to a reference by encoding a sequence of events (e.g. match/mismatch, insertions, deletions).[7]

CIGAR Format

[edit]

Ref.  : GTCGTAGAATA
Read: CACGTAG—TA
CIGAR: 2S5M2D2M where:
2S = 2 soft clipping (could be mismatches, or a read longer than the matched sequence)
5M = 5 matches or mismatches
2D = 2 deletions
2M = 2 matches or mismatches

The original CIGAR format from the exonerate alignment program did not distinguish between mismatches or matches with the M character.

The SAMv1 spec document defines newer CIGAR codes. In most cases it is preferred to use the '=' and 'X' characters to denote matches or mismatches rather than the older 'M' character, which is ambiguous.

CIGAR Code BAM Integer Description Consumes query Consumes reference
M 0 alignment match (can be a sequence match or mismatch) yes yes
I 1 insertion to the reference yes no
D 2 deletion from the reference no yes
N 3 skipped region from the reference no yes
S 4 soft clipping (clipped sequences present in SEQ) yes no
H 5 hard clipping (clipped sequences NOT present in SEQ) no no
P 6 padding (silent deletion from padded reference) no no
= 7 sequence match yes yes
X 8 sequence mismatch yes yes
  • "Consumes query" and "consumes reference" indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
  • H can only be present as the first and/or last operation.
  • S may only have H operations between them and the ends of the CIGAR string.
  • For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined.
  • Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ

Global and local alignments

[edit]

Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot start and/or end in gaps.) A general global alignment technique is the Needleman–Wunsch algorithm, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith–Waterman algorithm is a general local alignment method based on the same dynamic programming scheme but with additional choices to start and end at any place.[4]

Hybrid methods, known as semi-global or "glocal" (short for global-local) methods, search for the best possible partial alignment of the two sequences (in other words, a combination of one or both starts and one or both ends is stated to be aligned). This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.[8] Another case where semi-global alignment is useful is when one sequence is short (for example a gene sequence) and the other is very long (for example a chromosome sequence). In that case, the short sequence should be globally (fully) aligned but only a local (partial) alignment is desired for the long sequence.

Fast expansion of genetic data challenges speed of current DNA sequence alignment algorithms. Essential needs for an efficient and accurate method for DNA variant discovery demand innovative approaches for parallel processing in real time. Optical computing approaches have been suggested as promising alternatives to the current electrical implementations, yet their applicability remains to be tested [1].

Pairwise alignment

[edit]

Pairwise sequence alignment methods are used to find the best-matching piecewise (local or global) alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods;[1] however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low information content - especially where the number of repetitions differ in the two sequences to be aligned.

Maximal unique match

[edit]

One way of quantifying the utility of a given pairwise alignment is the 'maximal unique match' (MUM), or the longest subsequence that occurs in both query sequences. Longer MUM sequences typically reflect closer relatedness [9] in the multiple sequence alignment of genomes in computational biology. Identification of MUMs and other potential anchors, is the first step in larger alignment systems such as MUMmer. Anchors are the areas between two genomes where they are highly similar. To understand what a MUM is we can break down each word in the acronym. Match implies that the substring occurs in both sequences to be aligned. Unique means that the substring occurs only once in each sequence. Finally, maximal states that the substring is not part of another larger string that fulfills both prior requirements. The idea behind this, is that long sequences that match exactly and occur only once in each genome are almost certainly part of the global alignment.

More precisely:

"Given two genomes A and B, Maximal Unique Match (MUM) substring is a common substring of A and B of length longer than a specified minimum length d (by default d= 20) such that

  • it is maximal, that is, it cannot be extended on either end without incurring a mismatch; and
  • it is unique in both sequences"[10]

Dot-matrix methods

[edit]
Self comparison of a part of a mouse strain genome. The dot-plot shows a patchwork of lines, demonstrating duplicated segments of DNA.
A DNA dot plot of a human zinc finger transcription factor (GenBank ID NM_002383), showing regional self-similarity. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence. This is a typical example of a recurrence plot.

The dot-matrix approach, which implicitly produces a family of alignments for individual sequence regions, is qualitative and conceptually simple, though time-consuming to analyze on a large scale. In the absence of noise, it can be easy to visually identify certain sequence features—such as insertions, deletions, repeats, or inverted repeats—from a dot-matrix plot. To construct a dot-matrix plot, the two sequences are written along the top row and leftmost column of a two-dimensional matrix and a dot is placed at any point where the characters in the appropriate columns match—this is a typical recurrence plot. Some implementations vary the size or intensity of the dot depending on the degree of similarity of the two characters, to accommodate conservative substitutions. The dot plots of very closely related sequences will appear as a single line along the matrix's main diagonal.

Problems with dot plots as an information display technique include: noise, lack of clarity, non-intuitiveness, difficulty extracting match summary statistics and match positions on the two sequences. There is also much wasted space where the match data is inherently duplicated across the diagonal and most of the actual area of the plot is taken up by either empty space or noise, and, finally, dot-plots are limited to two sequences. None of these limitations apply to Miropeats alignment diagrams but they have their own particular flaws.

Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect occurs when a protein consists of multiple similar structural domains.

Dynamic programming

[edit]

The technique of dynamic programming can be applied to produce global alignments via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman algorithm. In typical usage, protein alignments use a substitution matrix to assign scores to amino-acid matches or mismatches, and a gap penalty for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore base stacking effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.)[citation needed] A common extension to standard linear gap costs are affine gap costs. Here two different gap penalties are applied for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. This results in fewer gaps in an alignment and residues and gaps are kept together, traits more representative of biological sequences. The Gotoh algorithm implements affine gap costs by using three matrices.[11][12]

Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account frameshift mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The BLAST and EMBOSS suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from open-source software such as GeneWise.[citation needed]

The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of sequences or extremely long sequences.[citation needed]

Word methods

[edit]

Word methods, also known as k-tuple methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST family.[1] Word methods identify a series of short, nonoverlapping subsequences ("words") in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.

In the FASTA method, the user defines a value k to use as the word length with which to search the database. The method is slower but more sensitive at lower values of k, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length k, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as EMBL FASTA and NCBI BLAST.

Multiple sequence alignment

[edit]
Alignment of 27 avian influenza hemagglutinin protein sequences colored by residue conservation (top) and residue properties (bottom)

Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and mechanistic information to locate the catalytic active sites of enzymes. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to NP-complete combinatorial optimization problems.[13][14] Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.

Dynamic programming

[edit]

The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and memory, it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the n-dimensional equivalent of the sequence matrix formed from two sequences, where n is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the "alignment space" is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the "sum of pairs" objective function, has been implemented in the MSA software package.[15]

Progressive methods

[edit]

Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to FASTA. Progressive alignment results are dependent on the choice of "most related" sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.

Many variations of the Clustal progressive implementation[16][17][18] are used for multiple sequence alignment, phylogenetic tree construction, and as input for protein structure prediction. A slower but more accurate variant of the progressive method is known as T-Coffee.[19]

Iterative methods

[edit]

Iterative methods attempt to improve on the heavy dependence on the accuracy of the initial pairwise alignments, which is the weak point of the progressive methods. Iterative methods optimize an objective function based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's multiple sequence alignment. Various ways of selecting the sequence subgroups and objective function are reviewed in.[20]

Motif finding

[edit]

Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved sequence motifs among the sequences in the query set. This is usually done by first constructing a general global multiple sequence alignment, after which the highly conserved regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original data set contained a small number of sequences, or only highly related sequences, pseudocounts are added to normalize the character distributions represented in the motif.

Techniques inspired by computer science

[edit]
A profile HMM modelling a multiple sequence alignment

A variety of general optimization algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. Hidden Markov models have been used to produce probability scores for a family of possible multiple sequence alignments for a given query set; although early HMM-based methods produced underwhelming performance, later applications have found them especially effective in detecting remotely related sequences because they are less susceptible to noise created by conservative or semiconservative substitutions.[21] Genetic algorithms and simulated annealing have also been used in optimizing multiple sequence alignment scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article multiple sequence alignment.

The Burrows–Wheeler transform has been successfully applied to fast short read alignment in popular tools such as Bowtie and BWA. See FM-index.

Structural alignment

[edit]

Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the secondary and tertiary structure of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more evolutionarily conserved than sequence,[22] structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.

Structural alignments are used as the "gold standard" in evaluating alignments for homology-based protein structure prediction[23] because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.[23]

DALI

[edit]

The DALI method, or distance matrix alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences.[24] It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the Protein Data Bank (PDB). It has been used to construct the FSSP structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at DALI and the FSSP is located at The Dali Database.

SSAP

[edit]

SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments,[25] and has been used in the construction of the CATH (Class, Architecture, Topology, Homology) hierarchical database classification of protein folds.[26] The CATH database can be accessed at CATH Protein Structure Classification.

Combinatorial extension

[edit]

The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment.[27] Based on measures such as rigid-body root mean square distance, residue distances, local secondary structure, and surrounding environmental features such as residue neighbor hydrophobicity, local alignments called "aligned fragment pairs" are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at the Combinatorial Extension website.

Phylogenetic analysis

[edit]

Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness.[28] The field of phylogenetics makes extensive use of sequence alignments in the construction and interpretation of phylogenetic trees, which are used to classify the evolutionary relationships between homologous genes represented in the genomes of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young most recent common ancestor, while low identity suggests that the divergence is more ancient. This approximation, which reflects the "molecular clock" hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the coalescence time), assumes that the effects of mutation and selection are constant across sequence lineages. Therefore, it does not account for possible difference among organisms or species in the rates of DNA repair or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between silent mutations that do not alter the meaning of a given codon and other mutations that result in a different amino acid being incorporated into the protein). More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.

Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly heuristic because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is NP-hard.[29]

Assessment of significance

[edit]

Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that convergent evolution can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.

In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.

Methods of statistical significance estimation for gapped sequence alignments are available in the literature.[28][30][31][32][33][34][35][36]

Assessment of credibility

[edit]

Statistical significance indicates the probability that an alignment of a given quality could arise by chance, but does not indicate how much superior a given alignment is to alternative alignments of the same sequences. Measures of alignment credibility indicate the extent to which the best scoring alignments for a given pair of sequences are substantially similar. Methods of alignment credibility estimation for gapped sequence alignments are available in the literature.[37]

Scoring functions

[edit]

The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using substitution matrices that reflect the probabilities of given character-to-character substitutions. A series of matrices called PAM matrices (Point Accepted Mutation matrices, originally defined by Margaret Dayhoff and sometimes referred to as "Dayhoff matrices") explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as BLOSUM (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. Gap penalties account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.

It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.

Other biological uses

[edit]

Sequenced RNA, such as expressed sequence tags and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about alternative splicing[38] and RNA editing.[39] Sequence alignment is also a part of genome assembly, where sequences are aligned to find overlap so that contigs (long stretches of sequence) can be formed.[40] Another use is SNP analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.[41]

Non-biological uses

[edit]

The methods used for biological sequence alignment have also found applications in other fields, most notably in natural language processing and in social sciences, where the Needleman-Wunsch algorithm is usually referred to as Optimal matching.[42] Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs.[43] In the field of historical and comparative linguistics, sequence alignment has been used to partially automate the comparative method by which linguists traditionally reconstruct languages.[44] Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.[45]

Software

[edit]

A more complete list of available software categorized by algorithm and alignment type is available at sequence alignment software, but common software tools used for general sequence alignment tasks include ClustalW2[46] and T-coffee[47] for alignment, and BLAST[48] and FASTA3x[49] for database searching. Commercial tools such as DNASTAR Lasergene, Geneious, and PatternHunter are also available. Tools annotated as performing sequence alignment are listed in the bio.tools registry.

Alignment algorithms and software can be directly compared to one another using a standardized set of benchmark reference multiple sequence alignments known as BAliBASE.[50] The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at BAliBASE.[51][52] A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench STRAP.[53]

See also

[edit]

References

[edit]
[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Sequence alignment is a computational method in bioinformatics that arranges two or more biological sequences, such as DNA, RNA, or amino acid chains, to identify regions of similarity indicative of shared evolutionary origins, structural features, or functional properties.[1] This process involves inserting gaps to optimize the alignment based on scoring schemes that reward matches and penalize mismatches or insertions/deletions, enabling the inference of homology between sequences.[2] The technique originated with pairwise alignments, which compare exactly two sequences, and can be categorized as global or local depending on whether the entire sequences or only specific subsequences are aligned.[1] Global alignment, introduced by the Needleman-Wunsch algorithm in 1970, seeks the optimal end-to-end matching of sequences using dynamic programming to maximize similarity scores across their full lengths.[2] In contrast, local alignment, developed by the Smith-Waterman algorithm in 1981, focuses on the highest-scoring contiguous regions, which is particularly useful for detecting conserved domains within longer, dissimilar sequences.[3] Multiple sequence alignment (MSA) extends pairwise methods to three or more sequences, revealing conserved motifs and patterns across a family of related biomolecules, and is essential for advanced analyses like phylogenetic tree construction and secondary structure prediction.[4] Common MSA tools, such as Clustal Omega, employ progressive alignment strategies that build alignments iteratively based on guide trees derived from pairwise similarities.[5] Sequence alignment underpins numerous bioinformatics applications, including database searches via tools like BLAST for identifying homologous genes, functional annotation of newly sequenced genomes, and evolutionary studies to trace divergence among species.[6] It also supports protein structure modeling by aligning sequences to known templates and aids in drug design by highlighting binding site conservation.[1] Advances in high-throughput sequencing have increased the demand for efficient alignment algorithms, leading to heuristic approximations like FASTA and gapped BLAST to handle large-scale genomic data without exhaustive computation.[7]

Fundamentals

Definition and Interpretation

Sequence alignment is the process of arranging sequences of DNA, RNA, or protein to identify regions of similarity, which may arise from shared evolutionary origins, structural constraints, or functional roles. This arrangement maximizes the correspondence between residues while allowing for gaps to represent insertions or deletions (indels) that occurred during evolution. The resulting alignment provides a framework for comparing sequences and quantifying their relatedness through a scoring function that rewards similarities and penalizes differences or gaps.[1] In a biological context, sequence alignment serves to detect homology between sequences, infer evolutionary relationships by tracing descent from common ancestors, and highlight conserved regions indicative of functional or structural importance. For example, aligned sequences can reveal motifs critical for protein folding, enzyme activity, or regulatory elements in DNA, aiding in the prediction of gene function and phylogenetic reconstruction. Such interpretations rely on the assumption that similar sequences have undergone comparable evolutionary pressures, preserving key features over time.[4][1] Fundamental terminology in sequence alignment includes matches (identical residues aligned vertically), mismatches (non-identical residues), and gaps (hyphen symbols denoting indels, often penalized by length). Alignments are evaluated using scoring schemes, where substitution matrices assign values to matches and mismatches based on observed evolutionary substitutions; notable examples are the PAM (Percent Accepted Mutation) matrices, derived from closely related protein alignments by Dayhoff et al. in 1978, and the BLOSUM (BLOcks SUbstitution Matrix) series, constructed from conserved protein blocks by Henikoff and Henikoff in 1992. These matrices, such as PAM250 or BLOSUM62, provide higher scores for conservative substitutions (e.g., leucine to isoleucine) than radical ones, reflecting biological likelihood. Gap penalties, typically affine (a fixed cost plus per-position extension), further refine the score to balance alignment quality.[8][1] The origins of sequence alignment trace back to the 1970s, when Margaret Dayhoff pioneered systematic computational approaches for aligning protein sequences as part of her Atlas of Protein Sequence and Structure, enabling the first large-scale comparisons of known proteins. This work laid the groundwork for modern bioinformatics by integrating sequence data with evolutionary models. For a simple illustrative example, consider the pairwise alignment of two short DNA sequences:
Seq1:  A T G C G T  
Seq2:  A T - C C T
Here, matches at positions 1–2 and 6 are conserved (e.g., the "AT" and "T" motifs), a gap in Seq2 accounts for an insertion in Seq1, and the mismatch at position 5 highlights a substitution; such alignments can reveal conserved regulatory motifs in non-coding DNA. Variants of this process, such as global or local alignments, extend the method to full or partial sequence comparisons, respectively.[9][10][11]

Global versus Local Alignment

Global alignment seeks to align two entire sequences from end to end, maximizing similarity across their full lengths, and is particularly suitable for sequences of comparable length that exhibit overall homology, such as orthologous proteins or genes.[12] This approach, exemplified by the Needleman-Wunsch algorithm, assumes that the sequences are homologous throughout and penalizes gaps to enforce complete coverage.[12] It is ideal when the biological relationship involves conservation across the whole sequence, but it may produce suboptimal results if sequences have significant terminal or internal divergences.[13] In contrast, local alignment focuses on identifying the highest-scoring regions of similarity between subsequences, without requiring alignment of the full sequences, making it appropriate for detecting conserved domains or motifs within otherwise divergent or unequally sized sequences.[14] The Smith-Waterman algorithm represents this strategy, allowing for the detection of local homologies in larger genomic contexts where global methods might dilute significant but partial matches.[14] Local alignments are computationally intensive for long sequences due to the need to explore all possible subsequence pairs, but they excel in scenarios involving functional elements embedded in non-homologous regions.[13] The choice between global and local alignment depends on factors such as sequence length, degree of divergence, and computational resources; global methods are preferred for closely related sequences with uniform similarity, while local methods suit cases of high divergence or interest in specific domains, though local approaches incur higher costs for exhaustive searches.[13] For instance, global alignment is commonly applied to orthologous genes to assess evolutionary conservation across their entirety, whereas local alignment aids in motif discovery within expansive genomes to pinpoint regulatory or structural elements.[15] Scoring systems in both incorporate gap penalties to balance matches and insertions/deletions, influencing the alignment's biological interpretability.[12] Semi-global alignment serves as a hybrid strategy, aligning one full sequence to a subsequence of another while permitting free gaps at the ends of the shorter sequence, which is useful for applications like read mapping to reference genomes where terminal overhangs occur without penalization.[16] This variant addresses limitations of pure global or local methods in scenarios involving partial overlaps, such as assembling contigs or aligning exons within genes.[16]

Sequence Representations

Sequence alignments are commonly represented in text-based formats where the aligned biological sequences—such as DNA, RNA, or proteins—are arranged in rows, with homologous positions aligned in vertical columns and gaps inserted as hyphen characters ('-') to account for insertions or deletions.[17] This columnar layout facilitates the identification of matches, mismatches, and conserved regions across sequences.[18] For multiple sequence alignments, a consensus sequence is often derived below the aligned rows, representing the most frequent residue or base at each position, sometimes using ambiguity codes (e.g., 'N' for any nucleotide) to denote variability.[18] A compact text-based notation for alignments, particularly in genomic contexts, is the CIGAR (Concise Idiosyncratic Gapped Alignment Report) string, which encodes the series of alignment operations between a query sequence and a reference using a sequence of integers followed by single-letter codes, such as 'M' for alignment match or mismatch, 'D' for deletion in the query, 'I' for insertion in the query, and others like 'S' for soft-clipped regions. Introduced in the Sequence Alignment/Map (SAM) format specification in 2009, CIGAR strings enable efficient storage and processing of alignments in bioinformatics pipelines, for example, "5M2D3M" indicates five matched bases, followed by two deletions, and then three more matches. This format is widely used in tools like SAMtools for handling large-scale sequencing data. Graphical representations provide visual aids for interpreting alignments, including dot plots that display sequence similarity as diagonal lines or clusters of points on a two-dimensional grid, where one sequence runs along each axis and dots mark matching k-mers or residues. Specialized viewers, such as Jalview, render alignments as color-coded rows of sequences, highlighting conserved regions (often in similar hues), gaps, and secondary structure annotations to enhance pattern recognition. These visualizations are applicable in both pairwise and multiple sequence alignments to reveal evolutionary relationships and functional motifs. Text-based representations, including CIGAR, offer advantages in compactness for data storage and ease of parsing in computational workflows, making them suitable for high-throughput analysis. Graphical formats, while more intuitive for human analysis, support interactive exploration but require software rendering. However, linear text formats like aligned sequences or CIGAR inherently discard three-dimensional structural context, potentially overlooking spatial relationships in protein alignments.

Pairwise Alignment Techniques

Dot-Matrix Methods

Dot-matrix methods, also known as dot plots, offer a graphical approach to visualize pairwise sequence similarities and identify repetitive elements in biological sequences such as DNA, RNA, or proteins.[19] These methods plot sequences against each other or themselves to reveal patterns of conservation without requiring alignment optimization.[20] To construct a dot plot, one sequence is aligned along the horizontal axis and the other along the vertical axis, creating a two-dimensional grid where each cell represents a pair of residues. A dot is placed in the cell (i, j) if the residue at position i in the first sequence matches or exceeds a predefined similarity threshold with the residue at position j in the second sequence.[21] This threshold helps filter noise from random matches, focusing on biologically relevant similarities.[22] In interpretation, continuous diagonal lines parallel to the main diagonal indicate regions of strong sequence similarity or alignment.[21] Lines parallel to the main diagonal highlight direct repeats, anti-diagonals signify inverted repeats, and off-diagonal segments can detect inversions or tandem duplications.[19] The primary advantages of dot-matrix methods lie in their intuitive visualization, enabling rapid identification of direct repeats, inverted repeats, and tandem duplications in sequences.[21] They provide an immediate overview of structural features without the need for complex computations.[22] However, these methods suffer from quadratic space and time complexity, making them inefficient for long sequences where the grid becomes densely populated with noise unless filtering is applied.[23] Dot-matrix methods were introduced by Gibbs and McIntyre in 1970 as a simple diagrammatic tool for comparing amino acid and nucleotide sequences. For example, a self-similarity dot plot of a eukaryotic gene can reveal exons as short diagonal segments of high density, contrasting with the sparse regions corresponding to introns.[24] While dot-matrix methods excel in visualization, they serve as a precursor to dynamic programming techniques that incorporate scoring for more precise alignments.[22]

Dynamic Programming Algorithms

Dynamic programming algorithms provide an exact method for computing the optimal pairwise alignment of biological sequences by systematically evaluating all possible alignments through a scoring matrix. These approaches, pioneered in the late 1960s and early 1980s, guarantee the highest-scoring alignment under a given scoring scheme but incur quadratic time and space complexity, making them suitable for sequences up to moderate lengths.[12] The Needleman-Wunsch algorithm, introduced in 1970, computes the optimal global alignment between two sequences by aligning them end-to-end, penalizing unaligned regions at the termini. It constructs a dynamic programming matrix $ F $ where $ F(i,j) $ represents the optimal alignment score for the prefixes of sequences $ A[1..i] $ and $ B[1..j] $. The recurrence relation is defined as:
F(i,j)=max{F(i1,j1)+s(Ai,Bj)F(i1,j)dF(i,j1)d F(i,j) = \max \begin{cases} F(i-1,j-1) + s(A_i, B_j) \\ F(i-1,j) - d \\ F(i,j-1) - d \end{cases}
Here, $ s(A_i, B_j) $ is the substitution score for aligning residues $ A_i $ and $ B_j $, and $ d $ is the linear gap penalty for insertions or deletions. The matrix is initialized with $ F(0,0) = 0 $ and $ F(i,0) = -i \cdot d $, $ F(0,j) = -j \cdot d $, and the optimal alignment is recovered by traceback from $ F(n,m) $, where $ n $ and $ m $ are the sequence lengths.[12] The Smith-Waterman algorithm, developed in 1981, extends this framework to local alignments, identifying the highest-scoring subsequence pairs without forcing terminal alignments. It modifies the Needleman-Wunsch approach by initializing the matrix with zeros along the borders ($ F(0,j) = F(i,0) = 0 $) and adding a fourth option in the recurrence: $ \max(0, \dots) $, ensuring negative scores are set to zero to discard suboptimal partial alignments. Traceback begins from the maximum value in the matrix, terminating upon reaching a zero, which delineates the local alignment boundaries. This makes it particularly effective for detecting conserved domains within longer, divergent sequences. To better model biological insertions and deletions, which are often costlier to initiate than extend, affine gap penalties were introduced in 1982. Under this model, the cost of a gap of length $ k $ is $ -(h + (k-1)g) $, where $ h > 0 $ is the gap-opening penalty and $ g \geq 0 $ is the gap-extension penalty. Implementing affine penalties requires three auxiliary matrices to track states (match, insertion, deletion), increasing time complexity to $ O(nm) $ but with a constant factor of three compared to linear penalties; the recurrence for the primary matrix $ F $ incorporates maximums from these states. Substitution scores in these algorithms are derived from empirical matrices that reflect evolutionary relationships. The PAM (Point Accepted Mutation) matrices, developed in 1978 from alignments of closely related proteins (at least 85% identity), model evolutionary distances in units of 1% accepted mutations and are extrapolated for global alignments of divergent sequences, with PAM250 commonly used for distant homologs. In contrast, BLOSUM (BLOcks SUbstitution Matrix) matrices, introduced in 1992 from conserved blocks in distantly related protein families, favor local alignments and are clustered by sequence identity (e.g., BLOSUM62 for 62% clustering threshold), outperforming PAM for detecting remote similarities. For DNA sequences, simpler identity-based scoring (+1 match, -1 mismatch) is often applied.[25][8] Both algorithms exhibit $ O(nm) $ time and space complexity for sequences of lengths $ n $ and $ m $, limiting practicality for large genomes. Space optimizations, such as Hirschberg's divide-and-conquer algorithm from 1975, reduce memory to $ O(\min(n,m)) $ by recursively computing scores for halves of one sequence and using the midpoint to guide traceback, preserving optimality while halving space iteratively. Heuristics can approximate these for database-scale searches, but exact dynamic programming remains the gold standard for precision.[26] As an illustrative example, consider aligning short DNA sequences "GAATTC" (sequence A) and "GATTA" (sequence B) using Needleman-Wunsch with match score +1, mismatch -1, and gap penalty -2. The score matrix $ F $ is filled row-by-row:
  • Initialization: First row [0, -2, -4, -6, -8, -10]; first column [0, -2, -4, -6, -8, -10].
  • Key cells: $ F(1,1) = 1 $ (G-G match); $ F(2,2) = 2 $ (A-A match); proceeding to $ F(6,5) = 1 $.
The final matrix is:
εGATTA
ε0-2-4-6-8-10
G-21-1-3-5-7
A-4-120-2-4
A-6-301-1-1
T-8-5-2120
T-10-7-4-121
C-12-9-6-301
Traceback from $ F(6,5) = 1 $ yields paths: diagonal for matches/mismatches, horizontal/vertical for gaps. One optimal alignment is: GAATTC
G-ATTA
with score 1 (four matches, one mismatch, one gap).

Heuristic and Word-Based Methods

Heuristic methods in sequence alignment prioritize computational efficiency over exhaustive optimality, making them suitable for large-scale database searches where exact dynamic programming would be prohibitively slow. These approaches typically employ a seed-and-extend strategy, identifying short exact matches (seeds or words) between query and database sequences to initiate alignment, followed by localized extension using simplified scoring. Word-based methods, in particular, index short subsequences—often k-mers of length 3 to 11— to rapidly locate potential homologous regions, reducing the search space from quadratic to near-linear time complexity.[27] The Basic Local Alignment Search Tool (BLAST), introduced in 1990, exemplifies this paradigm by scanning for exact word matches as seeds and extending them bidirectionally with a heuristic version of dynamic programming to form high-scoring segment pairs (HSPs).[27] In BLAST, the query sequence is broken into overlapping words, which are indexed against a precomputed database lookup table; matches exceeding a neighborhood word score threshold trigger extensions until the alignment score drops below a predefined drop-off value. This process approximates local alignments without gaps in the initial ungapped BLAST variant, enabling searches against vast repositories like GenBank in seconds rather than hours.[27] Subsequent refinements addressed limitations in handling insertions and deletions. The ungapped BLAST requires continuous matches, potentially fragmenting alignments across indels, whereas gapped BLAST variants, developed in 1997, incorporate two-hit heuristics: after an initial ungapped seed, a second nearby word match triggers a more sensitive gapped extension using banded dynamic programming, improving detection of divergent sequences while maintaining speed. Statistical significance of HSPs is assessed via the E-value, which estimates the number of alignments with equal or better scores expected by chance in a database of given size; lower E-values (e.g., < 10^{-5}) indicate reliable homology, adjusted for factors like sequence length and composition.[28] Word-based methods extend beyond BLAST through techniques like maximal unique matches (MUMs), which identify the longest exact matches occurring uniquely in both sequences to anchor alignments in genome-scale comparisons. In tools like MUMmer, MUMs are computed efficiently using suffix trees or enhanced suffix arrays, filtering out repetitive regions and enabling rapid sketching of conserved synteny without full alignment. This indexing of short exact matches (e.g., 3-mers) as seeds prioritizes unique anchors, scoring alignments based on MUM density and order to infer evolutionary relationships. BLAST's primary advantage lies in its speed for database searches, such as NCBI's implementation against GenBank, where it processes queries 50-100 times faster than full dynamic programming, facilitating routine annotation of millions of sequences.[27] However, its heuristic nature trades optimality for efficiency, potentially missing distant homologs with low sequence identity (<25%) due to sparse seed hits, and it exhibits a sensitivity-specificity trade-off where relaxed parameters increase false positives. For example, in a protein database search using BLASTP, output parsing involves extracting HSPs from the tabular format (e.g., via -outfmt 6), where fields like max score, E-value, and percent identity rank hits; a query against UniProt might yield top matches with E-values near 0 and identities >40%, indicating strong homologs, while parsing scripts filter for query coverage >70% to prioritize full-length alignments.

Multiple Sequence Alignment Approaches

Progressive Alignment Methods

Progressive alignment methods for multiple sequence alignment (MSA) involve constructing a hierarchical alignment by first computing pairwise similarities between sequences to build a guide tree, then progressively aligning sequences or clusters in the order dictated by the tree, starting with the most similar pairs.[29] This heuristic approach transforms the computationally intensive MSA problem into a series of pairwise alignments, avoiding the exponential complexity of exact methods while aiming to preserve biological relevance. The guide tree, often generated using neighbor-joining or similar distance-based clustering from a matrix of pairwise alignment scores, serves as a roadmap for sequential addition, ensuring that closely related sequences are aligned first to establish a robust core.[29] The Clustal series represents one of the most influential implementations of progressive alignment, originating in the late 1980s and evolving through versions like ClustalW and Clustal Omega.[30] In Clustal, pairwise alignments are first performed using dynamic programming to create a distance matrix, from which a neighbor-joining guide tree is constructed; sequences are then added sequentially to the growing alignment, with initial alignments between the two most similar sequences followed by progressive incorporation of others.[31] Later enhancements in ClustalW introduced sequence weighting to emphasize informative positions and position-specific gap penalties, which increase penalties in conserved regions to discourage insertions or deletions there while allowing flexibility in variable areas, thereby better preserving functionally important motifs.[31] Clustal Omega further optimized this for scalability, enabling alignments of thousands of sequences through parallelization and mBed-like strategies for large clusters. These methods excel in efficiency, scaling well to dozens or hundreds of sequences with polynomial time complexity, making them practical for routine bioinformatics tasks. They are particularly valuable in phylogenetics, where aligned sequences form the basis for tree inference and evolutionary analysis, as the hierarchical structure mirrors expected evolutionary relationships.[32] However, a key limitation is error propagation due to the "once a gap, always a gap" rule, where early alignment decisions—such as gap placements—are fixed and propagated through subsequent steps, potentially leading to inaccuracies in divergent sequences.[29] A representative example is the MSA of globin family proteins, such as human alpha- and beta-hemoglobins alongside myoglobin, where progressive alignment highlights conserved alpha-helical regions essential for heme binding and oxygen transport, while accommodating variable loop insertions.[31] Iterative refinements can sometimes mitigate these propagation issues by realigning subsets post-progression.

Iterative and Consistency-Based Methods

Iterative methods for multiple sequence alignment (MSA) begin with an initial alignment, often generated via progressive techniques, and then refine it through repeated realignments of subsets of sequences or resampling strategies to optimize an overall score.[33] These approaches address limitations in progressive alignments by allowing corrections to early errors that propagate through the process. A key example is the Progressive, Recursive, and Permutation (PRRP) method, which uses iterative refinement with a double nested loop to ensure consistency among the alignment, guide tree, and sequence weights; it iteratively adjusts alignment weights and local alignments to maximize the sum-of-pairs (SP) score. PRRP starts with a progressive alignment and repeatedly refines the alignment by realigning against the current tree and weights until convergence, thereby improving accuracy on datasets with divergent sequences. Another prominent iterative strategy is the Partial Order Alignment (POA) approach, which represents alignments as directed acyclic graphs rather than linear profiles to preserve branching structures and avoid forced consensus in variable regions.[34] In POA, an initial progressive alignment is constructed by sequentially aligning new sequences to the evolving graph using dynamic programming, followed by iterative refinements that realign subsets against the graph to optimize the SP objective while accommodating indels and rearrangements.[34] This graph-based iteration enables handling of up to hundreds of sequences efficiently. Consistency-based methods enhance MSA by enforcing pairwise consistency across all sequence pairs in the alignment, using a library of high-quality pairwise alignments as anchors to guide the global alignment. The seminal T-Coffee algorithm (2000) constructs such a consistency library from external information, including structural alignments or precomputed pairs, and optimizes the MSA by maximizing the sum of pairwise scores derived from this library rather than a simple SP objective. For instance, T-Coffee can incorporate RNA secondary structure anchors to align non-coding sequences, where it realigns pairs iteratively against the library to resolve inconsistencies, yielding higher accuracy than progressive methods alone on datasets with low sequence identity. These iterative and consistency-based techniques generally outperform pure progressive alignments in accuracy, particularly for datasets involving structural or functional constraints, while remaining computationally feasible for hundreds of sequences through heuristic optimizations.[35] For example, T-Coffee's use of consistency libraries has demonstrated superior performance on RNA alignments by integrating structural data compared to non-consistency methods.

Exact and Motif-Finding Methods

Exact dynamic programming (DP) methods for multiple sequence alignment (MSA) extend the pairwise Needleman-Wunsch algorithm to multiple dimensions, computing the optimal alignment by filling a k-dimensional matrix for k sequences, each of length L. These approaches guarantee the globally optimal solution under a given scoring scheme but are computationally intensive, with a time and space complexity of $ O(L^k \cdot 2^k) $, where the $ 2^k $ factor arises from considering all possible combinations of matches or gaps at each position across the sequences.[36] Due to this exponential growth, exact DP is feasible only for small numbers of sequences, typically fewer than 10, and short sequence lengths, often limited to alignments of ribosomal RNA or small protein families.[37] Tree-based exact DP variants, such as those aligning sequences according to a predefined phylogenetic tree, further optimize computation by decomposing the problem into pairwise alignments along tree branches, reducing redundancy while maintaining optimality for the tree topology. These methods are particularly useful for small sets of closely related sequences, like viral genomes, where the tree structure guides the alignment without heuristic approximations. However, even tree-based implementations remain impractical for larger k due to the underlying exponential scaling.[38] Motif-finding methods complement exact DP by focusing on identifying short, conserved sequence blocks (motifs) rather than full alignments, which is valuable when sequences are unaligned or contain distant homologies. The MEME (Multiple Expectation maximization for Motif Elicitation) algorithm, introduced in 1994, uses an expectation-maximization (EM) framework to discover one or more motifs in a set of unaligned biopolymer sequences by modeling them as mixtures of motifs and background distributions. In the E-step, MEME probabilistically assigns positions to motifs; in the M-step, it updates motif parameters to maximize the likelihood, iterating until convergence. This approach excels at finding ungapped motifs of specified widths in DNA or protein sequences without requiring prior alignment.[39] Gibbs sampling provides an alternative probabilistic strategy for motif discovery, iteratively sampling motif positions in all but one sequence while fixing the others, then updating the excluded sequence based on the current model to explore the posterior distribution of motif locations. This Markov chain Monte Carlo technique avoids local optima more robustly than EM by randomly perturbing assignments, making it suitable for discovering multiple motifs with variable spacing or in noisy datasets. Seminal implementations, such as the Gibbs Motif Sampler, have been applied to detect subtle patterns in large sequence collections.[40] These methods find applications in identifying regulatory elements, such as transcription factor binding sites in promoter regions, and conserved protein domains that mediate functions like enzymatic activity or protein-protein interactions. For example, MEME and Gibbs sampling have been used to discover zinc-finger motifs—short cysteine- and histidine-rich sequences that bind DNA—in transcription factors across eukaryotic genomes, revealing conserved patterns like C2H2 that coordinate zinc ions for DNA recognition.[41] Despite their precision, exact DP methods suffer from exponential complexity that limits scalability beyond small datasets, often requiring supercomputing resources even for modest k. Motif-finding approaches, while efficient, assume motifs as ungapped blocks and may overlook gapped or low-information patterns, necessitating integration with other tools for comprehensive analysis.[37][42]

Structural Sequence Alignment

Protein Structure Comparison Metrics

Protein structure comparison metrics are essential for evaluating the similarity between three-dimensional protein conformations, particularly when integrating structural data with sequence alignments. These metrics quantify how well aligned residues in a sequence alignment correspond to spatially close positions in their folded structures, allowing for the assessment of evolutionary and functional relationships beyond sequence similarity alone. By measuring atomic or residue-level deviations after optimal superposition, they help validate whether sequence-based alignments reflect conserved structural features, which is crucial for applications like homology modeling and functional annotation. The root-mean-square deviation (RMSD) is a widely used metric that calculates the average distance between corresponding atoms in two superimposed protein structures. After aligning the structures via least-squares rotation and translation to minimize deviations, RMSD is computed as the square root of the mean of the squared pairwise distances between N equivalent atoms:
RMSD=1Ni=1Ndi2 \text{RMSD} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} d_i^2}
where did_i is the Euclidean distance between atom ii in the two structures. This measure is sensitive to global conformation and is typically applied to backbone atoms like Cα for robustness, with lower values indicating greater similarity; values below 2 Å often signify high structural homology. The method originates from the optimal superposition algorithm that minimizes RMSD. The Template Modeling score (TM-score) addresses limitations of RMSD by providing a topology-based, scale-independent measure normalized by protein length, ranging from 0 (no similarity) to 1 (identical structures). Defined as
TM-score=1Ltargeti=1Laligned11+(di/d0)2, \text{TM-score} = \frac{1}{L_{\text{target}}} \sum_{i=1}^{L_{\text{aligned}}} \frac{1}{1 + (d_i / d_0)^2},
where LtargetL_{\text{target}} is the length of the target protein, LalignedL_{\text{aligned}} is the number of aligned residues, did_i is the distance between aligned residues ii, and d0d_0 is a scaling factor (often 4.8 Å for globular proteins), TM-score emphasizes global topology and is robust to local perturbations or chain length differences. A TM-score > 0.5 typically indicates the same fold, making it valuable for detecting remote homologs. This metric was developed to better assess template quality in structure prediction.[43] The Global Distance Test (GDT), particularly the total score variant (GDT-TS), evaluates similarity by calculating the percentage of residues in a structure that lie within specified distance cutoffs (e.g., 1 Å, 2 Å, 4 Å, 8 Å) from their equivalents in the reference structure after superposition. GDT-TS averages these percentages across the cutoffs, yielding a value between 0 and 100, where higher scores reflect better agreement; it is less sensitive to outliers than RMSD and focuses on residue-level correspondence. Introduced for evaluating structure predictions in CASP assessments starting from CASP3, with the total score variant used in subsequent assessments, GDT is effective for comparing models with varying alignment lengths.[44] In structural sequence alignment, these metrics score the quality of overlays derived from sequence alignments, enabling refinement or validation of alignments by prioritizing those with low RMSD, high TM-score, or high GDT. For instance, a sequence alignment can guide initial residue pairing, after which structures are superposed and metrics computed to assess fit. Structural metrics offer advantages over sequence-based scores by capturing functional similarities in proteins with low sequence identity (<30%), where divergent evolution obscures primary sequence conservation but preserves core folds essential for binding or catalysis. A representative example is the comparison of myoglobin and the alpha chain of hemoglobin, homologous oxygen-binding proteins sharing approximately 25% sequence identity. Despite this modest sequence similarity, structural alignment reveals conserved globin folds, with low RMSD values for superposed Cα atoms confirming high conformational overlap and functional analogy in heme coordination.[45]

Key Algorithms for Structural Alignment

Structural alignment algorithms aim to superimpose three-dimensional protein structures to identify similarities in their folds, even when sequences diverge significantly. These methods typically operate on atomic coordinates, prioritizing spatial proximity and geometric consistency over linear sequence order. Prominent approaches include distance-based and fragment-based techniques that iteratively refine alignments to maximize structural overlap while accounting for flexibility. The DALI (Distance matrix ALIgnment) algorithm, introduced by Holm and Sander in 1993, performs pairwise structural comparisons by aligning intra-molecular distance matrices derived from Cα atom coordinates. It identifies similar segments through an iterative process that superimposes structurally equivalent regions, optimizing for a score that balances residue similarity and gap penalties, thereby detecting distant homologs effectively. DALI has been widely used for database scanning and fold classification due to its robustness against local distortions.[46] SSAP (Sequential Structure Alignment Program), developed by Orengo and Taylor in 1996, employs a combinatorial extension of double dynamic programming to match secondary structure elements while considering three-dimensional geometry. The method first aligns vectors between residue pairs in conserved regions, then extends alignments using a scoring function that incorporates environmental and physicochemical properties, ensuring sequential order is preserved for biologically relevant comparisons. SSAP excels in handling proteins with insertions or deletions, facilitating the classification of domain architectures in large structural databases.[47] The Combinatorial Extension (CE) algorithm, proposed by Shindyalov and Bourne in 1998, builds alignments by identifying short aligned fragment pairs (AFPs) based on spatial superposition and extending them combinatorially to form a global alignment path. It uses rigid-body superposition to minimize deviations, followed by optimization of the alignment length and RMSD (root-mean-square deviation) to evaluate quality. CE is particularly effective for aligning structures with non-sequential similarities, such as circular permutations, and supports multiple structure alignments through iterative pairwise comparisons.[48] A notable recent advance is TM-align, developed by Zhang and Skolnick in 2005, which optimizes alignments using the TM-score—a scale-independent metric that emphasizes global topology over local distortions. The algorithm applies a heuristic dynamic programming approach to superimpose structures, iteratively refining rotations and translations to maximize the TM-score, often outperforming earlier methods in accuracy for low-identity pairs. TM-align integrates sequence information optionally to enhance detection of remote homologs.[49] These algorithms generally take as input Protein Data Bank (PDB) files containing three-dimensional coordinates of protein atoms, typically focusing on backbone Cα positions for computational efficiency. Outputs include superimposed coordinate sets, aligned residue mappings that can be converted to sequence alignments, and quantitative scores such as alignment length, RMSD, and similarity indices for assessing structural conservation. For validation, metrics like RMSD measure atomic deviations post-superposition, with values below 3 Å indicating significant similarity. An illustrative application involves aligning enzyme active sites across protein families, such as comparing the catalytic cores of serine proteases from different superfamilies using DALI or CE, revealing conserved geometries despite sequence divergence and aiding in functional annotation.

Applications in Biology

Role in Phylogenetic Analysis

Sequence alignment plays a foundational role in phylogenetic analysis by providing the structural framework necessary for inferring evolutionary relationships among organisms. Multiple sequence alignments (MSAs) serve as a prerequisite step, enabling the identification of homologous residues across sequences before applying substitution models that account for evolutionary changes such as mutations and insertions/deletions.[50][51] By aligning sequences to reveal positional correspondences, MSAs facilitate the construction of phylogenetic trees that represent hypothesized ancestry, with homology assessments directly informing the reliability of downstream inferences.[52] To ensure the robustness of alignments used in phylogenetics, statistical measures such as E-values and p-values are employed to evaluate the significance of sequence similarities, particularly in initial pairwise or database searches akin to BLAST. These metrics derive from Karlin-Altschul statistics, which model the expected distribution of alignment scores under a null hypothesis of random similarity, allowing researchers to distinguish biologically meaningful homologies from chance matches.[53] Once an MSA is established, tree credibility is further assessed through bootstrap resampling, where columns of the alignment are randomly sampled with replacement to generate replicate datasets; the proportion of replicates supporting a given clade provides a measure of branch reliability, as introduced by Felsenstein.[54] Phylogenetic tree-building relies on scoring functions applied to the aligned sequences to optimize tree topologies, contrasting parsimony methods—which seek the tree requiring the fewest evolutionary changes—with likelihood-based approaches that maximize the probability of observing the data under a specified evolutionary model. For instance, likelihood scoring under the Jukes-Cantor model assumes equal nucleotide frequencies and substitution rates, providing a probabilistic framework to evaluate trees by incorporating branch lengths and transition probabilities.[55][56] These aligned sequences are then integrated into specialized tools like MrBayes for Bayesian inference or RAxML for maximum likelihood estimation, where the MSA serves as direct input to compute posterior probabilities or optimized likelihoods for tree topologies.[57][58] A prominent example of sequence alignment's application in phylogenetics is the use of 16S ribosomal RNA (rRNA) MSAs to reconstruct bacterial evolutionary histories, as pioneered in Woese's foundational work, which aligned conserved and variable regions to delineate major bacterial phyla and reveal universal tree structures.[59] Structural alignments of protein domains can also briefly support such rRNA-based trees by incorporating three-dimensional homology for deeper divergences.[60]

Other Biological Applications

Sequence alignment plays a crucial role in gene prediction by facilitating the identification of exons through comparisons with known genes, particularly in genome assembly projects. Tools like SLAM integrate spliced alignments to annotate coding and conserved non-coding regions in homologous sequences, enabling de novo gene finding across species. Similarly, the spliced alignment algorithm in GeneWise aligns genomic DNA to protein sequences, exploring possible exon assemblies in polynomial time to delineate multiexon gene structures. These methods leverage sequence similarity to detect conserved splicing signals and start/stop codons, improving accuracy in eukaryotic genomes where introns complicate prediction. In functional annotation, sequence alignments enable the transfer of Gene Ontology (GO) terms based on homology, accelerating the assignment of biological roles to uncharacterized proteins. The BLAST2GO suite performs BLAST-based alignments against databases like NCBI nr, followed by GO mapping and annotation, allowing researchers to infer functions from sequence similarity in high-throughput genomics. This approach has been widely adopted for plant and microbial proteomes, where it integrates InterProScan for domain detection to refine annotations and support data mining in functional genomics. Variant analysis in next-generation sequencing (NGS) pipelines relies on aligning short reads to a reference genome to detect single nucleotide polymorphisms (SNPs) and other variants. The Burrows-Wheeler Aligner (BWA) efficiently maps reads using backward search with the Burrows-Wheeler transform, providing high accuracy for large-scale human genome resequencing. Subsequent SNP calling with the Genome Analysis Toolkit (GATK) processes these alignments through realignment and recalibration to minimize errors, enabling reliable genotype determination in clinical and population studies. In drug design, sequence alignments underpin homology modeling to predict structures of target proteins lacking experimental data, facilitating ligand docking. The MODELLER program constructs 3D models by satisfying spatial restraints derived from alignments with template structures, optimizing side-chain packing and loop conformations for targets with 30-50% sequence identity. This has been instrumental in designing inhibitors for proteins like kinases, where aligned homologs guide the identification of binding sites conserved across families. For RNA structure prediction, alignments of related sequences inform covariance models that capture both primary sequence and secondary structure covariations. Introduced by Eddy and Durbin, these probabilistic models generalize hidden Markov models to include base-pairing probabilities, enabling sensitive homology searches. The Infernal software implements covariance models for aligning and searching RNA families, such as in the Rfam database, to predict structures in non-coding RNAs like tRNAs and ribozymes. An illustrative application is in metagenomics, where aligning short reads from environmental samples to microbial reference databases reveals community diversity. MetaPhlAn profiles taxa by mapping reads to clade-specific marker genes via Bowtie alignments, estimating relative abundances without full genome assembly and uncovering uncultured species in complex ecosystems like the human gut.

Advanced and Non-Biological Uses

Machine Learning Enhancements

Machine learning has revolutionized sequence alignment by addressing limitations in traditional scoring matrices and heuristic methods, particularly for distant homologs and complex multiple sequence alignments (MSAs). Neural networks, such as protein language models (pLMs), replace position-specific scoring matrices (PSSMs) with learned embeddings that capture evolutionary and structural contexts more effectively. For instance, pLM-BLAST employs ESM-2 embeddings to detect distant homologies by comparing sequence representations, outperforming BLAST in low-identity cases with up to 20% higher sensitivity on benchmark datasets.[61] Similarly, Protein Embedding based Alignments (PEbA) uses transformer-derived embeddings to align sequences with less than 20% identity, improving accuracy over traditional tools like Clustal Omega by incorporating semantic similarities beyond pairwise substitutions.[62] Inspired by AlphaFold's success in structure prediction, recent approaches integrate predicted structural embeddings into alignment scoring for enhanced detection of remote homologs. The Ankh-score method (preprint) leverages AlphaFold3-generated structures to refine sequence alignments, achieving superior performance in identifying homologs with low sequence identities compared to TM-align, as demonstrated on BAliBASE and CDD datasets.[63] These embeddings encode three-dimensional constraints directly into the scoring function, enabling alignments that align not just residues but also functional motifs across divergent proteins. DeepBLAST further extends this by combining structural vectors (TM-Vec) with sequence embeddings, boosting remote homology detection accuracy by 10-15% over HMMER on diverse protein families.[64] For alignment refinement, convolutional neural networks (CNNs) and transformers correct errors in initial MSAs, particularly in sparse or noisy datasets. DeepMSA, introduced in 2019, constructs deep MSAs by iteratively refining alignments using multi-source homologous sequences, improving contact prediction accuracy by 5-10% for distant-homology proteins when integrated with tools like trRosetta.[65] Its successor, DeepMSA2, employs transformer architectures to generate high-quality MSAs for both monomers and multimers, reducing alignment errors in low-homology scenarios. BetaAlign, a 2024 deep learning framework, trains on simulated alignments to perform end-to-end MSA optimization, surpassing progressive methods like MAFFT in handling insertions and deletions with a 12% gain in column score on simulated benchmarks.[66] Post-2020 advances incorporate diffusion models for structure-aware alignments, simulating evolutionary processes to handle ultra-low-identity cases better than BLAST. ESM3, released in 2024, uses a diffusion-based generative model to evolve protein sequences over simulated timelines, enabling alignments that recover homologs with <10% identity by modeling co-evolutionary dynamics, as validated on 500 million years of synthetic evolution data.[67] These models facilitate end-to-end learning, where alignments are optimized jointly with downstream tasks like structure prediction, yielding significant improvements in sensitivity for metagenomic datasets where traditional methods fail due to fragmented assemblies. In metagenomics, such enhancements increase functional annotation accuracy over BLAST-based pipelines, as shown in large-scale microbial community analyses.[68] Overall, these AI integrations provide greater sensitivity for divergent sequences and scalable processing of massive datasets, marking a shift toward predictive, context-aware alignment paradigms.

Applications Outside Biology

Sequence alignment techniques, originally developed for biological data, have found significant applications in non-biological domains, particularly in computational linguistics and data processing. One of the earliest adaptations emerged in the 1960s with the introduction of edit distance metrics for string comparison in computing. The Levenshtein distance, proposed in 1965, measures the minimum number of single-character edits—insertions, deletions, or substitutions—required to transform one string into another, providing a foundational method for aligning non-biological sequences.[69] This metric was further refined by the Wagner-Fischer algorithm in 1974, which efficiently computes edit distances using dynamic programming, enabling practical implementations for string alignment in early computing tasks. In natural language processing, string alignment via Levenshtein distance is widely employed for spell-checking, where it identifies and suggests corrections for typographical errors by quantifying deviations from dictionary words. For instance, autocorrect systems in word processors and search engines use this distance to rank potential fixes based on minimal edit operations.[70] Similarly, plagiarism detection tools apply edit distance to compare documents or code snippets, flagging similarities that exceed a threshold after normalizing for minor variations like rephrasing or formatting changes; this approach has been validated in systems analyzing both textual and source code content for academic and software integrity checks.[71] In software engineering, these methods extend to tracking code evolution, treating source code versions as sequences to align and detect "code clones" or evolutionary changes, akin to genomic comparisons but focused on refactoring and duplication analysis.[72] Beyond text, sequence alignment principles underpin time series analysis through algorithms like dynamic time warping (DTW), which aligns signals of varying speeds or lengths by minimizing the distance between corresponding points. Developed in 1978 for speech recognition, DTW warps temporal axes to match patterns, such as aligning spoken words despite differences in pronunciation speed or pauses. This technique is routinely used in signal processing for applications like gesture recognition in human-computer interfaces and anomaly detection in sensor data, where non-linear alignments reveal underlying similarities not captured by rigid metrics.[73] Bioinformatics-inspired alignments also appear in forensics, where DNA barcoding sequences—short genetic markers—are aligned against reference databases to identify species in illicit trade or crime scenes, such as verifying wildlife products at customs without full genomic sequencing.[74] In materials science, protein sequence alignment concepts inspire the design of synthetic polymers; for example, aligning motifs from spider silk proteins guides the generation of novel sequences to engineer biomaterials with enhanced tensile strength and biocompatibility.[75] Despite these successes, adapting biological sequence alignment to non-biological contexts faces limitations, as scoring schemes like substitution matrices (e.g., BLOSUM) rely on evolutionary models irrelevant to strings or signals, necessitating simpler uniform costs that may overlook domain-specific nuances like semantic meaning in text.[76]

References

User Avatar
No comments yet.