Recent from talks
Nothing was collected or created yet.
Hirschberg's algorithm
View on WikipediaIn computer science, Hirschberg's algorithm, named after its inventor, Dan Hirschberg, is a dynamic programming algorithm that finds the optimal sequence alignment between two strings. Optimality is measured with the Levenshtein distance, defined to be the sum of the costs of insertions, replacements, deletions, and null actions needed to change one string into the other. Hirschberg's algorithm is simply described as a more space-efficient version of the Needleman–Wunsch algorithm that uses dynamic programming.[1] Hirschberg's algorithm is commonly used in computational biology to find maximal global alignments of DNA and protein sequences.
Algorithm information
[edit]Hirschberg's algorithm is a generally applicable algorithm for optimal sequence alignment. BLAST and FASTA are suboptimal heuristics. If and are strings, where and , the Needleman–Wunsch algorithm finds an optimal alignment in time, using space. Hirschberg's algorithm is a clever modification of the Needleman–Wunsch Algorithm, which still takes time, but needs only space and is much faster in practice.[2] One application of the algorithm is finding sequence alignments of DNA or protein sequences. It is also a space-efficient way to calculate the longest common subsequence between two sets of data such as with the common diff tool.
The Hirschberg algorithm can be derived from the Needleman–Wunsch algorithm by observing that:[3]
- one can compute the optimal alignment score by only storing the current and previous row of the Needleman–Wunsch score matrix;
- if is the optimal alignment of , and is an arbitrary partition of , there exists a partition of such that .
Algorithm description
[edit]denotes the i-th character of , where . denotes a substring of size , ranging from the i-th to the j-th character of . is the reversed version of .
and are sequences to be aligned. Let be a character from , and be a character from . We assume that , and are well defined integer-valued functions. These functions represent the cost of deleting , inserting , and replacing with , respectively.
We define , which returns the last line of the Needleman–Wunsch score matrix :
function NWScore(X, Y)
Score(0, 0) = 0 // 2 * (length(Y) + 1) array
for j = 1 to length(Y)
Score(0, j) = Score(0, j - 1) + Ins(Yj)
for i = 1 to length(X) // Init array
Score(1, 0) = Score(0, 0) + Del(Xi)
for j = 1 to length(Y)
scoreSub = Score(0, j - 1) + Sub(Xi, Yj)
scoreDel = Score(0, j) + Del(Xi)
scoreIns = Score(1, j - 1) + Ins(Yj)
Score(1, j) = max(scoreSub, scoreDel, scoreIns)
end
// Copy Score[1] to Score[0]
Score(0, :) = Score(1, :)
end
for j = 0 to length(Y)
LastLine(j) = Score(1, j)
return LastLine
Note that at any point, only requires the two most recent rows of the score matrix. Thus, is implemented in space.
The Hirschberg algorithm follows:
function Hirschberg(X, Y)
Z = ""
W = ""
if length(X) == 0
for i = 1 to length(Y)
Z = Z + '-'
W = W + Yi
end
else if length(Y) == 0
for i = 1 to length(X)
Z = Z + Xi
W = W + '-'
end
else if length(X) == 1 or length(Y) == 1
(Z, W) = NeedlemanWunsch(X, Y)
else
xlen = length(X)
xmid = length(X) / 2
ylen = length(Y)
ScoreL = NWScore(X1:xmid, Y)
ScoreR = NWScore(rev(Xxmid+1:xlen), rev(Y))
ymid = arg max ScoreL + rev(ScoreR)
(Z,W) = Hirschberg(X1:xmid, y1:ymid) + Hirschberg(Xxmid+1:xlen, Yymid+1:ylen)
end
return (Z, W)
In the context of observation (2), assume that is a partition of . Index is computed such that and .
Example
[edit]Let
The optimal alignment is given by
W = AGTACGCA Z = --TATGC-
Indeed, this can be verified by backtracking its corresponding Needleman–Wunsch matrix:
T A T G C
0 -2 -4 -6 -8 -10
A -2 -1 0 -2 -4 -6
G -4 -3 -2 -1 0 -2
T -6 -2 -4 0 -2 -1
A -8 -4 0 -2 -1 -3
C -10 -6 -2 -1 -3 1
G -12 -8 -4 -3 1 -1
C -14 -10 -6 -5 -1 3
A -16 -12 -8 -7 -3 1
One starts with the top level call to , which splits the first argument in half: . The call to produces the following matrix:
T A T G C
0 -2 -4 -6 -8 -10
A -2 -1 0 -2 -4 -6
G -4 -3 -2 -1 0 -2
T -6 -2 -4 0 -2 -1
A -8 -4 0 -2 -1 -3
Likewise, generates the following matrix:
C G T A T
0 -2 -4 -6 -8 -10
A -2 -1 -3 -5 -4 -6
C -4 0 -2 -4 -6 -5
G -6 -2 2 0 -2 -4
C -8 -4 0 1 -1 -3
Their last lines (after reversing the latter) and sum of those are respectively
ScoreL = [ -8 -4 0 -2 -1 -3 ] rev(ScoreR) = [ -3 -1 1 0 -4 -8 ] Sum = [-11 -5 1 -2 -5 -11]
The maximum (shown in bold) appears at ymid = 2, producing the partition .
The entire Hirschberg recursion (which we omit for brevity) produces the following tree:
(AGTACGCA,TATGC)
/ \
(AGTA,TA) (CGCA,TGC)
/ \ / \
(AG, ) (TA,TA) (CG,TG) (CA,C)
/ \ / \
(T,T) (A,A) (C,T) (G,G)
The leaves of the tree contain the optimal alignment.
See also
[edit]References
[edit]- ^ Hirschberg's algorithm.
- ^ "The Algorithm".
- ^ Hirschberg, D. S. (1975). "A linear space algorithm for computing maximal common subsequences". Communications of the ACM. 18 (6): 341–343. CiteSeerX 10.1.1.348.4774. doi:10.1145/360825.360861. MR 0375829. S2CID 207694727.
Hirschberg's algorithm
View on GrokipediaIntroduction
Overview
Hirschberg's algorithm is a dynamic programming technique for computing the longest common subsequence (LCS) of two sequences of lengths and by employing a divide-and-conquer strategy to optimize space usage.[4] The LCS of two sequences is the longest subsequence present in both, preserving the relative order of elements but not necessarily contiguous positions.[4] The algorithm's primary innovation lies in its reduction of space complexity from the quadratic required by the conventional dynamic programming approach to linear , while retaining the time complexity.[4] This efficiency addresses critical memory limitations when processing lengthy sequences, such as those encountered in bioinformatics for DNA or protein analysis, where quadratic space demands can exceed practical hardware constraints.[5] Although originally formulated for the unweighted LCS problem, Hirschberg's method extends naturally to scored global sequence alignments, enabling linear-space computation of optimal alignments akin to the Needleman-Wunsch framework.[6]Historical Development
Hirschberg's algorithm was developed by Daniel S. Hirschberg, a computer scientist then affiliated with Princeton University. In 1975, Hirschberg published the foundational work introducing the algorithm as a solution to the longest common subsequence (LCS) problem, achieving quadratic time complexity while reducing space requirements from quadratic to linear. The paper, titled "A Linear Space Algorithm for Computing Maximal Common Subsequences," appeared in Communications of the ACM and addressed the practical limitations of prior dynamic programming approaches, which demanded excessive memory for even moderately long strings.[1][7] The algorithm emerged during a period of expanding interest in efficient string processing techniques in the mid-1970s, driven by nascent applications in computational biology—such as sequence alignment following the 1970 Needleman-Wunsch algorithm—and in data compression and text differencing for resource-limited computing environments.[4] Hirschberg's innovation drew conceptual inspiration from complexity theory, particularly ideas in space-time tradeoffs exemplified by Savitch's 1970 theorem on simulating nondeterministic space with deterministic space at a quadratic penalty, adapting such principles to practical algorithmic design.[3] This context was particularly relevant as early computers, with memory capacities often in the kilobyte range, struggled with the O(n²) space demands of standard LCS methods for strings beyond a few hundred characters. In 1977, Hirschberg extended his contributions with a follow-up publication in the Journal of the ACM, titled "Algorithms for the Longest Common Subsequence Problem," which explored variants achieving sub-quadratic time in cases where the LCS length was small relative to input sizes, further refining efficiency for specialized scenarios like file comparisons.[4][7] The algorithm's initial impact lay in providing a viable method for memory-constrained settings of the era, enabling LCS computation on strings up to length 10,000 using only about 100K bytes of space—far more practical than the million-plus bytes required by predecessors—thus facilitating broader adoption in early software tools for text analysis and biological sequence handling.[1]Background Concepts
Longest Common Subsequence Problem
The longest common subsequence (LCS) problem involves identifying the longest subsequence present in two given sequences while preserving the relative order of elements, though the elements need not be contiguous.[8] Formally, given two sequences and over some alphabet, the LCS length is the maximum value of such that there exist strictly increasing indices and satisfying for all .[8] For example, consider the sequences "ABCBDAB" and "BDCAB"; possible LCS include "BCAB" and "BDAB", both of length 4.[9] While the LCS problem for two sequences can be solved in polynomial time using dynamic programming, it becomes NP-hard when extended to an arbitrary number of sequences, as established in early complexity analyses. The LCS problem serves as a foundational tool in various domains, including diff utilities for comparing file versions, plagiarism detection by measuring textual similarities, and evolutionary biology for aligning biological sequences like DNA or proteins.[10][11]Standard Dynamic Programming Approach
The standard dynamic programming approach to computing the longest common subsequence (LCS) of two sequences and constructs a table of size , where denotes the length of an LCS of the prefixes and .[1] The table is populated using the recurrence: The base cases are C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for and C{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 for .[1] Filling the table requires time, as each of the entries depends on a constant number of previous entries, and space to store the full table.[1] To reconstruct an actual LCS (not just its length), backtracking begins at and traces backward through the table: if , include the matching character and move to ; otherwise, move to the neighbor with the maximum value (breaking ties arbitrarily). This path yields the positions of the LCS characters in reverse order.[1] The quadratic space usage limits applicability to cases where and are large, such as comparing lengthy genomic sequences in bioinformatics.[1]Algorithm Mechanics
Divide-and-Conquer Strategy
Hirschberg's algorithm employs a divide-and-conquer strategy to compute the longest common subsequence (LCS) of two sequences, say of length and of length with , while achieving linear space complexity. The core idea is to recursively bisect one sequence (typically the shorter one, ) at its midpoint and determine an optimal split point in the other sequence () by computing LCS lengths for relevant prefixes and suffixes using only a single row of dynamic programming values at a time, thus avoiding the full quadratic space of the standard approach. This recursive partitioning reduces the problem size balancedly, ensuring the overall time remains while space drops to .[1] The process begins by selecting the midpoint in , dividing it into the prefix and suffix . A forward dynamic programming pass is then performed to compute the array , where represents the length of the LCS between and for each to . This computation reuses the standard DP recurrence but stores only the final row corresponding to , requiring space and time.[12][1] To handle the suffix, a backward dynamic programming pass is executed on the suffix and , yielding the array , where denotes the LCS length between and for each to . Like the forward pass, this uses only space and time.[12][1] The optimal split point in is then identified by scanning the arrays to find the that maximizes , which gives the position where the total LCS length across the split is longest; this scan takes time. The algorithm recurses on the two subproblems: the left half consisting of and , and the right half consisting of and , concatenating the results from these recursions to form the overall LCS.[6][1] Recursion terminates at base cases when the length of (or the divided portion) is at most 1: if is empty, the LCS is empty; if has length 1, a linear scan of identifies any matching character to include or returns empty otherwise. These base cases ensure the recursion depth is , balancing the divide.[12][1]Recursive Computation Steps
The recursive computation in Hirschberg's algorithm proceeds by dividing the problem of finding the longest common subsequence (LCS) between a subsequence of string (denoted ) and the full string of length , assuming without loss of generality that the length of is at most that of to optimize space. The function, typically denoted as , first checks the base case: if , it returns an empty sequence, as there are no elements to match. Otherwise, it computes the midpoint , effectively splitting the relevant portion of into a left half and a right half . This divide-and-conquer approach builds on the strategy of halving the problem size at each recursive level. Next, the forward computation builds the dynamic programming (DP) row for the LCS lengths between the left half and all prefixes of . Specifically, it constructs an array where represents the length of the LCS between and , computed using the standard DP recurrence for LCS but retaining only the final row to achieve linear space usage. This is done by iterating through the characters of the left half and updating a single array of size , simulating the last row of the full DP table for this subproblem. Similarly, the backward computation addresses the right half by building the DP row for the LCS lengths between the right half and all suffixes of . This yields an array where is the LCS length between and the suffix , again using only linear space for the final row. These computations ensure that the alignments for prefixes and suffixes of are available without storing the entire two-dimensional DP table.[12] The merging step then combines these rows to identify the optimal split point in . For each possible split position from 0 to , it calculates the total LCS length as , which represents the combined LCS length for with and with . The position is selected as the split that maximizes this value, determining how should be divided to preserve the overall LCS. Finally, the recursion concatenates the results from two subcalls: for the left subproblem and for the right subproblem, building the full subsequence bottom-up through these halvings. While the algorithm is naturally recursive, with a recursion depth logarithmic in the length of , it can alternatively be implemented iteratively using a stack to manage subproblems, providing better control over stack overflow in deep calls, though the recursive form remains the standard presentation.[13]Implementation and Recovery
Pseudocode Outline
Hirschberg's algorithm for computing the longest common subsequence (LCS) of two strings and employs a recursive divide-and-conquer approach that achieves linear space complexity while maintaining quadratic time. The core procedure, often implemented in a functional style, recursively splits at its midpoint and determines the optimal split point in by maximizing the sum of LCS lengths from forward and backward passes. This yields the actual LCS string by concatenating results from subproblems. The following pseudocode outlines the mainhirschberg function, assuming strings (length ) and (length ):
def hirschberg(X, Y):
if len(X) == 0:
return ""
if len(X) == 1:
return X if X in Y else ""
mid = len(X) // 2
L1 = dp_row(X[:mid], Y)
L2 = dp_row_reverse(X[mid:], Y)
j_star = argmax(L1[j] + L2[len(Y) - j] for j in range(len(Y) + 1))
return hirschberg(X[:mid], Y[:j_star]) + hirschberg(X[mid:], Y[j_star:])
def hirschberg(X, Y):
if len(X) == 0:
return ""
if len(X) == 1:
return X if X in Y else ""
mid = len(X) // 2
L1 = dp_row(X[:mid], Y)
L2 = dp_row_reverse(X[mid:], Y)
j_star = argmax(L1[j] + L2[len(Y) - j] for j in range(len(Y) + 1))
return hirschberg(X[:mid], Y[:j_star]) + hirschberg(X[mid:], Y[j_star:])
dp_row(A, B) computes the last row of the standard dynamic programming table for the LCS lengths between A and the prefixes of B up to each position , using time and space. It is implemented efficiently with two arrays (prev and curr) to alternate rows and avoid quadratic space. Similarly, dp_row_reverse(A, B) computes the LCS lengths for the reversed A against the reversed B, producing a row that, when indexed from the end, simulates the backward pass for the suffix of the original ; this allows to represent the LCS length between the suffix of and the suffix of starting after position . Both helpers follow the space-optimized DP procedure from the original formulation.
Backtracking for Sequence Reconstruction
In Hirschberg's algorithm, the reconstruction of the actual longest common subsequence (LCS) occurs during the unwind of the recursive divide-and-conquer process, integrating the recovery seamlessly with the length computations to maintain linear space efficiency. Once the optimal split point is determined for the midpoint of sequence , the algorithm recursively computes the LCS for the left subproblem (prefixes of and up to ) and the right subproblem (suffixes of and from ). The resulting subsequences from these subcalls are then concatenated to form the overall LCS for the current level, ensuring that the full string is built incrementally without requiring additional storage beyond the recursion stack.[1] The base cases in this recursive reconstruction are handled straightforwardly to initiate the process. When the length of sequence is 0, an empty string is returned as the LCS. If the length of is 1, the algorithm checks for a matching character in ; if a match exists, that single character is returned, otherwise an empty string is returned. These base cases ensure termination and provide the foundational building blocks for higher-level concatenations.[1] This approach to reconstruction leverages the recursion stack for recovery, avoiding the need to store full alignment paths or an entire dynamic programming matrix, which contrasts with the standard quadratic-space DP method that requires backtracking through a complete table. By performing the string assembly on the return path of the recursion—appending the left LCS to the right LCS at each level—the algorithm achieves the actual LCS output in time while using only space. If multiple values of maximize the combined lengths from the left and right subproblems, the algorithm selects any one, yielding a single optimal LCS without enumerating all possibilities.[1]Analysis
Time Complexity
Hirschberg's algorithm computes the longest common subsequence of two sequences of lengths and in time, matching the complexity of the standard dynamic programming approach.[1] The time complexity arises from the divide-and-conquer recurrence , where the term accounts for computing the forward and backward dynamic programming rows to determine the optimal split point in the second sequence, and the recursive calls solve disjoint subproblems.[3] This recurrence reflects the halving of the first sequence at each step, with the split in the second sequence adapting to the problem structure.[1] To establish the bound, strong induction on is applied. Base cases where yield for some constant . For the inductive hypothesis, assume holds for all . In the step, the recursive contributions satisfy , and adding the work at the current level gives for constant ; choosing ensures .[3] An equivalent perspective views the recursion tree as having depth , where subproblems at each level partition the first sequence disjointly; across all subproblems at a given level, the total dynamic programming work sums to because the second-sequence intervals, while potentially unbalanced, are bounded such that the product of subproblem dimensions aggregates to the full size.[14] Thus, the divide-and-conquer structure incurs no asymptotic time overhead compared to the quadratic baseline.[3]Space Complexity
Hirschberg's algorithm achieves a space complexity of O(min(m, n)) in the dominant term, assuming without loss of generality that m ≥ n, where m and n are the lengths of the input sequences, by storing only the necessary rows of the dynamic programming table during computation. This contrasts sharply with the standard dynamic programming approach for the longest common subsequence problem, which requires O(mn) space to maintain the full table.[1] In each recursive call, the algorithm computes the split point by filling two rows of the dynamic programming table— one for the forward pass and one for the backward pass—each of size n+1, thus using O(n space per call for these arrays. These rows are discarded immediately after determining the optimal split in the shorter sequence, ensuring that intermediate computations do not accumulate. The recursion proceeds by dividing the longer sequence and splitting the shorter one accordingly, without retaining prior rows beyond the current level.[1] The recursion depth is O(log m) due to the divide-and-conquer halving of the longer sequence, leading to a worst-case space usage of O(n log m) if each stack frame retains O(n) space for local variables or temporary storage. However, since the dynamic programming arrays are local and deallocated before deeper recursions, and stack frames typically require only O(1) additional space for indices and split values, the total space remains linear at O(n + m). An iterative implementation can further optimize this to strictly O(n) auxiliary space beyond the input and output.[3] To see why the space is linear, note that at any point in the execution, only the rows for the current recursive call's split computation are in memory; as the recursion unwinds, subproblem solutions are concatenated into the final subsequence without storing full paths or tables from prior levels. This discard-after-use strategy ensures no quadratic growth, enabling the algorithm to handle large sequences where the standard approach would exceed memory limits.[1]Illustrative Example
Step-by-Step Walkthrough
To illustrate Hirschberg's algorithm, consider the sequences (length ) and (length ), for which the longest common subsequence has length 4, such as "MJAU". The algorithm proceeds recursively by dividing at its midpoint and finding an optimal split point in to balance the LCS contributions from the left and right halves. Step 1: Forward dynamic programming on the left half. Divide at the midpoint , so the left half is . Compute the dynamic programming row for to , where is the LCS length between "XMJ" and the prefix . This is done using the standard LCS DP recurrence in space by keeping only the previous row: L_1{{grok:render&&&type=render_inline_citation&&&citation_id=0&&&citation_type=wikipedia}} = 0 For to : If the last characters match, ; else, , yielding the row:| 0 | 1 (M) | 2 (MZ) | 3 (MZJ) | 4 (MZJA) | 5 (MJAW) | 6 (MJAWX) | 7 (MJAWXU) | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2 | 2 | 2 | 2 | 2 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 2 | 2 | 2 | 2 | 1 | 1 | 1 | 0 |
