Recent from talks
Nothing was collected or created yet.
Generalized suffix tree
View on Wikipedia
ABAB and BABA. Suffix links not shown.In computer science, a generalized suffix tree is a suffix tree for a set of strings. Given the set of strings of total length , it is a Patricia tree containing all suffixes of the strings. It is mostly used in bioinformatics.[1]
Functionality
[edit]It can be built in time and space, and can be used to find all z occurrences of a string P of length m in time, which is asymptotically optimal (assuming the size of the alphabet is constant[2]: 119 ).
When constructing such a tree, each string should be padded with a unique out-of-alphabet marker symbol (or string) to ensure no suffix is a substring of another, guaranteeing each suffix is represented by a unique leaf node.
Algorithms for constructing a GST include Ukkonen's algorithm (1995) and McCreight's algorithm (1976).
Example
[edit]A suffix tree for the strings ABAB and BABA is shown in a figure above. They are padded with the unique terminator strings $0 and $1. The numbers in the leaf nodes are string number and starting position. Notice how a left to right traversal of the leaf nodes corresponds to the sorted order of the suffixes. The terminators might be strings or unique single symbols. Edges on $ from the root are left out in this example.
Alternatives
[edit]An alternative to building a generalized suffix tree is to concatenate the strings, and build a regular suffix tree or suffix array for the resulting string. When hits are evaluated after a search, global positions are mapped into documents and local positions with some algorithm and/or data structure, such as a binary search in the starting/ending positions of the documents.
References
[edit]- ^ Paul Bieganski; John Riedl; John Carlis; Ernest F. Retzel (1994). "Generalized Suffix Trees for Biological Sequence Data". Biotechnology Computing, Proceedings of the Twenty-Seventh Hawaii International Conference on. pp. 35–44. doi:10.1109/HICSS.1994.323593.
- ^ Gusfield, Dan (1999) [1997]. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. USA: Cambridge University Press. ISBN 978-0-521-58519-4.
External links
[edit]
Media related to Generalized suffix tree at Wikimedia Commons- A C implementation of Generalized Suffix Tree for two strings
Generalized suffix tree
View on GrokipediaFundamentals
Definition
A generalized suffix tree for a set of strings is a compressed trie, also known as a Patricia tree, containing all suffixes of each string . Each leaf in the tree is labeled with the string index and the starting position of the corresponding suffix within . To distinguish suffixes originating from different strings and prevent cross-string overlaps, a unique terminal symbol \_i S_i S_1 $_1, S_2 $_2, \dots, S_d $_d $.[3] This tree is equivalently defined as the suffix tree of the single concatenated string T = S_1 \_1 S_2 $_2 \dots S_d $_d $, where the terminal symbols ensure that no suffix of one augmented string is a prefix of another. In the trie representation, explicit nodes correspond to branching points or leaves, while implicit nodes represent positions embedded within edge labels, allowing for compression of unary paths.[3]Historical Context and Motivation
The suffix tree, a foundational data structure for efficient string processing, was first introduced by Peter Weiner in 1973 as a "position tree" enabling linear-time pattern matching.[4] This was followed by Ed McCreight's 1976 algorithm, which refined the construction to a left-to-right approach while renaming the structure a "suffix tree" and maintaining linear time complexity.[5] The generalized suffix tree emerged as a natural extension of these single-string methods in the late 1980s and early 1990s, adapting the core algorithms to handle multiple strings by concatenating them with unique terminators and building a unified tree over the combined input.[1] A key early application appeared in Dan Gusfield's 1992 work on the all-pairs suffix-prefix problem, where the generalized suffix tree facilitated efficient computation across sets of strings.[6] Esko Ukkonen's 1995 online construction algorithm further advanced this by providing a simpler, linear-time method adaptable to the generalized case without recomputing from scratch. The primary motivation for developing generalized suffix trees stemmed from the limitations of single-string suffix trees in scenarios requiring simultaneous analysis of multiple texts, such as comparing DNA sequences in bioinformatics, where identifying common motifs or alignments across genomes demands shared structural representation.[7] In natural language processing, they enable tasks like document similarity and phrase extraction from corpora by capturing shared substrings efficiently.[1] Similarly, in plagiarism detection, the structure supports rapid identification of overlapping content across documents, as highlighted in early applications for parameterized string matching.[8] These fields, particularly bioinformatics during the mid-1990s genome projects, underscored the need for tools beyond isolated string processing, as naive approaches like separate suffix trees for each string lacked the integrated view necessary for cross-string queries.[1] Compared to building independent suffix trees for each string—a naive multiple approach that, while also O(n) in space, hinders direct comparison of shared suffixes—generalized suffix trees achieve space efficiency of O(n) total size for n characters across all strings, leveraging prefix sharing in the trie structure.[9] Their construction remains linear time via adaptations of Weiner, McCreight, or Ukkonen's algorithms on the concatenated input, avoiding quadratic overheads in pattern matching or substring discovery. The evolution of generalized suffix trees progressed from rudimentary concatenation techniques in Weiner's and McCreight's eras, which treated multiple strings as a single augmented one, to more optimized online algorithms like Ukkonen's that directly incorporate multiple-string handling during incremental building.[1] This shift enabled broader adoption in computational biology and text analysis by the late 1990s, emphasizing direct multi-string processing without excessive preprocessing.[9]Construction Methods
Concatenation Approach
The concatenation approach provides a straightforward method for constructing a generalized suffix tree from a set of input strings by reducing the problem to building a standard suffix tree on a single concatenated string. This technique ensures that all suffixes from the individual strings are represented distinctly in the resulting tree structure. To implement this method, unique terminal symbols—distinct from each other and from the alphabet of the input strings—are appended to the end of each string. These terminals prevent suffixes originating from different strings from being erroneously combined during tree construction, as they act as non-overlapping delimiters. For example, given strings , append terminals \_1, $_2, \dots, $_k ) (where each ( $_i S_1 $_1, S_2 $_2, \dots, S_k $_k ). These are then concatenated into a superstring ( T = S_1 $_1 S_2 $_2 \dots S_k $_k T $ using an established linear-time algorithm, such as Ukkonen's online construction method for single strings.[10][11][12] The time and space complexity of this approach is , where is the total length of the superstring, matching the efficiency of standard suffix tree construction algorithms assuming a constant-sized alphabet.[12][10] A limitation of this method is its offline nature; it does not support efficient incremental additions of new strings, as incorporating additional input requires reconstructing the entire superstring and suffix tree anew.[11] The following pseudocode outlines the core steps:function BuildGeneralizedSuffixTree(strings S[1..k]):
terminals = generate_unique_terminals(k) // Distinct symbols outside input alphabet
T = [empty string](/page/Empty_string)
for i = 1 to k:
T = T + S[i] + terminals[i]
return SuffixTreeConstruction(T) // e.g., using Ukkonen's [algorithm](/page/Algorithm)
function BuildGeneralizedSuffixTree(strings S[1..k]):
terminals = generate_unique_terminals(k) // Distinct symbols outside input alphabet
T = [empty string](/page/Empty_string)
for i = 1 to k:
T = T + S[i] + terminals[i]
return SuffixTreeConstruction(T) // e.g., using Ukkonen's [algorithm](/page/Algorithm)
Online Construction Algorithms
Online construction algorithms for generalized suffix trees extend the principles of Ukkonen's linear-time method for single strings to handle multiple strings directly, avoiding explicit concatenation by representing substrings with references to their originating strings. This approach builds the tree incrementally as strings are added, sharing structure for common substrings across strings while maintaining distinct leaf representations for suffixes from different strings. The resulting tree encodes all suffixes of the input strings in O(n total space and time, where n is the aggregate length of the strings. Key modifications to Ukkonen's algorithm include incorporating string identifiers into edge labels, typically as triples (string_id, start_index, end_index) to denote substrings without copying characters. Suffix links are generalized to span contexts across strings, enabling efficient traversal and implicit suffix insertion regardless of the source string. End markers are handled per string during construction, often as unique terminators appended virtually to each string's suffixes to prevent cross-string suffix confusion without altering the original texts. These changes ensure the tree remains compact and queryable for multi-string operations.[13] The algorithm proceeds by initializing an empty suffix tree with a root node. Strings are added sequentially: for each new string of length m, the procedure simulates Ukkonen's online phases by processing its characters from left to right, starting with the active point at the root. For each position i in the new string, all implicit suffixes ending at i are added via a single extension step, leveraging existing tree paths for prefixes shared with prior strings. Traversals fetch characters using the edge labels' string references, splitting edges at mismatch points and creating new leaves labeled with the current string's identifier. Suffix links facilitate jumping to the longest proper suffix, amortizing the cost across phases. This ensures complete representation of all suffixes from the added string while integrating with the existing structure. The overall time complexity is O(n), as each of the n characters across all strings is involved in a constant number of operations on average, thanks to the suffix link accelerations and active point management. The method supports dynamic insertion of new strings in amortized O(m) time for a string of length m, allowing online updates to the collection without rebuilding the entire tree.[13] A notable variant, described by Gusfield drawing from Farach's divide-and-conquer framework, adapts the approach for multiple strings by recursively partitioning suffixes based on initial characters while respecting string boundaries through identifier tracking. This method constructs the tree bottom-up: it builds subtrees for groups of suffixes from the same or different strings, then merges them linearly by resolving cross-string branches at internal nodes. The extension phase in this variant focuses on integrating a new suffix group by matching against the current partial tree and updating links, achieving O(n) time overall for fixed alphabets. Brief pseudocode for the extension phase resembles:function extend(active_node, active_edge, active_length, pos, string_id, char):
if active_edge != null:
walk_down(active_node, active_edge, active_length, pos, string_id)
remaining = active_length - walked
if remaining > 0 and edge_label(active_node, next_char) matches char:
// Implicit extension via active point
update_active_length(remaining + 1)
else:
// Split edge and create new [leaf](/page/Leaf)
split_edge(active_node, active_edge, walked)
new_leaf = create_leaf(string_id, pos)
add_child(active_node, char, new_leaf)
update_suffix_link()
function extend(active_node, active_edge, active_length, pos, string_id, char):
if active_edge != null:
walk_down(active_node, active_edge, active_length, pos, string_id)
remaining = active_length - walked
if remaining > 0 and edge_label(active_node, next_char) matches char:
// Implicit extension via active point
update_active_length(remaining + 1)
else:
// Split edge and create new [leaf](/page/Leaf)
split_edge(active_node, active_edge, walked)
new_leaf = create_leaf(string_id, pos)
add_child(active_node, char, new_leaf)
update_suffix_link()