Hubbry Logo
SubstringSubstringMain
Open search
Substring
Community hub
Substring
logo
8 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Substring
Substring
from Wikipedia
"string" is a substring of "substring"

In formal language theory and computer science, a substring is a contiguous sequence of characters within a string. For instance, "the best of" is a substring of "It was the best of times". In contrast, "Itwastimes" is a subsequence of "It was the best of times", but not a substring.

Prefixes and suffixes are special cases of substrings. A prefix of a string is a substring of that occurs at the beginning of ; likewise, a suffix of a string is a substring that occurs at the end of .

The substrings of the string "apple" would be: "a", "ap", "app", "appl", "apple", "p", "pp", "ppl", "pple", "pl", "ple", "l", "le" "e", "" (note the empty string at the end).

Substring

[edit]

A string is a substring (or factor)[1] of a string if there exists two strings and such that . In particular, the empty string is a substring of every string.

Example: The string is equal to substrings (and subsequences) of at two different offsets:

banana
 |||||
 ana||
   |||
   ana

The first occurrence is obtained with and , while the second occurrence is obtained with and being the empty string.

A substring of a string is a prefix of a suffix of the string, and equivalently a suffix of a prefix; for example, nan is a prefix of nana, which is in turn a suffix of banana. If is a substring of , it is also a subsequence, which is a more general concept. The occurrences of a given pattern in a given string can be found with a string searching algorithm. Finding the longest string which is equal to a substring of two or more strings is known as the longest common substring problem. In the mathematical literature, substrings are also called subwords (in America) or factors (in Europe). [citation needed]

Prefix

[edit]

A string is a prefix[1] of a string if there exists a string such that . A proper prefix of a string is not equal to the string itself;[2] some sources[3] in addition restrict a proper prefix to be non-empty. A prefix can be seen as a special case of a substring.

Example: The string ban is equal to a prefix (and substring and subsequence) of the string banana:

banana
|||
ban

The square subset symbol is sometimes used to indicate a prefix, so that denotes that is a prefix of . This defines a binary relation on strings, called the prefix relation, which is a particular kind of prefix order.

Suffix

[edit]

A string is a suffix[1] of a string if there exists a string such that . A proper suffix of a string is not equal to the string itself. A more restricted interpretation is that it is also not empty.[1] A suffix can be seen as a special case of a substring.

Example: The string nana is equal to a suffix (and substring and subsequence) of the string banana:

banana
  ||||
  nana

A suffix tree for a string is a trie data structure that represents all of its suffixes. Suffix trees have large numbers of applications in string algorithms. The suffix array is a simplified version of this data structure that lists the start positions of the suffixes in alphabetically sorted order; it has many of the same applications.

Border

[edit]

A border is suffix and prefix of the same string, e.g. "" is a border of "" (and also of "").[citation needed]

Superstring

[edit]

A superstring of a finite set of strings is a single string that contains every string in as a substring. For example, is a superstring of , and is a shorter one. Concatenating all members of , in arbitrary order, always obtains a trivial superstring of . Finding superstrings whose length is as small as possible is a more interesting problem.

A string that contains every possible permutation of a specified character set is called a superpermutation.

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In and theory, a substring is a contiguous sequence of characters within a larger , formally defined as a string yy such that the original string ww can be expressed as w=xyzw = x y z for some (possibly empty) strings xx and zz. This distinguishes substrings from subsequences, which may consist of non-adjacent characters while preserving order. Substrings are fundamental to processing and manipulation in algorithms, enabling operations like , text searching, and data extraction in computational tasks. Substrings are implemented across programming languages through dedicated methods or operators that extract portions based on indices. In , the substring(int beginIndex, int endIndex) method returns a new string from the inclusive starting index to the exclusive ending index, creating an immutable copy of the specified contiguous segment. For example, "hamburger".substring(4, 8) yields "urge". In Python, substrings are obtained via slicing with the syntax s[start:stop:step], where start is inclusive and stop is exclusive, allowing flexible extraction such as s[2:5] for characters at positions 2 through 4. In Oracle SQL, substrings are extracted using the SUBSTR function, with syntax SUBSTR(string, position, length), where position indicates the starting point (1-based indexing) and length the number of characters to extract. These mechanisms support efficient handling of s in applications ranging from bioinformatics to , where identifying and manipulating contiguous patterns is essential.

Definitions and Basic Concepts

Definition

A substring of a string SS is defined as any contiguous sequence of characters within SS. Formally, a string yy is a substring of SS if there exist strings xx and zz (possibly empty) such that S=xyzS = x y z, ensuring that the characters in yy appear in SS in the same order and without interruption. Substrings preserve both the order and adjacency of characters from the original string. Every string is a substring of itself (with x=z=ϵx = z = \epsilon, the empty string), and the empty string ϵ\epsilon is a substring of every string. Additionally, proper substrings exclude the full string itself, focusing on non-trivial contiguous segments. For example, consider the string S=S = "abc" with indices starting at 0. The substrings include "a" (positions 0 to 0), "ab" (0 to 1), "abc" (0 to 2), "b" (1 to 1), "bc" (1 to 2), and "c" (2 to 2), along with the empty string. Another illustration: for S=S = "472", substrings are ϵ\epsilon, "4", "7", "2", "47", "72", and "472", but "42" is not a substring due to lack of contiguity. In , a substring is often denoted as S[i..j]S[i..j], representing the sequence from index ii to jj inclusive, where 0ij<S0 \leq i \leq j < |S| and S|S| is the length of SS. In programming languages like Python, this corresponds to slice notation S[i:j+1]S[i:j+1], which extracts characters from index ii to jj (end index exclusive). For instance, in Python, "abc"[0:3] yields "abc". Prefixes and suffixes represent special cases of substrings that start from the beginning or end of SS, respectively.

Substring vs. Subsequence

A subsequence of a string is a sequence derived from the string by deleting some or no characters without changing the order of the remaining characters, and it does not require the selected characters to be contiguous in the original string. In contrast, a substring is defined as a contiguous portion of the string, where all characters appear consecutively without any gaps. The primary distinction between the two lies in this contiguity requirement: substrings must form a continuous block, whereas subsequences preserve only the relative order and allow for intervening characters to be skipped. For instance, in the string "abc", the sequence "ac" qualifies as a subsequence by selecting the first and third characters but fails as a substring due to the non-adjacent positions of 'a' and 'c'. Similarly, for the string S = "abcde", "ace" is a subsequence obtained from positions 1, 3, and 5, yet it is not a substring because the characters are not contiguous; conversely, "bcd" serves as both a substring (positions 2-4) and a subsequence. This difference has significant implications for their applications in computer science. Substrings, being stricter in structure, are essential in tasks like exact pattern matching and string searching, where adjacency and positional integrity are critical, such as in implementing the Knuth-Morris-Pratt algorithm for efficient substring detection. Subsequences, however, enable more flexible analyses that tolerate gaps, powering problems like the longest common subsequence for biological sequence alignment or the longest increasing subsequence for optimization in data sequences and resource scheduling.

Types of Substrings

Prefix

A prefix of a string SS is a substring that begins at the first character (position 0) and extends to some position jj, where 0jS0 \leq j \leq |S|. Formally, a string vΣv \in \Sigma^* is a prefix of a string uΣu \in \Sigma^* if there exists a string uΣu' \in \Sigma^* such that u=vuu = v u'. The set of all prefixes of SS includes strings of lengths ranging from 0 to S|S|, with the empty string (length 0) and SS itself (length S|S|) considered trivial prefixes. Proper prefixes exclude the full string SS, including the empty string and all prefixes shorter than SS. For example, the prefixes of the string S=S = "banana" are the empty string "", "b", "ba", "ban", "bana", "banan", and "banana". In string theory, prefixes play a foundational role in properties related to code design and language recognition, such as ensuring no codeword is a prefix of another in prefix codes for efficient decoding, though their primary significance lies in delineating initial segments of strings.

Suffix

In computer science, a suffix of a string SS of length nn is defined as any substring that begins at some position ii (where 0in0 \leq i \leq n) and extends to the end of SS, specifically ending at position n1n-1. More formally, a string vv is a suffix of a string uu if there exists a prefix uu' such that u=uvu = u' v. This positions the suffix as a contiguous sequence anchored at the string's terminus, distinguishing it from other substring types by its fixed endpoint. The set of all suffixes for a given string includes those of varying lengths, ranging from 0 to nn. The empty string (of length 0) and the full string itself (of length nn) are considered trivial suffixes, as they universally apply to any string without providing distinctive structural insight. Proper suffixes, by contrast, exclude the full string, including the empty string and all suffixes shorter than the original. These properties highlight the hierarchical nature of suffixes, where each starting position generates a unique tail segment, enabling analysis of a string's repetitive or overlapping end patterns. For example, consider the string S=S = "banana" (n=6n = 6). Its suffixes are:
  • "banana" (length 6, trivial full string)
  • "anana" (length 5)
  • "nana" (length 4)
  • "ana" (length 3)
  • "na" (length 2)
  • "a" (length 1)
  • "" (length 0, trivial empty string). This enumeration illustrates how suffixes capture progressively shorter terminal segments, useful for examining end-based repetitions like the recurring "a" and "na" in this case.
In string theory, suffixes form the basis for advanced data structures such as suffix arrays and trees, which sort and index these segments to facilitate efficient pattern matching and substring queries without exhaustive scanning. For instance, the seminal work on suffix arrays leverages the sorted order of all suffixes to achieve linear-time searches in preprocessed texts.

Border

In string theory, a border of a string SS of length nn is defined as a proper prefix of SS that is also a proper suffix of SS, meaning it is non-empty and has length strictly less than nn. The empty string is considered a trivial border of any non-empty string, but borders typically refer to non-trivial ones unless specified otherwise. Borders play a key role in characterizing the periodicity of strings, as the existence of a border of length bb implies a period of length nbn - b, where the string repeats every nbn - b positions up to the overlapping portion. A string is termed unbordered if it possesses no non-trivial borders. Unbordered strings are necessarily primitive, meaning they cannot be expressed as a non-trivial power uku^k (with k2k \geq 2) of a shorter string uu. For example, in the string S=S = "abab", the substring "ab" is a border because it matches both the initial prefix and the terminal suffix. In contrast, the string S=S = "abc" has no non-trivial border, making it unbordered. Among the possible borders of a string, the longest border is of particular interest and is often denoted as Border(S)\text{Border}(S). Conjugate borders extend this concept to the cyclic shifts (conjugates) of SS, analyzing how borders vary under rotation to study cyclic periodicity. Furthermore, borders are intimately linked to the notion of periods in strings; for instance, the Knuth-Morris-Pratt algorithm leverages border information implicitly through its prefix table to exploit string periodicity in pattern matching.

Advanced Concepts and Relations

Superstring

A superstring of a string SS is defined as any string TT that contains SS as a contiguous substring. The shortest superstring of SS is then SS itself, as it achieves the minimal length while satisfying the containment condition. Trivial superstrings of SS include SS prepended or appended with any additional characters, such as inserting symbols before or after SS in TT. Overlaps play a key role in constructing more compact superstrings, particularly when extending SS with characters that align partially with its prefixes or suffixes, thereby minimizing the added length. For example, given S="ab"S = \text{"ab"}, valid superstrings include "xab"\text{"xab"}, "abc"\text{"abc"}, and "ab"\text{"ab"} itself, with the latter being the shortest. In the context of multiple strings, the shortest common superstring (SCS) problem seeks the shortest string TT that contains each string in a given set as a contiguous substring, often leveraging overlaps to reduce total length. For instance, for the set {"ab","bc"}\{ \text{"ab"}, \text{"bc"} \}, a shortest common superstring is "abc"\text{"abc"}, where the overlap of "b" allows concatenation without redundancy. The SCS problem is NP-hard, as it can be reduced to finding a maximum Hamiltonian path in an overlap graph representation of the strings. This problem has significant applications in computational biology, such as DNA shotgun sequencing, where overlapping fragments are assembled into a contiguous genome sequence modeled as an SCS, and in data compression, where strings are merged to form compact representations. Seminal work on approximation algorithms for SCS, achieving factors close to optimal overlaps, underscores its theoretical importance.

Longest Common Substring

The longest common substring (LCS) of two strings SS and TT is defined as the longest contiguous sequence of characters that appears as a substring in both strings. This problem focuses on exact, uninterrupted matches, distinguishing it from related concepts like the longest common subsequence, which allows non-contiguous elements. The LCS is a fundamental primitive in string algorithms, enabling the identification of shared patterns across strings of varying lengths. Key properties of the LCS include the fact that it is not necessarily unique; multiple substrings of the maximum length may exist between the same pair of strings. For instance, in strings "ABCDGH" and "AEDFHR", both "A" and "D" could be candidates if no longer match exists, but longer ones take precedence. The length of the LCS can be computed efficiently using dynamic programming in O(ST)O(|S| \cdot |T|) time and space, where S|S| and T|T| are the lengths of the input strings, making it practical for moderate-sized inputs. A trivial case occurs when no common characters exist, yielding an empty string as the LCS with length 0. These properties highlight the problem's deterministic nature and its scalability for computational purposes. Consider the example where S=S = "xabcy" and T=T = "abcz"; the LCS is "abc" with length 3, as it is the longest contiguous match present in both. In contrast, for S=S = "abc" and T=T = "def", the LCS is the empty string "", since no characters overlap. These examples illustrate how the LCS captures exact textual overlaps, which can be visualized through a dynamic programming table where entries represent the length of the common substring ending at specific positions in SS and TT. Conceptually, the approach builds a 2D matrix filled row-by-row: for positions ii in SS and jj in TT, if S=TS = T, the value is the diagonal predecessor plus one; otherwise, it is zero. The maximum value in this table gives the LCS length, and backtracking recovers the substring itself. The LCS finds significant applications in bioinformatics for assessing gene similarity, such as aligning DNA sequences to identify conserved regions indicative of functional homology, as seen in microarray analysis and sequence comparison tasks. In plagiarism detection, it helps locate verbatim copied passages by identifying long matching substrings between suspicious and source documents, often integrated into clustering methods to merge approximate matches efficiently. These uses underscore the LCS's role in pattern matching across diverse domains, prioritizing exact overlaps for reliable inference.

Algorithms and Computations

Substring search, also known as string matching, is the problem of finding all occurrences of a given pattern string PP of length mm within a larger text string TT of length nn, typically by identifying the starting positions ii such that T[i..i+m1]=PT[i..i+m-1] = P. The goal is to report these positions efficiently, as naive methods can be computationally expensive for large strings. The simplest approach is the brute-force or naive algorithm, which slides the pattern over the text starting from each possible position ii from 0 to nmn - m and performs a character-by-character comparison until a mismatch or full match is found. This results in a worst-case time complexity of O((nm+1)m)O((n - m + 1) \cdot m), or O(nm)O(nm) when mm is comparable to nn, making it inefficient for long patterns or texts. To achieve better efficiency, the Knuth-Morris-Pratt (KMP) algorithm preprocesses the pattern to compute a prefix table (also called the failure function), which identifies the longest proper prefix of PP that is also a suffix for each prefix length, allowing the search to skip redundant comparisons in the text. This preprocessing leverages border computations from the pattern itself and enables the overall search to run in linear time, O(n+m)O(n + m), independent of the alphabet size. The Boyer-Moore algorithm, in contrast, starts comparisons from the end of the pattern and uses heuristics like the bad-character rule (skipping based on mismatched characters not in the pattern) and good-suffix rule (skipping based on partial matches) to jump ahead in the text, often achieving sublinear average-case performance, though its worst-case time is O(nm)O(nm); variants like Horspool simplify it for practical use while maintaining good speed. For example, searching for the pattern "ana" in the text "banana" using KMP would first compute the prefix table for "ana" as [0, 0, 1], indicating that after a mismatch following "an", the search can shift to align the overlapping "a". This yields matches at starting positions 1 ("ana" from indices 1-3) and 3 ("ana" from indices 3-5), with the failure function allowing efficient skipping without re-examining the initial "b". Substring search variants include exact matching, as described above, and approximate matching, which allows up to kk errors (such as insertions, deletions, or substitutions) between the pattern and text substrings, useful in applications like spell-checking or bioinformatics but requiring more complex dynamic programming or filtering techniques.

Substring Extraction

Substring extraction refers to the process of obtaining a contiguous portion of a string, known as a , specified by starting and ending indices. Given a string SS of length nn and non-negative integers ii and jj where 0ij<n0 \leq i \leq j < n, the goal is to return the substring S[i..j]S[i..j], which includes characters from position ii to jj inclusive. This operation is fundamental in string processing and is supported natively in most programming languages through slicing or substring functions. In Python, substring extraction uses slicing notation S[i:j+1]S[i:j+1], which produces a new string containing the characters from index ii to jj (exclusive of j+1j+1). For example, given S=S = "hello", S[1:4]S[1:4] yields "ell". Python employs zero-based indexing, and strings are immutable, meaning extraction creates a shallow copy without modifying the original. In C++, the std::string::substr method is invoked as S.substr(i,ji+1)S.substr(i, j - i + 1), returning a new string from position ii for the specified length; the same example S.substr(1,3)S.substr(1, 3) produces "ell". C++ strings are mutable, allowing in-place modifications, but extraction typically involves copying. Java's String class provides substring(i, j+1), which extracts from ii inclusive to j+1j+1 exclusive, again yielding "ell" for the example; Java strings are also immutable, and extraction shares the underlying character array where possible for efficiency. Key properties include handling edge cases to ensure robustness. If i>ji > j, the result is an . When i=0i = 0 and j=n1j = n-1, the full string SS is returned. Out-of-bounds indices, such as ini \geq n or jnj \geq n, typically raise exceptions like IndexError in Python or StringIndexOutOfBoundsException in to prevent invalid access. The time complexity is O(ji+1)O(j - i + 1) in the worst case, as it requires copying the characters, though some implementations optimize by avoiding full copies for immutable strings. Note that while most languages use zero-based indexing, some environments like certain SQL dialects employ one-based indexing, requiring adjustment in code. Optimizations in substring extraction often distinguish between in-place views and full copies to balance memory usage and performance. In languages with immutable strings like Python and , modern implementations may use shared references to the original , delaying actual copying until modification is needed—a technique known as . This reduces overhead for frequent extractions, especially in paradigms. In contrast, mutable languages like C++ default to copying but allow custom views via spans or views in C++20 for extraction. These approaches ensure efficient handling of large strings without unnecessary allocations.

References

  1. https://en.wiktionary.org/wiki/superstring
Add your contribution
Related Hubs
User Avatar
No comments yet.