Recent from talks
Contribute something
Nothing was collected or created yet.
Empty string
View on WikipediaThis article needs additional citations for verification. (November 2009) |
In formal language theory, the empty string, also known as the empty word or null string, is the unique string of length zero.
Formal theory
[edit]Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special case where the sequence has length zero, so there are no symbols in the string. There is only one empty string, because two strings are only different if they have different lengths or a different sequence of symbols. In formal treatments,[1] the empty string is denoted with ε or sometimes Λ or λ.
The empty string should not be confused with the empty language ∅, which is a formal language (i.e. a set of strings) that contains no strings, not even the empty string.
The empty string has several properties:
- |ε| = 0. Its string length is zero.
- ε ⋅ s = s ⋅ ε = s. The empty string is the identity element of the concatenation operation. The set of all strings forms a free monoid with respect to ⋅ and ε.
- εR = ε. Reversal of the empty string produces the empty string, so the empty string is a palindrome.
- . Statements that are about all characters in a string are vacuously true.
- The empty string precedes any other string under lexicographical order, because it is the shortest of all strings.[2]
In context-free grammars, a production rule that allows a symbol to produce the empty string is known as an ε-production, and the symbol is said to be "nullable".
Use in programming languages
[edit]In most programming languages, the term "string" often refers to instances of a data type and thus they're a concept distinct from the one in the formal theory. Such strings are typically stored at distinct memory addresses (locations) and thus have an identity. Thus, representatives of the same formal string (e.g., the empty string) may be stored in two or more places in memory and they can be taken as names of the formal empty string.
In this way, there could be multiple representatives of the empty string in memory, in contrast with the formal theory definition, for which there is only one possible empty string. However, a "string comparison function" would indicate that all of these representatives are equal to each other.
Even a string of length zero can require memory to store it, depending on the format being used. In most programming languages, the empty string is distinct from a null reference (or null pointer) because a null reference points to no string at all, not even the empty string. The empty string is a legitimate string, upon which most string operations should work. Some languages treat some or all of the following in similar ways: empty strings, null references, the integer 0, the floating point number 0, the Boolean value false, the ASCII character NUL, or other such values.
The empty string is usually represented similarly to other strings. In implementations with string terminating character (null-terminated strings or plain text lines), the empty string is indicated by the immediate use of this terminating character.
Different functions, methods, macros, or idioms exist for checking if a string is empty in different languages.[example needed]
| λ representation | Programming languages |
|---|---|
""
|
C, C#, C++, Go, Haskell, Java, JavaScript, Julia, Lua, M, Objective-C (as a C string), OCaml, Perl, PHP, PowerShell, Python, Ruby, Scala, Standard ML, Swift, Tcl, Visual Basic .NET |
''
|
APL, Delphi, JavaScript, Lua, MATLAB, Pascal, Perl, PHP, PowerShell, Python, R, Ruby, Smalltalk, SQL |
character(0)
|
R[3] |
{'\0'}
|
C, C++, Objective-C (as a C string) |
new String() (from java.lang.String)
|
Java |
string() (from std::string)
|
C++ |
""s
|
C++ (since the 2014 standard) |
@""
|
Objective-C (as a constant NSString object)
|
[NSString string]
|
Objective-C (as a new NSString object)
|
q(), qq()
|
Perl |
str()[4] """""" r"" u""
|
Python |
%{}%()
|
Ruby |
String::new() (from std::string::String)[5]
|
Rust |
String.Empty (from System.String)
|
C#, Visual Basic .NET |
String.make 0 '-'
|
OCaml |
{}
|
Tcl |
[[]]
|
Lua |
“”
|
PowerShell |
.byte 0
|
A64 |
Representations of the empty string
[edit]This section needs expansion. You can help by adding missing information. (March 2010) |
The empty string is a syntactically valid representation of zero in positional notation (in any base), which does not contain leading zeros. Since the empty string does not have a standard visual representation outside of formal language theory, the number zero is traditionally represented by a single decimal digit 0 instead.
Zero-filled memory area, interpreted as a null-terminated string, is an empty string.
Empty lines of text show the empty string. This can occur from two consecutive EOLs, as often occur in text files. This is sometimes used in text processing to separate paragraphs, e.g. in MediaWiki.
See also
[edit]References
[edit]- ^ Corcoran, John; Frank, William; Maloney, Michael (1974). "String theory". Journal of Symbolic Logic. 39 (4): 625–637. doi:10.2307/2272846. JSTOR 2272846. S2CID 2168826.
- ^ "CSE1002 Lecture Notes – Lexicographic" (PDF). Archived from the original (PDF) on 2009-12-29. Retrieved 2010-03-27.
- ^ There are two ways to create "empty strings" in R; the other is listed here as
"".character(0)creates empty character vectors, which will output 0 when counted. - ^ Another way to make an empty string is multiplying a string by 0 or a negative integer.
- ^ "String in std::string - Rust". doc.rust-lang.org. Retrieved 2022-11-30.
- ^ "about_Quoting_Rules – PowerShell". Microsoft Learn. Archived from the original on 2025-08-14. Retrieved 26 August 2025.
PowerShell treats smart quotation marks, also called typographic or curly quotes, as normal quotation marks for strings. Don't use smart quotation marks to enclose strings.
- ^ All 'smart' quotation marks work as both opening and closing marks in any combination, except that single marks must be paired with single marks and double marks with double marks. For example
"Hello, world„is valid. Guillemets are not supported. Note that the German-style low single quotation mark given here is U+201A ‚ SINGLE LOW-9 QUOTATION MARK; the similar-looking character U+002C , COMMA does not function as a quotation mark. PowerShell's official documentation recommends using straight quotation marks.[6]
Empty string
View on Grokipedia"" and has a length of 0, allowing methods like concatenation and substring operations to function predictably.[4] Similarly, in C++, the std::string class initializes to an empty string by default, and its empty() member function returns true for strings of length zero. In Python, the empty string '' is a built-in type that evaluates to false in boolean contexts and supports operations like joining with other strings.[5]
The empty string plays a critical role in algorithms and data structures, such as regular expressions where it matches positions without consuming input, and in parsing where it enables ε-productions in context-free grammars.[1] It also distinguishes between "no value" (null) and "known empty value," aiding in robust error handling and data validation across systems.[6]
Definition and Notation
Formal Definition
In formal language theory, the empty string is defined as the unique string consisting of zero symbols from a given alphabet, making it the sole sequence of length zero. This distinguishes it from the empty set, which is a collection containing no elements whatsoever, and from the empty language, which is the set of no strings at all.[2][7] Regardless of the alphabet's size or composition, there exists precisely one empty string, as it represents the complete absence of any symbols and thus serves as a fundamental unit in the construction of all possible strings over that alphabet.[8] This uniqueness underscores its role as the foundational element in string theory within formal languages, where it acts as the identity for concatenation operations.[2] The concept of the empty string originated in formal language theory during the mid-20th century, with early formalizations appearing in the work of Noam Chomsky and contemporaries in the 1950s, building on foundational ideas in computability and linguistics. Early works, such as those by Noam Chomsky, denoted it with I (capital i).[9] Syntactically, the empty string can represent zero in positional numeral systems, analogous to an empty sum or the base case in unary representation where no symbols denote the value 0.[10]Common Notations
In formal language theory and automata theory, the empty string is most commonly denoted by the Greek letter ε (epsilon). This notation serves as the identity element for string concatenation and is standard in contemporary textbooks on the subject.[11] Another frequently used symbol is λ (lambda), particularly in discussions of finite automata and regular languages, where it represents the string of length zero.[12] The uppercase variant Λ (capital lambda) appears in some lecture notes and older references on formal languages.[2] In certain fields, the empty set symbol ∅ is occasionally misused to denote the empty string, though it properly represents the set containing no elements, distinct from the unique string of zero length.[13] This distinction is critical, as the empty string belongs to languages like Σ* (all possible strings over alphabet Σ), while ∅ denotes the language with no strings. In informal discussions and programming contexts, the empty string is typically written as a pair of empty quotes, "". For instance, in languages like Python and Java, "" literalizes the empty string. The ε notation gained prominence in the mid-20th century alongside the formalization of regular languages, building on foundational work like Kleene's 1956 paper, which used λ to denote the empty word.[14] Over time, ε became the preferred symbol in theoretical contexts for its clarity and avoidance of overlap with other notations like λ (used in lambda calculus) or ∅. Guidelines for selection recommend ε for rigorous mathematical and theoretical writing to emphasize its algebraic role, while "" suits practical implementations in code where readability for developers is key. In context-free grammars, ε denotes productions that derive the empty string, enabling nullable nonterminals.Theoretical Properties
Algebraic Properties
In abstract algebra, the empty string, denoted , acts as the identity element under the operation of string concatenation. For any string over an alphabet , the concatenation satisfies .[15] This property holds because appending or prepending nothing to a string leaves it unchanged, mirroring the role of the multiplicative identity 1 in integer arithmetic.[16] The empty string also exhibits palindromic symmetry, reading the same forwards and backwards due to the absence of characters. In recursive definitions of palindromes, serves as a base case, satisfying the condition that a string equals its reverse. This property extends to formal definitions where a palindrome satisfies , with .[17] Predicates involving universal quantification over the characters of the empty string hold via vacuous truth. For instance, the statement "every character in is uppercase" is true because there are no characters to falsify it, analogous to universal statements over an empty domain in predicate logic.[18] This logical principle ensures that any property asserted for all elements of the empty set of characters is satisfied without counterexamples. The empty string remains invariant under common string operations, demonstrating closure. Its reversal yields itself, as reversing an empty sequence produces emptiness.[2] Similarly, repetition or powers of , such as for positive integers , result in , preserving the empty structure.[19]Ordering and Relations
In lexicographical ordering over a finite alphabet, the empty string precedes all non-empty strings, as it has length zero and thus comes before any string of positive length when strings are compared position by position until a difference is found or one ends.[20] This property holds in both standard lexicographic order and shortlex order (where shorter strings precede longer ones of equal prefix), ensuring the empty string is the minimal element in the ordered set of all strings.[21] The empty string exhibits specific relational properties with respect to other strings: it is a substring of every string in a vacuous sense, since any string can be expressed as a concatenation involving the empty string without altering its content.[15] Furthermore, the empty string is a prefix and a suffix of every string, because any string satisfies and , where denotes concatenation.[22] In sorted lists of strings under ascending lexicographical order, the empty string always appears first, regardless of the alphabet. For example, over the alphabet , the order begins as .[20] In the context of string-based numeral systems, such as positional representations where digits form strings, the empty string corresponds to the number zero, establishing a bijection between finite digit sequences and non-negative integers starting from 0.[23]Role in Formal Languages
In Language Composition
In formal language theory, a language over an alphabet is a subset of , the Kleene closure of , which includes all finite strings formed from symbols in as well as the empty string . The empty string may or may not be an element of ; for instance, if , then is a non-empty language consisting solely of .[24] Conversely, the empty language contains no strings, including .[25] The empty string plays a fundamental role in language composition operations. In concatenation, the singleton language serves as the identity element, satisfying for any language , since concatenating with any string leaves it unchanged.[26] For union, , establishing the empty language as the identity, while , which equals if and only if .[25] A key operation involving the empty string is the Kleene star , defined as the set of all finite concatenations of zero or more strings from . This always includes as the result of the empty (zero-length) concatenation, ensuring regardless of whether .[27] Thus, even if , .[28] As an example in regular languages, consider the language of all even-length strings over the alphabet , which is regular and includes since its length is 0, an even number. This language can be expressed by the regular expression , where the star operation incorporates as the base case.[29]In Grammars and Productions
In generative grammars, particularly context-free grammars (CFGs) within the Chomsky hierarchy, the empty string, denoted ε, is central to ε-productions, which take the form A → ε where A is a non-terminal symbol, allowing that non-terminal to generate nothing. These productions enable the derivation of ε from non-terminals, distinguishing them from terminals, which cannot be nullable.[30] A non-terminal is termed nullable if it can derive ε, either directly via an ε-production or indirectly through a chain of such derivations; identifying these requires computing the transitive closure of dependencies among productions where ε can be reached.[31] The impact of ε-productions on a grammar's structure is significant, as they introduce ambiguity in derivations and complicate normalization; for example, converting a CFG to Chomsky normal form necessitates eliminating ε-productions to ensure all rules are of the form A → BC or A → a, while preserving the generated language.[32] Algorithms for elimination typically proceed in steps: first, mark all nullable non-terminals by iteratively finding those with productions consisting solely of other nullables or ε; then, for each production A → X₁X₂⋯Xₙ, generate new rules by replacing subsets of nullable Xᵢ with ε, avoiding the original ε-production itself.[33] If the start symbol is nullable and the language includes ε, a special provision may retain S → ε, but otherwise, the process ensures no ε-productions remain.[34] In parsing applications, ε-derivations manifest as optional nodes in syntax trees, permitting structures where phrases or modifiers can be omitted without altering the overall validity of the derivation.[30] For instance, in a grammar for simple expressions, a rule like Expr → ε | Term allows expressions to optionally include terms, reflecting syntactic optionality in the parse tree.[31] The role of ε-productions traces to Noam Chomsky's foundational 1956 paper, where they were integral to defining type-2 grammars in the hierarchy, facilitating the modeling of natural languages with optional elements like adjectives or prepositional phrases.Representations in Computing
In Programming Languages
In programming languages, the empty string is typically represented using specific literals that denote a string of zero length. In C, the empty string literal"" is a null-terminated byte string consisting solely of the terminating null character \0, forming a character array of size 1.[35] In Java, an object-oriented language, the empty string can be represented as the literal "" or via the constant String.EMPTY, which points to a shared immutable instance of the String class with no characters.[4] Python supports both double and single quotes for this purpose, where "" or '' initializes a string object of length 0.[5] In R, a statistical programming language, "" denotes a character vector of length 1 containing an empty string, distinct from character(0), which creates an empty character vector of length 0.[36]
Initialization of empty strings varies by language but often leverages these literals for simplicity. For instance, in C, a pointer to an empty string can be initialized as char *s = "";, which assigns the address of the static empty literal.[35] In Java, initialization occurs via String s = ""; or String s = new String();, both creating an instance representing an empty character sequence. Python allows direct assignment like s = "", and its built-in str() constructor also yields an empty string, aligning with the language's dynamic typing where uninitialized strings are not predefined but can be set to empty literals.[5] R initializes via s <- "" for a single empty string or character(0) for an empty vector, with default string handling in data structures often starting as empty vectors.[36]
Variations in empty string handling arise with Unicode support, where the empty string contains zero Unicode code points and does not include U+0000 (the null character), which is instead used in some encodings to terminate strings but is absent from the empty string itself.[37] In object-oriented languages like Java and Python, empty strings are treated as instances of string classes, while in procedural languages like C, they are arrays or pointers without object overhead.[4][35]
Even with zero length, an empty string requires allocation of a minimal structure—such as an object header in object-oriented languages or a small array in procedural ones—unlike null, which merely indicates the absence of any allocated reference.[4][35] This distinction ensures the empty string can be manipulated as a valid entity, separate from uninitialized or absent values.
Storage and Memory Aspects
In computing, the empty string is typically represented internally as a minimal data structure to optimize memory and processing efficiency. In the C programming language, strings are null-terminated arrays of characters, where the empty string is implemented as a pointer to a single null character\0, requiring only 1 byte for the terminator itself plus the size of the pointer (usually 8 bytes on 64-bit systems).[38] This representation allows standard library functions like strlen to immediately recognize the empty string by encountering the null terminator without scanning further.[38]
In languages like Java, the empty string benefits from interning in the string pool, where all instances of the empty string literal "" reference a single shared object, acting as a singleton to prevent redundant allocations.[4] This immutable object incurs minimal object overhead but is reused across the application, minimizing memory footprint for frequent empty string usage. Similar optimizations appear in other managed languages, such as C#'s string.Empty, which also points to a pre-allocated empty instance to avoid heap allocations.[39]
Memory usage for the empty string is inherently minimal across implementations, often limited to a flag, pointer, or small fixed object, contrasting with non-empty strings that scale with content length. For instance, in low-level representations like C, the total footprint remains under 10 bytes, while higher-level optimizations in Java ensure no additional instances are created beyond the initial singleton.[38][4] These designs prioritize efficiency in scenarios involving temporary or placeholder strings, such as default values in data structures.
Common operations on the empty string emphasize its neutral role in string manipulation. A length check, such as s.length() == 0 in Java or strlen(s) == 0 in C, is a constant-time O(1) verification that confirms emptiness without iterating contents.[4][38] Concatenation with the empty string acts as an identity operation, where appending it to any string s yields s unchanged, as in s + "" or s + ε, avoiding unnecessary copying or allocation in optimized implementations.[4] Trimming operations, like removing whitespace, on an already empty string simply return the same empty instance, further enhancing efficiency.[4]
Practical examples highlight the empty string's role in input handling and performance-critical code. In file I/O, detecting empty inputs is essential for validating reads; for instance, after using fgets in C to read a line, checking if the resulting string's length is zero (after stripping newline) prevents processing invalid or blank entries from files.[38] Similarly, in Java's BufferedReader.readLine(), an empty string indicates a blank line, allowing robust error handling in data parsing.
In string construction scenarios, tools like Java's StringBuilder or C#'s StringBuilder leverage the empty string to initialize buffers without initial content, avoiding reallocations during growth. Initializing with an empty string sets a capacity (default 16 characters) that accommodates subsequent appends, reducing the frequency of internal array resizes and improving throughput in loops building dynamic strings from variable data. This approach is particularly effective in high-volume operations, such as logging or report generation, where starting from empty minimizes overhead compared to repeated concatenations on immutable strings.[39]
Distinctions and Applications
Versus Null and Empty Values
In computing, the empty string is defined as a valid string object with a length of zero, such as"" in most programming languages, representing a deliberate absence of characters while still being an instance of the string type.[6][40] In contrast, null (or equivalents like nullptr in C++, None in Python, or null in Java) denotes the complete absence of any value or object reference, indicating that no string instance exists at all.[41][42][43]
A key behavioral difference arises in operations like concatenation: appending an empty string to another string preserves the original content unchanged (e.g., "" + "a" yields "a" in Python or Java), as the empty string acts as an identity element for string concatenation.[40][44] However, attempting to concatenate with null typically results in an error, such as a NullPointerException in Java, or returns null/undefined, preventing unintended data corruption but requiring explicit null checks.[44] In C++, dereferencing a null pointer for string operations leads to undefined behavior, whereas an empty std::string can be safely manipulated.[42]
In database contexts like SQL, the empty string ('') in a VARCHAR column represents a known value of zero length, distinct from NULL, which signifies an unknown or missing value according to ANSI SQL standards.[45] For instance, queries using IS NULL filter NULL values but not empty strings, and aggregate functions like COUNT treat empty strings as valid data points while ignoring NULL.[46] Note that some databases, such as Oracle, treat empty strings as equivalent to NULL for storage and comparison purposes, potentially leading to interoperability issues with ANSI-compliant systems.[46]
In APIs and data interchange formats like JSON, an empty string ("") is a valid string value indicating no content, while null explicitly denotes the absence of a value, as defined in RFC 8259.[47] This distinction allows clients to differentiate between intentionally empty data (e.g., an empty username field) and unprovided or unknown data (e.g., an optional field not set), improving API clarity and reducing parsing errors.[47]
Best practices recommend using the empty string for scenarios where a string field is valid but contains no data, such as an optional address line left blank, as it avoids null-related exceptions and maintains type consistency.[44] Conversely, reserve null for uninitialized variables, unknown values, or explicitly absent optional fields to signal intent clearly; for example, in databases, using NULL for missing phone numbers prevents false positives in searches treating empty strings as matches.[46] Mixing them can cause errors, such as runtime exceptions when invoking methods on null references or incorrect query results when WHERE column = '' inadvertently includes or excludes NULL values.[44] In memory terms, empty strings may allocate minimal space for the object itself, unlike null which requires no allocation.[3]
In Regular Expressions and Pattern Matching
In regular expressions, the empty string, denoted as ε in formal theory, is matched explicitly by the pattern consisting solely of anchors^ and $, which specify the beginning and end of the string with no intervening characters. This pattern succeeds only on an entirely empty input, as the anchors collapse to the same position without any content between them.[48] In Python's re module, for instance, re.match(r'^$', '') returns a successful match object, confirming that the empty string is recognized at the zero-length position.[48] Similarly, in Perl, the pattern /^$/ matches an empty string, anchoring to both the start and end simultaneously.[49]
Quantifiers in regex enable implicit matching of the empty string by permitting zero occurrences of preceding elements. The Kleene star (*) matches zero or more repetitions, so a pattern like a* succeeds on the empty string (zero as), as well as on "a" or "aa".[50] The question mark (?) matches zero or one occurrence, allowing a? to match either the empty string or a single "a".[50] These behaviors hold across major engines; for example, in Python, re.match(r'a*', '') matches successfully, while re.match(r'a?', 'b') matches the empty string before "b" due to the zero option.[48] In Perl, /a*/ applied to an empty input or before non-matching characters yields a zero-length match.[49]
Optional groups further illustrate this utility, where (a)? matches either "a" or the empty string, providing flexibility in pattern design. In practice, Python's re.match(r'(a)?', '') captures the empty string in the group, and Perl's /(a)?/ behaves analogously on empty input.[48][49]
At an advanced level, regex engines often compile patterns into non-deterministic finite automata (NFAs) that incorporate ε-transitions, which enable state changes without consuming input characters, directly supporting empty string matching in composite expressions. This approach stems from Thompson's construction algorithm, where ε-moves connect sub-automata for operations like union and Kleene star, allowing the NFA to "skip" to accepting states via empty paths.[51] Such transitions are essential for efficient implementation, as they model zero-length advancements underlying quantifiers and optional elements without explicit backtracking in every case.