Hubbry Logo
Empty stringEmpty stringMain
Open search
Empty string
Community hub
Empty string
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Contribute something
Empty string
Empty string
from Wikipedia

In formal language theory, the empty string, also known as the empty word or null string, is the unique string of length zero.

Formal theory

[edit]

Formally, a string is a finite, ordered sequence of characters such as letters, digits or spaces. The empty string is the special case where the sequence has length zero, so there are no symbols in the string. There is only one empty string, because two strings are only different if they have different lengths or a different sequence of symbols. In formal treatments,[1] the empty string is denoted with ε or sometimes Λ or λ.

The empty string should not be confused with the empty language , which is a formal language (i.e. a set of strings) that contains no strings, not even the empty string.

The empty string has several properties:

  • |ε| = 0. Its string length is zero.
  • ε ⋅ s = s ⋅ ε = s. The empty string is the identity element of the concatenation operation. The set of all strings forms a free monoid with respect to ⋅ and ε.
  • εR = ε. Reversal of the empty string produces the empty string, so the empty string is a palindrome.
  • . Statements that are about all characters in a string are vacuously true.
  • The empty string precedes any other string under lexicographical order, because it is the shortest of all strings.[2]

In context-free grammars, a production rule that allows a symbol to produce the empty string is known as an ε-production, and the symbol is said to be "nullable".

Use in programming languages

[edit]

In most programming languages, the term "string" often refers to instances of a data type and thus they're a concept distinct from the one in the formal theory. Such strings are typically stored at distinct memory addresses (locations) and thus have an identity. Thus, representatives of the same formal string (e.g., the empty string) may be stored in two or more places in memory and they can be taken as names of the formal empty string.

In this way, there could be multiple representatives of the empty string in memory, in contrast with the formal theory definition, for which there is only one possible empty string. However, a "string comparison function" would indicate that all of these representatives are equal to each other.

Even a string of length zero can require memory to store it, depending on the format being used. In most programming languages, the empty string is distinct from a null reference (or null pointer) because a null reference points to no string at all, not even the empty string. The empty string is a legitimate string, upon which most string operations should work. Some languages treat some or all of the following in similar ways: empty strings, null references, the integer 0, the floating point number 0, the Boolean value false, the ASCII character NUL, or other such values.

The empty string is usually represented similarly to other strings. In implementations with string terminating character (null-terminated strings or plain text lines), the empty string is indicated by the immediate use of this terminating character.

Different functions, methods, macros, or idioms exist for checking if a string is empty in different languages.[example needed]

λ representation Programming languages
"" C, C#, C++, Go, Haskell, Java, JavaScript, Julia, Lua, M, Objective-C (as a C string), OCaml, Perl, PHP, PowerShell, Python, Ruby, Scala, Standard ML, Swift, Tcl, Visual Basic .NET
'' APL, Delphi, JavaScript, Lua, MATLAB, Pascal, Perl, PHP, PowerShell, Python, R, Ruby, Smalltalk, SQL
character(0) R[3]
{'\0'} C, C++, Objective-C (as a C string)
new String() (from java.lang.String) Java
string() (from std::string) C++
""s C++ (since the 2014 standard)
@"" Objective-C (as a constant NSString object)
[NSString string] Objective-C (as a new NSString object)
q(), qq() Perl
str()[4] """""" r"" u"" Python
%{}
%()
Ruby
String::new() (from std::string::String)[5] Rust
String.Empty (from System.String) C#, Visual Basic .NET
String.make 0 '-' OCaml
{} Tcl
[[]] Lua
“”

‘’ „” ‚‘[7]

PowerShell
.byte 0

.ascii "" .asciz ""

A64

Representations of the empty string

[edit]

The empty string is a syntactically valid representation of zero in positional notation (in any base), which does not contain leading zeros. Since the empty string does not have a standard visual representation outside of formal language theory, the number zero is traditionally represented by a single decimal digit 0 instead.

Zero-filled memory area, interpreted as a null-terminated string, is an empty string.

Empty lines of text show the empty string. This can occur from two consecutive EOLs, as often occur in text files. This is sometimes used in text processing to separate paragraphs, e.g. in MediaWiki.

See also

[edit]

References

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
In formal language theory, the empty string, also known as the empty word, is the unique string of length zero, consisting of no symbols from an alphabet. It serves as the identity element for string concatenation, meaning that appending the empty string to any other string yields the original string unchanged. Denoted typically by the symbol ε (epsilon) or sometimes λ (lambda), the empty string is a fundamental component of the set of all possible strings over an alphabet Σ, known as Σ*, which includes the empty string alongside all non-empty strings. In programming languages, the empty string represents a valid string object with zero characters, distinct from null (which indicates the absence of any object). For example, in , it is created as "" and has a length of 0, allowing methods like and operations to function predictably. Similarly, in C++, the std::string class initializes to an empty string by default, and its empty() member function returns true for strings of length zero. In Python, the empty string '' is a built-in type that evaluates to false in contexts and supports operations like joining with other strings. The empty string plays a critical role in algorithms and data structures, such as regular expressions where it matches positions without consuming input, and in where it enables ε-productions in context-free grammars. It also distinguishes between "no value" (null) and "known empty value," aiding in robust error handling and across systems.

Definition and Notation

Formal Definition

In formal language theory, the is defined as the unique consisting of zero symbols from a given , making it the sole of zero. This distinguishes it from the , which is a collection containing no elements whatsoever, and from the empty language, which is the set of no at all. Regardless of the alphabet's size or composition, there exists precisely one empty string, as it represents the complete absence of any symbols and thus serves as a fundamental unit in the construction of all possible strings over that . This uniqueness underscores its role as the foundational element in within formal languages, where it acts as the identity for operations. The concept of the empty string originated in formal language theory during the mid-20th century, with early formalizations appearing in the work of and contemporaries in the , building on foundational ideas in and . Early works, such as those by Noam Chomsky, denoted it with I (capital i). Syntactically, the empty string can represent zero in positional numeral systems, analogous to an or the base case in unary representation where no symbols denote the value 0.

Common Notations

In theory and , the empty string is most commonly denoted by the Greek letter ε (epsilon). This notation serves as the for string concatenation and is standard in contemporary textbooks on the subject. Another frequently used symbol is λ (lambda), particularly in discussions of finite automata and regular languages, where it represents the string of length zero. The uppercase variant Λ (capital lambda) appears in some lecture notes and older references on s. In certain fields, the empty set symbol ∅ is occasionally misused to denote the empty string, though it properly represents the set containing no elements, distinct from the unique string of zero length. This distinction is critical, as the empty string belongs to languages like Σ* (all possible strings over alphabet Σ), while ∅ denotes the language with no strings. In informal discussions and programming contexts, the empty string is typically written as a pair of empty quotes, "". For instance, in languages like Python and Java, "" literalizes the empty string. The ε notation gained prominence in the mid-20th century alongside the formalization of regular languages, building on foundational work like Kleene's 1956 paper, which used λ to denote the empty word. Over time, ε became the preferred symbol in theoretical contexts for its clarity and avoidance of overlap with other notations like λ (used in ) or ∅. Guidelines for selection recommend ε for rigorous mathematical and theoretical writing to emphasize its algebraic role, while "" suits practical implementations in code where readability for developers is key. In context-free grammars, ε denotes productions that derive the empty string, enabling nullable nonterminals.

Theoretical Properties

Algebraic Properties

In , the empty string, denoted ϵ\epsilon, acts as the under the operation of string concatenation. For any ss over an Σ\Sigma, the concatenation satisfies ϵs=sϵ=s\epsilon \cdot s = s \cdot \epsilon = s. This property holds because appending or prepending nothing to a string leaves it unchanged, mirroring the role of the multiplicative identity 1 in integer arithmetic. The empty string also exhibits palindromic symmetry, reading the same forwards and backwards due to the absence of characters. In recursive definitions of palindromes, ϵ\epsilon serves as a base case, satisfying the condition that a string equals its reverse. This property extends to formal definitions where a palindrome pp satisfies p=pRp = p^R, with ϵR=ϵ\epsilon^R = \epsilon. Predicates involving universal quantification over the characters of the empty string hold via vacuous truth. For instance, the statement "every character in ϵ\epsilon is uppercase" is true because there are no characters to falsify it, analogous to universal statements over an empty domain in predicate logic. This logical principle ensures that any property asserted for all elements of the empty set of characters is satisfied without counterexamples. The empty string remains invariant under common string operations, demonstrating closure. Its reversal yields ϵ\epsilon itself, as reversing an empty produces emptiness. Similarly, repetition or powers of ϵ\epsilon, such as ϵn\epsilon^n for positive integers nn, result in ϵ\epsilon, preserving the empty structure.

Ordering and Relations

In lexicographical ordering over a finite alphabet, the empty string precedes all non-empty strings, as it has length zero and thus comes before any string of positive length when strings are compared position by position until a difference is found or one ends. This property holds in both standard lexicographic order and shortlex order (where shorter strings precede longer ones of equal prefix), ensuring the empty string is the minimal element in the ordered set of all strings. The empty string exhibits specific relational properties with respect to other strings: it is a of every string in a vacuous sense, since any string can be expressed as a involving the empty string without altering its content. Furthermore, the empty string is a prefix and a of every string, because any string ww satisfies w=ϵww = \epsilon \cdot w and w=wϵw = w \cdot \epsilon, where \cdot denotes . In sorted lists of strings under ascending lexicographical order, the empty string always appears first, regardless of the alphabet. For example, over the alphabet {a,b}\{a, b\}, the order begins as ϵ<a<aa<aab<ab<aba<b<ba<baa\epsilon < a < aa < aab < ab < aba < b < ba < baa. In the context of string-based numeral systems, such as positional representations where digits form strings, the empty string corresponds to the number zero, establishing a bijection between finite digit sequences and non-negative integers starting from 0.

Role in Formal Languages

In Language Composition

In formal language theory, a LL over an Σ\Sigma is a of Σ\Sigma^*, the Kleene closure of Σ\Sigma, which includes all finite strings formed from symbols in Σ\Sigma as well as the empty string ε\varepsilon. The empty string may or may not be an element of LL; for instance, if L={ε}L = \{\varepsilon\}, then LL is a non-empty consisting solely of ε\varepsilon. Conversely, the empty \emptyset contains no strings, including ε\varepsilon. The empty string plays a fundamental role in language composition operations. In concatenation, the singleton language {ε}\{\varepsilon\} serves as the identity element, satisfying {ε}L=L{ε}=L\{\varepsilon\} \cdot L = L \cdot \{\varepsilon\} = L for any language LL, since concatenating ε\varepsilon with any string leaves it unchanged. For union, L=L\emptyset \cup L = L, establishing the empty language as the identity, while {ε}L=L{ε}\{\varepsilon\} \cup L = L \cup \{\varepsilon\}, which equals LL if and only if εL\varepsilon \in L. A key operation involving the empty string is the LL^*, defined as the set of all finite concatenations of zero or more strings from LL. This always includes ε\varepsilon as the result of the empty (zero-length) concatenation, ensuring εL\varepsilon \in L^* regardless of whether εL\varepsilon \in L. Thus, even if L=L = \emptyset, L={ε}L^* = \{\varepsilon\}. As an example in regular languages, consider the language of all even-length strings over the alphabet {a,b}\{a, b\}, which is regular and includes ε\varepsilon since its length is 0, an even number. This language can be expressed by the (aa+ab+ba+bb)(aa + ab + ba + bb)^*, where the star operation incorporates ε\varepsilon as the base case.

In Grammars and Productions

In generative grammars, particularly context-free grammars (CFGs) within the , the empty string, denoted ε, is central to ε-productions, which take the form A → ε where A is a non-terminal , allowing that non-terminal to generate . These productions enable the derivation of ε from non-terminals, distinguishing them from terminals, which cannot be nullable. A non-terminal is termed nullable if it can derive ε, either directly via an ε-production or indirectly through a of such derivations; identifying these requires computing the of dependencies among productions where ε can be reached. The impact of ε-productions on a grammar's structure is significant, as they introduce in derivations and complicate normalization; for example, converting a CFG to necessitates eliminating ε-productions to ensure all rules are of the form A → BC or A → a, while preserving the generated language. Algorithms for elimination typically proceed in steps: first, mark all nullable non-terminals by iteratively finding those with productions consisting solely of other nullables or ε; then, for each production A → X₁X₂⋯Xₙ, generate new rules by replacing subsets of nullable Xᵢ with ε, avoiding the original ε-production itself. If the start symbol is nullable and the language includes ε, a special provision may retain S → ε, but otherwise, the process ensures no ε-productions remain. In parsing applications, ε-derivations manifest as optional nodes in syntax trees, permitting structures where phrases or modifiers can be omitted without altering the overall validity of the derivation. For instance, in a for simple expressions, a rule like Expr → ε | Term allows expressions to optionally include terms, reflecting syntactic optionality in the . The role of ε-productions traces to Noam Chomsky's foundational 1956 paper, where they were integral to defining type-2 grammars in the hierarchy, facilitating the modeling of natural languages with optional elements like adjectives or prepositional phrases.

Representations in Computing

In Programming Languages

In programming languages, the empty string is typically represented using specific literals that denote a string of zero length. In C, the empty string literal "" is a null-terminated byte string consisting solely of the terminating null character \0, forming a character array of size 1. In Java, an object-oriented language, the empty string can be represented as the literal "" or via the constant String.EMPTY, which points to a shared immutable instance of the String class with no characters. Python supports both double and single quotes for this purpose, where "" or '' initializes a string object of length 0. In R, a statistical programming language, "" denotes a character vector of length 1 containing an empty string, distinct from character(0), which creates an empty character vector of length 0. Initialization of empty strings varies by language but often leverages these literals for simplicity. For instance, , a pointer to an empty string can be initialized as char *s = "";, which assigns the address of the static empty literal. In , initialization occurs via String s = ""; or String s = new String();, both creating an instance representing an empty character sequence. Python allows direct assignment like s = "", and its built-in str() constructor also yields an empty string, aligning with the language's dynamic typing where uninitialized strings are not predefined but can be set to empty literals. initializes via s <- "" for a single empty string or character(0) for an empty vector, with default string handling in data structures often starting as empty vectors. Variations in empty string handling arise with support, where the empty string contains zero Unicode code points and does not include U+0000 (the ), which is instead used in some encodings to terminate strings but is absent from the empty string itself. In object-oriented languages like and Python, empty strings are treated as instances of string classes, while in procedural languages like , they are arrays or pointers without object overhead. Even with zero length, an empty string requires allocation of a minimal structure—such as an object header in object-oriented languages or a small in procedural ones—unlike null, which merely indicates the absence of any allocated reference. This distinction ensures the empty string can be manipulated as a valid , separate from uninitialized or absent values.

Storage and Aspects

In computing, the empty string is typically represented internally as a minimal to optimize memory and processing efficiency. In , strings are null-terminated arrays of characters, where the empty string is implemented as a pointer to a single \0, requiring only 1 byte for the terminator itself plus the size of the pointer (usually 8 bytes on 64-bit systems). This representation allows functions like strlen to immediately recognize the empty string by encountering the null terminator without scanning further. In languages like , the empty string benefits from interning in the string pool, where all instances of the empty string literal "" reference a single shared object, acting as a singleton to prevent redundant allocations. This incurs minimal object overhead but is reused across the application, minimizing memory footprint for frequent empty string usage. Similar optimizations appear in other managed languages, such as C#'s string.Empty, which also points to a pre-allocated empty instance to avoid heap allocations. Memory usage for the empty string is inherently minimal across implementations, often limited to a flag, pointer, or small fixed object, contrasting with non-empty strings that scale with content length. For instance, in low-level representations like C, the total footprint remains under 10 bytes, while higher-level optimizations in Java ensure no additional instances are created beyond the initial singleton. These designs prioritize efficiency in scenarios involving temporary or placeholder strings, such as default values in data structures. Common operations on the empty string emphasize its neutral role in string manipulation. A length check, such as s.length() == 0 in Java or strlen(s) == 0 in C, is a constant-time O(1) verification that confirms emptiness without iterating contents. Concatenation with the empty string acts as an identity operation, where appending it to any string s yields s unchanged, as in s + "" or s + ε, avoiding unnecessary copying or allocation in optimized implementations. Trimming operations, like removing whitespace, on an already empty string simply return the same empty instance, further enhancing efficiency. Practical examples highlight the empty string's role in input handling and performance-critical code. In file I/O, detecting empty inputs is essential for validating reads; for instance, after using fgets to read a line, checking if the resulting string's length is zero (after stripping ) prevents processing invalid or blank entries from files. Similarly, in Java's BufferedReader.readLine(), an empty string indicates a blank line, allowing robust handling in data parsing. In string construction scenarios, tools like Java's StringBuilder or C#'s StringBuilder leverage the empty string to initialize buffers without initial content, avoiding reallocations during growth. Initializing with an empty string sets a capacity (default 16 characters) that accommodates subsequent appends, reducing the frequency of internal array resizes and improving throughput in loops building dynamic strings from variable . This approach is particularly effective in high-volume operations, such as or report generation, where starting from empty minimizes overhead compared to repeated concatenations on immutable strings.

Distinctions and Applications

Versus Null and Empty Values

In computing, the empty string is defined as a valid string object with a length of zero, such as "" in most programming languages, representing a deliberate absence of characters while still being an instance of the string type. In contrast, null (or equivalents like nullptr in C++, None in Python, or null in Java) denotes the complete absence of any value or object reference, indicating that no string instance exists at all. A key behavioral difference arises in operations like : appending an empty string to another string preserves the original content unchanged (e.g., "" + "a" yields "a" in Python or ), as the empty string acts as an for string concatenation. However, attempting to concatenate with null typically results in an error, such as a NullPointerException in Java, or returns null/undefined, preventing unintended but requiring explicit null checks. In C++, dereferencing a for string operations leads to , whereas an empty std::string can be safely manipulated. In database contexts like SQL, the empty string ('') in a VARCHAR column represents a known value of zero length, distinct from NULL, which signifies an unknown or missing value according to ANSI SQL standards. For instance, queries using IS NULL filter NULL values but not empty strings, and aggregate functions like COUNT treat empty strings as valid data points while ignoring NULL. Note that some databases, such as Oracle, treat empty strings as equivalent to NULL for storage and comparison purposes, potentially leading to interoperability issues with ANSI-compliant systems. In APIs and data interchange formats like , an empty ("") is a valid value indicating no content, while null explicitly denotes the absence of a value, as defined in RFC 8259. This distinction allows clients to differentiate between intentionally empty data (e.g., an empty username field) and unprovided or unknown data (e.g., an optional field not set), improving API clarity and reducing parsing errors. Best practices recommend using the empty string for scenarios where a string field is valid but contains no data, such as an optional address line left blank, as it avoids null-related exceptions and maintains type consistency. Conversely, reserve null for uninitialized variables, unknown values, or explicitly absent optional fields to signal intent clearly; for example, in databases, using NULL for missing phone numbers prevents false positives in searches treating empty strings as matches. Mixing them can cause errors, such as runtime exceptions when invoking methods on null references or incorrect query results when WHERE column = '' inadvertently includes or excludes NULL values. In memory terms, empty strings may allocate minimal space for the object itself, unlike null which requires no allocation.

In Regular Expressions and Pattern Matching

In regular expressions, the empty string, denoted as ε in formal theory, is matched explicitly by the pattern consisting solely of anchors ^ and $, which specify the beginning and end of the string with no intervening characters. This pattern succeeds only on an entirely empty input, as the anchors collapse to the same position without any content between them. In Python's re module, for instance, re.match(r'^$', '') returns a successful object, confirming that the empty string is recognized at the zero-length position. Similarly, in , the pattern /^$/ matches an empty string, anchoring to both the start and end simultaneously. Quantifiers in regex enable implicit matching of the empty string by permitting zero occurrences of preceding elements. The (*) matches zero or more repetitions, so a pattern like a* succeeds on the empty string (zero as), as well as on "a" or "aa". The (?) matches zero or one occurrence, allowing a? to match either the empty string or a single "a". These behaviors hold across major engines; for example, in Python, re.match(r'a*', '') matches successfully, while re.match(r'a?', 'b') matches the empty string before "b" due to the zero option. In , /a*/ applied to an empty input or before non-matching characters yields a zero-length match. Optional groups further illustrate this utility, where (a)? matches either "a" or the empty string, providing flexibility in . In practice, Python's re.match(r'(a)?', '') captures the empty string in the group, and Perl's /(a)?/ behaves analogously on empty input. At an advanced level, regex engines often compile patterns into non-deterministic finite automata (NFAs) that incorporate ε-transitions, which enable state changes without consuming input characters, directly supporting empty string matching in composite expressions. This approach stems from algorithm, where ε-moves connect sub-automata for operations like union and , allowing the NFA to "skip" to accepting states via empty paths. Such transitions are essential for efficient implementation, as they model zero-length advancements underlying quantifiers and optional elements without explicit in every case.

References

Add your contribution
Related Hubs
Contribute something
User Avatar
No comments yet.