Alphabet (formal languages)

In formal language theory, an alphabet, often called a vocabulary in the context of terminal and nonterminal symbols, is a non-empty set of indivisible symbols/characters/glyphs,^[1] typically thought of as representing letters, characters, digits, phonemes, or even words.^[2]^[3] The definition is used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and, depending on its purpose, may be finite (e.g., the alphabet of letters "a" through "z"), countable (e.g., $\{v_{1},v_{2},\ldots \}$ ), or even uncountable (e.g., $\{v_{x}:x\in \mathbb {R} \}$ ).

Strings, also known as "words" or "sentences", over an alphabet are defined as a sequence of the symbols from the alphabet set.^[4] For example, the alphabet of lowercase letters "a" through "z" can be used to form English words like "iceberg" while the alphabet of both upper and lower case letters can also be used to form proper names like "Wikipedia". A common alphabet is {0,1}, the binary alphabet, and "00101111" is an example of a binary string. Infinite sequences of symbols may be considered as well (see Omega language).

Strings are often written as the concatenation of their symbols, and when using this notational convention it is convenient for practical purposes to restrict the symbols in an alphabet so that this notation is unambiguous. For instance, if the two-member alphabet is {00,0}, a string written in concatenated form as "000" is ambiguous because it is unclear if it is a sequence of three "0" symbols, a "00" followed by a "0", or a "0" followed by a "00". However, this is a limitation on the notation for writing strings, not on their underlying definitions. Like any finite set, {00,0} can be used as an alphabet, whose strings can be written unambiguously in a different notational convention with commas separating their elements: 0,00 ≠ 0,0,0 ≠ 00,0.

Notation

By definition, the alphabet of a formal language $L$ over $\Sigma$ is the set $\Sigma$ , which can be any non-empty set of symbols from which every string in $L$ is built. For example, the set $\Sigma =\{\_,\mathrm {a} ,\dots ,\mathrm {z} ,\mathrm {A} ,\dots ,\mathrm {Z} ,0,\mathrm {1} ,\dots ,\mathrm {9} \}$ can be the alphabet of the formal language $L$ that means "all variable identifiers in the C programming language". It is not required to use every symbol in the alphabet of $L$ for its strings.

Given an alphabet $\Sigma$ , the set of all strings of length $n$ over the alphabet $\Sigma$ is indicated by $\Sigma ^{n}$ . The set ${\textstyle \bigcup _{i\in \mathbb {N} }\Sigma ^{i}}$ of all finite strings (regardless of their length) is indicated by the Kleene star operator as $\Sigma ^{*}$ , and is also called the Kleene closure of $\Sigma$ . The notation $\Sigma ^{\omega }$ indicates the set of all infinite sequences over the alphabet $\Sigma$ , and $\Sigma ^{\infty }$ indicates the set $\Sigma ^{\ast }\cup \Sigma ^{\omega }$ of all finite or infinite sequences.

For example, using the binary alphabet {0,1}, the strings ε, 0, 1, 00, 01, 10, 11, 000, etc. are all in the Kleene closure of the alphabet (where ε represents the empty string).

Applications

Alphabets are important in the use of formal languages, automata and semiautomata. In most cases, for defining instances of automata, such as deterministic finite automata (DFAs), it is required to specify an alphabet from which the input strings for the automaton are built. In these applications, an alphabet is usually required to be a finite set, but is not otherwise restricted.

When using automata, regular expressions, or formal grammars as part of string-processing algorithms, the alphabet may be assumed to be the character set of the text to be processed by these algorithms, or a subset of allowable characters from the character set.

References

^ Fletcher, Peter; Hoyle, Hughes; Patty, C. Wayne (1991). Foundations of Discrete Mathematics. PWS-Kent. p. 114. ISBN 0-53492-373-9. An alphabet is a nonempty finite set the members of which are called symbols or characters.
^ Ebbinghaus, H.-D.; Flum, J.; Thomas, W. (1994). Mathematical Logic (2nd ed.). New York: Springer. p. 11. ISBN 0-387-94258-0. By an alphabet ${\mathcal {A}}$ we mean a nonempty set of symbols.
^ Rosen, Kenneth H. (2012). Discrete Mathematics and Its Applications (PDF) (7th ed.). New York: McGraw Hill. pp. 847–851. ISBN 978-0-07-338309-5. A vocabulary (or alphabet) V is a finite, nonempty set of elements called symbols. A word (or sentence) over V is a string of finite length of elements of V.
^ Rautenberg, Wolfgang (2010). A Concise Introduction to Mathematical Logic (PDF) (Third ed.). Springer. p. xx. ISBN 978-1-4419-1220-6. If 𝗔 is an alphabet, i.e., if the elements 𝐬 ∈ 𝗔 are symbols or at least named symbols, then the sequence (𝐬₁,...,𝐬_n)∈𝗔ⁿ is written as 𝐬₁···𝐬_n and called a string or a word over 𝗔.

Literature

John E. Hopcroft and Jeffrey D. Ullman, Introduction to Automata Theory, Languages, and Computation, Addison-Wesley Publishing, Reading Massachusetts, 1979. ISBN 0-201-02988-X.

[1] Fletcher, Peter; Hoyle, Hughes; Patty, C. Wayne (1991). Foundations of Discrete Mathematics. PWS-Kent. p. 114. ISBN 0-53492-373-9. An alphabet is a nonempty finite set the members of which are called symbols or characters.

[Ebbinghaus1994-2] Ebbinghaus, H.-D.; Flum, J.; Thomas, W. (1994). Mathematical Logic (2nd ed.). New York: Springer. p. 11. ISBN 0-387-94258-0. By an alphabet ${\mathcal {A}}$ we mean a nonempty set of symbols.

[3] Rosen, Kenneth H. (2012). Discrete Mathematics and Its Applications (PDF) (7th ed.). New York: McGraw Hill. pp. 847–851. ISBN 978-0-07-338309-5. A vocabulary (or alphabet) V is a finite, nonempty set of elements called symbols. A word (or sentence) over V is a string of finite length of elements of V.

[Rautenberg-4] Rautenberg, Wolfgang (2010). A Concise Introduction to Mathematical Logic (PDF) (Third ed.). Springer. p. xx. ISBN 978-1-4419-1220-6. If 𝗔 is an alphabet, i.e., if the elements 𝐬 ∈ 𝗔 are symbols or at least named symbols, then the sequence (𝐬₁,...,𝐬_n)∈𝗔ⁿ is written as 𝐬₁···𝐬_n and called a string or a word over 𝗔.

[1]

[2]

[3]

[4]

History

Alphabet (formal languages)

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Alphabet (formal languages)

Notation

Applications

See also

References

Literature