Recent from talks
Nothing was collected or created yet.
Production (computer science)
View on WikipediaThis article needs additional citations for verification. (November 2012) |
| Part of a series on |
| Formal languages |
|---|
In computer science, a production or production rule is a rewrite rule that replaces some symbols with other symbols. A finite set of productions is the main component in the specification of a formal grammar (specifically a generative grammar).
In such grammars, a set of productions is a special case of relation on the set of strings (where is the Kleene star operator) over a finite set of symbols called a vocabulary that defines which non-empty strings can be substituted with others. The set of productions is thus a special kind subset
and productions are then written in the form to mean that (not to be confused with being used as function notation, since there may be multiple rules for the same ). Given two subsets , productions can be restricted to satisfy , in which case productions are said "to be of the form . Different choices and constructions of lead to different types of grammars. In general, any production of the form
where is the empty string (sometimes also denoted ), is called an erasing rule, while productions that would produce strings out of nowhere, namely of the form
are never allowed.
In order to allow the production rules to create meaningful sentences, the vocabulary is partitioned into (disjoint) sets and providing two different roles:
- denotes the terminal symbols known as an alphabet containing the symbols allowed in a sentence;
- denotes nonterminal symbols, containing a distinguished start symbol , that are needed together with the production rules to define how to build the sentences.
In the most general case of an unrestricted grammar, a production , is allowed to map arbitrary strings and in (terminals and nonterminals), as long as is not empty. So unrestricted grammars have productions of the form
or if we want to disallow changing finished sentences
- ,
where indicates concatenation and forces a non-terminal symbol to always be present in of the left-hand side of the productions, denotes set union, and denotes set minus or set difference. If we do not allow the start symbol to occur in (the word on the right side), we have to replace by in the right-hand side.[1]
The other types of formal grammar in the Chomsky hierarchy impose additional restrictions on what constitutes a production. Notably in a context-free grammar, the left-hand side of a production must be a single nonterminal symbol. So productions are of the form:
Grammar generation
[edit]To generate a string in the language, one begins with a string consisting of only a single start symbol, and then successively applies the rules (any number of times, in any order) to rewrite this string. This stops when a string containing only terminals is obtained. The language consists of all the strings that can be generated in this manner. Any particular sequence of legal choices taken during this rewriting process yields one particular string in the language. If there are multiple different ways of generating this single string, then the grammar is said to be ambiguous.
For example, assume the alphabet consists of and , with the start symbol , and we have the following rules:
- 1.
- 2.
then we start with , and can choose a rule to apply to it. If we choose rule 1, we replace with and obtain the string . If we choose rule 1 again, we replace with and obtain the string . This process is repeated until we only have symbols from the alphabet (i.e., and ). If we now choose rule 2, we replace with and obtain the string , and are done. We can write this series of choices more briefly, using symbols: . The language of the grammar is the set of all the strings that can be generated using this process: .
See also
[edit]- Formal grammar
- Finite automata
- Generative grammar
- L-system
- Rewrite rule
- Backus–Naur form (A compact form for writing the productions of a context-free grammar.)
- Phrase structure rule
- Post canonical system (Emil Post's production systems- a model of computation.)
References
[edit]- ^ See Klaus Reinhardt: Prioritatszahlerautomaten und die Synchronisation von Halbspursprachen Archived 2018-01-17 at the Wayback Machine; Fakultät Informatik der Universität Stuttgart; 1994 (German)
Production (computer science)
View on GrokipediaDefinition and Notation
Formal Definition
In formal language theory, a production is a rewriting rule of the form , where and are strings over a finite alphabet , is non-empty, and the rule specifies that any occurrence of in a string may be replaced by during a derivation process.[4] A formal grammar incorporating such productions is defined as the four-tuple , where is the finite set of symbols (the vocabulary), is the finite set of terminal symbols (which cannot be rewritten), is the finite set of productions viewed as ordered pairs , and is the start symbol.[1] Productions distinguish between terminal symbols from , which appear in the final output strings and are not subject to further replacement, and nonterminal symbols from , which serve as placeholders that can be expanded via productions.[4] For example, in a grammar generating balanced parentheses, a production such as might introduce nonterminals and (e.g., later rewritten as and ), illustrating how productions build structured strings from the start symbol .[1]Standard Notation
In formal language theory, production rules are conventionally denoted using a rightward arrow (→) to indicate the replacement of a left-hand side string (containing at least one nonterminal) by a right-hand side string during derivation, as exemplified by the rule (typical of context-free productions, where and are nonterminals and is a terminal).[4] This notation, introduced by Noam Chomsky in his foundational work on phrase structure grammars, provides a clear and compact representation of rewriting rules within the components of a formal grammar.[4] A prominent variant of this notation is the Backus-Naur Form (BNF), which employs the assignment-like symbol ::= to define productions, particularly for context-free grammars, as in .[5] BNF was introduced by John Backus in 1959 to specify the syntax of the proposed international algebraic language (later influencing ALGOL 60), offering a metalanguage that emphasizes hierarchical structure and readability in syntactic descriptions.[5] In the original BNF, alternatives for a single nonterminal are indicated using the word "or" to separate options, such as or or or ; modern variants often use the vertical bar (|) for conciseness, allowing multiple right-hand sides to be compactly expressed in one rule, e.g., .[5] To address limitations in expressing repetition and optional elements succinctly, Extended BNF (EBNF) extends the BNF notation with operators borrowed from regular expressions, including the Kleene star (*) for zero or more repetitions of a symbol or group, as in . Parentheses are also used in EBNF for grouping subexpressions, enabling more concise rules like , which denotes one or more grouped terms. This extension was proposed by Niklaus Wirth in 1977 to standardize and simplify syntactic notations across diverse formal descriptions.[6]Types of Productions
Context-Free Productions
In formal language theory, a context-free production is a rule in a context-free grammar (CFG) where the left-hand side consists of a single non-terminal symbol from the set of variables , and the right-hand side is any finite string composed of symbols from the union of variables and terminals , denoted as with .[7] This structure was formalized by Noam Chomsky as part of Type-2 grammars in the Chomsky hierarchy, distinguishing them from more restrictive or permissive rule types.[4] A key property of context-free productions is their independence from the surrounding context in a derivation string; the replacement of with can occur regardless of the adjacent symbols, which simplifies the modeling of recursive and nested structures in languages.[7] This context independence facilitates efficient parsing algorithms, such as the Cocke-Younger-Kasami (CYK) algorithm, that recognize membership in context-free languages in polynomial time for grammars in restricted forms.[8] For instance, a CFG for simple arithmetic expressions with addition, multiplication, parentheses, and identifiers can be defined using the following productions, where represents expressions, terms, factors, and identifiers:E → E + T | T
T → T * F | F
F → ( E ) | id
E → E + T | T
T → T * F | F
F → ( E ) | id
Context-Sensitive Productions
Context-sensitive productions form the core of context-sensitive grammars, allowing rewriting rules where the application of a production depends on the surrounding symbols in the derivation string. These rules take the form , where is a single non-terminal, and are (possibly empty) strings of terminals and non-terminals providing the left and right contexts, respectively, and is a non-empty string consisting of terminals and non-terminals.[10] This structure ensures that the length of the right-hand side is at least as long as the left-hand side, preserving or increasing the string length during derivations, which distinguishes them from more permissive Type-0 productions. Context-sensitive languages, generated using these productions, are recognized exactly by non-deterministic linear bounded automata (LBAs), a restricted class of Turing machines where the tape space is linearly bounded by the input size.[11] This equivalence highlights the computational limits of context-sensitive productions, as LBAs simulate the context-dependent rewriting while operating within linear space, connecting grammatical formalisms to automata-based models of computation. A representative example of a language requiring context-sensitive productions is , which captures strings where the counts of , , and are equal but cannot be generated context-freely due to the cross-dependencies. In a grammar for this language, the production exemplifies context preservation: the left context ensures balanced 's by introducing a only adjacent to a corresponding , while non-terminals like and track pending 's and 's through additional rules to maintain equality across the segments.[12] The problem of recognizing whether a string belongs to a context-sensitive language is PSPACE-complete, reflecting the high computational demands of simulating non-deterministic derivations within linear space bounds.[13] This complexity exceeds that of context-free recognition (which is in P) and stems from the ability of LBAs to solve PSPACE-hard problems like quantified Boolean formulas within their tape limits.Role in Language Generation
Derivation Process
In formal grammar theory, a derivation is a sequence of applications of production rules that transforms the start symbol into a string over the terminal alphabet , by repeatedly replacing the left-hand side of a production with its right-hand side until only terminals remain.[14] Each intermediate result in this process is known as a sentential form, which is any string in (where is the vocabulary) that can be derived from via zero or more production applications; sentential forms may contain both terminals and nonterminals.[14] A single application of a production rule constitutes a direct derivation, denoted , where and are sentential forms and is obtained by replacing one occurrence of a nonterminal's left-hand side with the corresponding right-hand side in .[14] A full derivation, or multi-step derivation, is the reflexive transitive closure of direct derivations, denoted , allowing zero or more steps. Derivations can proceed in different orders: a leftmost derivation replaces the leftmost nonterminal in the current sentential form at each step; a rightmost derivation replaces the rightmost nonterminal; and an arbitrary derivation replaces any nonterminal without a fixed order.[14] These variants ensure that for any context-free grammar, every derivable terminal string has at least one leftmost and one rightmost derivation, though the sequences may differ.[15] The language generated by a grammar , denoted , consists of all terminal strings such that .[14] This set captures the generative power of the productions, where only strings reachable via complete derivations (ending in terminals) belong to . For example, consider a simple context-free grammar for basic arithmetic expressions with productions and , where is the start symbol and represents an identifier (a terminal). A leftmost derivation of the string proceeds as follows: This sequence replaces the leftmost nonterminal at each step, yielding the target string after four direct derivations.[16]Chomsky Hierarchy Integration
The Chomsky hierarchy classifies formal grammars—and thus the languages they generate—based on the restrictions imposed on their productions, establishing a nested structure of increasing expressive power that corresponds to different computational models.[17] At the base level, Type-3 grammars, also known as regular grammars, restrict productions to right-linear forms or , where and are nonterminals, and is a string of terminals, or equivalently to left-linear forms; these generate regular languages equivalent to those recognized by finite automata.[17] Type-2 grammars, or context-free grammars, relax this restriction by allowing productions of the form , where is a single nonterminal and is any string of terminals and nonterminals, enabling the generation of context-free languages through derivations that replace nonterminals independently of surrounding context.[17] Type-1 grammars, termed context-sensitive, further broaden expressiveness with productions where , ensuring non-contracting rules that allow context-dependent replacements while preserving or increasing string length, thus generating context-sensitive languages.[17] At the apex, Type-0 grammars impose no restrictions on productions, permitting arbitrary where contains at least one nonterminal, yielding unrestricted grammars that generate recursively enumerable languages equivalent in power to Turing machines.[17] This hierarchy, proposed by Noam Chomsky in 1956, links the restrictive power of productions to progressively more capable models of computation, with derivations serving as the mechanism for language generation across all levels.[18]Applications in Computing
Parsing Algorithms
Parsing algorithms utilize productions from a context-free grammar in reverse to analyze an input string, determining membership in the language by constructing a valid derivation tree or leftmost derivation that yields the input.[19] This process inverts the forward generation of strings via productions, verifying syntactic structure through systematic application of grammar rules.[20] Top-down parsing starts from the grammar's start symbol and recursively expands non-terminals using productions to match the input from left to right. LL parsers exemplify this approach, employing predictive production selection based on lookahead tokens to choose applicable rules without backtracking in LL(1) cases. Recursive descent parsers implement this via mutually recursive procedures, each corresponding to a non-terminal and invoking expansions as dictated by the grammar.[21] In contrast, bottom-up parsing builds the parse tree from the input string upward, recognizing substrings as non-terminals by applying the inverse of productions through reduction steps. LR parsers, such as those using shift-reduce techniques, maintain a stack of symbols and shift input tokens before reducing matched handles to non-terminals, enabling efficient handling of a broad class of context-free grammars. Ambiguity arises when multiple productions can apply to the same substring, potentially yielding distinct parse trees for the input and complicating unique structure identification.[22] Resolving this often requires grammar redesign or disambiguation rules to ensure deterministic parsing. A representative example is the Cocke-Younger-Kasami (CYK) algorithm, which parses context-free grammars in Chomsky normal form using dynamic programming. The algorithm constructs a triangular table where entry holds non-terminals deriving the substring from position to ; for length 1, terminals populate the base, and longer spans combine via productions if and cover subintervals. This yields a recognition in time for input length , also building the parse forest if needed.[23]Compiler Design
In compiler design, productions form the core of syntax specification for programming languages, defining the hierarchical structure of valid code through context-free grammars (CFGs) in the front-end phases. A CFG consists of non-terminals, terminals, productions, and a start symbol, where each production rule of the form α → β (with α a non-terminal and β a string over terminals and non-terminals) specifies how syntactic constructs expand. These grammars enable the parser to validate input tokens from the lexical analyzer against the language's rules, constructing parse trees or abstract syntax trees (ASTs) that represent the program's structure for subsequent semantic analysis and optimization. This approach ensures syntactic correctness and facilitates error detection, as the parser reduces input sequences according to the productions to derive the start symbol.[24] Tools like Yacc and its open-source successor Bison automate scanner and parser generation by taking BNF-style productions as input to produce efficient LALR(1) parsers for compiler front-ends. The input to Yacc includes a declarations section for token definitions, a rules section specifying productions (e.g.,stmt: IF '(' expr ')' stmt ELSE stmt ; with alternatives via |), and a programs section for auxiliary code, separated by %%. Associated actions in curly braces, such as { $$ = create_if_node($3, $5, $7); }, execute C code upon matching a production, enabling semantic actions like AST node creation during parsing. Yacc generates a deterministic parser subroutine that calls the lexical analyzer (e.g., Lex) for tokens, handling shifts and reductions based on the grammar to build structured output, thus streamlining compiler implementation for languages with complex syntax.[25]
Attribute grammars extend standard productions by attaching attributes to grammar symbols, allowing synthesized (computed bottom-up from children) and inherited (computed top-down from parents and siblings) attributes to incorporate semantic analysis directly into the syntax phase. Introduced by Knuth, these augment CFGs to define language semantics, such as type checking or scope resolution, where production rules include equations like T.type = E1.type for a binary operator, evaluated during parsing to propagate information across the AST. In compilers, attribute grammars support syntax-directed translation, evaluating attributes in one pass to generate intermediate code or detect errors, with tools evaluating dependencies to ensure efficiency.[26]
A representative example appears in the C++ grammar, where productions for selection statements handle conditional logic: selection-statement: if (init-statementopt condition) statement or if (init-statementopt condition) statement else statement, with optional constexpr or consteval modifiers. These rules, part of the ISO standard's EBNF notation, allow the parser to recognize structures like if (x > 0) y = 1; else y = 0;, resolving ambiguities via precedence and associating the else with the nearest if. Such productions integrate with attribute evaluation for semantic checks, like ensuring the condition's type is boolean.[27]
The use of productions in syntax specification evolved from ALGOL 60's pioneering adoption of Backus-Naur Form (BNF) metalinguistic formulae, which formalized rules like <statement> ::= <unconditional statement> | <conditional statement>, replacing earlier informal syntax diagrams with precise, recursive definitions for unambiguous parsing. This shift enabled rigorous language descriptions, influencing subsequent standards. Modern tools build on this by automating transformations, such as eliminating left recursion (e.g., converting E → E + T | T to right-recursive forms via substitutions) to support top-down or bottom-up parsers, while handling larger grammars through modularization and error recovery.[28][24]