Hubbry Logo
Natural language generationNatural language generationMain
Open search
Natural language generation
Community hub
Natural language generation
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Natural language generation
Natural language generation
from Wikipedia

Natural language generation (NLG) is a software process that produces natural language output. A widely cited survey of NLG methods describes NLG as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems that can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information".[1]

While it is widely agreed that the output of any NLG process is text, there is some disagreement about whether the inputs of an NLG system need to be non-linguistic.[2] Common applications of NLG methods include the production of various reports, for example weather [3] and patient reports;[4] image captions;[5] and chatbots like ChatGPT.

Automated NLG can be compared to the process humans use when they turn ideas into writing or speech. Psycholinguists prefer the term language production for this process, which can also be described in mathematical terms, or modeled in a computer for psychological research. NLG systems can also be compared to translators of artificial computer languages, such as decompilers or transpilers, which also produce human-readable code generated from an intermediate representation. Human languages tend to be considerably more complex and allow for much more ambiguity and variety of expression than programming languages, which makes NLG more challenging.

NLG may be viewed as complementary to natural-language understanding (NLU): whereas in natural-language understanding, the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to make decisions about how to put a representation into words. The practical considerations in building NLU vs. NLG systems are not symmetrical. NLU needs to deal with ambiguous or erroneous user input, whereas the ideas the system wants to express through NLG are generally known precisely. NLG needs to choose a specific, self-consistent textual representation from many potential representations, whereas NLU generally tries to produce a single, normalized representation of the idea expressed.[6]

NLG has existed since ELIZA was developed in the mid 1960s, but the methods were first used commercially in the 1990s.[7] NLG techniques range from simple template-based systems like a mail merge that generates form letters, to systems that have a complex understanding of human grammar. NLG can also be accomplished by training a statistical model using machine learning, typically on a large corpus of human-written texts.[8]

Example

[edit]

The Pollen Forecast for Scotland system[9] is a simple example of a simple NLG system that could essentially be based on a template. This system takes as input six numbers, which give predicted pollen levels in different parts of Scotland. From these numbers, the system generates a short textual summary of pollen levels as its output.

For example, using the historical data for July 1, 2005, the software produces:

Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country. However, in Northern areas, pollen levels will be moderate with values of 4.

In contrast, the actual forecast (written by a human meteorologist) from this data was:

Pollen counts are expected to remain high at level 6 over most of Scotland, and even level 7 in the south east. The only relief is in the Northern Isles and far northeast of mainland Scotland with medium levels of pollen count.

Comparing these two illustrates some of the choices that NLG systems must make; these are further discussed below.

Stages

[edit]

The process to generate text can be as simple as keeping a list of canned text that is copied and pasted, possibly linked with some glue text. The results may be satisfactory in simple domains such as horoscope machines or generators of personalized business letters. However, a sophisticated NLG system needs to include stages of planning and merging of information to enable the generation of text that looks natural and does not become repetitive. The typical stages of natural-language generation, as proposed by Dale and Reiter,[6] are:

Content determination
Deciding what information to mention in the text. For instance, in the pollen example above, deciding whether to explicitly mention that pollen level is 7 in the southeast.
Document structuring
Overall organisation of the information to convey. For example, deciding to describe the areas with high pollen levels first, instead of the areas with low pollen levels.
Aggregation
Merging of similar sentences to improve readability and naturalness. For instance, merging the sentences "Grass pollen levels for Friday have increased from the moderate to high levels of yesterday." and "Grass pollen levels will be around 6 to 7 across most parts of the country." into a single sentence of "Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country."
Lexical choice
Putting words to the concepts. For example, deciding whether medium or moderate should be used when describing a pollen level of 4.
Referring expression generation
Creating referring expressions that identify objects and regions. For example, deciding to use in the Northern Isles and far northeast of mainland Scotland to refer to a certain region in Scotland. This task also includes making decisions about pronouns and other types of anaphora.
Realization
Creating the actual text, which should be correct according to the rules of syntax, morphology, and orthography. For example, using will be for the future tense of to be.

An alternative approach to NLG is to use "end-to-end" machine learning to build a system, without having separate stages as above.[10] In other words, we build an NLG system by training a machine learning algorithm (often an LSTM) on a large data set of input data and corresponding (human-written) output texts. The end-to-end approach has perhaps been most successful in image captioning,[11] that is automatically generating a textual caption for an image.

Applications

[edit]

Automatic report generation

[edit]

From a commercial perspective, the most successful NLG applications have been data-to-text systems which generate textual summaries of databases and data sets; these systems usually perform data analysis as well as text generation. Research has shown that textual summaries can be more effective than graphs and other visuals for decision support,[12][13][14] and that computer-generated texts can be superior (from the reader's perspective) to human-written texts.[15]

The first commercial data-to-text systems produced weather forecasts from weather data. The earliest such system to be deployed was FoG,[3] which was used by Environment Canada to generate weather forecasts in French and English in the early 1990s. The success of FoG triggered other work, both research and commercial. Recent applications include the UK Met Office's text-enhanced forecast.[16]

Data-to-text systems have since been applied in a range of settings. Following the minor earthquake near Beverly Hills, California on March 17, 2014, The Los Angeles Times reported details about the time, location and strength of the quake within 3 minutes of the event. This report was automatically generated by a 'robo-journalist', which converted the incoming data into text via a preset template.[17][18] Currently there is considerable commercial interest in using NLG to summarise financial and business data. Indeed, Gartner has said that NLG will become a standard feature of 90% of modern BI and analytics platforms.[19] NLG is also being used commercially in automated journalism, chatbots, generating product descriptions for e-commerce sites, summarising medical records,[20][4] and enhancing accessibility (for example by describing graphs and data sets to blind people[21]).

An example of an interactive use of NLG is the WYSIWYM framework. It stands for What you see is what you meant and allows users to see and manipulate the continuously rendered view (NLG output) of an underlying formal language document (NLG input), thereby editing the formal language without learning it.

Looking ahead, the current progress in data-to-text generation paves the way for tailoring texts to specific audiences. For example, data from babies in neonatal care can be converted into text differently in a clinical setting, with different levels of technical detail and explanatory language, depending on intended recipient of the text (doctor, nurse, patient). The same idea can be applied in a sports setting, with different reports generated for fans of specific teams.[22]

Image captioning

[edit]

Over the past few years, there has been an increased interest in automatically generating captions for images, as part of a broader endeavor to investigate the interface between vision and language. A case of data-to-text generation, the algorithm of image captioning (or automatic image description) involves taking an image, analyzing its visual content, and generating a textual description (typically a sentence) that verbalizes the most prominent aspects of the image.

An image captioning system involves two sub-tasks. In Image Analysis, features and attributes of an image are detected and labelled, before mapping these outputs to linguistic structures. Recent research utilizes deep learning approaches through features from a pre-trained convolutional neural network such as AlexNet, VGG or Caffe, where caption generators use an activation layer from the pre-trained network as their input features. Text Generation, the second task, is performed using a wide range of techniques. For example, in the Midge system, input images are represented as triples consisting of object/stuff detections, action/pose detections and spatial relations. These are subsequently mapped to <noun, verb, preposition> triples and realized using a tree substitution grammar.[22]

A common method in image captioning is to use a vision model (such as a ResNet) to encode an image into a vector, then use a language model (such as an RNN) to decode the vector into a caption.[23][24]

Despite advancements, challenges and opportunities remain in image capturing research. Notwithstanding the recent introduction of Flickr30K, MS COCO and other large datasets have enabled the training of more complex models such as neural networks, it has been argued that research in image captioning could benefit from larger and diversified datasets. Designing automatic measures that can mimic human judgments in evaluating the suitability of image descriptions is another need in the area. Other open challenges include visual question-answering (VQA),[25] as well as the construction and evaluation multilingual repositories for image description.[22]

Chatbots

[edit]

Another area where NLG has been widely applied is automated dialogue systems, frequently in the form of chatbots. A chatbot or chatterbot is a software application used to conduct an on-line chat conversation via text or text-to-speech, in lieu of providing direct contact with a live human agent. While natural language processing (NLP) techniques are applied in deciphering human input, NLG informs the output part of the chatbot algorithms in facilitating real-time dialogues.

Early chatbot systems, including Cleverbot created by Rollo Carpenter in 1988 and published in 1997,[citation needed] reply to questions by identifying how a human has responded to the same question in a conversation database using information retrieval (IR) techniques.[citation needed] Modern chatbot systems predominantly rely on machine learning (ML) models, such as sequence-to-sequence learning and reinforcement learning to generate natural language output. Hybrid models have also been explored. For example, the Alibaba shopping assistant first uses an IR approach to retrieve the best candidates from the knowledge base, then uses the ML-driven seq2seq model re-rank the candidate responses and generate the answer.[26]

Creative writing and computational humor

[edit]

Creative language generation by NLG has been hypothesized since the field's origins. A recent pioneer in the area is Phillip Parker, who has developed an arsenal of algorithms capable of automatically generating textbooks, crossword puzzles, poems and books on topics ranging from bookbinding to cataracts.[27] The advent of large pretrained transformer-based language models such as GPT-3 has also enabled breakthroughs, with such models demonstrating recognizable ability for creating-writing tasks.[28]

A related area of NLG application is computational humor production.  JAPE (Joke Analysis and Production Engine) is one of the earliest large, automated humor production systems that uses a hand-coded template-based approach to create punning riddles for children. HAHAcronym creates humorous reinterpretations of any given acronym, as well as proposing new fitting acronyms given some keywords.[29]

Despite progresses, many challenges remain in producing automated creative and humorous content that rival human output. In an experiment for generating satirical headlines, outputs of their best BERT-based model were perceived as funny 9.4% of the time (while real headlines from The Onion were 38.4%) and a GPT-2 model fine-tuned on satirical headlines achieved 6.9%.[30]  It has been pointed out that two main issues with humor-generation systems are the lack of annotated data sets and the lack of formal evaluation methods,[29] which could be applicable to other creative content generation. Some have argued relative to other applications, there has been a lack of attention to creative aspects of language production within NLG. NLG researchers stand to benefit from insights into what constitutes creative language production, as well as structural features of narrative that have the potential to improve NLG output even in data-to-text systems.[22]

Evaluation

[edit]

As in other scientific fields, NLG researchers need to test how well their systems, modules, and algorithms work. This is called evaluation. There are three basic techniques for evaluating NLG systems:

  • Task-based (extrinsic) evaluation: give the generated text to a person, and assess how well it helps them perform a task (or otherwise achieves its communicative goal). For example, a system which generates summaries of medical data can be evaluated by giving these summaries to doctors, and assessing whether the summaries help doctors make better decisions.[4]
  • Human ratings: give the generated text to a person, and ask them to rate the quality and usefulness of the text.
  • Metrics: compare generated texts to texts written by people from the same input data, using an automatic metric such as BLEU, METEOR, ROUGE and LEPOR.

An ultimate goal is how useful NLG systems are at helping people, which is the first of the above techniques. However, task-based evaluations are time-consuming and expensive, and can be difficult to carry out (especially if they require subjects with specialised expertise, such as doctors). Hence (as in other areas of NLP) task-based evaluations are the exception, not the norm.

Recently researchers are assessing how well human-ratings and metrics correlate with (predict) task-based evaluations. Work is being conducted in the context of Generation Challenges[31] shared-task events. Initial results suggest that human ratings are much better than metrics in this regard. In other words, human ratings usually do predict task-effectiveness at least to some degree (although there are exceptions), while ratings produced by metrics often do not predict task-effectiveness well. These results are preliminary. In any case, human ratings are the most popular evaluation technique in NLG; this is contrast to machine translation, where metrics are widely used.

An AI can be graded on faithfulness to its training data or, alternatively, on factuality. A response that reflects the training data but not reality is faithful but not factual. A confident but unfaithful response is a hallucination. In Natural Language Processing, a hallucination is often defined as "generated content that is nonsensical or unfaithful to the provided source content".[32]

See also

[edit]

References

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Natural language generation (NLG) is the subfield of and focused on the automatic production of human-readable text from structured data or other non-linguistic inputs to achieve specific communicative objectives. This process involves transforming abstract representations, such as databases, knowledge graphs, or semantic structures, into coherent, fluent, and contextually appropriate outputs. Unlike natural language understanding, which interprets text, NLG emphasizes deliberate construction to meet goals like informing, persuading, or entertaining. The core architecture of NLG systems typically follows a comprising content planning (selecting relevant information), discourse structuring (organizing it logically), sentence planning (choosing words and aggregation), and surface realization (ensuring grammaticality and fluency). Early NLG efforts in the and relied on rule-based and template-filling methods for simple tasks, such as generating reports or database summaries, but these were limited in flexibility and scalability. By the , more sophisticated systems emerged, incorporating knowledge representation techniques to handle complex domains like medical reporting and explanations. In the , the advent of statistical and neural approaches revolutionized NLG, enabling end-to-end models that learn directly from data-to-text pairs without explicit modular stages. Transformer-based large language models, such as GPT variants including the GPT-5 series released in 2025, have further advanced the field by producing diverse, creative text for applications including systems, content summarization, and . These neural methods excel in handling open-ended generation but introduce challenges like factual inaccuracies (hallucinations) and the need for . As of 2025, NLG applications span numerous domains, from personalized descriptions derived from product ontologies to accessible explanations of data for non-experts. In task-oriented dialogue systems, NLG integrates with natural language understanding to generate responses that align with user intents and system policies. Evaluation metrics for NLG emphasize , adequacy, and informativeness, often using automated scores like alongside human judgments. Ongoing research addresses ethical concerns, such as bias mitigation and ensuring generated text's trustworthiness, particularly in high-stakes areas like legal or healthcare communication.

Fundamentals

Definition and Scope

Natural language generation (NLG) is the subfield of and concerned with the construction of computer systems that produce understandable texts in human languages from underlying non-linguistic representations of information, such as , knowledge bases, or structured inputs. This process involves deliberately constructing natural language text to meet specified communicative goals, transforming data into coherent, fluent output that mimics human-like expression. In scope, NLG focuses on output generation rather than input parsing, distinguishing it from natural language understanding (NLU), which maps text to meaning representations. While NLU interprets unstructured language, NLG inverts this by generating text from semantic or structured sources, encompassing sub-tasks such as text summarization from documents and dialogue response creation in conversational systems. NLG operates as the inverse of (NLP) in the broader communication pipeline, where NLP encompasses both understanding and generation but NLG specifically handles the production phase. Key concepts in NLG include the transformation of diverse input types—ranging from numerical to semantic representations—into varied output forms like reports, captions, or descriptive narratives. For instance, inputs from knowledge bases might yield explanatory texts, emphasizing the need for coherence and appropriateness in the generated language. As a core component of human-AI interaction, NLG enables machines to communicate effectively in , bridging the gap between computational systems and human users by producing readable and informative text from abstract . This capability supports applications in AI-driven interfaces, where generated language enhances and interpretability of machine outputs.

Historical Development

The origins of natural language generation (NLG), the subfield of focused on producing coherent and contextually appropriate text from non-linguistic inputs, trace back to the mid-20th century amid broader advances in and . Foundational theoretical work in the 1950s and 1960s, particularly Noam Chomsky's introduction of in , emphasized hierarchical structures and transformational rules for language production, influencing early computational efforts to model text generation as a systematic process akin to human . By the , initial experiments in rule-based systems emerged, building on these linguistic theories to generate simple sentences from logical representations, though limited by computational constraints and a lack of empirical data. The classical era of NLG in the 1980s and 1990s shifted toward structured pipeline architectures, emphasizing modular processes for content planning, sentence structuring, and surface realization. David D. McDonald's 1982 work on salience in selection mechanisms highlighted how prioritizing key information could guide text construction in rule-based generators, addressing challenges in choosing what to express from complex inputs. Concurrently, the PENMAN project, developed by William C. Mann at the USC Information Sciences Institute, introduced a comprehensive text generation system that integrated knowledge representation with rhetorical planning, enabling the production of multi-sentence discourses. A pivotal contribution was Rhetorical Structure Theory (RST), formalized by Mann and Thompson in 1988, which modeled text coherence through hierarchical relations between spans (e.g., elaboration, contrast), providing a framework for organizing generated content to mimic human argumentation and narrative flow. These template- and rule-driven approaches dominated, focusing on domain-specific applications like weather reports, but struggled with and flexibility. The 2000s marked a transition to data-driven paradigms, incorporating statistical methods to handle variability in language output. Irene Langkilde's system (2000) represented a breakthrough by combining symbolic input representations with statistical optimization over vast realization forests, drawing techniques from to select fluent sentences probabilistically rather than exhaustively via rules. This integration allowed NLG to leverage parallel corpora and n-gram models, improving robustness in noisy or ambiguous scenarios, and paved the way for hybrid systems that balanced interpretability with empirical performance. Key events during this period included the establishment of the International Natural Language Generation Conference (INLG), with workshops dating back to 1983 and formal conferences beginning around 2000, fostering collaboration on benchmarks and evaluation metrics. From the 2010s onward, the advent of revolutionized NLG, enabling end-to-end models that bypassed traditional pipelines. The Transformer architecture, introduced by et al. in 2017, used self-attention mechanisms to capture long-range dependencies in sequences, dramatically enhancing generation quality and efficiency for tasks like summarization and dialogue. Subsequent models like OpenAI's GPT series, starting with in 2018, scaled unsupervised pretraining on massive corpora to produce diverse, context-aware text, while Google's T5 (Raffel et al., 2020) unified NLG tasks under a text-to-text framework, achieving state-of-the-art results through fine-tuning on diverse datasets. Notable advancements since then include OpenAI's (2020) and (2023), which demonstrated unprecedented scale in parameter size and performance, alongside models like Google's (2022) and Meta's Llama series, enhancing NLG's versatility and integration with multimodal tasks. The confluence of , neural networks, and increased computational power has since driven NLG toward more scalable, general-purpose systems, with ongoing INLG conferences highlighting impacts on and ethical considerations.

Methodologies

Classical Pipeline Approaches

Classical pipeline approaches in natural language generation (NLG) rely on a modular, sequential that decomposes the generation process into distinct stages, transforming non-linguistic input —such as databases or representations—into coherent human-readable text. This typically consists of three primary phases: content planning, which determines the relevant information to include; sentence planning (or microplanning), which organizes that information into logical structures; and surface realization, which applies linguistic rules to produce grammatical output. Unlike end-to-end neural models, these pipelines offer high interpretability and fine-grained control, allowing developers to intervene at specific stages for or error correction, though they require extensive manual engineering. The key components form a structured framework where each stage builds on the previous one to ensure systematic text production. In content planning, rules or schemas select and organize messages from input data, often drawing on domain-specific knowledge bases to decide what facts to convey and in what order, such as prioritizing critical events in a report. Sentence planning then aggregates related messages, performs lexical choice to select appropriate words, and generates referring expressions to maintain coherence. Finally, surface realization linearizes this abstract structure into surface forms using syntactic and morphological rules, ensuring fluency and correctness. This decomposition, rooted in early NLG theory, enables targeted development but demands integration across modules to avoid inconsistencies. Rule-based methods dominate these pipelines, employing templates for simple slot-filling, formal grammars for syntactic construction, and knowledge bases for semantic guidance. Templates provide predefined patterns with placeholders for data, offering efficiency in controlled domains but limited variability. More sophisticated approaches use unification-based grammars, which merge feature structures to resolve choices like lexical selection through argumentation over rhetorical relations. A seminal example is FUF (Functional Unification Formalism), an early system that implements unification grammars to control lexical choice and generate varied realizations from abstract inputs, emphasizing declarative rules over procedural coding. These methods leverage hand-crafted resources, such as systemic grammars or meaning-text theory, to encode linguistic knowledge explicitly. Despite their strengths, classical pipelines exhibit notable limitations, including rigidity in handling novel inputs or ambiguities, as rules cannot easily generalize beyond encoded scenarios. Developing and maintaining these systems is labor-intensive, requiring expert for grammars, lexicons, and domain rules, which scales poorly to new applications. They dominated NLG research and deployment until the early , when techniques began offering greater flexibility. A representative example is (Forecast Generator) system, which produces textual forecasts from meteorological data using a domain-specific . FoG employs rule-based content planning to select key weather events, sentence planning for aggregation and phrasing choices (e.g., using for uncertain predictions), and surface realization via templates and simple grammars to generate readable bulletins. Deployed for Canadian weather services, it demonstrated the practicality of pipelines in operational settings, producing forecasts in English and French while highlighting the need for corpus-informed rules to ensure naturalness. Modern alternatives, such as neural decoders, have since reduced the of such systems for broader applicability.

Machine Learning-Based Techniques

Machine learning-based techniques in natural language generation (NLG) represent a shift from rule-based systems to data-driven approaches that learn patterns directly from annotated corpora, enabling more flexible and scalable text production. Early statistical methods laid the foundation by employing probabilistic models to capture linguistic regularities. N-gram-based generation, for instance, models the probability of word sequences using Markov assumptions, where the likelihood of a word depends on the preceding n-1 words, facilitating simple yet effective sentence completion in NLG tasks. These models were often combined with maximum entropy frameworks, which optimize feature-based probabilities without assuming feature independence, as demonstrated in trainable systems for surface realization that generate from semantic using annotated . A pivotal advancement came with neural architectures, particularly sequence-to-sequence () models, which use encoder-decoder frameworks to map input sequences—such as structured data or meaning representations—to output text. Introduced using (LSTM) networks, these models encode the input into a fixed-dimensional vector and decode it autoregressively, achieving strong performance in tasks like that parallel NLG applications. To address limitations in handling long-range dependencies, attention mechanisms were integrated, allowing the decoder to focus dynamically on relevant input parts during generation; this culminated in the Transformer architecture, which relies entirely on self-attention layers to process sequences in parallel, revolutionizing NLG by improving coherence and efficiency in producing fluent text from diverse inputs. End-to-end learning extends these neural approaches by directly mapping structured inputs, like database records or RDF triples, to natural language outputs without intermediate symbolic stages, trained via . The core objective in such frameworks is to minimize the negative log-likelihood loss: L=t=1TlogP(yty<t,x)L = -\sum_{t=1}^{T} \log P(y_t \mid y_{<t}, x) where xx denotes the input , y<ty_{<t} the partial output up to timestep tt, and TT the output , enabling models to learn holistic mappings from . This has been applied effectively in data-to-text generation, producing descriptive text from tabular or graph-structured information. Key datasets supporting these methods include WebNLG, which provides RDF triple sets paired with verbalizations for training RDF-to-text systems across multiple languages, and the E2E dataset, comprising dialogue acts in the restaurant domain mapped to referring expressions, designed to evaluate end-to-end NLG in spoken systems. More recent developments leverage pre-trained large models (LLMs) for NLG by fine-tuning them on task-specific data, enhancing generation quality through from vast unlabeled corpora. Models like GPT, pre-trained generatively on next-token prediction, excel in open-ended text production and can incorporate controllability through structured prompts that guide output towards desired attributes, such as style or factual accuracy. Similarly, BERT's bidirectional pre-training on masked language modeling allows fine-tuning for conditional generation tasks, where encoder components process inputs to inform decoder outputs, though adaptations like extend this for full NLG. These techniques have demonstrated superior fluency and diversity in applications ranging from summarization to personalized , often outperforming earlier neural baselines on benchmarks like scores in controlled evaluations.

Hybrid and Emerging Methods

Hybrid systems in natural language generation integrate rule-based planning with neural realization to leverage the strengths of both paradigms, enabling structured content while producing fluent outputs. For instance, the Plan-and-Generate framework separates the process into a symbolic planning stage that ensures fidelity to input data and a neural generation stage for linguistic realization, improving control over output structure without sacrificing naturalness. This approach balances the interpretability and precision of classical methods with the flexibility of , particularly in data-to-text tasks where adherence to source information is crucial. Controllable natural language generation techniques allow for targeted attribute control during text production, addressing limitations in unconstrained neural models. Plug-and-Play Language Models (PPLM) achieve this by steering pretrained language models using lightweight attribute classifiers that manipulate activation patterns without fine-tuning the base model, enabling attributes like sentiment or to be controlled dynamically. methods further enhance fidelity by optimizing generation for specific constraints, such as factual accuracy in summaries, through reward signals derived from external verifiers. Multimodal natural language generation extends text production to incorporate non-textual inputs like images or videos, fostering richer interactions. Vision-language models such as CLIP facilitate this by aligning visual and textual representations, allowing generators to produce descriptive captions or narratives grounded in visual content through integrated encoding-decoding pipelines. Recent advancements as of 2025 include natively multimodal large language models like Llama 4 variants, which process text and images for more coherent cross-modal generation in applications such as visual . These systems improve coherence between modalities, as seen in applications where image features guide narrative flow, reducing mismatches in generated descriptions. Neuro-symbolic methods that merge neural networks with symbolic logic have become established approaches to enhance reasoning and interpretability in NLG. These integrate logical rules into neural architectures for tasks like , where symbolic inference ensures consistency while neural components handle linguistic variability, as surveyed in recent frameworks from 2024. Ethical considerations, particularly bias mitigation, are integral to these developments; techniques such as and counterfactual fairness interventions counteract or racial biases in generated text by balancing training distributions and evaluating outputs against fairness metrics. Addressing scalability issues in large models for involves strategies to curb hallucinations, where models produce unverifiable content. Retrieval-augmented (RAG) mitigates this by conditioning outputs on retrieved external knowledge, improving factual accuracy in knowledge-intensive tasks by up to 20-30% on benchmarks like open-domain without expanding model parameters. Recent hybrid RAG systems as of 2025 further refine this by combining dense and sparse retrieval for enhanced performance in dynamic environments. This method supports efficient scaling by offloading memory to non-parametric stores, enabling reliable in resource-constrained environments.

Core Processes

Content Determination

Content determination is the initial phase in the natural language generation (NLG) , where the system transforms raw input —such as database records or outputs—into a set of communicative goals by selecting, aggregating, and prioritizing relevant for expression in text. This process ensures that the generated output focuses on key facts while avoiding , aligning the content with the intended purpose, such as informing or persuading the audience. Aggregation involves grouping related points to improve and reduce . Techniques for content determination include schema-based selection, which uses predefined templates to identify pertinent data elements based on domain-specific criteria, and Rhetorical Structure Theory (RST), which organizes selected content into a hierarchical discourse structure to guide overall text coherence. RST, introduced by Mann and Thompson, defines relations between text spans (e.g., elaboration or contrast) to prioritize information that supports the primary communicative intent. Content planning algorithms often employ rule-based systems to evaluate input against goals, such as including only statistically significant trends in a report. A representative example occurs in automated report generation from sensor , where the system selects key statistics—like and peak from hourly readings—while omitting redundant entries, applying aggregation rules to summarize numerical into concise descriptors such as "The was mild, with gusty winds." This selection ensures the text remains focused and readable without overwhelming the reader with raw details. Unique challenges in content determination arise when handling incomplete or conflicting sources, such as missing values in a or contradictory records from multiple sensors, which can lead to biased or inaccurate selections if not resolved through imputation or heuristics. Systems must incorporate validation steps to detect and mitigate these issues, ensuring robust content choices. This phase applies to various input types, ranging from structured data like relational tables, where selection involves querying specific rows and columns, to semi-structured formats such as knowledge graphs, where traversal algorithms identify relevant nodes and edges for inclusion. The determined content then informs subsequent microplanning, where rhetorical relations and ordering are refined for textual expression.

Microplanning

Microplanning is the intermediate stage in natural language generation (NLG) pipelines where selected content from the document planning phase is organized into coherent textual units, focusing on decisions that ensure logical flow and linguistic appropriateness before surface realization. This process transforms abstract representations, such as propositions or facts, into structured specifications for , addressing how is packaged to achieve communicative goals. According to the classic framework outlined by Reiter and Dale, microplanning bridges high-level content selection and low-level syntactic formation by handling choices that impact readability and coherence. Core tasks in microplanning include structuring sentences through discourse relations, lexical choice, and aggregation of clauses. Discourse relations, such as elaboration or contrast, are often modeled using Rhetorical Structure Theory (RST), which organizes text into hierarchical trees where relations link spans to convey intentions like explanation or justification. For instance, in explanatory texts, a cause-effect relation might connect a precipitating event to its outcome, ensuring the generated paragraph flows logically from bullet-point facts like " experienced low oxygen levels" to "This led to respiratory distress." Lexical choice involves selecting words or phrases that best convey meaning while considering context, such as choosing "decline" over "drop" for medical reports to match register, guided by resources like VerbNet for semantic compatibility. Aggregation merges related clauses to avoid , for example, combining multiple similar events into a single sentence like "The had three successive bradycardias down to 69 bpm" instead of separate statements, using rule-based heuristics or statistical methods. Referring expressions are generated during microplanning to maintain coherence, resolving anaphora through theories like centering theory, which tracks salience across utterances to decide between pronouns and full descriptions. For example, in a sequence describing events, a highly salient (e.g., "the patient") might be referred to as "he" in subsequent sentences if it remains the focus, following principles of local coherence. An incremental prioritizes attributes in descriptions, such as type before color, to generate concise yet informative references like "the red car" only when necessary. Formalisms like the schema support these decisions by providing schemas for rhetorical relations in RST-based generation, ensuring relations are realized appropriately in text spans. Linguistic aspects such as tense, aspect, and modality are selected based on contextual cues during microplanning to align with the intended temporal or evidential stance. For instance, and might be chosen for completed events in reports, as in "The treatment had been administered," while modality like "may" introduces for hypothetical outcomes. These choices are encoded in semantic representations passed to realization, drawing from input specifications like event types and arguments.

Realization and Generation

Realization and generation, often termed surface realization, constitutes the concluding phase of the natural language generation (NLG) pipeline, transforming abstract representations from microplanning—such as conceptual structures or deep syntactic forms—into coherent, grammatical text. This process ensures that the output adheres to linguistic rules, producing sentences that are syntactically correct and morphologically appropriate for the target language. Syntactic realization maps logical forms to surface structures by selecting and ordering words within grammatical frameworks. Early systems leveraged Generalized Phrase Structure Grammar (GPSG), a context-free formalism that supports efficient generation through feature percolation and unification, enabling the production of varied syntactic variants from a single input representation. Tree-adjoining grammars (TAG) offer an alternative, using elementary trees as building blocks that can be combined via substitution and adjunction to handle dependencies like relative clauses, providing precise control over sentence complexity in NLG. Comprehensive implementations, such as system, integrate principles with unification to realize deep-syntactic inputs into full English sentences, demonstrating reusability across diverse NLG applications. Morphological generation addresses word-level adjustments, inflecting lemmas according to syntactic features like tense, number, and case to form complete lexical items. For example, it conjugates verbs (e.g., "" to "rained" for ) and pluralizes nouns based on orthographic rules and exceptions. Robust finite-state implementations achieve high accuracy by prioritizing rules for irregularities, such as deriving "stimuli" from "stimulus+s_N" while handling over 1,100 exceptional lemmata for consonant doubling and other patterns. A illustration of the process transforms an abstract input like "event: , location: , time: yesterday" into the sentence "It rained in the yesterday," where syntactic frames embed the event, adjuncts specify location and time, and morphological rules apply . Key algorithms for syntactic realization include chart-based methods, which use bottom-up dynamic programming to parse and assemble structures from lexical entries, as adapted for (CCG) to cover logical forms with bit-vector tracking for efficiency. Optimization techniques employ to jointly select lexical choices and structures, minimizing sentence length or maximizing compactness while enforcing grammatical constraints, often integrating with content selection for improved output density. Output polishing refines the generated text by applying orthographic rules for , , and spacing, alongside basic checks for , ensuring the final product reads naturally without altering core semantics.

Applications

Data-to-Text Systems

Data-to-text systems in natural language generation (NLG) focus on transforming structured data, such as tables or , into coherent, human-readable text. These systems are particularly valuable in domains requiring regular reporting from quantitative inputs, where manual writing is time-intensive or error-prone. Early examples include the SUMTIME system, which generates textual forecasts from numerical meteorological data for offshore reports, demonstrating how rule-based pipelines can produce reliable summaries tailored to specific user needs like safety-critical marine operations. Similarly, in , data-to-text approaches automate summaries of data, converting tabular records of prices, volumes, and trends into overviews that highlight key movements and implications for investors. Adapting classical NLG pipelines for data-to-text involves customizing stages like content determination and microplanning to handle tabular or relational inputs. For instance, meaning representation languages such as Abstract Meaning Representation (AMR) facilitate the mapping of structured data to semantic graphs, enabling systematic selection and aggregation of relevant facts while preserving logical relationships. This adaptation ensures that generated text adheres to domain-specific conventions, such as emphasizing temporal sequences in weather data or causal inferences in financial trends. Case studies illustrate practical impacts, including systems for generating product descriptions from attribute-value pairs, which support scalable for catalogs. Such systems enhance accessibility, particularly for visually impaired users, by converting geo-referenced or tabular data into auditory-readable narratives via screen readers, as explored in projects linking map data to descriptive text. Modern enhancements leverage neural architectures for more fluent and context-aware generation. For example, end-to-end models like DataTuner employ sequence-to-sequence approaches to process structured inputs, improving coherence and factual alignment in outputs compared to traditional methods. The domain has seen growth in sports commentary, where systems like those trained on the SportSett: dataset produce NBA game recaps from play-by-play statistics, capturing highlights and narratives with to event data.

Conversational and Interactive Uses

Natural language generation (NLG) plays a pivotal role in conversational and interactive systems, enabling the production of human-like responses in real-time dialogue. These systems integrate NLG with natural language understanding (NLU) to form end-to-end pipelines that interpret user intents and generate coherent outputs, evolving from modular architectures to unified neural models that reduce error propagation. Early examples include rule-based chatbots like , developed by Richard Wallace in 1995, which used pattern-matching via (AIML) to generate responses without deep contextual reasoning. This marked a foundational shift toward interactive NLG, though limited to scripted interactions. Advancements in neural architectures have transformed conversational NLG, with systems like BlenderBot, released by AI in 2020, employing large-scale transformer-based models to produce open-domain responses that maintain fluency and relevance across turns. Techniques such as response generation from dialogue acts—abstract representations of communicative intentions—allow NLG modules to convert structured plans into natural utterances, often integrated with dialogue management frameworks like Partially Observable Markov Decision Processes (POMDPs) for tracking hidden user states and context. POMDPs enable probabilistic belief updates over dialogue history, facilitating adaptive generation in uncertain environments. In practical applications, NLG powers task-oriented personal assistants like Apple's and Amazon's Alexa, which generate responses to fulfill user goals such as scheduling or by verbalizing dialogue states and actions. Customer service chatbots in banking domains similarly leverage NLG to produce personalized, context-aware replies, drawing on non-linguistic like transaction histories to enhance response in multi-turn interactions. Key challenges in conversational NLG include maintaining coherent context across extended dialogues, where models must resolve coreferences and track evolving states to avoid repetition or drift. Handling in user inputs—such as vague intents or polysemous queries—further complicates generation, often requiring clarification strategies to elicit precise information without disrupting flow. Recent advancements distinguish between retrieval-based approaches, which select pre-defined responses from a corpus for consistency and speed, and generative methods, which synthesize novel outputs for flexibility but risk hallucinations. Datasets like MultiWOZ, introduced by Budzianowski et al. in 2018, have driven progress by providing multi-domain, annotated dialogues for training end-to-end systems that simulate real-world task-oriented interactions.

Multimedia and Creative Generation

Natural language generation (NLG) in multimedia contexts involves producing textual descriptions from non-textual inputs such as images and videos, enabling applications like automated captioning for and content indexing. A seminal approach is the "Show and Tell" model, which combines a (CNN) to encode visual features with a (RNN) to decode them into coherent sentences, achieving state-of-the-art performance on benchmarks at the time. This encoder-decoder architecture has influenced subsequent multimodal NLG systems by demonstrating how visual embeddings can guide sequence generation. Datasets like MS COCO, released in 2014, have been pivotal for training such models, providing over 120,000 images paired with multiple human-annotated captions to support evaluation of descriptive accuracy and diversity. In creative NLG tasks, systems generate artistic text outputs, such as stories or , often using character-level RNNs to capture stylistic nuances. The Char-RNN framework exemplifies this by training on literary corpora like Shakespeare's works to produce novel verses, highlighting the potential of neural to mimic poetic structures through on sequences. Computational humor generation, particularly punchline prediction, employs probabilistic models to extend setups with unexpected resolutions, as seen in frameworks integrating surprise metrics and for creation. These methods underscore NLG's role in fostering originality, though outputs often require human refinement to align with cultural nuances. Representative examples include meme generation, where multimodal models pair image templates with contextually humorous captions generated via transformer-based language models, as in systems trained on corpora to automate viral content creation. leverages NLG for dynamic storytelling, with AI-driven engines generating branching narratives in response to user inputs, exemplified by platforms that use large language models to evolve plotlines in real-time. A key challenge in these creative applications is ensuring novelty, as generative models tend to produce individually innovative text but reduce collective diversity by converging on similar patterns, potentially limiting broader artistic impact. Multimodal NLG extends to integrating audio and speech inputs, particularly in accessibility tools that transcribe spoken content into readable text for the hearing impaired. Systems combining automatic generate captions from live audio streams, improving in video conferencing and educational videos. Emerging trends highlight AI-human collaborations in artistic domains, such as using for scriptwriting, where the model generates dialogue and plot outlines from prompts, facilitating co-creative processes in and theater production as demonstrated in its few-shot learning capabilities. These integrations briefly extend to conversational elements in , enhancing immersive narratives with generated responses.

Assessment and Challenges

Evaluation Metrics

Evaluating the quality of natural language generation (NLG) systems requires a combination of intrinsic and extrinsic metrics to assess aspects such as , adequacy, and coherence. Intrinsic metrics focus on the generated text in isolation, often comparing it to texts, while extrinsic metrics evaluate the text's in achieving a specific task or goal, typically through user interaction or downstream performance. These approaches address the challenges of NLG evaluation, where traditional metrics from have been adapted but often fall short in capturing semantic nuances and contextual appropriateness. Intrinsic metrics, such as , measure surface-level similarities between generated and reference texts using n-gram overlap. Introduced for evaluation, BLEU computes a score based on precision of n-grams, modified by a brevity penalty to avoid favoring short outputs. The is given by: BLEU=BPexp(n=1Nwnlogpn)\text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n \right) where BP is the brevity penalty, pnp_n is the modified n-gram precision, wnw_n are weights (typically uniform), and N is the maximum n-gram order (often 4). Despite its widespread use, BLEU has limitations in capturing semantics, as it penalizes valid paraphrases and struggles with diverse expressions common in NLG. Another intrinsic metric, ROUGE, is particularly suited for summarization tasks within NLG and emphasizes over precision by measuring overlap of n-grams, longest common subsequences, or skip-bigrams between generated and reference summaries. Variants like ROUGE-N (n-gram ) and ROUGE-L (sequence-based) provide flexible assessments, though like , they overlook deeper meaning and coherence. These metrics enable quick, automated comparisons but correlate poorly with human perceptions of quality in open-ended generation scenarios. Extrinsic metrics assess NLG output through its practical impact, such as task success rates in applications like report generation, where user comprehension or accuracy is measured. For instance, in data-to-text systems, success might be quantified by how well generated reports inform user actions compared to human-written ones. Human judgments often complement these, using Likert scales to rate dimensions like (grammaticality and naturalness) or adequacy ( to input data), providing nuanced insights but requiring careful guidelines to ensure reliability. Advanced measures like BERTScore address the semantic shortcomings of n-gram-based metrics by leveraging contextual embeddings from pre-trained models such as BERT to compute token-level similarities via cosine distance. This yields a precision, , and F1 score that better aligns with human evaluations, especially for paraphrases and diverse phrasings in NLG tasks, though it remains computationally intensive. Benchmarks and shared tasks facilitate standardized evaluation, such as the Second Multilingual Surface Realisation Shared Task (SR'19), which assessed NLG systems across languages using automatic metrics alongside human assessments to promote multilingual robustness. These initiatives highlight the need for diverse datasets and metrics tailored to non-English generation. Balancing human and automatic evaluation involves trade-offs: automatic metrics offer and consistency, while human judgments capture subjective qualities but introduce variability and cost. Crowdsourcing platforms like enable large-scale human evaluations by distributing annotation tasks to remote workers, often with quality controls such as qualification tests, though results must be validated against expert judgments to mitigate biases.

Key Limitations and Future Directions

Neural natural language generation (NLG) models frequently produce hallucinations, generating fluent but factually incorrect or unsubstantiated content due to inconsistencies in training data or inadequate decoding strategies. This issue is particularly pronounced in abstractive tasks like summarization, where models invent details not present in the input, undermining reliability in applications such as or healthcare. Additionally, amplification occurs when models perpetuate and exacerbate societal stereotypes from training data, such as gender biases in occupational descriptions, as demonstrated in analyses of word embeddings that influence generated text. Ethical concerns in NLG arise from the potential for propagation through hallucinated outputs, which can mislead users in high-stakes domains like legal or reporting, and from gaps in where large models (LLMs) struggle to adhere to user-specified constraints without veering into harmful content. challenges further compound these issues, as the high computational costs of and deploying large models limit , restrict output length, and hinder real-time to new domains or data. Domain remains difficult, often requiring extensive retraining that exacerbates resource demands for non-English or specialized contexts. These limitations also highlight inadequacies in current evaluation metrics, which struggle to detect subtle hallucinations or biases comprehensively. Future directions in NLG emphasize developing interpretable systems through explainable AI techniques, such as leveraging LLMs to generate human-readable rationales for outputs, to enhance trust and debugging in complex models. Integration with for embodied communication represents another promising avenue, enabling robots to produce context-aware responses grounded in physical interactions and . Personalized generation, which tailors outputs to individual user profiles via multimodal contexts, is gaining traction to improve in recommendation and feedback systems. Research gaps persist in low-resource languages, where scarce datasets impede effective NLG development, and in real-time ethical filtering mechanisms to dynamically mitigate biases or during generation. Addressing these could involve hybrid approaches combining external knowledge bases with efficient, lightweight models to broaden NLG's applicability.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.