Hubbry Logo
Automatic summarizationAutomatic summarizationMain
Open search
Automatic summarization
Community hub
Automatic summarization
logo
7 pages, 0 posts
0 subscribers
Be the first to start a discussion here.
Be the first to start a discussion here.
Automatic summarization
Automatic summarization
from Wikipedia

Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence (AI) algorithms are commonly developed and employed to achieve this, specialized for different types of data.

Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.[1] On the other hand, visual content can be summarized using computer vision algorithms. Image summarization is the subject of ongoing research; existing approaches typically attempt to display the most representative images from a given image collection, or generate a video that only includes the most important content from the entire collection.[2][3][4] Video summarization algorithms identify and extract from the original video content the most important frames (key-frames), and/or the most important video segments (key-shots), normally in a temporally ordered fashion.[5][6][7][8] Video summaries simply retain a carefully selected subset of the original video frames and, therefore, are not identical to the output of video synopsis algorithms, where new video frames are being synthesized based on the original video content.

Commercial products

[edit]

In 2022 Google Docs released an automatic summarization feature.[9]

Approaches

[edit]

There are two general approaches to automatic summarization: extraction and abstraction.

Extraction-based summarization

[edit]

Here, content is extracted from the original data, but the extracted content is not modified in any way. Examples of extracted content include key-phrases that can be used to "tag" or index a text document, or key sentences (including headings) that collectively comprise an abstract, and representative images or video segments, as stated above. For text, extraction is analogous to the process of skimming, where the summary (if available), headings and subheadings, figures, the first and last paragraphs of a section, and optionally the first and last sentences in a paragraph are read before one chooses to read the entire document in detail.[10] Other examples of extraction that include key sequences of text in terms of clinical relevance (including patient/problem, intervention, and outcome).[11]

Abstractive-based summarization

[edit]

Abstractive summarization methods generate new text that did not exist in the original text.[12] This has been applied mainly for text. Abstractive methods build an internal semantic representation of the original content (often called a language model), and then use this representation to create a summary that is closer to what a human might express. Abstraction may transform the extracted content by paraphrasing sections of the source document, to condense a text more strongly than extraction. Such transformation, however, is computationally much more challenging than extraction, involving both natural language processing and often a deep understanding of the domain of the original text in cases where the original document relates to a special field of knowledge. "Paraphrasing" is even more difficult to apply to images and videos, which is why most summarization systems are extractive.

Aided summarization

[edit]

Approaches aimed at higher summarization quality rely on combined software and human effort. In Machine Aided Human Summarization, extractive techniques highlight candidate passages for inclusion (to which the human adds or removes text). In Human Aided Machine Summarization, a human post-processes software output, in the same way that one edits the output of automatic translation by Google Translate.

Applications and systems for summarization

[edit]

There are broadly two types of extractive summarization tasks depending on what the summarization program focuses on. The first is generic summarization, which focuses on obtaining a generic summary or abstract of the collection (whether documents, or sets of images, or videos, news stories etc.). The second is query relevant summarization, sometimes called query-based summarization, which summarizes objects specific to a query. Summarization systems are able to create both query relevant text summaries and generic machine-generated summaries depending on what the user needs.

An example of a summarization problem is document summarization, which attempts to automatically produce an abstract from a given document. Sometimes one might be interested in generating a summary from a single source document, while others can use multiple source documents (for example, a cluster of articles on the same topic). This problem is called multi-document summarization. A related application is summarizing news articles. Imagine a system, which automatically pulls together news articles on a given topic (from the web), and concisely represents the latest news as a summary.

Image collection summarization is another application example of automatic summarization. It consists in selecting a representative set of images from a larger set of images.[13] A summary in this context is useful to show the most representative images of results in an image collection exploration system. Video summarization is a related domain, where the system automatically creates a trailer of a long video. This also has applications in consumer or personal videos, where one might want to skip the boring or repetitive actions. Similarly, in surveillance videos, one would want to extract important and suspicious activity, while ignoring all the boring and redundant frames captured.

At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set. This is also called the core-set. These algorithms model notions like diversity, coverage, information and representativeness of the summary. Query based summarization techniques, additionally model for relevance of the summary with the query. Some techniques and algorithms which naturally model summarization problems are TextRank and PageRank, Submodular set function, Determinantal point process, maximal marginal relevance (MMR) etc.

Keyphrase extraction

[edit]

The task is the following. You are given a piece of text, such as a journal article, and you must produce a list of keywords or key[phrase]s that capture the primary topics discussed in the text.[14] In the case of research articles, many authors provide manually assigned keywords, but most text lacks pre-existing keyphrases. For example, news articles rarely have keyphrases attached, but it would be useful to be able to automatically do so for a number of applications discussed below. Consider the example text from a news article:

"The Army Corps of Engineers, rushing to meet President Bush's promise to protect New Orleans by the start of the 2006 hurricane season, installed defective flood-control pumps last year despite warnings from its own expert that the equipment would fail during a storm, according to documents obtained by The Associated Press".

A keyphrase extractor might select "Army Corps of Engineers", "President Bush", "New Orleans", and "defective flood-control pumps" as keyphrases. These are pulled directly from the text. In contrast, an abstractive keyphrase system would somehow internalize the content and generate keyphrases that do not appear in the text, but more closely resemble what a human might produce, such as "political negligence" or "inadequate protection from floods". Abstraction requires a deep understanding of the text, which makes it difficult for a computer system. Keyphrases have many applications. They can enable document browsing by providing a short summary, improve information retrieval (if documents have keyphrases assigned, a user could search by keyphrase to produce more reliable hits than a full-text search), and be employed in generating index entries for a large text corpus.

Depending on the different literature and the definition of key terms, words or phrases, keyword extraction is a highly related theme.

Supervised learning approaches

[edit]

Beginning with the work of Turney,[15] many researchers have approached keyphrase extraction as a supervised machine learning problem. Given a document, we construct an example for each unigram, bigram, and trigram found in the text (though other text units are also possible, as discussed below). We then compute various features describing each example (e.g., does the phrase begin with an upper-case letter?). We assume there are known keyphrases available for a set of training documents. Using the known keyphrases, we can assign positive or negative labels to the examples. Then we learn a classifier that can discriminate between positive and negative examples as a function of the features. Some classifiers make a binary classification for a test example, while others assign a probability of being a keyphrase. For instance, in the above text, we might learn a rule that says phrases with initial capital letters are likely to be keyphrases. After training a learner, we can select keyphrases for test documents in the following manner. We apply the same example-generation strategy to the test documents, then run each example through the learner. We can determine the keyphrases by looking at binary classification decisions or probabilities returned from our learned model. If probabilities are given, a threshold is used to select the keyphrases. Keyphrase extractors are generally evaluated using precision and recall. Precision measures how many of the proposed keyphrases are actually correct. Recall measures how many of the true keyphrases your system proposed. The two measures can be combined in an F-score, which is the harmonic mean of the two (F = 2PR/(P + R) ). Matches between the proposed keyphrases and the known keyphrases can be checked after stemming or applying some other text normalization.

Designing a supervised keyphrase extraction system involves deciding on several choices (some of these apply to unsupervised, too). The first choice is exactly how to generate examples. Turney and others have used all possible unigrams, bigrams, and trigrams without intervening punctuation and after removing stopwords. Hulth showed that you can get some improvement by selecting examples to be sequences of tokens that match certain patterns of part-of-speech tags. Ideally, the mechanism for generating examples produces all the known labeled keyphrases as candidates, though this is often not the case. For example, if we use only unigrams, bigrams, and trigrams, then we will never be able to extract a known keyphrase containing four words. Thus, recall may suffer. However, generating too many examples can also lead to low precision.

We also need to create features that describe the examples and are informative enough to allow a learning algorithm to discriminate keyphrases from non- keyphrases. Typically features involve various term frequencies (how many times a phrase appears in the current text or in a larger corpus), the length of the example, relative position of the first occurrence, various Boolean syntactic features (e.g., contains all caps), etc. The Turney paper used about 12 such features. Hulth uses a reduced set of features, which were found most successful in the KEA (Keyphrase Extraction Algorithm) work derived from Turney's seminal paper.

In the end, the system will need to return a list of keyphrases for a test document, so we need to have a way to limit the number. Ensemble methods (i.e., using votes from several classifiers) have been used to produce numeric scores that can be thresholded to provide a user-provided number of keyphrases. This is the technique used by Turney with C4.5 decision trees. Hulth used a single binary classifier so the learning algorithm implicitly determines the appropriate number.

Once examples and features are created, we need a way to learn to predict keyphrases. Virtually any supervised learning algorithm could be used, such as decision trees, Naive Bayes, and rule induction. In the case of Turney's GenEx algorithm, a genetic algorithm is used to learn parameters for a domain-specific keyphrase extraction algorithm. The extractor follows a series of heuristics to identify keyphrases. The genetic algorithm optimizes parameters for these heuristics with respect to performance on training documents with known key phrases.

Unsupervised approach: TextRank

[edit]

Another keyphrase extraction algorithm is TextRank. While supervised methods have some nice properties, like being able to produce interpretable rules for what features characterize a keyphrase, they also require a large amount of training data. Many documents with known keyphrases are needed. Furthermore, training on a specific domain tends to customize the extraction process to that domain, so the resulting classifier is not necessarily portable, as some of Turney's results demonstrate. Unsupervised keyphrase extraction removes the need for training data. It approaches the problem from a different angle. Instead of trying to learn explicit features that characterize keyphrases, the TextRank algorithm[16] exploits the structure of the text itself to determine keyphrases that appear "central" to the text in the same way that PageRank selects important Web pages. Recall this is based on the notion of "prestige" or "recommendation" from social networks. In this way, TextRank does not rely on any previous training data at all, but rather can be run on any arbitrary piece of text, and it can produce output simply based on the text's intrinsic properties. Thus the algorithm is easily portable to new domains and languages.

TextRank is a general purpose graph-based ranking algorithm for NLP. Essentially, it runs PageRank on a graph specially designed for a particular NLP task. For keyphrase extraction, it builds a graph using some set of text units as vertices. Edges are based on some measure of semantic or lexical similarity between the text unit vertices. Unlike PageRank, the edges are typically undirected and can be weighted to reflect a degree of similarity. Once the graph is constructed, it is used to form a stochastic matrix, combined with a damping factor (as in the "random surfer model"), and the ranking over vertices is obtained by finding the eigenvector corresponding to eigenvalue 1 (i.e., the stationary distribution of the random walk on the graph).

The vertices should correspond to what we want to rank. Potentially, we could do something similar to the supervised methods and create a vertex for each unigram, bigram, trigram, etc. However, to keep the graph small, the authors decide to rank individual unigrams in a first step, and then include a second step that merges highly ranked adjacent unigrams to form multi-word phrases. This has a nice side effect of allowing us to produce keyphrases of arbitrary length. For example, if we rank unigrams and find that "advanced", "natural", "language", and "processing" all get high ranks, then we would look at the original text and see that these words appear consecutively and create a final keyphrase using all four together. Note that the unigrams placed in the graph can be filtered by part of speech. The authors found that adjectives and nouns were the best to include. Thus, some linguistic knowledge comes into play in this step.

Edges are created based on word co-occurrence in this application of TextRank. Two vertices are connected by an edge if the unigrams appear within a window of size N in the original text. N is typically around 2–10. Thus, "natural" and "language" might be linked in a text about NLP. "Natural" and "processing" would also be linked because they would both appear in the same string of N words. These edges build on the notion of "text cohesion" and the idea that words that appear near each other are likely related in a meaningful way and "recommend" each other to the reader.

Since this method simply ranks the individual vertices, we need a way to threshold or produce a limited number of keyphrases. The technique chosen is to set a count T to be a user-specified fraction of the total number of vertices in the graph. Then the top T vertices/unigrams are selected based on their stationary probabilities. A post- processing step is then applied to merge adjacent instances of these T unigrams. As a result, potentially more or less than T final keyphrases will be produced, but the number should be roughly proportional to the length of the original text.

It is not initially clear why applying PageRank to a co-occurrence graph would produce useful keyphrases. One way to think about it is the following. A word that appears multiple times throughout a text may have many different co-occurring neighbors. For example, in a text about machine learning, the unigram "learning" might co-occur with "machine", "supervised", "un-supervised", and "semi-supervised" in four different sentences. Thus, the "learning" vertex would be a central "hub" that connects to these other modifying words. Running PageRank/TextRank on the graph is likely to rank "learning" highly. Similarly, if the text contains the phrase "supervised classification", then there would be an edge between "supervised" and "classification". If "classification" appears several other places and thus has many neighbors, its importance would contribute to the importance of "supervised". If it ends up with a high rank, it will be selected as one of the top T unigrams, along with "learning" and probably "classification". In the final post-processing step, we would then end up with keyphrases "supervised learning" and "supervised classification".

In short, the co-occurrence graph will contain densely connected regions for terms that appear often and in different contexts. A random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters. This is similar to densely connected Web pages getting ranked highly by PageRank. This approach has also been used in document summarization, considered below.

Document summarization

[edit]

Like keyphrase extraction, document summarization aims to identify the essence of a text. The only real difference is that now we are dealing with larger text units—whole sentences instead of words and phrases.

Supervised learning approaches

[edit]

Supervised text summarization is very much like supervised keyphrase extraction. Basically, if you have a collection of documents and human-generated summaries for them, you can learn features of sentences that make them good candidates for inclusion in the summary. Features might include the position in the document (i.e., the first few sentences are probably important), the number of words in the sentence, etc. The main difficulty in supervised extractive summarization is that the known summaries must be manually created by extracting sentences so the sentences in an original training document can be labeled as "in summary" or "not in summary". This is not typically how people create summaries, so simply using journal abstracts or existing summaries is usually not sufficient. The sentences in these summaries do not necessarily match up with sentences in the original text, so it would be difficult to assign labels to examples for training. Note, however, that these natural summaries can still be used for evaluation purposes, since ROUGE-1 evaluation only considers unigrams.

Maximum entropy-based summarization

[edit]

During the DUC 2001 and 2002 evaluation workshops, TNO developed a sentence extraction system for multi-document summarization in the news domain. The system was based on a hybrid system using a Naive Bayes classifier and statistical language models for modeling salience. Although the system exhibited good results, the researchers wanted to explore the effectiveness of a maximum entropy (ME) classifier for the meeting summarization task, as ME is known to be robust against feature dependencies. Maximum entropy has also been applied successfully for summarization in the broadcast news domain.

Adaptive summarization

[edit]

A promising approach is adaptive document/text summarization.[17] It involves first recognizing the text genre and then applying summarization algorithms optimized for this genre. Such software has been created.[18]

TextRank and LexRank

[edit]

The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data. Some unsupervised summarization approaches are based on finding a "centroid" sentence, which is the mean word vector of all the sentences in the document. Then the sentences can be ranked with regard to their similarity to this centroid sentence.

A more principled way to estimate sentence importance is using random walks and eigenvector centrality. LexRank[19] is an algorithm essentially identical to TextRank, and both use this approach for document summarization. The two methods were developed by different groups at the same time, and LexRank simply focused on summarization, but could just as easily be used for keyphrase extraction or any other NLP ranking task.

In both LexRank and TextRank, a graph is constructed by creating a vertex for each sentence in the document.

The edges between sentences are based on some form of semantic similarity or content overlap. While LexRank uses cosine similarity of TF-IDF vectors, TextRank uses a very similar measure based on the number of words two sentences have in common (normalized by the sentences' lengths). The LexRank paper explored using unweighted edges after applying a threshold to the cosine values, but also experimented with using edges with weights equal to the similarity score. TextRank uses continuous similarity scores as weights.

In both algorithms, the sentences are ranked by applying PageRank to the resulting graph. A summary is formed by combining the top ranking sentences, using a threshold or length cutoff to limit the size of the summary.

It is worth noting that TextRank was applied to summarization exactly as described here, while LexRank was used as part of a larger summarization system (MEAD) that combines the LexRank score (stationary probability) with other features like sentence position and length using a linear combination with either user-specified or automatically tuned weights. In this case, some training documents might be needed, though the TextRank results show the additional features are not absolutely necessary.

Unlike TextRank, LexRank has been applied to multi-document summarization.

Multi-document summarization

[edit]

Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. Resulting summary report allows individual users, such as professional information consumers, to quickly familiarize themselves with information contained in a large cluster of documents. In such a way, multi-document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload. Multi-document summarization may also be done in response to a question.[20][11]

Multi-document summarization creates information reports that are both concise and comprehensive. With different opinions being put together and outlined, every topic is described from multiple perspectives within a single document. While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents, comprehensive multi-document summary should itself contain the required information, hence limiting the need for accessing original files to cases when refinement is required. Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased. [dubiousdiscuss]

Diversity
[edit]

Multi-document extractive summarization faces a problem of redundancy. Ideally, we want to extract sentences that are both "central" (i.e., contain the main ideas) and "diverse" (i.e., they differ from one another). For example, in a set of news articles about some event, each article is likely to have many similar sentences. To address this issue, LexRank applies a heuristic post-processing step that adds sentences in rank order, but discards sentences that are too similar to ones already in the summary. This method is called Cross-Sentence Information Subsumption (CSIS). These methods work based on the idea that sentences "recommend" other similar sentences to the reader. Thus, if one sentence is very similar to many others, it will likely be a sentence of great importance. Its importance also stems from the importance of the sentences "recommending" it. Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences. This makes intuitive sense and allows the algorithms to be applied to an arbitrary new text. The methods are domain-independent and easily portable. One could imagine the features indicating important sentences in the news domain might vary considerably from the biomedical domain. However, the unsupervised "recommendation"-based approach applies to any domain.

A related method is Maximal Marginal Relevance (MMR),[21] which uses a general-purpose graph-based ranking algorithm like Page/Lex/TextRank that handles both "centrality" and "diversity" in a unified mathematical framework based on absorbing Markov chain random walks (a random walk where certain states end the walk). The algorithm is called GRASSHOPPER.[22] In addition to explicitly promoting diversity during the ranking process, GRASSHOPPER incorporates a prior ranking (based on sentence position in the case of summarization).

The state of the art results for multi-document summarization are obtained using mixtures of submodular functions. These methods have achieved the state of the art results for Document Summarization Corpora, DUC 04 - 07.[23] Similar results were achieved with the use of determinantal point processes (which are a special case of submodular functions) for DUC-04.[24]

A new method for multi-lingual multi-document summarization that avoids redundancy generates ideograms to represent the meaning of each sentence in each document, then evaluates similarity by comparing ideogram shape and position. It does not use word frequency, training or preprocessing. It uses two user-supplied parameters: equivalence (when are two sentences to be considered equivalent?) and relevance (how long is the desired summary?).

Submodular functions as generic tools for summarization

[edit]

The idea of a submodular set function has recently emerged as a powerful modeling tool for various summarization problems. Submodular functions naturally model notions of coverage, information, representation and diversity. Moreover, several important combinatorial optimization problems occur as special instances of submodular optimization. For example, the set cover problem is a special case of submodular optimization, since the set cover function is submodular. The set cover function attempts to find a subset of objects which cover a given set of concepts. For example, in document summarization, one would like the summary to cover all important and relevant concepts in the document. This is an instance of set cover. Similarly, the facility location problem is a special case of submodular functions. The Facility Location function also naturally models coverage and diversity. Another example of a submodular optimization problem is using a determinantal point process to model diversity. Similarly, the Maximum-Marginal-Relevance procedure can also be seen as an instance of submodular optimization. All these important models encouraging coverage, diversity and information are all submodular. Moreover, submodular functions can be efficiently combined, and the resulting function is still submodular. Hence, one could combine one submodular function which models diversity, another one which models coverage and use human supervision to learn a right model of a submodular function for the problem.

While submodular functions are fitting problems for summarization, they also admit very efficient algorithms for optimization. For example, a simple greedy algorithm admits a constant factor guarantee.[25] Moreover, the greedy algorithm is extremely simple to implement and can scale to large datasets, which is very important for summarization problems.

Submodular functions have achieved state-of-the-art for almost all summarization problems. For example, work by Lin and Bilmes, 2012[26] shows that submodular functions achieve the best results to date on DUC-04, DUC-05, DUC-06 and DUC-07 systems for document summarization. Similarly, work by Lin and Bilmes, 2011,[27] shows that many existing systems for automatic summarization are instances of submodular functions. This was a breakthrough result establishing submodular functions as the right models for summarization problems.[citation needed]

Submodular Functions have also been used for other summarization tasks. Tschiatschek et al., 2014 show[28] that mixtures of submodular functions achieve state-of-the-art results for image collection summarization. Similarly, Bairi et al., 2015[29] show the utility of submodular functions for summarizing multi-document topic hierarchies. Submodular Functions have also successfully been used for summarizing machine learning datasets.[30]

Applications

[edit]

Specific applications of automatic summarization include:

  • The Reddit bot "autotldr",[31] created in 2011 summarizes news articles in the comment-section of reddit posts. It was found to be very useful by the reddit community which upvoted its summaries hundreds of thousands of times.[32] The name is reference to TL;DRInternet slang for "too long; didn't read".[33][34]
  • Adversarial stylometry may make use of summaries, if the detail lost is not major and the summary is sufficiently stylistically different to the input.[35]

Evaluation

[edit]

The most common way to evaluate the informativeness of automatic summaries is to compare them with human-made model summaries.

Evaluation can be intrinsic or extrinsic,[36] and inter-textual or intra-textual.[37]

Intrinsic versus extrinsic

[edit]

Intrinsic evaluation assesses the summaries directly, while extrinsic evaluation evaluates how the summarization system affects the completion of some other task. Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc.

Inter-textual versus intra-textual

[edit]

Intra-textual evaluation assess the output of a specific summarization system, while inter-textual evaluation focuses on contrastive analysis of outputs of several summarization systems.

Human judgement often varies greatly in what it considers a "good" summary, so creating an automatic evaluation process is particularly difficult. Manual evaluation can be used, but this is both time and labor-intensive, as it requires humans to read not only the summaries but also the source documents. Other issues are those concerning coherence and coverage.

The most common way to evaluate summaries is ROUGE (Recall-Oriented Understudy for Gisting Evaluation). It is very common for summarization and translation systems in NIST's Document Understanding Conferences.[2] ROUGE is a recall-based measure of how well a summary covers the content of human-generated summaries known as references. It calculates n-gram overlaps between automatically generated summaries and previously written human summaries. It is recall-based to encourage inclusion of all important topics in summaries. Recall can be computed with respect to unigram, bigram, trigram, or 4-gram matching. For example, ROUGE-1 is the fraction of unigrams that appear in both the reference summary and the automatic summary out of all unigrams in the reference summary. If there are multiple reference summaries, their scores are averaged. A high level of overlap should indicate a high degree of shared concepts between the two summaries.

ROUGE cannot determine if the result is coherent, that is if sentences flow together in a sensibly. High-order n-gram ROUGE measures help to some degree.

Another unsolved problem is Anaphor resolution. Similarly, for image summarization, Tschiatschek et al., developed a Visual-ROUGE score which judges the performance of algorithms for image summarization.[38]

Domain-specific versus domain-independent summarization

[edit]

Domain-independent summarization techniques apply sets of general features to identify information-rich text segments. Recent research focuses on domain-specific summarization using knowledge specific to the text's domain, such as medical knowledge and ontologies for summarizing medical texts.[39]

Qualitative

[edit]

The main drawback of the evaluation systems so far is that we need a reference summary (for some methods, more than one), to compare automatic summaries with models. This is a hard and expensive task. Much effort has to be made to create corpora of texts and their corresponding summaries. Furthermore, some methods require manual annotation of the summaries (e.g. SCU in the Pyramid Method). Moreover, they all perform a quantitative evaluation with regard to different similarity metrics.

History

[edit]

The first publication in the area dates back to 1957 [40] (Hans Peter Luhn), starting with a statistical technique. Research increased significantly in 2015. Term frequency–inverse document frequency had been used by 2016. Pattern-based summarization was the most powerful option for multi-document summarization found by 2016. In the following year it was surpassed by latent semantic analysis (LSA) combined with non-negative matrix factorization (NMF). Although they did not replace other approaches and are often combined with them, by 2019 machine learning methods dominated the extractive summarization of single documents, which was considered to be nearing maturity. By 2020, the field was still very active and research is shifting towards abstractive summation and real-time summarization.[41]

Recent approaches

[edit]

Recently the rise of transformer models replacing more traditional RNN (LSTM) have provided a flexibility in the mapping of text sequences to text sequences of a different type, which is well suited to automatic summarization. This includes models such as T5[42] and Pegasus.[43]

See also

[edit]

References

[edit]

Works cited

[edit]

Further reading

[edit]
Revisions and contributorsEdit on WikipediaRead on Wikipedia
from Grokipedia
Automatic summarization, also known as automatic text summarization (ATS), is the computational process of generating a concise version of one or more source documents while retaining their core information content and overall meaning, typically to 5–30% of the original length or less. This task addresses the challenge of in an era of vast digital text data, enabling efficient access to key insights from sources like articles, scientific papers, and legal documents. ATS methods are broadly categorized into two primary types: extractive summarization, which identifies and extracts salient sentences or phrases directly from the input text to form the summary, and abstractive summarization, which interprets the source material and generates novel sentences that the essential ideas in a more fluent, human-like manner. Hybrid approaches combine elements of both to leverage their strengths, such as the factual accuracy of extractive methods with the coherence of abstractive ones. Further distinctions include single-document versus multi-document summarization, where the latter aggregates information across multiple related sources, and generic versus query-focused summarization, tailored to specific user needs. The field originated in the mid-20th century with early statistical techniques, notably Hans Peter Luhn's 1958 work on auto-abstracting, which used word frequency and proximity to select significant sentences from technical literature. Subsequent advancements in the 2000s incorporated and , evolving through recurrent neural networks (RNNs) and (LSTM) models in the to transformer-based architectures like BERT and GPT series in the 2020s. Large language models (LLMs) have recently revolutionized ATS by enabling few-shot learning and context-aware generation, though they introduce challenges like factual hallucinations. Applications of ATS span diverse domains, including aggregation for quick overviews, report condensation to aid diagnostics, and legal analysis for faster case reviews, thereby enhancing productivity in information-intensive fields. Evaluation typically relies on intrinsic metrics like ROUGE scores for lexical overlap with reference summaries and extrinsic measures assessing summary utility in downstream tasks, alongside human judgments for coherence and relevance. Ongoing challenges include ensuring factual consistency, handling long-context documents, and adapting to domain-specific without extensive retraining.

Fundamentals

Definition and Scope

Automatic summarization is the computational process of producing a concise text that captures the most important information from one or more source documents, typically to 10-30% of the original length while preserving semantic meaning and key details without involvement. This task aims to create a reductive transformation of the source material through selection, generalization, or interpretation, enabling efficient information access in an era of . The scope of automatic summarization delineates it from related tasks like paraphrasing or by focusing on condensation rather than equivalence or reformulation. It distinguishes between generic summarization, which provides domain-independent overviews of the source content, and query-focused summarization, which generates tailored outputs responsive to specific user queries or interests. Additionally, it covers single-document approaches that individual texts and multi-document strategies that integrate and synthesize across multiple sources to avoid and highlight contrasts or updates. Summaries may also vary in length, ranging from brief indicative versions that outline main topics to more detailed informative ones that elaborate on core elements. Core objectives emphasize producing outputs that achieve coherence for logical flow and , coverage to include essential facts and viewpoints, non-redundancy to eliminate repetition, and to ensure grammatical and stylistic naturalness akin to human writing. These goals guide system design to balance brevity with informativeness, often drawing on high-level approaches like extractive methods, which select existing phrases, versus abstractive ones, which content. Foundational to automatic summarization are basic natural language processing concepts, including tokenization, which segments text into words, subwords, or sentences for analysis, and sentence parsing, which decomposes structures to identify dependencies and relationships. These preprocessing steps enable subsequent modeling of text semantics and , forming the bedrock for more advanced summarization techniques.

Importance and Challenges

Automatic summarization addresses the challenges of by enabling efficient processing of vast textual data, such as news articles, legal documents, and medical reports, where it condenses complex information into concise forms to support quick comprehension and decision-making. In practical applications, it facilitates news aggregation by extracting essential events and opinions from multiple sources, streamlines legal document review by highlighting key clauses and precedents, and aids in medical report condensation by summarizing histories and diagnoses for healthcare professionals. Additionally, it enhances snippets by providing brief overviews of web content, improving user navigation in online environments. These roles are particularly beneficial for , offering simplified summaries for non-native language speakers and visually impaired individuals through text-to-speech integrations. The societal impact of automatic summarization has intensified since the with the explosion of , driven by the proliferation of on , academic publications, and online news, which overwhelms human processing capacities, with global creation already exceeding 150 zettabytes annually as of 2025. It delivers gains in time-sensitive fields like , where automated tools can generate summaries of breaking stories in seconds, allowing reporters to focus on rather than initial reading. Broader benefits include supporting workflows by distilling , thereby accelerating discoveries in fields like , and aiding educational settings by creating digestible overviews for students. Overall, these advancements promote equitable access amid growing volumes. Despite its value, automatic summarization faces significant technical challenges, including gaps in semantic understanding that lead to incomplete or distorted representations of source nuance, particularly in context-dependent languages or specialized domains. Bias amplification occurs when models perpetuate skewed perspectives from training data, resulting in unbalanced summaries that favor dominant viewpoints. in generative models, such as abstractive systems, introduces fabricated details not present in the original text, undermining reliability. Scalability issues arise with long texts, where computational demands and loss of coherence degrade performance on documents exceeding thousands of words. Ethical concerns further complicate deployment, as poor summaries risk misrepresenting source intent and disseminating , especially in high-stakes areas like legal or medical contexts where inaccuracies could lead to erroneous decisions. Ensuring preservation of authorial nuance requires safeguards against oversimplification, while addressing potential harms from biased outputs demands diverse training data and transparency in model operations. These issues highlight the need for robust guidelines to mitigate societal risks from automated content generation.

Methods

Extractive Summarization

Extractive summarization is a method in automatic text summarization that identifies and selects key sentences or phrases directly from the source document to form a coherent summary, without paraphrasing or generating new content. The core mechanism involves computing scores for candidate sentences based on linguistic and structural features, followed by ranking and selection to meet a desired summary length, typically 10-30% of the original text. Common features include sentence position, where earlier sentences often receive higher weights due to journalistic conventions placing important upfront; TF-IDF, which measures term importance by balancing word frequency in the document against its rarity across a corpus; and , which assesses a sentence's to the overall text through connectivity or representativeness. These scores are aggregated linearly or via to prioritize sentences that capture the document's main ideas. Pioneering work in extractive summarization dates to the late with Hans Peter Luhn's frequency-based approach, which scans documents to identify "significant" words—those appearing frequently but excluding common —and selects sentences containing the highest concentrations of these words to form an "auto-abstract." This method laid the foundation for statistical extraction by emphasizing lexical salience without requiring deep semantic analysis. A decade later, in 1969, H.P. Edmundson advanced the field with his cue method, which combines multiple cues: location (favoring sentences near the beginning or end), of , and predefined "cue phrases" (e.g., "in conclusion" or "the purpose of") that signal importance, weighted subjectively to compute sentence relevance scores. Edmundson's approach improved upon pure by incorporating structural and contextual indicators, achieving better performance on scientific and technical texts in early evaluations. Statistical models in extractive summarization build on these early ideas with baselines and refinements focused on frequency and position. The Lead-3 baseline, a simple yet robust method, extracts the first three sentences of a document as the summary, exploiting the inverted pyramid structure common in news articles where key facts appear early; it serves as a strong reference in benchmarks, often outperforming more complex systems on datasets like / due to its alignment with human writing patterns. Frequency-driven extraction extends Luhn's principle by using variants of term weighting, such as summing TF-IDF scores across words in a sentence to gauge informativeness, then greedily selecting non-redundant high-scoring sentences to avoid overlap. These models prioritize computational efficiency and interpretability, making them suitable for large-scale applications. Extractive summarization offers advantages such as to the source material, ensuring summaries contain verbatim content that avoids fabrication or distortion of facts, which is particularly valuable in domains like legal or texts. Additionally, its reliance on direct selection facilitates easier intrinsic evaluation using metrics like ROUGE, as overlap with the original can be precisely measured against gold-standard summaries. However, limitations include reduced , as concatenated sentences may lack smooth transitions and cohesive flow, potentially resulting in a read. is another challenge, where similar sentences might be selected if diversity is not explicitly enforced during . A representative example of extractive summarization via graph-based ranking is the TextRank algorithm, which constructs an undirected graph where nodes represent sentences and edges are weighted by (often using TF-IDF vectors); it then applies the algorithm to compute scores, selecting the top-ranked sentences to form the summary. This approach captures global text structure by propagating importance through similarity links, improving over purely local features like frequency. Similarly, LexRank uses on a sentence similarity graph to identify salient nodes, emphasizing cluster-based representativeness for more diverse selections. These graph methods, while unsupervised, have demonstrated competitive performance on single-document tasks by modeling inter-sentence relationships. A subtype of extractive summarization is query-focused summarization, which tailors the selection of sentences to a specific user query by computing relevance scores, such as through cosine similarity between the query and sentence representations using TF-IDF or embeddings. This method prioritizes content most pertinent to the query, enhancing targeted information extraction from documents. In Retrieval-Augmented Generation (RAG) systems, query-focused summarization supports efficient processing by concentrating on query-relevant content from retrieved documents, enabling context compression to fit within model context windows and facilitating the synthesis of information from multiple sources for accurate response generation.

Abstractive Summarization

Abstractive summarization generates novel sentences that and synthesize information from the source text, aiming to capture its semantic essence in a more concise and fluent form than the original. The core mechanism entails first deriving a semantic representation of the input, such as through syntactic parse trees or dense vector embeddings, which encodes key concepts and relationships. This representation then informs a process that constructs new text, often guided by linguistic rules or learned patterns to ensure coherence and grammaticality. Early approaches to abstractive summarization, emerging in the , primarily utilized template-based systems and rule-driven paraphrasing. Kathleen McKeown's foundational work, including her discourse-focused text generation framework, employed predefined templates populated with extracted entities and events from the source, combined with rules for rephrasing to produce summaries that mimicked human abstracts. These methods prioritized interpretability and control but were constrained by hand-crafted rules, limiting their scalability to diverse texts. The paradigm shifted toward neural architectures around 2014, with sequence-to-sequence models incorporating mechanisms enabling end-to-end learning for abstraction. Rush et al. (2015) pioneered this by introducing a local -based encoder-decoder model for sentence summarization, where the decoder generates each summary word conditioned on attended input representations, achieving substantial improvements over prior baselines on benchmark datasets. Building on this, Nallapati et al. (2016) adapted attentional encoder-decoder recurrent neural networks for longer documents, addressing challenges like rare word handling and hierarchical structure to produce state-of-the-art abstractive outputs. A pivotal advancement came with pointer-generator networks, as proposed by See et al. (2017), which hybridize generation and extraction within a neural framework. This approach computes a over the vocabulary that interpolates between generating unseen words and pointing to source tokens, allowing the model to reproduce factual details accurately while enabling paraphrasing for novelty; an added coverage mechanism further mitigates repetition by penalizing overlooked input elements. Despite these innovations, abstractive methods face persistent challenges like factual inconsistency, where models may hallucinate or distort information absent from the source, undermining reliability. Abstractive summarization offers advantages in producing human-like and cohesion, enabling summaries that integrate across sentences more naturally than extractive alternatives. However, it incurs higher computational demands due to the complexity of generation and remains error-prone, as the reliance on learned abstractions can amplify inaccuracies in underrepresented domains.

Hybrid and Aided Approaches

Hybrid approaches in automatic summarization integrate extractive and abstractive techniques to leverage the strengths of both paradigms, typically employing extractive methods for initial content selection followed by abstractive refinement for coherent output generation. Early hybrid models from the 2010s, such as the hierarchical approach proposed by Wang et al., combined statistical sentence scoring with semi-supervised learning to identify salient elements before generating summaries, achieving improved coherence over pure extractive systems on multi-document tasks. In the neural era, models like the one introduced by Pilault et al. in 2020 used transformer-based extractive pre-selection to compress long documents into key segments, which were then abstractively summarized, demonstrating ROUGE score improvements of up to 2 points on datasets like compared to standalone abstractive baselines. More recent hybrids, such as SEHY (2022), exploit for extractive section selection prior to abstractive processing, balancing to source content with . Aided summarization extends hybrid methods by incorporating human guidance to enhance accuracy and adaptability, often through interactive interfaces where users refine outputs via queries, edits, or feedback loops. For instance, the query-assisted framework by Narayan et al. (2022) employs to iteratively update summaries based on user-specified queries, enabling targeted from document sets while reducing risks. Semi-supervised hybrids, like the salient representation learning model by Zhong et al. (2023), blend statistical scoring for extractive candidate generation with neural abstractive refinement, using limited labeled data to train on unlabeled corpora for multi-document tasks. Crowd-sourced aided tools, such as those in the aspect-based summarization benchmark by Roit et al. (2023), involve controlled human annotations to guide hybrid pipelines, ensuring diverse perspectives in summary generation for open-domain topics. These systems address limitations of fully automated methods by allowing user interventions, such as editing salient phrases, to maintain factual accuracy. The benefits of hybrid and aided approaches lie in their ability to balance extractive —preserving original semantics—and abstractive creativity—producing novel, concise expressions—while mitigating issues like or factual errors in pure paradigms. For example, the joint extractive-abstractive model for financial narratives by et al. (2021) reported 5-10% gains in semantic consistency metrics over non-hybrid baselines, highlighting improved usability in domain-specific applications. Post-2020 developments emphasize human-AI frameworks, such as SUMMHELPER (2023), which facilitates real-time human-computer co-editing of summaries, and design space mappings by Zhang et al. (2022) that outline interaction modes like iterative feedback to foster trust and in collaborative summarization. These emerging systems, often integrated with large language models, promote scalable processes that enhance summary quality through complementary human oversight.

Techniques

Keyphrase Extraction

Keyphrase extraction is the task of automatically identifying and selecting multi-word terms, such as noun phrases, that best represent the essence or main topics of a document. These keyphrases serve as concise descriptors of the document's content, aiding in indexing, retrieval, and understanding without requiring full reading. Unlike single keywords, keyphrases capture compound concepts (e.g., "automatic summarization" rather than just "summarization"), making them particularly valuable for representing complex ideas in technical or lengthy texts. Supervised methods for keyphrase extraction typically frame the problem as a sequence labeling or task, where classifiers are trained on annotated datasets to distinguish keyphrases from non-keyphrases. Common features include word position (e.g., proximity to the document's beginning or , as phrases appearing early often indicate importance), frequency (e.g., term frequency-inverse document frequency, tf-idf, to weigh rarity and occurrence), and co-occurrence (e.g., measuring semantic relatedness with surrounding terms). For instance, Conditional Random Fields (CRF) models excel in this context by modeling dependencies across phrase boundaries, using features like part-of-speech tags, dependency parses, and contextual windows to label candidate phrases. In evaluations on scientific articles, CRF-based approaches have demonstrated superior performance over baselines like SVM, achieving F-measures around 32-33% on datasets such as SemEval-2010. Unsupervised methods, in contrast, rely on intrinsic text properties without labeled training data, often employing graph-based ranking to identify salient phrases. The TextRank algorithm, introduced in , exemplifies this approach by constructing a graph where candidate phrases (or words) serve as nodes and edges represent similarity based on within a sliding window (typically 2-10 words). Node scores are computed iteratively using a PageRank-inspired voting mechanism, propagating importance from highly connected nodes until convergence, typically after 20-30 iterations; the top-scoring nodes form the extracted keyphrases. This method has shown competitive results on benchmarks like the dataset, with precision around 31% and recall around 43% for window size 2. Evaluation of keyphrase extraction commonly uses precision (fraction of extracted phrases that are correct), (fraction of gold-standard phrases retrieved), and their F-, often computed at top-K (e.g., the top 5 or 10 candidates) to assess performance under practical constraints. These metrics highlight trade-offs, such as high precision at low K versus broader at higher K, and are standard on datasets like or SemEval. In automatic summarization, extracted keyphrases are frequently applied as features for sentence scoring in extractive methods, where sentences containing more keyphrases receive higher scores, guiding the selection of summary content. This integration enhances focus on topical elements, improving summary coherence.

Single-Document Summarization

Single-document summarization focuses on generating a concise representation of the key from a single , such as a news article, scientific paper, or , while preserving its core meaning and structure. Unlike multi-document approaches, it emphasizes the internal coherence and logical flow of one , often using extractive or abstractive techniques tailored to the source's and content. Early methods relied on rule-based heuristics, but modern techniques incorporate to select or generate summary elements more effectively. Supervised learning approaches for extractive single-document summarization commonly employ features such as sentence length, position in the document, and similarity to the or headings to rank and select important sentences. These features capture indicators of salience, like shorter sentences for key facts or those closely aligned with the document's for . A seminal work in this area is the trainable document summarizer by Kupiec et al., which uses a Bayesian classifier to estimate the probability of a sentence being included in a human-generated summary based on such features, achieving improved performance over baseline methods on technical documents. Adaptive methods in single-document summarization dynamically adjust the output based on user-specified needs, such as desired summary length or focus on particular aspects like key events or entities. For instance, systems can modulate sentence selection thresholds or reweight features in real-time to produce shorter or more targeted summaries without retraining. This enhances flexibility for diverse applications. Graph-based techniques model the document as a graph where sentences are nodes connected by similarity edges, enabling centrality measures to identify salient content. LexRank, introduced by Erkan and Radev, applies inspired by to compute lexical importance scores for sentences, treating the graph as a and converging on stable rankings for extractive selection; this method outperforms frequency-based baselines on news corpora by better capturing thematic clusters. One key challenge in single-document summarization is maintaining narrative flow, particularly in non-news texts like stories or reports, where extractive methods may disrupt chronological or causal sequences by selecting disjoint sentences. Abstractive approaches aim to mitigate this through paraphrasing, but they risk introducing inconsistencies if not grounded in the source structure. Datasets like /DailyMail, comprising over 300,000 news articles paired with human-written highlights, have become standard for training and evaluating single-document models, facilitating advancements in both extractive and abstractive paradigms.

Multi-Document Summarization

Multi-document summarization (MDS) aims to generate a concise, coherent overview from a collection of related documents, such as news articles or research papers, by integrating key information while minimizing overlap and ensuring comprehensive coverage. Unlike single-document approaches, MDS must synthesize diverse perspectives, often from sources with varying emphases, to produce a unified that captures the essence of the topic. This process emphasizes redundancy reduction—eliminating repetitive content across documents—and synthesis, where complementary details are fused into novel expressions. Early frameworks, such as those based on Cross-document Structure Theory (CST), highlight the need to model relations like elaboration (adding details) and subsumption (overlapping ideas) to achieve this balance. Core challenges in MDS include managing , where identical or similar facts recur across sources, potentially inflating summary length without adding value; handling contradictions, such as conflicting reports on events or findings; and performing topic clustering to identify and group sub-themes within the document set. is often addressed through similarity measures like cosine distance on sentence vectors, which filter out near-duplicate content during selection. Contradictions require relational modeling, as in CST, where conflicting segments (e.g., one document claiming an outcome increase while another reports a decrease) are flagged for inclusion or reconciliation based on user needs, ensuring summaries avoid unsubstantiated claims. Topic clustering, meanwhile, involves grouping documents or sentences by shared themes, using techniques like to partition content and prevent scattered narratives. These issues are exacerbated in large corpora, where input size can exceed thousands of sentences, demanding scalable algorithms to maintain coherence. Key techniques for MDS include Maximal Marginal Relevance (MMR), introduced in 1998, which balances relevance to the central topic with novelty to promote diversity and curb redundancy. MMR selects sentences by maximizing a score that weighs similarity to a query or centroid against dissimilarity to already chosen elements, formalized as: MMR=argmaxDiRS[λSim1(Di,Q)(1λ)maxDjSSim2(Di,Dj)]\text{MMR} = \arg\max_{D_i \in R \setminus S} \left[ \lambda \cdot \text{Sim}_1(D_i, Q) - (1 - \lambda) \cdot \max_{D_j \in S} \text{Sim}_2(D_i, D_j) \right] where RR is the candidate set, SS the selected set, QQ the query, λ\lambda tunes the trade-off (typically 0.5–0.7), and Sim1,Sim2\text{Sim}_1, \text{Sim}_2 are cosine similarities. This greedy reranking has been widely adopted for extractive MDS, reducing overlap in news clusters by up to 20–30% in early evaluations. Hierarchical clustering extends this by organizing documents into nested structures, such as temporal layers for evolving events, enabling summaries that reflect progression (e.g., initial reports to updates). In the SUMMA system, sentences are clustered recursively by burstiness—peaks in coverage—and evenness, optimizing for salience and coherence across levels, which improved human preference by 92% over flat methods on news corpora. Supervised approaches, such as Integer Linear Programming (ILP), formulate MDS as an optimization problem for optimal sentence selection under constraints like length limits. ILP models maximize a linear objective combining sentence importance (predicted via supervised regression on features like position and n-gram overlap) and diversity (e.g., unique bigram coverage), subject to binary variables indicating selection and non-overlap penalties. A 2012 method using Support Vector Regression for importance scoring achieved state-of-the-art ROUGE-2 scores of 0.0817 on DUC 2005 datasets, outperforming greedy baselines by incorporating global constraints solvable in seconds via solvers like GLPK. This supervised paradigm trains on annotated corpora to prioritize informative, non-redundant content. Evaluation in MDS uniquely emphasizes coverage of events or entities across documents, assessing how well summaries capture distributed information rather than isolated facts. Metrics like ROUGE variants measure n-gram overlap with references, but event-focused approaches, such as QA-based scoring in the DIVERSE SUMM benchmark, quantify inclusivity by checking if summaries address diverse question-answer pairs (e.g., "what" and "how" events), revealing gaps in large language models where coverage hovers at 36% despite high faithfulness. Human judgments often prioritize event completeness, as partial coverage can mislead on multi-source topics. Practical examples include summarizing news clusters, as in the Multi-News dataset, which comprises 56,216 pairs of 2–10 articles on events like arrests or elections, enabling models to fuse timelines and perspectives into 260-word overviews that reduce redundancy by integrating overlapping reports. In scientific literature, MDS supports reviews by synthesizing study abstracts; the MSLR2022 shared task, using datasets like MS² (20,000 reviews), tasked systems with generating conclusions on evidence directions (e.g., treatment effects), where top entries improved ROUGE-L by 2+ points via hybrid extractive-abstractive methods tailored to domain-specific clustering.

Advanced Optimization Methods

Advanced optimization methods in automatic summarization leverage mathematical frameworks to select optimal summary elements under constraints such as length budgets. A prominent approach involves submodular functions, which are set functions exhibiting the property of , enabling efficient diverse subset selection for extractive tasks like sentence ranking and coverage maximization. These functions model summarization as optimizing an objective F(S)F(S), where SS is the summary set, to balance representativeness and diversity while adhering to submodularity: F(A{e})F(A)F(B{e})F(B)F(A \cup \{e\}) - F(A) \geq F(B \cup \{e\}) - F(B) for all ABA \subseteq B and eBe \notin B. Recent advancements integrate large language models (LLMs) into these frameworks, using techniques like prompt-based optimization and fine-tuning to enhance abstractive summarization while addressing hallucinations through submodular coverage constraints. For example, as of 2024, LLM-based methods have improved ROUGE scores on benchmarks like CNN/DailyMail by incorporating deterministic constraints for factual consistency. Hierarchical summarization is another advanced technique for handling very long documents, involving iterative summarization of smaller sections followed by summarization of those intermediate summaries to produce a concise overall representation. This approach enables effective context compression, particularly in Retrieval-Augmented Generation (RAG) systems, where it allows more documents to fit within the limited context windows of LLMs by densely packing and synthesizing information from multiple sources while preserving key details and query relevance. Quality in such applications is evaluated based on faithfulness to the original content and relevance to the specific query. In practice, submodular functions facilitate greedy algorithms that iteratively select sentences to maximize coverage, providing a principled way to approximate the best summary under budget constraints. Complementary to this, Bayesian approaches address uncertainty in summaries by modeling probabilistic dependencies, such as query relevance or sentence importance, through posterior distributions that incorporate prior knowledge and observed data. For instance, Bayesian query-focused summarization uses hidden variables to estimate sentence contributions, enabling robust handling of ambiguous inputs. These methods offer theoretical advantages, including greedy algorithms' approximation guarantees of (11/e)(1 - 1/e)-optimality for maximizing monotone submodular functions under constraints. However, their limitations include high , often O(n2)O(n^2) for evaluating marginal gains over large document sets, which can hinder scalability in real-time applications.

Evaluation

Intrinsic and Extrinsic Metrics

Evaluation of automatic summarization systems relies on intrinsic and extrinsic metrics to assess summary quality. Intrinsic metrics evaluate the summary directly by comparing it to reference summaries or the source text, focusing on aspects such as content coverage, fluency, and coherence without requiring human task performance. These metrics are typically automated and domain-independent, enabling scalable assessment, though they may not fully capture semantic nuances. In contrast, extrinsic metrics measure the utility of a summary in supporting downstream tasks, such as or , where the summary's effectiveness is gauged by its impact on task outcomes like accuracy or . A prominent is ROUGE (Recall-Oriented Understudy for Gisting Evaluation), introduced in 2004, which quantifies n-gram overlap between the candidate summary and multiple reference summaries to approximate human judgments of informativeness. ROUGE variants include ROUGE-1 for unigram overlap, ROUGE-2 for overlap emphasizing phrase-level matching, and ROUGE-L based on the to account for sentence-level structure. The core ROUGE-N formula is defined as: ROUGE-N=min(Countmatch(gramn),Count(gramn))Count(gramn)\text{ROUGE-N} = \frac{\sum \min(\text{Count}_{\text{match}}(\text{gram}_n), \text{Count}(\text{gram}_n))}{\sum \text{Count}(\text{gram}_n)} where the numerator sums the minimum matching counts of n-grams across references and candidate, and the denominator sums the counts in references; this recall-focused approach correlates well with human evaluations on datasets like DUC. Another intrinsic method is the Pyramid approach, proposed in 2004, which evaluates content selection by identifying semantic content units (SCUs) from human summaries and scoring a candidate summary based on how many unique SCUs it covers, weighted by their pyramid rank to reflect varying human priorities. This method addresses limitations in n-gram metrics by prioritizing semantic informativeness over surface form. Intrinsic evaluations can be categorized as intra-textual, comparing the summary to the source text for aspects like grammaticality or non-redundancy, or inter-textual, comparing it to reference summaries for content adequacy. Domain-independent intrinsic metrics, such as cosine similarity on text embeddings (e.g., using TF-IDF or neural embeddings like BERT), provide a generic measure of semantic overlap without relying on domain-specific references, often serving as a baseline for broader applicability. For instance, cosine similarity computes the angular distance between vector representations of the summary and reference, yielding values from -1 to 1, where higher scores indicate greater alignment. Extrinsic metrics assess summarization through task-oriented performance, revealing practical utility but requiring controlled experiments. In question answering tasks, summaries are evaluated by how well they enable accurate answers to queries derived from the source, with metrics like F1-score on answer extraction showing that effective summaries reduce retrieval time while maintaining high precision. Similarly, in reading comprehension benchmarks, extrinsic evaluation measures improvements in comprehension scores when participants use summaries versus full texts, demonstrating correlations with intrinsic scores but highlighting real-world impacts like faster in audit tasks. These approaches underscore that while intrinsic metrics scale well for development, extrinsic ones validate end-use effectiveness.

Qualitative and Domain-Specific Assessment

Qualitative evaluation in automatic summarization relies on human assessors to rate summaries based on subjective criteria such as informativeness, coherence, and , providing insights into aspects that automated metrics may overlook. Assessors typically use Likert-style scales, such as 5-point Mean Opinion Scores (MOS), where ratings range from "very poor" to "very good" for attributes like grammaticality, non-redundancy, referential clarity, focus, and structure & coherence, which collectively address coherence and . For informativeness, overall responsiveness is scored on similar scales, evaluating how well the summary covers key content without redundancy. Pairwise comparisons, where assessors rank two summaries side-by-side for relative quality, are also employed to reduce and improve reliability in these ratings. In domain-specific contexts, evaluation adapts these qualitative methods to prioritize fidelity to source nuances, including specialized vocabulary and critical entities. For biomedicine, human assessors focus on entity preservation, using rubrics like SaferDx or PDQI-9 to check omission of key medical facts, diagnoses, and terminology accuracy via tools such as UMLS Scorer for groundedness and faithfulness. In finance, ratings emphasize numerical accuracy, verifying retention of vital figures like monetary values through entity-aware assessments to ensure factual precision in summaries. For legal domains, assessors evaluate preservation of case references, dates, and clause linkages, maintaining domain-specific relevance and coherence. These adaptations ensure summaries intra-textually align with source intricacies, such as technical terms in medical or legal texts. Challenges in qualitative and domain-specific assessment include high subjectivity, where inter-rater agreement varies, necessitating expert involvement that escalates costs. Creating summaries is resource-intensive, requiring domain experts for and clear guidelines to mitigate interpretation variability. The Text Analysis Conference (TAC) exemplifies structured qualitative scoring, using 5-point scales for / and overall , alongside methods for content units to guide human judgments in guided summarization tasks. Such human evaluations complement baselines like ROUGE by capturing nuanced quality.

Modern Evaluation Challenges

One of the primary modern challenges in evaluating automatic summarization lies in assessing factuality, particularly the detection of s where summaries introduce unsubstantiated or incorrect information not present in the source material. Traditional metrics like ROUGE often fail to capture these issues, as they prioritize lexical overlap rather than semantic fidelity, leading to overestimation of summary quality in abstractive systems. Recent approaches, such as the FactCC metric, employ weakly supervised models to verify factual consistency by applying rule-based transformations to source documents and detecting conflicts with generated summaries, achieving improved correlation with human judgments on datasets like /DailyMail. Despite these advances, evaluation remains complex due to its contextual nature, with automatic methods struggling to distinguish subtle factual errors in diverse domains. Multilingual evaluation introduces significant gaps, as most benchmarks and metrics are English-centric, resulting in poor for non-English languages where scarcity exacerbates issues like translation-induced errors in cross-lingual summarization. For instance, automatic metrics such as BERTScore exhibit biases toward high-resource languages, undervaluing summaries in low-resource ones like Swahili or , and failing to account for linguistic nuances across scripts and morphologies. Efforts to address this include meta-evaluation datasets that test metric robustness across languages, revealing that reference-based evaluators correlate weakly with assessments outside English, prompting calls for more inclusive, multilingual corpora. Bias and fairness pose additional hurdles, as summarization models can amplify inherent biases in source texts, such as gender or racial stereotypes in news articles, which standard metrics overlook by focusing on surface-level accuracy rather than equitable representation. Metrics like bias amplification ratios have been proposed to quantify how summaries exacerbate source biases, but their integration into evaluation pipelines remains limited, often requiring domain-specific adaptations. FactCC has been extended in fairness contexts to flag biased factual inconsistencies, yet comprehensive tools for detecting amplification in real-time generation are still emerging. Scalability challenges arise in evaluating long-form or streaming summaries, where processing extended inputs like books or live news feeds overwhelms traditional metrics designed for short texts, leading to incomplete assessments of coherence over thousands of tokens. Benchmarks for long-context tasks highlight failures in maintaining factual accuracy across extended narratives, with evaluators like those in the ETHIC framework showing that even large models degrade in performance on inputs exceeding 100,000 tokens. This necessitates efficient, hierarchical evaluation methods that can handle dynamic, incremental summarization without prohibitive computational costs. Emerging reference-free metrics leveraging large models (LLMs) since offer promising solutions by bypassing the need for gold-standard references, instead using LLMs to score summaries on criteria like coherence and directly against sources. For example, SummaC employs LLM-based question generation and answering to check factual consistency, outperforming reference-based alternatives on benchmarks like XSum and achieving up to 20% better alignment with human evaluations. However, these metrics face generalization failures across s and styles, where LLMs trained predominantly on English data produce inconsistent scores for morphologically rich or low-resource s, underscoring the need for multilingual fine-tuning.

Applications

Commercial Systems

Commercial automatic summarization systems have proliferated since the mid-2010s, driven by the maturation of and the integration of advanced (NLP) capabilities into scalable . This shift enabled enterprises to access sophisticated summarization without on-premises infrastructure, with major providers launching dedicated services around 2016, coinciding with the broader adoption of in cloud environments. Pricing models typically follow a pay-as-you-go structure, charging based on input volume (e.g., per 1,000 characters or API calls), often with tiered options for high-volume users and free tiers for initial testing. Google Cloud offers summarization through Vertex AI and Document AI, leveraging generative AI models for abstractive summarization that produce concise, human-like overviews of documents or text. These tools support hybrid extractive-abstractive approaches, where key sentences are identified before rephrasing, and integrate seamlessly with other Google Cloud services for enterprise workflows like in or legal review. Developers access features via APIs, with options for custom fine-tuning on proprietary data. IBM Watson, via watsonx.ai, provides document summarization using foundation models such as for both extractive and abstractive methods, emphasizing hybrid cloud deployments for secure enterprise use. While earlier components like Watson Tone Analyzer focused on sentiment alongside basic text processing, current capabilities extend to generative summarization for reports, transcripts, and legal documents, reducing processing time by up to 90% in case studies like media firm Blendow Group. APIs enable integration into applications, supporting retrieval-augmented generation (RAG) for context-aware summaries. Microsoft Azure AI Language (formerly Text Analytics) delivers key phrase extraction alongside extractive and abstractive summarization, using encoder models to rank and generate summaries from unstructured text or conversations. Extractive mode selects salient sentences with relevance scores, while abstractive generates novel phrasing for coherence; both handle documents up to 125,000 characters total across batches via asynchronous APIs in languages like Python and C#. This facilitates enterprise applications in compliance monitoring and knowledge management, with scalability for batch processing. Open-source integrations like Transformers enable enterprise deployment of summarization models, such as T5 or , fine-tuned for specific domains via the Hub. Companies leverage these for custom pipelines in production environments, deploying abstractive models that generate summaries while preserving key information, often combined with cloud infrastructure for inference at scale. Enterprise features include model sharing, evaluation metrics like ROUGE, and paid Hub services for private repositories and accelerated inference. In news applications, Apple Intelligence incorporates summarization tools within iOS and macOS ecosystems, using on-device generative models to condense articles into digests for notifications and reading apps like . This feature prioritizes key points from lengthy content, enhancing user experience in fast-paced media consumption, though it has faced challenges with accuracy in beta releases.

Real-World Use Cases

In , automatic summarization has enabled the generation of concise news briefs from structured data, allowing outlets to scale coverage efficiently. Since 2014, the (AP) has employed (NLG) technology to automate summaries of corporate earnings reports, transforming raw financial data into readable articles and increasing output from about 300 stories per quarter to over 4,000 without additional staff. This approach has since expanded to other routine reporting, such as sports recaps and election results, freeing journalists for in-depth analysis while maintaining factual accuracy through templated extraction methods. In the legal and enterprise sectors, automatic summarization streamlines contract review and e-discovery processes by condensing voluminous documents into key insights, reducing manual review time significantly. Tools integrated with e-discovery platforms use extractive and abstractive techniques to highlight clauses, risks, and obligations in contracts, enabling faster in mergers and litigation. For instance, in e-discovery, summarization aids early case assessment by generating overviews of document sets, helping legal teams prioritize relevant evidence from terabytes of data and cutting review costs in large cases. In healthcare, automatic summarization facilitates patient record abstraction by synthesizing electronic health records (EHRs) into coherent narratives, supporting clinical and reducing cognitive overload for providers. Systems like demonstrate how summarization tools assist data abstractors in quality metric abstraction by extracting and prioritizing key events from longitudinal records, improving efficiency in tasks such as identifying comorbidities or treatment histories. Similarly, for biomedical literature, techniques applied to abstracts automate the generation of lay summaries or evidence overviews from clinical trials, enhancing accessibility for researchers and patients; a highlights abstractive methods in condensing trial results while preserving medical accuracy. In education, automatic summarization condenses lecture notes and materials, aiding student comprehension and study efficiency. Applications process video lectures or transcripts to produce structured summaries with key points, timestamps, and concept maps, as shown in evaluations where large language models like GPT-3 generated summaries that improved learner retention by 15-20% compared to unassisted notes. For accessibility aids, summarization supports students with disabilities by adapting content into simplified formats, such as variable-length overviews in e-learning platforms that integrate with screen readers, thereby promoting inclusive education through on-demand customization of dense academic texts. On , automatic summarization handles (now X) threads by distilling multi-post discussions into single-paragraph overviews, helping users navigate complex conversations quickly. Bots and platform features employ NLG to generate thread summaries, capturing main arguments and conclusions, which has been implemented in tools that process viral threads to boost engagement without overwhelming readers. In , summarization assists by abstracting user reports or dialogue chains, enabling moderators to abusive content faster; multimodal systems, for example, summarize text and image interactions in posts, reducing false positives in detection by providing contextual digests for human review. In retrieval-augmented generation (RAG) systems, automatic summarization supports context compression, document understanding, and response generation from multiple sources. It enables denser packing of information within limited context windows of large language models and facilitates synthesis across diverse documents. Summarization types such as query-focused methods ensure relevance to specific queries, while hierarchical summarization handles very long documents through iterative compression. Quality is evaluated based on faithfulness to the original content—preserving key information without hallucinations—and relevance to the query. Large language models have advanced these capabilities, improving efficiency in applications like question answering and knowledge retrieval.

History and Developments

Early Foundations

The origins of automatic summarization trace back to the 1950s, when early efforts focused on rule-based systems inspired by (IR) techniques. In 1958, introduced one of the first automated methods for generating abstracts from , using a approach to identify significant based on word and proximity, effectively creating "auto-abstracts" by extracting key excerpts without deep semantic understanding. This work, rooted in IBM's punch-card innovations, marked a foundational shift toward computational text and drew heavily from emerging IR practices, such as indexing and keyword weighting, to prioritize content relevance in large document collections. During the 1960s, these ideas influenced broader IR developments, including models that treated documents as bags of words, laying groundwork for later extractive summarization by emphasizing over manual . The and saw a pivot to linguistic and knowledge-based approaches, emphasizing deeper text comprehension through structured representations. Schank's script , developed in the mid-, proposed using predefined "scripts" — stereotypical sequences of events — to model human understanding of narratives, enabling systems to infer and summarize implied content from partial descriptions in stories or reports. This rationalist , prominent in AI research, influenced summarization by incorporating to parse and reorganize text elements, moving beyond simple extraction to simulate human-like inference, though it required extensive hand-crafted knowledge bases that limited scalability. By the late , these methods intersected with government-funded initiatives; the program, initiated in the early as a precursor to structured evaluations, began integrating linguistic tools for text analysis, funding research that bridged IR with to handle real-world document sets. In the 1990s, the field transitioned to statistical extractive methods, driven by advances in and the need for more robust, data-oriented systems. Seminal work by Julian Kupiec and colleagues in 1995 demonstrated a trainable summarizer using probabilistic models to score sentences based on features like title overlap and position, achieving effective extracts by learning from annotated corpora without relying on rigid rules. This data-driven shift, influenced by IR techniques for relevance ranking and early (MT) efforts in sentence alignment, enabled scalable summarization for news and technical texts, marking a departure from knowledge-intensive approaches toward empirical optimization. Key milestones included the first large-scale evaluations under the program's SUMMAC initiative in 1998, which assessed summarization's utility for IR tasks, and the inception of the Document Understanding Conference (DUC) in 2001, building on these roots to standardize benchmarks at ACL conferences starting with workshops in 2002. These developments solidified influences from IR (e.g., term weighting) and MT (e.g., fluency in rephrasing), fostering a hybrid foundation for future progress.

Recent Advances

The advent of architectures marked a pivotal shift in automatic summarization, transitioning from statistical methods to approaches capable of capturing complex semantic relationships. The introduction of the model in 2017 revolutionized the field by relying solely on attention mechanisms, eliminating the need for recurrent or convolutional layers, which enabled more efficient parallel processing of long sequences and improved performance in sequence-to-sequence tasks like abstractive summarization. This architecture laid the groundwork for subsequent advancements, allowing models to generate coherent summaries that better mimic human-like abstraction. Building on Transformers, specialized models emerged for abstractive summarization. In 2019, (Bidirectional and Auto-Regressive Transformers) was proposed as a denoising pre-trained on large corpora through tasks like text infilling and sentence permutation, achieving state-of-the-art results on datasets such as CNN/ by fine-tuning for summarization, where it outperformed prior extractive baselines in ROUGE scores by up to 2-3 points. Similarly, the T5 (Text-to-Text Transfer Transformer) model, introduced in late 2019 and refined in 2020, unified all NLP tasks under a text-to-text framework, demonstrating superior abstractive summarization capabilities when fine-tuned, with ROUGE-2 improvements of approximately 1.5 points over non-pretrained models on news summarization benchmarks. The integration of large language models (LLMs) further advanced summarization, particularly through fine-tuning and zero-shot capabilities post-2020. Models in the GPT series, starting with in 2020, enabled zero-shot summarization by leveraging in-context prompting, where summaries are generated without task-specific training, achieving competitive ROUGE scores (around 20-25 on CNN/Daily Mail) comparable to supervised models in few-shot settings. Prompt-based approaches extended this by incorporating instructions for style or focus, enhancing flexibility without retraining. In the 2020s, innovations addressed limitations in context length and multilingual support; for instance, Longformer (2020) introduced sparse attention patterns to handle documents up to 4,096 tokens—four times longer than standard Transformers—improving summarization of extended texts like legal or scientific articles by reducing quadratic complexity. Multilingual extensions, such as mT5 (2021), pretrained on 101 languages, facilitated cross-lingual summarization, yielding ROUGE improvements of 5-10 points on non-English datasets when fine-tuned. Key datasets have driven these developments by providing diverse training resources. The XSum dataset (2018), comprising over 200,000 articles paired with single-sentence abstractive summaries, emphasized extreme summarization and novel content generation, boosting model training for concise outputs. Complementing this, MLSUM (2020) offered 1.5 million article-summary pairs across five languages (English, French, German, Spanish, Russian), enabling multilingual model evaluation and reducing language biases in training. Emerging trends focus on controllability and ethical considerations. Controllable summarization techniques, such as CTRLsum (2020), allow users to specify attributes like length or entity focus via prompts or prefixes, generating tailored summaries with up to 15% better alignment to on benchmarks like Multi-News. Post-2023, ethical AI has gained prominence, with research emphasizing mitigation of biases, hallucinations, and privacy risks in summarization systems, including guidelines for factual consistency evaluation and stakeholder impact assessment in deployment. From 2024 onward, advancements included enhanced zero-shot summarization with models like , which achieved higher ROUGE scores (e.g., ROUGE-1 around 40-45 on news benchmarks) compared to through better contextual understanding. LLMs have significantly advanced summarization capabilities, particularly in retrieval-augmented generation (RAG) through few-shot learning, context compression, document understanding, and response generation from multiple sources. In RAG, summarization supports types such as extractive (selecting important sentences), abstractive (generating new text), and query-focused (tailored to specific queries), enabling denser context packing within limited windows and synthesis across sources while ensuring quality via faithfulness to key information and relevance to queries. Techniques such as retrieval-augmented generation (RAG) integrated external retrieval to boost factual accuracy and reduce hallucinations in abstractive summaries, with graph-based variants like GraphRAG enabling query-focused multi-document summarization; however, challenges like hallucinations persist, requiring ongoing mitigation. Hierarchical summarization has emerged to handle very long documents by iterative compression, further enhancing RAG applications. New datasets, including CS-PaperSum (2025) for AI-generated summaries of papers, supported domain-specific training and evaluation as of 2025.

References

Add your contribution
Related Hubs
User Avatar
No comments yet.