METEOR
View on WikipediaThis article may be too technical for most readers to understand. (October 2010) |
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

Results have been presented which give correlation of up to 0.964 with human judgement at the corpus level, compared to BLEU's achievement of 0.817 on the same data set. At the sentence level, the maximum correlation with human judgement achieved was 0.403.[1]

Algorithm
[edit]As with BLEU, the basic unit of evaluation is the sentence, the algorithm first creates an alignment (see illustrations) between two sentences, the candidate translation string, and the reference translation string. The alignment is a set of mappings between unigrams. A mapping can be thought of as a line between a unigram in one string, and a unigram in another string. The constraints are as follows; every unigram in the candidate translation must map to zero or one unigram in the reference. Mappings are selected to produce an alignment as defined above. If there are two alignments with the same number of mappings, the alignment is chosen with the fewest crosses, that is, with fewer intersections of two mappings. From the two alignments shown, alignment (a) would be selected at this point. Stages are run consecutively and each stage only adds to the alignment those unigrams which have not been matched in previous stages. Once the final alignment is computed, the score is computed as follows: Unigram precision P is calculated as:
| Module | Candidate | Reference | Match |
|---|---|---|---|
| Exact | Good | Good | Yes |
| Stemmer | Goods | Good | Yes |
| Synonymy | well | Good | Yes |
Where m is the number of unigrams in the candidate translation that are also found in the reference translation, and is the number of unigrams in the candidate translation. Unigram recall R is computed as:
Where m is as above, and is the number of unigrams in the reference translation. Precision and recall are combined using the harmonic mean in the following fashion, with recall weighted 9 times more than precision:
The measures that have been introduced so far only account for congruity with respect to single words but not with respect to larger segments that appear in both the reference and the candidate sentence. In order to take these into account, longer n-gram matches are used to compute a penalty p for the alignment. The more mappings there are that are not adjacent in the reference and the candidate sentence, the higher the penalty will be.
In order to compute this penalty, unigrams are grouped into the fewest possible chunks, where a chunk is defined as a set of unigrams that are adjacent in the hypothesis and in the reference. The longer the adjacent mappings between the candidate and the reference, the fewer chunks there are. A translation that is identical to the reference will give just one chunk. The penalty p is computed as follows,
Where c is the number of chunks, and is the number of unigrams that have been mapped. The final score for a segment is calculated as M below. The penalty has the effect of reducing the by up to 50% if there are no bigram or longer matches.
To calculate a score over a whole corpus, or collection of segments, the aggregate values for P, R and p are taken and then combined using the same formula. The algorithm also works for comparing a candidate translation against more than one reference translations. In this case the algorithm compares the candidate against each of the references and selects the highest score.
Examples
[edit]| Reference | the | cat | sat | on | the | mat |
|---|---|---|---|---|---|---|
| Hypothesis | on | the | mat | sat | the | cat |
| Score | ||||||
| Fmean | ||||||
| Penalty | ||||||
| Fragmentation | ||||||
| Reference | the | cat | sat | on | the | mat |
|---|---|---|---|---|---|---|
| Hypothesis | the | cat | sat | on | the | mat |
| Score | ||||||
| Fmean | ||||||
| Penalty | ||||||
| Fragmentation | ||||||
| Reference | the | cat | sat | on | the | mat | |
|---|---|---|---|---|---|---|---|
| Hypothesis | the | cat | was | sat | on | the | mat |
| Score | |||||||
| Fmean | |||||||
| Penalty | |||||||
| Fragmentation | |||||||
See also
[edit]Notes
[edit]- ^ Banerjee, S. and Lavie, A. (2005)
References
[edit]- Banerjee, S. and Lavie, A. (2005) "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments" in Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005
- Lavie, A., Sagae, K. and Jayaraman, S. (2004) "The Significance of Recall in Automatic Metrics for MT Evaluation" in Proceedings of AMTA 2004, Washington DC. September 2004
External links
[edit]- The METEOR Automatic Machine Translation Evaluation System (including link for download)
METEOR
View on GrokipediaOverview
Definition and Purpose
METEOR, or Metric for Evaluation of Translation with Explicit ORdering, is an automatic evaluation metric designed for assessing the quality of machine translation (MT) output.[1] It aims to provide a linguistically motivated score that aligns more closely with human judgments compared to earlier lexical metrics, by extending beyond simple surface-level matching to incorporate semantic and structural elements of language.[1] At its core, METEOR evaluates MT by computing unigram precision and recall, while accounting for synonymy through resources like WordNet, stemming to handle morphological variations, and explicit penalties for word order deviations that affect fluency.[1] This approach addresses limitations in metrics like BLEU, which rely heavily on n-gram overlaps and often overlook recall or semantic equivalence, resulting in METEOR's demonstrated higher correlation with human assessments on datasets such as Arabic-to-English and Chinese-to-English translations.[1] Key design goals of METEOR include tunable parameters in its penalty functions, allowing flexibility for adaptation across different languages and evaluation scenarios, as well as an explicit mechanism to penalize unnatural word ordering through fragmentation analysis, thereby promoting scores that reflect both adequacy and fluency.[1] Developed at Carnegie Mellon University in 2005, METEOR was introduced as a robust alternative to n-gram-based metrics, emphasizing reliability in capturing subtle quality differences in MT systems.[1]Historical Context
Prior to the introduction of METEOR, the primary automated metric for evaluating machine translation (MT) systems was BLEU, proposed in 2002 by Papineni et al. at IBM Research. BLEU computes a modified n-gram precision score between the machine-generated translation and one or more human reference translations, incorporating a brevity penalty to discourage overly short outputs, but it fundamentally emphasizes precision over recall and does not account for semantic equivalences such as synonyms or stemming variations. This approach stemmed from the need for a quick, reference-based metric to rank MT systems during the early stages of statistical machine translation (SMT), which gained prominence in the late 1990s and early 2000s through probabilistic models trained on bilingual corpora.[5] Despite its widespread adoption, BLEU exhibited significant limitations, including poor capture of translation fluency and semantic adequacy, as it relied solely on exact surface-level n-gram matches without considering word order flexibility or contextual meaning. Studies in the early 2000s revealed low correlation with human judgments, typically ranging from 0.2 to 0.3 at the segment level, though higher (around 0.8-0.9) at the system level for ranking purposes.[6][7] The rise of SMT in this period, fueled by advances in computational resources and large parallel datasets from initiatives like the Europarl corpus, highlighted gaps in automated evaluation, as BLEU often penalized valid rephrasings and failed to assess whether translations preserved the source meaning or read naturally in the target language.[5] Human evaluation methods, which served as the gold standard, typically involved bilingual assessors rating translations on adequacy (how well the meaning was conveyed, on a 0-5 scale) and fluency (grammatical and idiomatic naturalness, also on a 0-5 scale), often through direct comparison to references or post-editing tasks.[5] These manual assessments were essential for capturing nuanced aspects like ordering and coherence but were criticized for subjectivity, low inter-annotator agreement (kappa scores around 0.4-0.6), and high costs, limiting their scalability in iterative MT research. The demand for reliable automatic metrics intensified with the early 2000s MT landscape, driven by DARPA's TIDES program (starting 2001) and NIST's annual evaluations, which tested systems on low-resource languages like Arabic and Chinese and underscored the need for metrics enabling rapid system development and comparison without exhaustive human involvement.[8][9] METEOR emerged as a direct response to these shortcomings in BLEU and human-centric limitations.[5]Development
Original Creation
METEOR was originally developed by Satanjeev Banerjee and Alon Lavie at the Language Technologies Institute of Carnegie Mellon University. The metric was first presented at the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization in June 2005.[10] The development process centered on creating an automatic evaluation metric that addressed limitations in prior approaches by incorporating a broader range of linguistic matches. METEOR is built upon the harmonic mean of unigram precision and recall, with a stronger emphasis on recall to better align with human judgments of translation adequacy. The initial implementation focused on three primary matching types: exact word matches, stem matches using the Porter stemmer, and synonym matches derived from WordNet, enabling improved coverage of semantic and morphological variations in translations.[10][11] Key innovations in the original METEOR included the integration of synonymy and stemming to enhance linguistic flexibility beyond simple exact matching, alongside a method for parameter tuning that maximizes Pearson correlation with human adequacy scores. The metric was first released in 2004 and was tested on Arabic-English and Chinese-English corpora from the NIST MT evaluations, specifically the LDC TIDES 2003 datasets comprising 664 Arabic sentences and 920 Chinese sentences. Early performance evaluations showed segment-level adequacy correlations of 0.347 for Arabic-English and 0.331 for Chinese-English, ranging approximately 0.34-0.39 overall, significantly outperforming BLEU at the segment level, where BLEU shows lower correlation with human judgments.[10][11]Versions and Extensions
The METEOR metric evolved through several key releases following its initial 2005 introduction, with updates focusing on enhanced matching capabilities, parameter tuning, and broader language support to improve correlation with human evaluations. The 2007 version, developed by Lavie and Agarwal, refined parameter tuning for higher correlation with human judgments on adequacy and fluency, and extended support to non-English languages including Spanish, French, and German through language-specific synonym tables and optimized weights.[12][13] In 2010, METEOR-NEXT introduced paraphrase matching modules using tables extracted from parallel corpora, enabling better capture of semantic equivalences beyond synonyms, and provided tuned resources for English, Czech, German, French, and Spanish.[14] The same year, Denkowski and Lavie extended the metric to phrase-level alignments, allowing multi-word units to be matched holistically for improved evaluation of structured translations.[15] Meteor 1.3, released in 2011 by Denkowski and Lavie, incorporated refinements to optimization algorithms and higher-precision paraphrase matching while distinguishing content from function words, facilitating more reliable system tuning.[16] Parameter adjustments across versions, such as increasing the synonym match weight from 0.25 in early releases to around 0.3 in tuned configurations for European languages, aimed at balancing recall and precision for cross-lingual performance.[13] The 2014 Meteor Universal version automated resource extraction and parameter learning from parallel data, enabling language-specific evaluations for any target language without manual tuning, thus broadening applicability to low-resource scenarios.[17] No major core updates have occurred since 2014, though open-source implementations, such as those in the Hugging Face Evaluate library, have seen minor tweaks for compatibility with neural machine translation datasets and modern corpora as of 2022.[18] METEOR remains relevant in contemporary surveys on evaluation metrics, praised for its linguistic sensitivity despite the rise of neural approaches like COMET.[19]Algorithm
Matching Mechanisms
METEOR employs a unigram-based matching process to align words from a candidate machine translation with one or more reference translations, focusing on semantic and morphological similarity rather than strict surface form identity. This foundation allows the metric to capture a broader range of linguistically valid equivalences, thereby improving recall in evaluation while preserving precision for assessing translation adequacy.[10] The matching mechanisms operate in a prioritized sequence, starting with exact matches where unigrams share identical surface forms, excluding common stop words to avoid irrelevant alignments. Next, stemming matches are applied using the Porter stemmer, which reduces words to their base form (e.g., "running" aligns with "run"), accommodating morphological variations common in English. Synonym matching follows, leveraging WordNet synsets to align unigrams that are semantically equivalent but not morphologically related (e.g., "big" with "large"). In versions from 2010 onward, paraphrase matching extends this to phrasal patterns, aligning multi-word phrases if one is listed as a paraphrase of the other in automatically generated tables derived from parallel corpora via pivot-based methods. This progression—exact, then stem, synonym, and finally paraphrase—ensures higher-confidence matches are prioritized, enhancing the metric's sensitivity to semantic fidelity.[10][11][20] Alignments are generated greedily by selecting the largest non-crossing subset of mappings at each stage to minimize alignment crosses and maximize coverage. This process produces the optimal alignment by iteratively building the sequence that best preserves word order while incorporating the prioritized match types.[10][11] When multiple reference translations are available, METEOR computes alignments independently against each reference and selects the one yielding the highest matching score, allowing the metric to credit the candidate for adequacy relative to any valid human rendition.[10] Linguistically, these mechanisms are motivated by the need to evaluate translations based on content preservation rather than literal equivalence; exact and stem matches handle surface and inflectional fidelity, while synonym and paraphrase extensions boost recall for synonymous expressions, collectively yielding higher correlation with human adequacy judgments without sacrificing precision.[10][20]Alignment and Chunking
In METEOR, the alignment process begins after identifying potential unigram matches between the candidate translation and the reference translation, utilizing types such as exact matches, stem matches, and synonym matches.[1] A greedy matching algorithm is then applied to select the optimal set of these mappings, prioritizing the largest number of matches while minimizing the number of crossing alignments to preserve word order integrity.[1] This algorithm operates in stages—first favoring exact matches, then stemmed matches, and finally synonym matches—and forms one-to-one links between unigrams in the candidate and reference, ensuring each unigram is mapped at most once to avoid overlaps.[1] Crossings are detected by checking if the relative positions of two mapped unigrams differ in sign between the candidate and reference, defined as (pos(t_i) – pos(t_k)) * (pos(r_j) – pos(r_l)) < 0 for mappings (t_i, r_j) and (t_k, r_l).[1] Once aligned, the matched unigrams are grouped into chunks, which are defined as maximal sequences of consecutive matched unigrams in the candidate that align to consecutive matched unigrams in the reference, forming contiguous phrases.[1] For instance, in comparing the candidate "the president spoke" to the reference "the president then spoke," the alignments form two chunks: "the president" as one contiguous group and "spoke" as another, reflecting the insertion of an unmatched word in the reference.[1] The number of such chunks serves as a measure of translation fluency, where fewer chunks indicate longer, more coherent matching sequences that align well with natural phrasing.[1] Fragmentation assesses the dispersion of these chunks by evaluating the proportion of unmatched unigrams that create gaps between chunks in the candidate translation, thereby penalizing scattered or non-contiguous matches that disrupt overall structure.[1] This metric highlights translations where matched elements are fragmented by extraneous or reordered words, reducing the perceived quality compared to those with compact, gap-free alignments.[1] The chunk-based approach explicitly captures word order differences, introducing a harmony measure that rewards translations with aligned chunks in the same sequence while penalizing inversions or disruptions, a key innovation in the 2005 formulation of METEOR to better correlate with human judgments of fluency and adequacy.[1]Scoring Computation
The scoring computation in METEOR begins with calculating unigram precision (P) and recall (R) based on the total number of matched unigrams after applying the matching and alignment processes. Precision is defined as the ratio of the number of matched unigrams in the candidate (system) translation to the total number of unigrams in the candidate translation:Examples
Simple Translation Example
To illustrate METEOR's application in a basic scenario, consider the example from the original paper: a reference sentence "the president then spoke to the audience" evaluated against a candidate "the president spoke to the audience."[11] This example focuses on unigram matching through exact word forms, with an insertion in the reference. The matching process identifies six exact unigram matches: "the," "president," "spoke," "to," "the," "audience." The candidate has 6 unigrams, the reference 7, yielding precision (P) of 1.0 and recall (R) of 6/7 ≈ 0.857. The Fmean, as the harmonic mean weighted toward recall, is 10 × P × R / (P + 9 × R) ≈ 0.984, with two contiguous chunks ("the president" and "spoke to the audience") due to the insertion of "then," resulting in fragmentation (chunks/matches) = 2/6 ≈ 0.333 and penalty (Pen) = 0.5 × (0.333)3 ≈ 0.018. The final METEOR score (S) is Fmean × (1 - Pen) ≈ 0.966.[11] This high score reflects near-perfect unigram coverage and mostly linear alignment, with only a minor fragmentation penalty for the insertion, indicating the candidate closely preserves the reference's meaning and structure.[11] The computation uses default parameters, including the Porter stemmer (not needed here) and no synonymy, suitable for English evaluations.[2]Complex Example with Synonyms
To illustrate METEOR's ability to handle synonyms while penalizing ordering discrepancies, consider the reference sentence "The dog runs quickly" and a candidate "Canine fast hurries," where "dog" matches "canine," "runs" matches "hurries," and "quickly" matches "fast" as synonyms via the WordNet module.[1] In the matching process, exact matches are first attempted (none beyond potential articles, which are excluded here for focus on content words), followed by stemming (inapplicable), and then synonymy mapping, yielding three unigram matches. The candidate has 3 unigrams, the reference 4 (including "The"), producing precision (P) of 1.0 and recall (R) of 3/4 = 0.75, resulting in an Fmean of 10 × 1.0 × 0.75 / (1.0 + 9 × 0.75) ≈ 0.97.[1] The alignment then groups matched unigrams into chunks based on their contiguity and order relative to the reference; the swapped adverb-verb order in the candidate ("fast hurries" versus "runs quickly") and overall positioning produce three separate chunks rather than a single contiguous phrase, with fragmentation (chunks/matches) = 1.0 and a penalty of 0.5 × (1.0)3 = 0.5.[1] The final score is thus S = Fmean × (1 - penalty) ≈ 0.97 × 0.5 ≈ 0.49, demonstrating how the explicit ordering penalty reduces the score despite strong semantic coverage through synonyms.[1] In later extensions, such as those incorporating paraphrase tables post-2007, a candidate like "The canine moves rapidly" could further boost matches by recognizing "moves rapidly" as a paraphrase of "runs quickly," derived from bilingual pivot tables, potentially linking the phrase into fewer chunks and improving the overall score beyond synonymy alone.[22] This highlights METEOR's evolution to capture nuanced rephrasings while still enforcing structural fidelity.[22]Evaluation and Comparisons
Human Correlation Studies
The initial empirical studies on METEOR's correlation with human judgments were presented in the 2005 ACL paper by Banerjee and Lavie, which evaluated the metric on Arabic-English and Chinese-English translation datasets from the LDC TIDES 2003 collection. At the segment level, METEOR achieved a Pearson correlation of 0.347 with combined human adequacy and fluency scores on Arabic-English translations, surpassing BLEU's correlation of 0.281 on combined scores; for Chinese-English, the correlation was 0.331, surpassing BLEU's 0.243. System-level correlations reached up to 0.52 across systems, demonstrating METEOR's robustness in ranking translation outputs relative to human assessments.[10] Subsequent evaluations through the NIST MetricsMaTr tasks from 2008 to 2012 further validated METEOR's performance, consistently showing strong correlations with human judgments and outperforming baselines like BLEU.[23] In the neural machine translation (NMT) era, recent analyses such as the 2023 MDPI survey affirm METEOR's sustained segment-level correlations of 0.3 to 0.4 with human judgments on high-resource languages, though performance drops to around 0.25 for low-resource settings due to limited synonym and stemming resources. The 2023 MDPI survey evaluating legacy metrics positioned METEOR as the closest to human evaluations among pre-neural approaches like BLEU and TER. METEOR generally correlates better with human fluency judgments than adequacy, as its synonym and paraphrasing mechanisms capture grammatical naturalness more effectively; tuned parameter versions, optimized via human data, boost overall correlations by 5-10%. As of 2025, METEOR continues to be applied in domain-specific evaluations, such as quantifying gaps between machine and human translation in medical contexts.[24][19]Performance Against Other Metrics
METEOR addresses key limitations of BLEU, particularly in capturing semantic flexibility and recall, by incorporating synonym matching via WordNet and stemming, which allows it to better handle linguistic variations in translations.[1] Empirical evaluations demonstrate that METEOR achieves higher correlation with human judgments compared to BLEU; for instance, on segment-level assessments for Arabic-to-English and Chinese-to-English translations, METEOR yields Pearson r values of 0.35 and 0.33, respectively, versus BLEU's 0.28 and 0.24.[1] However, METEOR's reliance on linguistic resources makes it computationally more demanding, whereas BLEU's focus on n-gram precision enables faster computation, especially for large-scale n-gram overlap checks.[1] Relative to TER and the NIST metric, METEOR's explicit inclusion of an ordering penalty through alignment chunking provides a more nuanced assessment of translation fluency beyond TER's simple edit-distance operations.[1] In the 2008 NIST Metrics for Machine Translation Challenge, which evaluated metrics across adequacy, fluency, and ranking tasks using human judgments on multiple language pairs, METEOR demonstrated strong performance, particularly in fluency evaluations.[25] Against contemporary neural metrics, METEOR shows reduced effectiveness in capturing deep semantic nuances, as highlighted in 2023-2024 reviews of machine translation evaluation. For example, COMET v2.0, a reference-free neural estimator trained on direct quality assessments, achieves higher correlation with human judgments (Spearman rho ≈ 0.45) compared to METEOR's ≈ 0.35 across diverse datasets, due to its multilingual embeddings and end-to-end learning.[26] Despite this, METEOR maintains robustness for legacy statistical machine translation data, where neural metrics may overfit to modern architectures.[19] METEOR is particularly favored in use cases emphasizing linguistic diversity, such as translations involving morphologically rich languages, owing to its balanced precision-recall framework and synonym handling.[1] To adapt to large language model-based machine translation, recent proposals introduce hybrids like the TQFLL framework, which combines METEOR with fuzzy logic and deep learning components to enhance accuracy in low-resource scenarios.[27]Implementations
Available Tools
The official implementation of METEOR is the Java-based package developed by Carnegie Mellon University (CMU), with version 1.5 released in 2014 as part of the system's update for the Workshop on Machine Translation (WMT).[28] This package includes the core aligner and scorer components, requiring integration with Princeton WordNet for synonym and stemming operations to enable flexible unigram matching.[17] Open-source alternatives provide accessible integrations in popular programming ecosystems. The Natural Language Toolkit (NLTK) in Python has supported METEOR scoring since version 3.4.1 in 2019, offering a straightforward API via thenltk.translate.meteor_score module for computing scores against reference translations.[29] Similarly, the Hugging Face Evaluate library, introduced in 2022, incorporates METEOR as one of its metrics, allowing seamless computation within machine learning pipelines through a unified interface.[30]
METEOR's core design is English-centric, relying on English WordNet for linguistic features, but extensions enable multilingual use. The Meteor Universal variant, bundled in version 1.5, supports any target language by automatically extracting paraphrase tables and morphological analyzers from parallel corpora, bypassing the need for language-specific resources like EuroWordNet.[17]
Installation typically requires Java 8 or later for the official package, along with WordNet version 3.0 downloaded separately and configured via environment paths.[28] Python-based options like NLTK necessitate installing the library via pip (pip install nltk) and downloading WordNet data with nltk.download('wordnet'), while Hugging Face Evaluate installs similarly (pip install evaluate) and loads METEOR on demand. Scoring remains efficient due to its rule-based nature.[29]
