Meteor

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

(source: Wikipedia)

In other words, Meteor’s output represents a level of similarity between MT output and the human translation. The scale is between 0 and 1.0 (100%) where higher number means higher quality of the MT output.

Note that this value is only indicative and can’t be used for any price calculations. For more precise evaluation of MT quality please refer to a Human Evaluation.

The principle is described in more detail here: http://www.cs.cmu.edu/~alavie/METEOR.

Implementation Details

This is a short description of the score calculation. First it creates all possible alignments using exact, stemmed, synonyms and paraphrases matches.

Then it selects the best alignments with criteria to maximize the number of covered words and minimize chunks and distances in matches between reference and hypothesis. With the best selected alignment, it then calculates the score:

  • Gives optimized weights for function/content words, calculates the weighted recall R and precision P.

  • Then calculates parametrized harmonic mean of P and R.

  • To account for gaps and differences in word order, it creates a fragmentation penalty.

  • Finaly, the score is calculated from penalty and harmonic mean.