BLEU

The bilingual evaluation understudy (BLEU) is an automatic metric introduced in 2002. It measures the n-gram precision between candidate translation and (multiple) reference translation. The value is between 0-100%, where higher is better. See the official paper [Papineni_2002] for details.

The BLEU score has its limits, several reports show it seem underestimate the neural machine translation output - the Slator report gives a nice summary in their “How BLEU Measures Translation and Why It Matters” article [Hazel_Mae_2016].

Implementation Details

MT Companion applies BLEU on the document level only, as the sentence level reliability is not high (see i.e. the results of the WMT18 Metrics Shared Task [Ma_Bojar_2018] - the best metrics achieved the maximal correlation of 0.5 on segment level, whereas on document level it goes over 0.9).

implementation is based on the SacreBLEU project [Matt_Post_2018], which is the result of the effort to unify both the computation and test sets availability.

We follow the paper guidance to achieve comparable results, thus we publish the BLEU setting on each evaluation and the output/reference processing is eliminated.

The only normalization applied on MT Companion side is following:

  1. Step: all EOL (end of line) characters are replaced with spaces

  2. Step: all XML elements are replaced with its content during TMX to plain text conversion

Tokenization is performed on the Metric level and set to “international”, which is known as slightly better correlating with human judgements (see [Machacek_2013]).

The Asian languages (Chinese + Japanese) are evaluated with the “zh” tokenization. We apply the OpenNLP tokenizer for Thai, where the SacreBLEU tokenizer would be insufficient.

Reference

[Papineni_2002]

Kishore Papineni, Salim Roukos, Todd Ward and WeiJing Zhu: Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. https://www.aclweb.org/anthology/P02-1040.pdf

[Matt_Post_2018]

Matt Post: A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2016. http://aclweb.org/anthology/W18-6319

[Ma_Bojar_2018]

Qingsong Ma, Ondřej Bojar and Yvette Graham: Results of the WMT18 Metrics Shared Task. Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers. 2018. http://www.aclweb.org/anthology/W18-6450

[Hazel_Mae_2016]

Hazel Mae Pan: How BLEU Measures Translation and Why It Matters. 2016. https://slator.com/technology/how-bleu-measures-translation-and-why-it-matters

[Machacek_2013]

Matouš Macháček and Ondřej Bojar: Results of the WMT13 Metrics Shared Task. Proceedings of the Eighth Workshop on Statistical Machine Translation. 2013. http://www.aclweb.org/anthology/W13-2202