chrF (character n-gram F-score)

Introduced by [Popovic_2015], the metric is designed to overcome the common issues related to word-based metrics like language/tokenization dependency, so is very promissing especially for morphologically rich languages.

It computes the F-score as arithmetic average of n-gram precision and recall, where the weight of recall is multiplied by Beta parameter.

\[chrF=(1+\beta^{2})\frac{ngrP \cdot ngrR}{\beta^{2} \cdot ngrP+ngrR}\]

Where

\(\beta\) says how much weight is assigned to recall over precision. Default is 2.
\(ngrP\) n-gram precision averaged over all ngrams where n=1 to N. Default N = 6.
\(ngrR\) n-gram recall averaged over all ngrams where n=1 to N. Default N = 6.

There are de-facto standard parameter settings named under following metric labels:

chrF: character ngrams only, N=6
chrF+: character ngrams (N=6), word unigrams
chrF++: character ngrams (N=6), word unigrams and bigrams.

For all those the \(\beta=2\). Sometimes the beta value is specified as the digit after score label, ie. chrF2 means beta=2.

Implementation Details

We use the SacreBLEU [Matt_Post_Bleu_2018] python library with the default settings (chrF2) - character order=6, case sensitive.

Reference

[Popovic_2015]

Maja Popovic: ChrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the 10th Workshop on Statistical Machine Translation (WMT-15). 2015. Lisbon, Portugal, pages 392–395. https://aclanthology.org/W15-3049/

[Matt_Post_Bleu_2018]

Matt Post: A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2016. http://aclweb.org/anthology/W18-6319