chrF (character n-gram F-score)
Introduced by [Popovic_2015], the metric is designed to overcome the common issues related to word-based metrics like language/tokenization dependency, so is very promissing especially for morphologically rich languages.
It computes the F-score as arithmetic average of n-gram precision and recall, where the weight of recall is multiplied by Beta parameter.
- Where
\(\beta\) says how much weight is assigned to recall over precision. Default is 2.
\(ngrP\) n-gram precision averaged over all ngrams where n=1 to N. Default N = 6.
\(ngrR\) n-gram recall averaged over all ngrams where n=1 to N. Default N = 6.
There are de-facto standard parameter settings named under following metric labels:
- chrF
character ngrams only, N=6
- chrF+
character ngrams (N=6), word unigrams
- chrF++
character ngrams (N=6), word unigrams and bigrams.
For all those the \(\beta=2\). Sometimes the beta value is specified as the digit after score label, ie. chrF2 means beta=2.
Implementation Details
We use the SacreBLEU [Matt_Post_Bleu_2018] python library with the default settings (chrF2) - character order=6, case sensitive.
Reference
Maja Popovic: ChrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the 10th Workshop on Statistical Machine Translation (WMT-15). 2015. Lisbon, Portugal, pages 392–395. https://aclanthology.org/W15-3049/
Matt Post: A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2016. http://aclweb.org/anthology/W18-6319