Levenshtein

In information theory and computer science, the Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e., insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.

Levenshtein distance may also be referred to as edit distance, although that may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.

Important: In our implementation, we changed representation of Levenshtein distance from absolute number of changed characters (which is not comparable among set of segments with different lengths) to its percentage and following the same scale as used for Meteor or BLEU, we inverted its value. Now the value is between 0-100% where 0 means that segments are completely different and 100% represents the exact same segments.

In other words, in our implementation Levenshtein value represents a level of similarity between two strings calculated using Levenshtein, not level of difference (edits) between strings.

Example

Target segment (MT):    This color is **black**.
Reference segment (MT): This color is **white**.

Levenhstein distance = 5
Percentage = 5 / 20 (length of the longest string in comparation) = 25%
Percentage inverted (our value) = 100-25% = 75%