Evaluation
Brian Thompson
slides by Philipp Koehn
25 September 2018
Philipp Koehn Machine Translation: Evaluation 25 September 2018
1Evaluation
• How good is a given machine translation system?
• Hard problem, since many different translations acceptable
→ semantic equivalence / similarity
• Evaluation metrics
– subjective judgments by human evaluators
– automatic evaluation metrics
– task-based evaluation, e.g.:
– how much post-editing effort?
– does information come across?
Philipp Koehn Machine Translation: Evaluation 25 September 2018
2Ten Translations of a Chinese Sentence
Israeli ofﬁcials are responsible for airport security.
Israel is in charge of the security at this airport.
The security work for this airport is the responsibility of the Israel government.
Israeli side was in charge of the security of this airport.
Israel is responsible for the airport’s security.
Israel is responsible for safety work at this airport.
Israel presides over the security of the airport.
Israel took charge of the airport security.
The safety of this airport is taken charge of by Israel.
This airport’s security is the responsibility of the Israeli security ofﬁcials.
(a typical example from the 2001 NIST evaluation set)
Philipp Koehn Machine Translation: Evaluation 25 September 2018
3
adequacy and ﬂuency
Philipp Koehn Machine Translation: Evaluation 25 September 2018
4Adequacy and Fluency
• Human judgement
– given: machine translation output
– given: source and/or reference translation
– task: assess the quality of the machine translation output
• Metrics
Adequacy: Does the output convey the same meaning as the input sentence?
Is part of the message lost, added, or distorted?
Fluency: Is the output good ﬂuent English?
This involves both grammatical correctness and idiomatic word choices.
Philipp Koehn Machine Translation: Evaluation 25 September 2018
5Fluency and Adequacy: Scales
Adequacy Fluency
5 all meaning 5 ﬂawless English
4 most meaning 4 good English
3 much meaning 3 non-native English
2 little meaning 2 disﬂuent English
1 none 1 incomprehensible
Philipp Koehn Machine Translation: Evaluation 25 September 2018
6Annotation Tool
Philipp Koehn Machine Translation: Evaluation 25 September 2018
7Hands On: Judge Translations
• Rank according to adequacy and ﬂuency on a 1-5 scale (5 is best)
– Source:
L’affaire NSA souligne l’absence totale de d´ebat sur le renseignement
– Reference:
NSA Affair Emphasizes Complete Lack of Debate on Intelligence
– System1:
The NSA case underscores the total lack of debate on intelligence
– System2:
The case highlights the NSA total absence of debate on intelligence
– System3:
The matter NSA underlines the total absence of debates on the piece of
information
Philipp Koehn Machine Translation: Evaluation 25 September 2018
8Hands On: Judge Translations
• Rank according to adequacy and ﬂuency on a 1-5 scale (5 is best)
– Source:
N’y aurait-il pas comme une vague hypocrisie de votre part ?
– Reference:
Is there not an element of hypocrisy on your part?
– System1:
Would it not as a wave of hypocrisy on your part?
– System2:
Is there would be no hypocrisy like a wave of your hand?
– System3:
Is there not as a wave of hypocrisy from you?
Philipp Koehn Machine Translation: Evaluation 25 September 2018
9Hands On: Judge Translations
• Rank according to adequacy and ﬂuency on a 1-5 scale (5 is best)
– Source:
La France a-t-elle b´en´eﬁci´e d’informations fournies par la NSA concernant des op´erations
terroristes visant nos int´erˆets ?
– Reference:
Has France beneﬁted from the intelligence supplied by the NSA concerning terrorist
operations against our interests?
– System1:
France has beneﬁted from information supplied by the NSA on terrorist operations against
our interests?
– System2:
Has the France received information from the NSA regarding terrorist operations aimed our
interests?
– System3:
Did France proﬁt from furnished information by the NSA concerning of the terrorist
operations aiming our interests?
Philipp Koehn Machine Translation: Evaluation 25 September 2018
10Evaluators Disagree
• Histogram of adequacy judgments by different human evaluators
1 2 3 4 5
10%
20%
30%
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
(from WMT 2006 evaluation)
Philipp Koehn Machine Translation: Evaluation 25 September 2018
11Measuring Agreement between Evaluators
• Kappa coefﬁcient
K =
p(A) − p(E)
1 − p(E)
– p(A): proportion of times that the evaluators agree
– p(E): proportion of time that they would agree by chance
(5-point scale → p(E) = 1
5)
• Example: Inter-evaluator agreement in WMT 2007 evaluation campaign
Evaluation type P(A) P(E) K
Fluency .400 .2 .250
Adequacy .380 .2 .226
Philipp Koehn Machine Translation: Evaluation 25 September 2018
12Ranking Translations
• Task for evaluator: Is translation X better than translation Y?
(choices: better, worse, equal)
• Evaluators are more consistent:
Evaluation type P(A) P(E) K
Fluency .400 .2 .250
Adequacy .380 .2 .226
Sentence ranking .582 .333 .373
Philipp Koehn Machine Translation: Evaluation 25 September 2018
13Ways to Improve Consistency
• Evaluate ﬂuency and adequacy separately
• Normalize scores
– use 100-point scale with ”analog” ruler
– normalize mean and variance of evaluators
• Check for bad evaluators (e.g., when using Amazon Turk)
– repeat items
– include reference
– include artiﬁcially degraded translations
Philipp Koehn Machine Translation: Evaluation 25 September 2018
14Goals for Evaluation Metrics
Low cost: reduce time and money spent on carrying out evaluation
Tunable: automatically optimize system performance towards metric
Meaningful: score should give intuitive interpretation of translation quality
Consistent: repeated use of metric should give same results
Correct: metric must rank better systems higher
Philipp Koehn Machine Translation: Evaluation 25 September 2018
15Other Evaluation Criteria
When deploying systems, considerations go beyond quality of translations
Speed: we prefer faster machine translation systems
Size: ﬁts into memory of available machines (e.g., handheld devices)
Integration: can be integrated into existing workﬂow
Customization: can be adapted to user’s needs
Philipp Koehn Machine Translation: Evaluation 25 September 2018
16
automatic metrics
Philipp Koehn Machine Translation: Evaluation 25 September 2018
17Automatic Evaluation Metrics
• Goal: computer program that computes the quality of translations
• Advantages: low cost, tunable, consistent
• Basic strategy
– given: machine translation output
– given: human reference translation
– task: compute similarity between them
Philipp Koehn Machine Translation: Evaluation 25 September 2018
18Precision and Recall of Words
Israeli ofﬁcials responsibility of airport safety
Israeli ofﬁcials are responsible for airport securityREFERENCE:
SYSTEM A:
• Precision correct
output-length
=
3
6
= 50%
• Recall correct
reference-length
=
3
7
= 43%
• F-measure precision × recall
(precision + recall)/2
=
.5 × .43
(.5 + .43)/2
= 46%
Philipp Koehn Machine Translation: Evaluation 25 September 2018
19Precision and Recall
Israeli ofﬁcials responsibility of airport safety
Israeli ofﬁcials are responsible for airport securityREFERENCE:
SYSTEM A:
airport security Israeli ofﬁcials are responsibleSYSTEM B:
Metric System A System B
precision 50% 100%
recall 43% 100%
f-measure 46% 100%
ﬂaw: no penalty for reordering
Philipp Koehn Machine Translation: Evaluation 25 September 2018
20Word Error Rate
• Minimum number of editing steps to transform output to reference
match: words match, no cost
substitution: replace one word with another
insertion: add word
deletion: drop word
• Levenshtein distance
WER =
substitutions + insertions + deletions
reference-length
Philipp Koehn Machine Translation: Evaluation 25 September 2018
21Example
ofﬁcials
Israeli
responsibility
of
safety
airport
0 1Israeli 2 3 4 5
1ofﬁcials 1 2 3 4
2 1are 1 2 3 4
3 2responsible 2 3 4
4 3for 3 3 3 4
5 4airport 4 4 4
6 5security 5 5 4 4
0
3
2
Israeli
2ofﬁcials
3are
4responsible
5for
airport
6security
airport
1
2
3
4
5
6
security
2
3
3
4
5
6
6
Israeli
3
4
5
6
7
ofﬁcials
3
3
3
4
5
6
are
4
4
3
3
4
5
responsible
52
2
5
5
2
2
1 2 4 5 63
2
3
4
5
7
1
0
6
1 2 3 4 5 60
1
2
3
4
5
6
7
Metric System A System B
word error rate (WER) 57% 71%
Philipp Koehn Machine Translation: Evaluation 25 September 2018
22BLEU
• N-gram overlap between machine translation output and reference translation
• Compute precision for n-grams of size 1 to 4
• Add brevity penalty (for too short translations)
BLEU = min 1,
output-length
reference-length
4
i=1
precisioni
1
4
• Typically computed over the entire corpus, not single sentences
Philipp Koehn Machine Translation: Evaluation 25 September 2018
23Example
airport security Israeli ofﬁcials are responsible
Israeli ofﬁcials responsibility of airport safety
Israeli ofﬁcials are responsible for airport securityREFERENCE:
SYSTEM A:
SYSTEM B:
4-GRAM MATCH2-GRAM MATCH
2-GRAM MATCH 1-GRAM MATCH
Metric System A System B
precision (1gram) 3/6 6/6
precision (2gram) 1/5 4/5
precision (3gram) 0/4 2/4
precision (4gram) 0/3 1/3
brevity penalty 6/7 6/7
BLEU 0% 52%
Philipp Koehn Machine Translation: Evaluation 25 September 2018
24Multiple Reference Translations
• To account for variability, use multiple reference translations
– n-grams may match in any of the references
– closest reference length used
• Example
Israeli ofﬁcials responsibility of airport safety
Israeli ofﬁcials are responsible for airport security
Israel is in charge of the security at this airport
The security work for this airport is the responsibility of the Israel government
Israeli side was in charge of the security of this airport
REFERENCES:
SYSTEM:
2-GRAM MATCH 1-GRAM2-GRAM MATCH
Philipp Koehn Machine Translation: Evaluation 25 September 2018
25METEOR: Flexible Matching
• Partial credit for matching stems
SYSTEM Jim went home
REFERENCE Joe goes home
• Partial credit for matching synonyms
SYSTEM Jim walks home
REFERENCE Joe goes home
• Use of paraphrases
Philipp Koehn Machine Translation: Evaluation 25 September 2018
26Critique of Automatic Metrics
• Ignore relevance of words
(names and core concepts more important than determiners and punctuation)
• Operate on local level
(do not consider overall grammaticality of the sentence or sentence meaning)
• Scores are meaningless
(scores very test-set speciﬁc, absolute value not informative)
• Human translators score low on BLEU
(possibly because of higher variability, different word choices)
Philipp Koehn Machine Translation: Evaluation 25 September 2018
27Evaluation of Evaluation Metrics
• Automatic metrics are low cost, tunable, consistent
• But are they correct?
→ Yes, if they correlate with human judgement
Philipp Koehn Machine Translation: Evaluation 25 September 2018
28Correlation with Human Judgement
Philipp Koehn Machine Translation: Evaluation 25 September 2018
29Pearson’s Correlation Coefﬁcient
• Two variables: automatic score x, human judgment y
• Multiple systems (x1, y1), (x2, y2), ...
• Pearson’s correlation coefﬁcient rxy:
rxy = i(xi − ¯x)(yi − ¯y)
(n − 1) sx sy
• Note: mean ¯x =
1
n
n
i=1
xi
variance s2
x =
1
n − 1
n
i=1
(xi − ¯x)2
Philipp Koehn Machine Translation: Evaluation 25 September 2018
30Metric Research
• Active development of new metrics
– syntactic similarity
– semantic equivalence or entailment
– metrics targeted at reordering
– trainable metrics
– etc.
• Evaluation campaigns that rank metrics
(using Pearson’s correlation coefﬁcient)
Philipp Koehn Machine Translation: Evaluation 25 September 2018
31Evidence of Shortcomings of Automatic Metrics
Post-edited output vs. statistical systems (NIST 2005)
2
2.5
3
3.5
4
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
HumanScore
Bleu Score
Adequacy
Correlation
Philipp Koehn Machine Translation: Evaluation 25 September 2018
32Evidence of Shortcomings of Automatic Metrics
Rule-based vs. statistical systems
2
2.5
3
3.5
4
4.5
0.18 0.2 0.22 0.24 0.26 0.28 0.3
HumanScore
Bleu Score
Adequacy
Fluency
SMT System 1
SMT System 2
Rule-based System
(Systran)
Philipp Koehn Machine Translation: Evaluation 25 September 2018
33Automatic Metrics: Conclusions
• Automatic metrics essential tool for system development
• Not fully suited to rank systems of different types
• Evaluation metrics still open challenge
Philipp Koehn Machine Translation: Evaluation 25 September 2018
34
statistical signiﬁcance
Philipp Koehn Machine Translation: Evaluation 25 September 2018
35Hypothesis Testing
• Situation
– system A has score x on a test set
– system B has score y on the same test set
– x > y
• Is system A really better than system B?
• In other words:
Is the difference in score statistically signiﬁcant?
Philipp Koehn Machine Translation: Evaluation 25 September 2018
36Core Concepts
• Null hypothesis
– assumption that there is no real difference
• P-Levels
– related to probability that there is a true difference
– p-level p < 0.01 = more than 99% chance that difference is real
– typcically used: p-level 0.05 or 0.01
• Conﬁdence Intervals
– given that the measured score is x
– what is the true score (on a inﬁnite size test set)?
– interval [x − d, x + d] contains true score with, e.g., 95% probability
Philipp Koehn Machine Translation: Evaluation 25 September 2018
37Computing Conﬁdence Intervals
• Example
– 100 sentence translations evaluated
– 30 found to be correct
• True translation score?
(i.e. probability that any randomly chosen sentence is correctly translated)
Philipp Koehn Machine Translation: Evaluation 25 September 2018
38Normal Distribution
true score lies in interval [¯x − d, ¯x + d] around sample score ¯x
with probability 0.95
Philipp Koehn Machine Translation: Evaluation 25 September 2018
39Conﬁdence Interval for Normal Distribution
• Compute mean ¯x and variance ¯s2 from data
¯x =
1
n
n
i=1
xi
s2
=
1
n − 1
n
i=1
(xi − ¯x)2
• True mean µ?
Philipp Koehn Machine Translation: Evaluation 25 September 2018
40Student’s t-distribution
• Conﬁdence interval p(µ ∈ [¯x − d, ¯x + d]) ≥ 0.95 computed by
d = t
s
√
n
• Values for t depend on test sample size and signiﬁcance level:
Signiﬁcance Test Sample Size
Level 100 300 600 ∞
99% 2.6259 2.5923 2.5841 2.5759
95% 1.9849 1.9679 1.9639 1.9600
90% 1.6602 1.6499 1.6474 1.6449
Philipp Koehn Machine Translation: Evaluation 25 September 2018
41Example
• Given
– 100 sentence translations evaluated
– 30 found to be correct
• Sample statistics
– sample mean ¯x = 30
100 = 0.3
– sample variance s2
= 1
99(70 × (0 − 0.3)2
+ 30 × (1 − 0.3)2
) = 0.2121
• Consulting table for t with 95% signiﬁcance → 1.9849
• Computing interval d = 1.9849 0.2121√
100
= 0.042 → [0.258; 0.342]
Philipp Koehn Machine Translation: Evaluation 25 September 2018
42Pairwise Comparison
• Typically, absolute score less interesting
• More important
– Is system A better than system B?
– Is change to my system an improvement?
• Example
– Given a test set of 100 sentences
– System A better on 60 sentence
– System B better on 40 sentences
• Is system A really better?
Philipp Koehn Machine Translation: Evaluation 25 September 2018
43Sign Test
• Using binomial distribution
– system A better with probability pA
– system B better with probability pB (= 1 − pA)
– probability of system A better on k sentences out of a sample of n sentences
n
k
pk
A pn−k
B =
n!
k!(n − k)!
pk
A pn−k
B
• Null hypothesis: pA = pB = 0.5
n
k
pk
(1 − p)n−k
=
n
k
0.5n
=
n!
k!(n − k)!
0.5n
Philipp Koehn Machine Translation: Evaluation 25 September 2018
44Examples
n p ≤ 0.01 p ≤ 0.05 p ≤ 0.10
5 - - - - k = 5 k
n = 1.00
10 k = 10 k
n = 1.00 k ≥ 9 k
n ≥ 0.90 k ≥ 9 k
n ≥ 0.90
20 k ≥ 17 k
n ≥ 0.85 k ≥ 15 k
n ≥ 0.75 k ≥ 15 k
n ≥ 0.75
50 k ≥ 35 k
n ≥ 0.70 k ≥ 33 k
n ≥ 0.66 k ≥ 32 k
n ≥ 0.64
100 k ≥ 64 k
n ≥ 0.64 k ≥ 61 k
n ≥ 0.61 k ≥ 59 k
n ≥ 0.59
Given n sentences
system has to be better in at least k sentences
to achieve statistical signiﬁcance at speciﬁed p-level
Philipp Koehn Machine Translation: Evaluation 25 September 2018
45Bootstrap Resampling
• Described methods require score at sentence level
• But: common metrics such as BLEU are computed for whole corpus
• Sampling
1. test set of 2000 sentences, sampled from large collection
2. compute the BLEU score for this set
3. repeat step 1–2 for 1000 times
4. ignore 25 highest and 25 lowest obtained BLEU scores
→ 95% conﬁdence interval
• Bootstrap resampling: sample from the same 2000 sentence, with replacement
Philipp Koehn Machine Translation: Evaluation 25 September 2018
46
other evaluation methods
Philipp Koehn Machine Translation: Evaluation 25 September 2018
47Task-Oriented Evaluation
• Machine translations is a means to an end
• Does machine translation output help accomplish a task?
• Example tasks
– producing high-quality translations post-editing machine translation
– information gathering from foreign language sources
Philipp Koehn Machine Translation: Evaluation 25 September 2018
48Post-Editing Machine Translation
• Measuring time spent on producing translations
– baseline: translation from scratch
– post-editing machine translation
But: time consuming, depend on skills of translator and post-editor
• Metrics inspired by this task
– TER: based on number of editing steps
Levenshtein operations (insertion, deletion, substitution) plus movement
– HTER: manually construct reference translation for output, apply TER
(very time consuming, used in DARPA GALE program 2005-2011)
Philipp Koehn Machine Translation: Evaluation 25 September 2018
49Content Understanding Tests
• Given machine translation output, can monolingual target side speaker answer
questions about it?
1. basic facts: who? where? when? names, numbers, and dates
2. actors and events: relationships, temporal and causal order
3. nuance and author intent: emphasis and subtext
• Very hard to devise questions
• Sentence editing task (WMT 2009–2010)
– person A edits the translation to make it ﬂuent
(with no access to source or reference)
– person B checks if edit is correct
→ did person A understand the translation correctly?
Philipp Koehn Machine Translation: Evaluation 25 September 2018