PLIN009 - Machine translation
Automatic MT quality evaluation Other MT topics
Vit Baisa
Motivation
► fluency - is the translation fluent, in a natural word order?
► adequacy - does the translation preserve meaning or changes/skews it?
► intelligibility - do we understand the translation?
Evaluation scale
adequacy			fluency	
5	all meaning		5	flawless English
4	most meaning		4	good
3	much meaning		3	non-native
2	little meaning		2	disfluent
1	no meaning		1	incomprehensible
Annotation tool
Judge Sentence
You have already judged 14 of 3064 sentences, taking 86.4 seconds per sentence.
Source: les deux pays constituent plutotun laboratoire necessaire au fonctionnenient interne de 1 ' ue .
Reference: rather , the two countries form a laboratory needed for the internal working of the en .
Translation	Adequacy	Fluency
both countries are rather a necessary laboratory the internal operation of the eu .	r r r r 1  2   3  4 5	recce 1  2   3  4 5
both countries are a necessary laboratory at internal functioning of the eu .	r r <f r r 1  2   3  4 5	r r <? r r 1  2   3  4 5
the two countries are rather a laboratory necessary for the internal workings of the eu .	c c c <? r 1  2  3  4 5	r r r <? r 1  2  3  4 5
the two countries are rather a laboratory for the internal workings of the eu .	c c <p r r 1  2   3  4 5	r r r r <? 1  2   3  4 5
the two countries are rather a necessary laboratory internal workings of the eu .	r r <r r r 1  2   3  4 5	r r c r r 1  2   3  4 5
Anno tat or: Philipp Koehn Task: WMT06 French-English	Annotate |	
Instructions	5= All Meaning 4= Most Meaning 3= Much Meaning 2= Little Meaning 1= None	5= Flawless English 4= Good English 3= Non-native English 2= Disfluent English 1= Incomprehensible
Disadvantages of manual evaluation
slow, expensive, subjective
► inter-annotator agreement (IAA) shows people agree more on fluency than on adequacy
► another option how to measure quality: is X better translation than Y?
► -> bigger IAA
► time spent on post-editing
how much cost of translation is reduced
Automatic translation evaluation
► advantages: speed, cost
► disadvantages: do we really measure quality of translation?
► gold standard: manually prepared reference translations
► candidate c is compared with n reference translations r,-
► the paradox of automatic evaluation: the task corresponds to situation where students are to assess their own exam: how they know where they made a mistake?
► various approaches: n-gram shared between c and rh edit distance, ...
Recall and precision on words
The simplest method of automatic evaluation.
system a:     Israeli officials responsibility ef airport safety
/ / \
reference:   Israeli officials are responsible for airport security
►
precision
correct    = 3 = 5Q%
output-length 6
► recall
correct 3 A„n/
= -= 43%
reference-length 7
► f-score
precision x recall        .5 x .43   _ ^
(precision + recall) /2    (.5 + .43) /2
Recall and precision - shortcomings
system a:     Israeli officials responsibility ef airport safety
/ / \
reference:   Israeli officials are responsible for airport security
system b:    airport security Israeli officials are responsible
metrics	system A	system B
precision	50%	100%
recall	43%	100%
f-score	46%	100%
It does not capture wrong word order.
the most famous (standard), the most used, the oldest (2001)
IBM, author Papineni
n-gram match between reference and candidate translations
precision is calculated for 1-, 2- ,3- and 4-grams + brevity penalty
BLEU = min (1,  ™W*y»>\ (A     fe/   , J
V   reference-length J
BLEU - an example
system a:
Israeli officials | responsibility of | airport | safety
2-gram match 1 -gram match
reference:   Israeli officials are responsible for airport security
system b:
airport security 11Israeli officials are responsible
2-gram match 4-gram match
metrics	system A	system B
precision (1gram)	3/6	6/6
precision (2gram)	1/5	4/5
precision (3gram)	0/4	2/4
precision (4gram)	0/3	1/3
brevity penalty	6/7	6/7
BLEU	0%	52%
Other metrics
► NIST
► NIST: National Institute of Standards and Technology
► weighted matches of n-grams (information value)
► very similar results as for BLEU (a variant)
► NEVA
► Ngram EVAIuation
► BLEU score adapted for short sentences
► it takes into account synonyms (stylistic richness)
► WAFT
► Word Accuracy for Translation
► edit distance between c and r
► WAFT = 1 - mitn+in
max(lr,lc)
Other metrics II
► TER
► Translation Edit Rate
► the least edit steps (deletion, insertion, swap, replacement) ^ jER =       number of edits
avg. number of ref. words
► r = dnes jsem si při fotbalu zlomil kotník
► c = při fotbalu jsem si dnes zlomil kotník
► TER = 4/7
► HTER
► Human TER
► r manually prepared and then TER is applied
► METEOR
► takes into account synonyms (WordNet) and morphological variants of words
Evaluation of evaluation metrics
Correlation of automatic evaluation with manual evaluation.
Translation evaluation example- EuroMatrix
EuroäMatox
output language
Danish		bleu 16.49	bleu 21.12	bleu 28.57	BLEU 14.24	bleu 2B.79	bleu 22.22	bleu 24.32	bleu 2E.49	bleu 28.33
										
bleu 20 51		bleu	bleu 17 49	bleu 23 01	BLEU 10 34	bleu 24.67	bleu 20 07	bleu 20 Tl	bleu 22 35	bleu 1903
bleu 22.35	bleu 23.40	German	bleu 20.75	bleu 25.36	BLEU 11 B8	bleu 27.75	bleu 21.36	bleu 23.28	bleu 25.45	bleu 20.51
										
bleu	bleu	bleu	Greek	bleu	BLEU	bleu	bleu	bleu	bleu	bleu
22 79	20 132	17 42	1_1	27 26	11 44	32.15	25 34	27.67	31.26	21 23
bleu 25.24	bleu 21.02	bleu I7.C4	bleu 23.23	English	BLEU 13.00	bleu 31.16	bleu 25.39	bleu 27.10	bleu 30.16	bleu 24.83
bleu 20.02	bleu 1709	bleu 14.57	bleu 13.20	bleu 21 36	Finnish	bleu 22.49	bleu 1339	bleu 19.14	bleu 21.16	bleu 13.35
					Hzl					
bleu 23.73	bleu 21.13	bleu 18.54	bleu 26 13	bleu 30.00	BLEU 1263	French 1	bleu 32.49	bleu 35.37	bleu SB.47	bleu 22.68
bleu 21.47	bleu 20 07	bleu 16 92	bleu 24. B3	bleu 27.39	ELEU 11 08	bleu 36.09	Italian	bleu 31.20	bleu 34.04	bleu 20.26
							MM			
bleu 23.27	bleu 20 23	bleu 18 27	bleu 2S.4S	bleu 30.11	BLEU 11 99	bleu 39.04	bleu 33.07	irt	bleu 37.95	bleu 21.9G
bleu	bleu	bleu	bleu	bleu	BLEU	bleu	bleu	bleu	Spanish	bleu
24.10	:i 42	ib 23	2B.33	30.51	12.57	40 27	32.31	35 92		21 90
bleu 30.35	bleu 21 94	bleu 18 97	bleu 22.BS	bleu 30 20	BLEU 15.37	bleu 2977	bleu 23.94	bleu 25.9?	bleu 28.66	□
Translation quality by language pairs
T 5Tg rt Lingua;												r-____^________										
	EH	EG			DA			" |		EE	HU			Uf	MT	Pi	Pi	FT	=13	3* |		7^
EM	J	■10.3		32 JS	300	ilO	33.2	343	S3	30.1	373	304	= 3 3	454	393	32 E	49.2	33J0	■lH.fl	44.7	30.7	320
as	513		33.7	334	336	34.3	4*3	233	25.7	424	£2.9	43 J.	293	23.1	233	4i3	33.1	433	EE E	34.1	E4.1	59.9
EC	: = =■	263	J	334	43.1	323	47.1	25.7	23 J	334	27 JS	42.7	27 js	E03	133	303	30.2	44.1	30.7	234	314	4L2
cs		32.0	42 js	>	43 js	34J6	43.9	30.7	30 J	41J6	274	443	S43	333	lb E	■15.3	392	43.7	i*J.	43.5	413	42.9
DA	37 js	2-1.7	44.1	33.7		343	47.3	273	313	413	243	433	29.7	323	21.1	43.3	343	434	333	33 JO	= = i	47.2
EL		324	43.1	37.7	—:	^	34.0	253	23 JO	433	£= 7	43.5	290	32.5	£= 3	433	342	:i:	= 7 2	33.1	553	-= =
E5	sua	31.1	42.7	37 J	—-	334	^	254	23 J	J 13	240	31.7	253	30J	ia 3	±ss	33.9	373	S3 J.	31.7	33.9	43.7
ET	32 O	Hi	373	332	373	233	40.4		37 7	= = -	303	37 JO	330	353	20.3	413	320	373	25 i	30 js	323	= 7 =
H	433	233	EE J	320	373	27.2	33.7	543	>	29.3	273	EE E	E03	Ei3	134	40.5	233	37 j.	25.3	273	25 l	37.6
Ffl	S40	i4J.	43.1	393	-:-	423	50.9	25.7	30 JO	J	23 J	35.1	LE E	313	il-.E	31.5	33.7	SIO	433	33.1	533	43.3
HLI	43.0	24.7	=43	30O	330	23.3	34.1	233	23 -	30.7		= =:	29 j5	313	13.1	35.1	293	=- 2	11 ?	23 JS	2= i	30.3
IT	510	32.1	■143	333	433	JOJS	23 E	2:3	23.7	32.7	2*3	J	294	32 js	2J-3	30.3	332	36.3	333	E23	54.7	44.3
LT	SUE	27.5	333	37 a	353	i::	21.1	342	32 JO		233	353	J	iO.l	12 2	33.1	313	31.5	23.3	313	333	33.5
LV	MjO	23.1	53 jo	373	= 3 ;.	23.7	21.3	342	324	33j6	233	333	334	J	233	41.3	=--	33 js	310	333	37.1	33.0
V-	72-1	323	37.2	373	33.3	33.7	43.7	253	11 3	424	LL -	-E 7	302	333	J	44 JO	E 7 .	43.3	33.9	333	400	4L6
	MS	233	443	37 jo	434	333	i£>.7	273	233	■d	233	44 J.	2E E	5L7	22 JG	J	320	47.7	33 JO	30.1	E- 5	43.5
PL	90S	31J	403	— 1	42.1	3*3	45.2	232	23 JO	400	24.3	433	332	33 js	273	— 3	J	44.1	= 3 2	333	333	42.1
FT	50.7	314	■±23	334	e	d03	60.7	254	233	333	233	323	230	313	2*3	433	543	J	394	32.1	544	45.9
Ed	503	33.1	333	373	J03	33 js	304	ia 3	253	45.3	j	■i43	234	23.3	23.7	43 JO	333	-e :	J	31.3	33.1	334
SK	»3	32.5	334	43 J	ilO	333	45.2	233	234	394	274	■113	333	35.7	23.3	— i.	390	433	333	J	-2 E	4L3
SL	SIO	33.1	373	433	-L E	34.0	47.0	31.1	23 3	333	23.7	423	E4J6	373	300	433	332	44.1	333	333	>	42.7
SV	33.3	263	iljO	33 js	-E E	333	-3 E	274	303	333	22.7	42 JO	2E :	31jO	23.7	J3j5	E2:	dsL2.	32.7	313	553	J-
Factored translation models
► common SMT models do not use linguistic knowledge
► usage of lemmas, PoS, stems helps to overcome data sparsity
► translation of vectors instead of words (tokens)
word lemma part-of-speech morphology Q word class Q
Input
O O O
Output Q word Q lemma Q part-of-speech Q morphology Q   word class
Factored translation models
► in standard SMT: dum and domyare independent tokens
► in FTM they share lemma, PoS and part of morph. information
► lemma and morphologic information are translated separately
► in target language, appropriate wordform is then generated
Input
word Q lemma
Output (~) word
part-of-speech
morphology (^y-
lenna
parl-of-speech
morphology
Implemented in Moses.
free-based translation models
► SMT translates word sequences
► many situations can be better explained with syntax: moving verb around a sentence, grammar agreement at long distance, ...
► translation models based on syntactic trees
► current topic, for some language pairs it gives the best results
TBTM II - synchronous phrase grammar
► EN rule NP    DET JJ NN
► DE rule NP ->• DET NN JJ
► synchronous rule NP    DETi NN2 JJ3 | DETi JJ3 NN2
► final rule N    dum | house
► mixed rule N -Ha maison JJ1 the JJ1 house
Parallel tree-bank
PRP MD
I    shall be passing on  to  you some comments
» 0 § 0
.........
Ich   werde  Ihnen  die  entsprechenden Anmerkungen aushändigen
PPER    VAFIN     PPER    ART ADJ NN VVFIN
Syntactic rules extraction
prp I
md shall — vb be
vbg passing rp on to to
prp you
dt some nns comments
pp
/\
to prp
I I
to you
Hybrid systems of machine translation
► combination of rule-based and statistical systems
► rule-based translation with post-editing by SMT (e.g. smoothing with a LM)
► data preparaion for SMT based on rules, changing output of SMT based on rules
Computer-aided Translation
► CAT - computer-assisted (aided) translation
► out of score of pure MT
► tools belonging to CAT realm:
► spell checkers (typos): hunspell
► grammar checkers: Lingea Grammaticon
► terminology management: Trados TermBase
► electronic translation dictionaries: Metatrans
► corpus managers: Manatee/Bonito
► translation memories: MemoQ, Trados
Translation memory
► database of segments: titles, phrases, sentences, terms, paragraphs
► which have already been translated (manually) translation units
► advantages:
► everything is translated only once
► cost reducing (repeated translation of manuals)
► disadvantages:
► majority of the best (biggest) systems are commercial
► translation units are hard to get
► inappropriate translation is repeated again and again
► CAT systems suggest translations based on exact match
► or on exact context match, fuzzy match
► CAT systems can automatically translated the repeated texts
Questions I
► Enumerate at least 3 rule-based MT systems.
► What does abbreviation FAHQMT mean?
► What does IBM-2 model adds to IBM-1 ?
► Explain noisy channel principle with its formula.
► State at least 3 metrics for MT quality evaluation.
► State types of translation according to R. Jakobson.
► What does Sapir-Whorf hypothesis claim?
► Describe Georgetown experiment (facts).
► State at least 3 examples of morphologically rich languages (different language families).
► What is the advantage of systems with interlingua against transfer systems? Draw a scheme of translations between 5 languages for these two types of systems.
► Give an example of a problematic string for tokenization (English, Czech).
Questions II
► What is tagset, treebank, PoS tagging, WSD, FrameNet, gisting, sense granularity?
► What advantages does space-based meaning representation have?
► Which classes of WSD methods do we distinguish?
► Draw Vauquois' triangle with SMT IBM-1 in it.
► Explain garden path phenomenon and come up with an example for Czech (or English) not used in slides.
► Draw dependency structure for sentence Máma vidí malou Emu.
► Draw the scheme of SMT.
► Give at least 3 sources of parallel data.
► Explain Zipf's law.
► Explain (using an example) Bayes' rule (state its formula).
► What is the purpose of decoding algorithms?
Questions III
► Write down the formula or describe with words Markov's assumption.
► > 3 examples of frequent word trigrams and quadrigrams for Czech (English).
► We aim at low of high perplexity for language models?
► Describe IBM models (1-5) briefly.
► Draw word alignment matrix for sentences / am very hungry, and Jsem velmi hladový.