PLIN009 - Machine translation Automatic MT quality evaluation Other MT topics Vit Baisa Motivation ► fluency - is the translation fluent, in a natural word order? ► adequacy - does the translation preserve meaning or changes/skews it? ► intelligibility - do we understand the translation? Evaluation scale adequacy fluency 5 all meaning 5 flawless English 4 most meaning 4 good 3 much meaning 3 non-native 2 little meaning 2 disfluent 1 no meaning 1 incomprehensible Annotation tool Judge Sentence You have already judged 14 of 3064 sentences, taking 86.4 seconds per sentence. Source: les deux pays constituent plutotun laboratoire necessaire au fonctionnenient interne de 1 ' ue . Reference: rather , the two countries form a laboratory needed for the internal working of the en . Translation Adequacy Fluency both countries are rather a necessary laboratory the internal operation of the eu . r r r r 1 2 3 4 5 recce 1 2 3 4 5 both countries are a necessary laboratory at internal functioning of the eu . r r bigger IAA ► time spent on post-editing how much cost of translation is reduced Automatic translation evaluation ► advantages: speed, cost ► disadvantages: do we really measure quality of translation? ► gold standard: manually prepared reference translations ► candidate c is compared with n reference translations r,- ► the paradox of automatic evaluation: the task corresponds to situation where students are to assess their own exam: how they know where they made a mistake? ► various approaches: n-gram shared between c and rh edit distance, ... Recall and precision on words The simplest method of automatic evaluation. system a: Israeli officials responsibility ef airport safety / / \ reference: Israeli officials are responsible for airport security ► precision correct = 3 = 5Q% output-length 6 ► recall correct 3 A„n/ = -= 43% reference-length 7 ► f-score precision x recall .5 x .43 _ ^ (precision + recall) /2 (.5 + .43) /2 Recall and precision - shortcomings system a: Israeli officials responsibility ef airport safety / / \ reference: Israeli officials are responsible for airport security system b: airport security Israeli officials are responsible metrics system A system B precision 50% 100% recall 43% 100% f-score 46% 100% It does not capture wrong word order. the most famous (standard), the most used, the oldest (2001) IBM, author Papineni n-gram match between reference and candidate translations precision is calculated for 1-, 2- ,3- and 4-grams + brevity penalty BLEU = min (1, ™W*y»>\ (A fe/ , J V reference-length J BLEU - an example system a: Israeli officials | responsibility of | airport | safety 2-gram match 1 -gram match reference: Israeli officials are responsible for airport security system b: airport security 11Israeli officials are responsible 2-gram match 4-gram match metrics system A system B precision (1gram) 3/6 6/6 precision (2gram) 1/5 4/5 precision (3gram) 0/4 2/4 precision (4gram) 0/3 1/3 brevity penalty 6/7 6/7 BLEU 0% 52% Other metrics ► NIST ► NIST: National Institute of Standards and Technology ► weighted matches of n-grams (information value) ► very similar results as for BLEU (a variant) ► NEVA ► Ngram EVAIuation ► BLEU score adapted for short sentences ► it takes into account synonyms (stylistic richness) ► WAFT ► Word Accuracy for Translation ► edit distance between c and r ► WAFT = 1 - mitn+in max(lr,lc) Other metrics II ► TER ► Translation Edit Rate ► the least edit steps (deletion, insertion, swap, replacement) ^ jER = number of edits avg. number of ref. words ► r = dnes jsem si při fotbalu zlomil kotník ► c = při fotbalu jsem si dnes zlomil kotník ► TER = 4/7 ► HTER ► Human TER ► r manually prepared and then TER is applied ► METEOR ► takes into account synonyms (WordNet) and morphological variants of words Evaluation of evaluation metrics Correlation of automatic evaluation with manual evaluation. Translation evaluation example- EuroMatrix EuroäMatox output language Danish bleu 16.49 bleu 21.12 bleu 28.57 BLEU 14.24 bleu 2B.79 bleu 22.22 bleu 24.32 bleu 2E.49 bleu 28.33 bleu 20 51 bleu bleu 17 49 bleu 23 01 BLEU 10 34 bleu 24.67 bleu 20 07 bleu 20 Tl bleu 22 35 bleu 1903 bleu 22.35 bleu 23.40 German bleu 20.75 bleu 25.36 BLEU 11 B8 bleu 27.75 bleu 21.36 bleu 23.28 bleu 25.45 bleu 20.51 bleu bleu bleu Greek bleu BLEU bleu bleu bleu bleu bleu 22 79 20 132 17 42 1_1 27 26 11 44 32.15 25 34 27.67 31.26 21 23 bleu 25.24 bleu 21.02 bleu I7.C4 bleu 23.23 English BLEU 13.00 bleu 31.16 bleu 25.39 bleu 27.10 bleu 30.16 bleu 24.83 bleu 20.02 bleu 1709 bleu 14.57 bleu 13.20 bleu 21 36 Finnish bleu 22.49 bleu 1339 bleu 19.14 bleu 21.16 bleu 13.35 Hzl bleu 23.73 bleu 21.13 bleu 18.54 bleu 26 13 bleu 30.00 BLEU 1263 French 1 bleu 32.49 bleu 35.37 bleu SB.47 bleu 22.68 bleu 21.47 bleu 20 07 bleu 16 92 bleu 24. B3 bleu 27.39 ELEU 11 08 bleu 36.09 Italian bleu 31.20 bleu 34.04 bleu 20.26 MM bleu 23.27 bleu 20 23 bleu 18 27 bleu 2S.4S bleu 30.11 BLEU 11 99 bleu 39.04 bleu 33.07 irt bleu 37.95 bleu 21.9G bleu bleu bleu bleu bleu BLEU bleu bleu bleu Spanish bleu 24.10 :i 42 ib 23 2B.33 30.51 12.57 40 27 32.31 35 92 21 90 bleu 30.35 bleu 21 94 bleu 18 97 bleu 22.BS bleu 30 20 BLEU 15.37 bleu 2977 bleu 23.94 bleu 25.9? bleu 28.66 □ Translation quality by language pairs T 5Tg rt Lingua; r-____^________ EH EG DA " | EE HU Uf MT Pi Pi FT =13 3* | 7^ EM J ■10.3 32 JS 300 ilO 33.2 343 S3 30.1 373 304 = 3 3 454 393 32 E 49.2 33J0 ■lH.fl 44.7 30.7 320 as 513 33.7 334 336 34.3 4*3 233 25.7 424 £2.9 43 J. 293 23.1 233 4i3 33.1 433 EE E 34.1 E4.1 59.9 EC : = =■ 263 J 334 43.1 323 47.1 25.7 23 J 334 27 JS 42.7 27 js E03 133 303 30.2 44.1 30.7 234 314 4L2 cs 32.0 42 js > 43 js 34J6 43.9 30.7 30 J 41J6 274 443 S43 333 lb E ■15.3 392 43.7 i*J. 43.5 413 42.9 DA 37 js 2-1.7 44.1 33.7 343 47.3 273 313 413 243 433 29.7 323 21.1 43.3 343 434 333 33 JO = = i 47.2 EL 324 43.1 37.7 —: ^ 34.0 253 23 JO 433 £= 7 43.5 290 32.5 £= 3 433 342 :i: = 7 2 33.1 553 -= = E5 sua 31.1 42.7 37 J —- 334 ^ 254 23 J J 13 240 31.7 253 30J ia 3 ±ss 33.9 373 S3 J. 31.7 33.9 43.7 ET 32 O Hi 373 332 373 233 40.4 37 7 = = - 303 37 JO 330 353 20.3 413 320 373 25 i 30 js 323 = 7 = H 433 233 EE J 320 373 27.2 33.7 543 > 29.3 273 EE E E03 Ei3 134 40.5 233 37 j. 25.3 273 25 l 37.6 Ffl S40 i4J. 43.1 393 -:- 423 50.9 25.7 30 JO J 23 J 35.1 LE E 313 il-.E 31.5 33.7 SIO 433 33.1 533 43.3 HLI 43.0 24.7 =43 30O 330 23.3 34.1 233 23 - 30.7 = =: 29 j5 313 13.1 35.1 293 =- 2 11 ? 23 JS 2= i 30.3 IT 510 32.1 ■143 333 433 JOJS 23 E 2:3 23.7 32.7 2*3 J 294 32 js 2J-3 30.3 332 36.3 333 E23 54.7 44.3 LT SUE 27.5 333 37 a 353 i:: 21.1 342 32 JO 233 353 J iO.l 12 2 33.1 313 31.5 23.3 313 333 33.5 LV MjO 23.1 53 jo 373 = 3 ;. 23.7 21.3 342 324 33j6 233 333 334 J 233 41.3 =-- 33 js 310 333 37.1 33.0 V- 72-1 323 37.2 373 33.3 33.7 43.7 253 11 3 424 LL - -E 7 302 333 J 44 JO E 7 . 43.3 33.9 333 400 4L6 MS 233 443 37 jo 434 333 i£>.7 273 233 ■d 233 44 J. 2E E 5L7 22 JG J 320 47.7 33 JO 30.1 E- 5 43.5 PL 90S 31J 403 — 1 42.1 3*3 45.2 232 23 JO 400 24.3 433 332 33 js 273 — 3 J 44.1 = 3 2 333 333 42.1 FT 50.7 314 ■±23 334 e d03 60.7 254 233 333 233 323 230 313 2*3 433 543 J 394 32.1 544 45.9 Ed 503 33.1 333 373 J03 33 js 304 ia 3 253 45.3 j ■i43 234 23.3 23.7 43 JO 333 -e : J 31.3 33.1 334 SK »3 32.5 334 43 J ilO 333 45.2 233 234 394 274 ■113 333 35.7 23.3 — i. 390 433 333 J -2 E 4L3 SL SIO 33.1 373 433 -L E 34.0 47.0 31.1 23 3 333 23.7 423 E4J6 373 300 433 332 44.1 333 333 > 42.7 SV 33.3 263 iljO 33 js -E E 333 -3 E 274 303 333 22.7 42 JO 2E : 31jO 23.7 J3j5 E2: dsL2. 32.7 313 553 J- Factored translation models ► common SMT models do not use linguistic knowledge ► usage of lemmas, PoS, stems helps to overcome data sparsity ► translation of vectors instead of words (tokens) word lemma part-of-speech morphology Q word class Q Input O O O Output Q word Q lemma Q part-of-speech Q morphology Q word class Factored translation models ► in standard SMT: dum and domyare independent tokens ► in FTM they share lemma, PoS and part of morph. information ► lemma and morphologic information are translated separately ► in target language, appropriate wordform is then generated Input word Q lemma Output (~) word part-of-speech morphology (^y- lenna parl-of-speech morphology Implemented in Moses. free-based translation models ► SMT translates word sequences ► many situations can be better explained with syntax: moving verb around a sentence, grammar agreement at long distance, ... ► translation models based on syntactic trees ► current topic, for some language pairs it gives the best results TBTM II - synchronous phrase grammar ► EN rule NP DET JJ NN ► DE rule NP ->• DET NN JJ ► synchronous rule NP DETi NN2 JJ3 | DETi JJ3 NN2 ► final rule N dum | house ► mixed rule N -Ha maison JJ1 the JJ1 house Parallel tree-bank PRP MD I shall be passing on to you some comments » 0 § 0 ......... Ich werde Ihnen die entsprechenden Anmerkungen aushändigen PPER VAFIN PPER ART ADJ NN VVFIN Syntactic rules extraction prp I md shall — vb be vbg passing rp on to to prp you dt some nns comments pp /\ to prp I I to you Hybrid systems of machine translation ► combination of rule-based and statistical systems ► rule-based translation with post-editing by SMT (e.g. smoothing with a LM) ► data preparaion for SMT based on rules, changing output of SMT based on rules Computer-aided Translation ► CAT - computer-assisted (aided) translation ► out of score of pure MT ► tools belonging to CAT realm: ► spell checkers (typos): hunspell ► grammar checkers: Lingea Grammaticon ► terminology management: Trados TermBase ► electronic translation dictionaries: Metatrans ► corpus managers: Manatee/Bonito ► translation memories: MemoQ, Trados Translation memory ► database of segments: titles, phrases, sentences, terms, paragraphs ► which have already been translated (manually) translation units ► advantages: ► everything is translated only once ► cost reducing (repeated translation of manuals) ► disadvantages: ► majority of the best (biggest) systems are commercial ► translation units are hard to get ► inappropriate translation is repeated again and again ► CAT systems suggest translations based on exact match ► or on exact context match, fuzzy match ► CAT systems can automatically translated the repeated texts Questions I ► Enumerate at least 3 rule-based MT systems. ► What does abbreviation FAHQMT mean? ► What does IBM-2 model adds to IBM-1 ? ► Explain noisy channel principle with its formula. ► State at least 3 metrics for MT quality evaluation. ► State types of translation according to R. Jakobson. ► What does Sapir-Whorf hypothesis claim? ► Describe Georgetown experiment (facts). ► State at least 3 examples of morphologically rich languages (different language families). ► What is the advantage of systems with interlingua against transfer systems? Draw a scheme of translations between 5 languages for these two types of systems. ► Give an example of a problematic string for tokenization (English, Czech). Questions II ► What is tagset, treebank, PoS tagging, WSD, FrameNet, gisting, sense granularity? ► What advantages does space-based meaning representation have? ► Which classes of WSD methods do we distinguish? ► Draw Vauquois' triangle with SMT IBM-1 in it. ► Explain garden path phenomenon and come up with an example for Czech (or English) not used in slides. ► Draw dependency structure for sentence Máma vidí malou Emu. ► Draw the scheme of SMT. ► Give at least 3 sources of parallel data. ► Explain Zipf's law. ► Explain (using an example) Bayes' rule (state its formula). ► What is the purpose of decoding algorithms? Questions III ► Write down the formula or describe with words Markov's assumption. ► > 3 examples of frequent word trigrams and quadrigrams for Czech (English). ► We aim at low of high perplexity for language models? ► Describe IBM models (1-5) briefly. ► Draw word alignment matrix for sentences / am very hungry, and Jsem velmi hladový.