PA153 Vít Baisa ENGLISH TO CZECH MT Moses is an implementation of the statistical (or data-driven) approach to machine translation (MT). This is the dominant approach in the field at the moment and is employed by the online translation systems deployed by the likes of Google and Microsoft. Mojžíš je implementace statistické (nebo řízené daty) přístupu k strojového překladu (MT). To je převládajícím přístupem v oblasti v současné době, a je zaměstnán pro on-line překladatelských systémů nasazených likes Google a Microsoft. Moses je implementace statistického (nebo daty řízeného) přístupu k strojovému překladu (MT). V současné době jde o převažující přístup v rámci strojového překladu, který je použit online překladovými systémy nasazenými Googlem a Microsoftem. Mojžíš je provádění statistické (nebo aktivovaný) přístup na strojový překlad (mt). To je dominantní přístup v oblasti v tuto chvíli, a zaměstnává on - line překlad systémů uskutečněné takové, Google a Microsoft. QUESTIONS Is accurate translation possible at all? What is easier: to translate from / to your mother tongue? How we know W\ is equivalent to wrf English wind types: airstream, breeze, crosswind, dust devil, easterly, gale, gust, headwind, jet stream, mistral, monsoon, prevailing wind, sandstorm, sea breeze, sirocco, southwester, tailwind, tornado, trade wind, turbulence, twister, typhoon, whirlwind, wind, windstorm, zephyr EXAMPLE OF HARD WORDS a lká č, večerníček, telka, čokl buřt, knížečka, ČSSD... ? matka, macecha, mamka, máma, maminka, matička, máti, mama, mamča, mamina scvrnkls, nejneobhospo....nějšími Navajo Code: language as a cipher Leacock: Nonsese novels (Literární poklesky) MACHINE TRANSLATION We consider only technical / specialized texts: • web pages, • technical manuals, • scientific documents and papers, • leaflets and catalogues, • law texts and • in general, texts from specific domains. Nuances on different language levels in art literature are out of scope of current MT systems. MACHINE TRANSLATION: ISSUES In fact an output of MT is always revised. We distinguish pre-editing and post-editing. MT systems make different types of errors. These mistakes are characteristic for human translators: • wrong prepositions: (/ am in school) • missing determiners {I saw man) • wrong tense (Vide Ijsem: I was seeing),... For computers, errors in meaning are characteristic: • Kiss me honey. Polib mi med. „r„p ERROR- error T?fJE_ orthography ORTHOGRAPHY TYPE OOliSSiOn addition TYPE L untranslated punctuation ca pi taxation -spelling OMISS:on- r conten t-wo rd ,TI*6 function-word addition- r conta nt_word function word ___grammar- grammar ^fypg- „ MSSELECTIONl mis selection jypg- woixf-c«ass verbs VERBfr TYPE tense person Lblend «™«ä«f AGREEMENT-agreement -^ypg- contraction ■gender - number -person ■blend semantic SEMA^TiC- discourse DISCOURSE "iYPF" mris ordering oonfusion_of_senses wrong-choice -coilocationai_errors -idioms style variety lshould not be translated Costa, Angela, et al. "A linguistically motivated taxonomy for Machine Translation error ana Machine Translation 29.2 (2015): 127-161. FREE WORD ORDER The more morphologically rich language, the freer word order it has. Katka snědla kousek koláče. Kati megevett egy szelet tortát Egy szelet tortát Kati evett meg Kati egy szelet tortát evett meg Egy szelet tortát evett meg Kati Megevett egy szelet tortát Kati Megevett Kati egy szelet tortát Katie eating a piece of cake Katie ate a piece of cake Katie ate a piece of cake Katie ate a piece of cake Katie eating a piece of cake Katie ate a piece of cake FREE WORD ORDER IN CZECH • Víš, že se z kávy vyrábí mouka? • Víš, že se z kávy mouka vyrábí? • Víš, že se mouka vyrábí z kávy? • Víš, že se mouka z kávy vyrábí? • Víš, že se vyrábí mouka z kávy? • Víš, že se vyrábí z kávy mouka? How their meanings differ? DIRECT METHODS FOR IMPROVING MT QUALITY • limit input to a: ■ sublanguage (indicative sentences) ■ domain (informatics) ■ document type (patents) • text pre-processing (e.g. manual syntactic analysis) CLASSIFICATION BASED ON APPROACH • rule-based, knowledge-based (RBMT, KBMT) ■ transfer ■ with interlingua • statistical machine translation (SMT) • hybrid machine translation (HMT, HyTran) • neural networks VAUQUOIS'S TRIANGLE Interlingua MACHINE TRANSLATION NOWADAYS • big companies (Microsoft) focused on English as SL • large pairs (En:Sp, En:Fr): very good translation quality • SMT enriched with syntax • Google Translate as a gold standard • morphologically rich languages neglected • En: a :En pairs prevail • neural networks being deployed MOTIVATION IN 21 ST CENTURY translation of web pages for gisting (getting the main message) methods for speeding-up human translation substantially (translation memories) cross-language extraction of facts and search for information instant translation of e-communication translation on mobile devices RULE-BASED MT RULE-BASED MACHINE TRANSLATION RBMT linguistic knowledge in form of rules rules for analysis of SL rules for transfer between languages rules for generation/rendering/synthesis of TL KNOWLEDGE-BASED MACHINE TRANSLATION systems using linguistic knowledge about languages older types, more general notion analysis of meaning of SL is crucia no total meaning (connotations, common sense) to be able to translate vrana na strome not necessary to know vrana is a bird and can fly term KBMT rather for systems with interlingua for us KBMT = RBMT KBMT CLASSIFICATION • direct translation • systems with interlingua • transfer systems The only types of MT until 90s. DIRECT TRANSLATION • the oldest systems • one step process: transfer • Georgetown experiment, METEO • interest dropped quickly DIRECT TRANSLATION focus on S^T elements correspondences first experiments on En-Ru pair all components are bound to a language pair (and one direction) typically consists of: ■ translation dictionary ■ monolithic program dealing with analysis and generation necessarily one-directional and bilingua efficacy: for N languages we need ? MT WITH INTERLINGUA • we suppose it is possible to convert SL to a language-independent representation • interlingua (IL) must be unambiguous • two steps: analysis & synthesis (generation) • from IL, TL is generated • analysis is SL-dependent butTL-independent • and vice versa for synthesis • for translation among N languages KBMT-89 Analyzar Syntactic paraor Mapping ink u7| id Augra*ntor Automate Intaracliva a Analysts grammars Aralys s, 19i Cons Structural MRs Concept L*vlcon Tool* Ontology/domain acquisition tool Gramma- writing locls teal* Knowledge rtprasantaiion _suppc-ri_ Gtntraior LaxieaT s olac lion Syntactic selection Syntactic ^anerator Gtnaralmn gramma rv i J G^nsration E Structural Mffe Nirenburg, Sergei. Knowledge-based machine translation. Machine Translation 4.1 (1989): 5-24. TRANSFER TRANSLATION analysis up to a certain level transfer rules S forms T forms not necessarily between same levels usually on syntactic level context constraints (not available in direct translation) distinction IL vs. transfer blurred three-step translation INTERLINGUA VS. TRANSFER SOURCE LANGUAGE ANALYSIS TOKENIZATION • first level in Vauquois' A • input text to tokens (words, numbers, punctuation) • token = sequence of non-white characters • output = list of tokens • input for further processing OBSTACLES OF TOKENIZATION • don't: do n't, do n't, don't, ? • červeno-černý: červeno - černý, červeno-černý, červeno- černý SCRIPTIOCONTINUA nil m0il9^rľuni?7n7iu^inííiÄn uíífiiu^^miJivifl0Ufíi9wiiníiiMi 1 41 uiäáuanqfl ueiuiifiwufb fíiuiflufiilíiuumtíl^^inri^iJiuni-jfiii^ Inm fiuliifJLfluííUííi^flilrijňaítuii niiniJiíäfl>u9uíni!nMui1nmsí 41 What is word? TOKENIZATION • in most cases a heuristic is used • alphabetic writing systems: split on spaces and on other punctuation marks ?!.,-()/:; • demo: unitok.py SENTENCE SEGMENTATION MT almost always uses sentences 90% of periods are sentence boundary indicators (Riley 1989) using list of punctuation (!?.<>) Měl jsem 5 (sic!) poznámek. exceptions: ■ abbreviations (aj. atd. etc. e.g.) ■ degrees (RNDr., prof.) HTML elements might be used (p, div, td, li) demo: tag_sentences paper on tokenization OBSTACLES OF SENTENCE SEGMENTATION Zeleninu jako rajče, mrkev atd. Petr nemá rád. Složil zkoušku a získal titul Mgr. Petr mu dost závidě John F. Kennedy = one token? John F. Kennedy's related to named entity recognition neglected step in the processing (DCEP, EUR-Lex) MORPHOLOGICAL LEVEL MORPHOLOGY • morpheme: the smallest item carrying a meaning • pn-lez-it-ost-n-ym-i • prefix-root-infix-suffix-suffix-suffix-affix • case, number, gender, lemma, affix MORPHOLOGIC LEVEL second level in Vauquois' A reducing the immense amount of wordforms demo: lexicon sizes of various corpora conversion from wordforms to lemmata give, gives, gave, given, giving -+ give dělá, dělám, dělal, dělaje, dělejme,... -+ dělat analysis of grammatical categories of wordforms dělali dělat + past t. + continuous + plural + 3rd p. did do + past t. + perfective + person ? + number ? Robertovým Robert + case ? + adjective + number ? demo: wwwajka MORPHOLOGIC ANALYSIS for each token we get a base form, grammar categories, segmentation to morphemes What is a base form? Lemma. nouns: singular, nominative, positive, masculine bycha bych?, nejpomalejšími pomalý neschopný schopný?, mimochodem mimochod verbs: infinitive neraď -+ radit?, bojím se bát (se) Why infinitive? the most frequent form of verbs example MORPHOLOGICAL TAGS, TAGSET language-dependent (various morphological categories) attribute system: pairs category-value maminkou k1gFnSc7 udělány k5eAaPmNgFnP positional system: 16 fixed positions kontury NNFP1-—A— zdají VB-P—3P-AA— Penn Treebank tagset (English): limited set of tags faster RBR doing VBG CLAWS tagset (English) and others (German) gigantische ADJA.ADJA.Pos.Acc.Sg.Fem erreicht WPP.VPP.Full.Psp MORPHOLOGICAL POLYSEMY in many cases: words have more than one tag PoS polysemy (>1 lemma), in Czech jednou k4gFnSc7, k6eAd1, k9 zenu k1gFnSc4, k5eAalmlp1nS k1 + k2, k3 + k5? what about English? demo: SkELL auto PoS polysemy within a PoS in Czech: nominative = accusative vina klgNnSd, k1gNnSc4,... odhalenf. 10 tags MORPHOLOGICAL DISAMBIGUATION • for each word: one tag and one lemma • morphological disambiguation • a tool: tagger • translational polysemy is another issue pubblico- Öffentlichkeit, Publikum, Zuschauer • most of methods use context • i.e. surrounding words, lemmas, tags STATISTICAL DISAMBIGUATION the most probable sequence of tags Ženu je domů. k51 k1, k31 k5, k61 k1 Mladé muže gF|gM, nS| nP there are tough situations: dítě škádlí lvíče machine learning trained on manually tagged/disambiguated data Brill's tagger, TreeTagger, Freeling, RFTagger demo for Czech: Desamb (hybrid, DESAM) RULE-BASED DISAMBIGUATION the only option if an annotated corpus not available also used as a filter before a statistical method rules help to capture wider context case, number and gender agreement in noun phrases malému (c3, gIMN) chlapci (nPd 57, nSc36, gM) a more matured: valency structure of sentences valency: vidět koho/co, to give OBJ to DIROBJ vidím dům c4 I gave the present to her DIROBJ VerbaLex, PDEV GUESSER • we aim at high coverage: as many words as possible • for out-of-vocabulary (OOV) tokens • new, borrowed, compound words • stemming, guessing PoSes from word suffix • vygooglit, olajkovat, zaxzovat • sedm dunhillek • tritisfcedvestede vadesatpet z n a ku • funny errors: Matka bozit, topit box MORPHOLOGICAL DISAMBIGUATION—EXAMPLE slovo analýzy disambiguace Pravidelné k2eAgMnPc4d1, k2eAglnPdd1, k2eAglnPc4d1, k2eAglnPc5d1, k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPd d1, k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSdd1, k2eAgNnSc4d1, k2eAgNnSc5d1,...(+ 5) k2eAgNnSdd1 krmení k2eAgMnPdd1, k2eAgMnPc5d1, klgNnSd, k1gNnSc4, k1gNnSc5; k1gNnSc6, k1gNnSc3, k1gNnSc2, k1gNnPc2, klgNnPd, k1gNnPc4; k1gNnPc5 klgNnSd je k5eAalmlp3nS, k3p3gMnPc4, k3p3glnPc4, k3p3gNnSc4, k3p3gNnPc4; k3p3gFnPc4, kO pro k7c4 k5eAalmlp3nS k7c4 správny k2eAgMnSdd1, k2eAgMnSc5d1, k2eAglnSd d1, k2eAglnSc4d1, k2eAglnSc5d1,... (+18) k2eAglnSc4d1 rust k5eAalmF, klglnSd, k1glnSc4 k1glnSc4 důležité k2eAgMnPc4d1, k2eAglnPdd1, k2eAglnPc4d1, k2eAglnPc5d1, k2eAgFnSc2d1, k2eAgFnSc3d1, k2eAgFnSc6d1, k2eAgFnPd d1, k2eAgFnPc4d1, k2eAgFnPc5d1, k2eAgNnSd d1, k2eAgNnSc4d1, k2eAgNnSc5d1,...(+ 5) k2eAgNnSdd1 PROBLEMS WITH POSES quality of MA affects all further levels of analysis quality depends on a language (English vs. Hungarian) chončaam: my small house (Tajik) kahramoni: you are hero (Tajik) legeslegmagasabb: the very highest (Hungarian) raněný: SUBS/ADJ the big red fire truck: SUBS / ADJ? The Duchess was entertaining last night. Pokojem se neslo tiché pšššš. M O R P H O LO G Y—S U M M ARY • MA introduce critical errors into the analysis • the goal is to limit the immense amount of wordforms • wordform lemma + tag • much simpler for English (cc. 35 tags) • PoS tagging accuracy depends on a language • usually around 95% LEXICAL LEVEL DICTIONARIES DICTIONARIES IN MT I connection between languages transfer systems: syntactic level dictionaries crucial for KBMT syste GNU-FDL slovník Wiktionary DICTIONARIES IN MTII how many items in a diet do we need / want? -»• named entities, slang, MWE liste me. lexical item, which can not be deduced from the principle of compositionality (slaměný vdovec) which form in a diet? -+ lemmatization how many different senses is reasonable to distinguish? granularity POLYSEMY IN DICTIONARIES words relates to senses what is meaning of meaning? we need a formal definition for computers data is discrete, meaning is continuous man: an adult male person what about 17-years-old male person? SMOOTH SENSE TRANSITIONS og og chair chair POLYSEMY ON SEVERAL LEVELS morphology: -s word level: key multiword expressions: bílá vrána sentence level: I saw a man with a telescope. homonymy: accidental ■ full homonymy: líčit, kolej ■ partial homonymy: los, stát polysemy is natural and ubiquitous MEANING REPRESENTATION list: a common dictionary graph: senses:vertices, semantic relations:edges space: senses:dots, similarity:distance RESERVE FEDERAL BANK MONEY LOANS COMMERCIAL DEPOSITS STREAM RIVER • DEEP FIELD MEADOW WOODS GASOLINE PETROLEUM CRUDE DRILL OIL WORD SENSE DISAMBIGUATION finding a proper sense of a word in a given context trivial for human, very hard for computers we need a finite inventory of senses accuracy about 90% crucial task for KBMT: Ludvig dodávka Beethoven, kiss me honey,... box in the pen (Bar-Hillel) granularity affects the quality of WSD SYNTACTIC LEVEL Miloš and Vojta SEMANTIC LEVEL / ANALYSIS Zuzka Nevěřilová, Adam Rambousek TECTOMT • PDT formalism, high modularity • splitting tasks to a sequence of blocks—scenarios • blocks are Perl scripts communicating via API • blocks allow massive data processing, parallelisation • rule-based, statistical, hybrid methods • processing: conversion to the tmt format application of a scenario conversion to an output format TECTOMT: A SIMPLE BLOCK English negative particles verb attributes sub process_document { my ($self,$document) = @_; foreach my $bundle ($document->get_bundles()) { my $a_root = $bundle->get_tree(1SEnglishA1); foreach my $a_node ($a_root->get_descendants) { my ($eff_parent) = $a_node->get_eff_parents; if ($a_node->get_attr('m/lemma1)=~/A(not|n1t)$/ and $eff_parent->get_attr("m/tag1)=~/AV/ ) { $a_node->set_attr(1is_aux_to_parent1,1); } } } } RULE-BASED SYSTEMS: CONCLUSION • (purely) rule-based systems not used anymore • statistical systems achieve better results • still, some methods from RBMT may improve SMT STATISTICAL MACHINE TRANSLATION INTRODUCTION rule-based systems motivated by linguistic theories SMT inspired by information theory and statistics Google, IBM, Microsoft develop SMT systems millions of webpages translated with SMT daily gisting: we don't need exact translation, sometimes a gist of a text is enough (on of the most frequent use of SMT) SMT in assisted MT (CAT) trending right now: neural network models for MT data-driven approach more viable than RBMT SMT SCHEME Spanish/English Bilingual Text Statistical Analysis Statistical Analysis Spanish Broken English >■ English Translation Model Language Model Decoding Algorithm argmax P(e)*p{s|e) PARALLEL CORPORA I basic data source for SMT available sources ~10-100 M size depends heavily on a language pair multilingual webpages (online newspapers) paragraph and sentence alignment needed PARALLEL CORPORA II • Europarl: 11 Is, 40 M words • OPUS: parallel texts of various origin, open subtitles, Ul localizations • Acquis Communautaire: law documents of EU (20 Is) • Hansards: 1.3 M pairs of text chunks from the official records of the Canadian Parliament • EUR-Lex • comparable corpora... SENTENCE ALIGNMENT sometimes sentences are not in 1:1 ratio in Church-Gale alignment hunalign P alignment 0.89 1:1 0.0099 1:0, 0:1 0.089 2:1, 1:2 0.011 2:2 SMT NOISY CHANNEL PRINCIPLE Claude Shannon (1948), self-correcting codes transfered through noisy channels based on information about the original data and errors made in the channels. Used for MT, ASR, OCR. Optical Character Recognition is erroneous but we can estimate what was damaged in a text (with a language model); errors 1-f-M pečivo > zákusek > mléko > babičku} CHOMSKY WAS WRONG Colorless green ideas sleep furiously vs. Furiously sleep ideas green colorless LM assigns higherp to the 1st! (Mikolov, 2012) GENERATING RANDOM TEXT To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have Every enter now severally so, let. (unigrams) Sweet prince, Fa/staff'shall die. Harry of Monmouth's grave. This shall forbid it should be branded, if renown made it empty, (trigrams) Can you guess the author of the original text? CBLM MAXIMUM LIKELIHOOD ESTIMATION count (wi, W2, ws) p(ws\w1)w2) = count (wi ,w2,w) (the, green, *): 1,748x in EuroParl w count p(w) paper 801 0.458 group 640 0.367 ight 110 0.063 party 27 0.015 ecu 21 0.012 LM QUALITY We need to compare quality of various LMs. 2 approaches: extrinsic and intrinsic evaluation. A good LM should assign a higher probability to a good (looking) text than to an incorrect text. For a fixed testing text we can compare various LMs. ENTROPY Shannon, 1949 the expected value (average) of the information contained in a message information viewed as the negative of the logarithm of the probability distribution events that always occur do not communicate information pure randomness has highest entropy (uniform distribution log^n) PERPLEXITY — 2h(plm) PP(W) = p(w\W2 • . • Wn)~^ A good LM should not waste p for improbable phenomena. The lower entropy, the better —>• the lower perplexity, the better. Minimizing probabilities = minimizing perplexity. WHAT INFLUENCES LM QUALITY? • size of training data • order of language model • smoothing, interpolation, back-off LARGE LM - N-GRAM COUNTS How many unique n-grams are in a corpus? order types singletons % unigram 86,700 33,447 (38,6%) bigram 1,948,935 1,132,844 (58,1%) trigram 8,092,798 6,022,286 (74,4%) 4-gram 15,303,847 13,081,621 (85,5%) 5-gram 19,882,175 18,324,577 (92,2%) Taken from Europarl with 30 mil. tokens. ZERO FREQUENCY, OOV, RARE WORDS • probability must always be non zero • to be able to measure perplexity • maximum likelihood bad at it • training data: work on Tuesday/Friday/Wednesday • test data: work on Sunday, p(Sunday\work on) = 0 EXTRINSIC EVALUATION: SENTENCE COMPLETION • Microsoft Research Sentence Completion Challenge • evaluation of language models • where perplexity not available • from five Holmes novels • training data: project Gutenberg Model Acc Human 90 smoothed 3-gram 36 smoothed 4-gram 39 RNN 59 RMN (LSTM) 69 SENTENCE COMPLETION The stage lost a fine XXX, even as science lost an acute reasoner, when he became a specialist in crime, a) linguist b) hunter c) actor d) estate e) horseman What passion of hatred can it be which leads a man to XXX in such a place at such a time, a) lurk b) dine c) luxuriate d) grow e) wiggle My heart is already XXX since i have confided my trouble to you. a) falling b) distressed c) soaring d) lightened e) punished My morning's work has not been XXX, since it has proved that he has the very strongest motives for standing in the way of anything of the sort, a) invisible b) neglected c) overlooked d) wasted e) deliberate That is his XXX fault, but on the whole he's a good worker, a) main b) successful c) mother's d) generous e) favourite NEURAL NETWORK LANGUAGE MODELS • old approach (1940s) • only recently applied successfully to LM • 2003 Bengio et al. (feed-forward NNLM) • 2012 Mikolov(RNN) • trending right now • key concept: distributed representations of words • 1-of-V, one-hot representation RECURRENT NEURAL NETWORK • Tomas Mikolov(VUT) • hidden layer feeds itself • shown to beat n-grams by large margin Model Num. Pa rams Training Time Perplexity [billions] [hours] [CPUs] Interpolated KN 5-gram, 1.1B n-grams (KN) 1.76 3 100 67.6 Katz 5-gram, 1.1B n-grams 1.74 2 100 79.9 Stupid Backoff 5-gram (SBO) 1.13 0.4 200 87.9 Interpolated KN 5-gram, 15M n-grams 0.03 3 100 243.2 Katz 5-gram, 15M n-grams 0.03 2 100 127.5 Binary MaxEnt 5-gram (n-gram features) 1.13 1 5000 115.4 Binary MaxEnt 5-gram (n-gram + skip-1 features) 1.8 1.25 5000 107.1 Hierarchical Softmax MaxEnt 4-gram (HME) 6 3 1 101.3 Recurrent NN-256 + MaxEnt 9-gram 20 60 24 58.3 Recurrent NN-512 + MaxEnt 9-gram 20 120 24 54.5 Recurrent NN-1024 + MaxEnt 9-gram 20 240 24 51.3 WORD EMBEDDINGS • distributional semantics with vectors • skip-gram, CBOW (continuous bag-of-words) INPUT PROJECTION output INPUT PROJECTION output w(t-2) w(t-1) w(t+1) w(t+2) CBOW Skip-gram WOMAN AUNT MAN UNCLE QUEEN KING Expression Nearest token Paris - France + Italy bigger - big + cold sushi - Japan + Germany Cu - copper + gold Windows - Microsoft + Google Montreal Canadiens - Montreal + Toronto Rome colder bratwurst Au Android Toronto Maple Leafs -1-1-1-r China< Beijing Russia* Japan* Turkey- Moscow Ankara ^oky° Poland Germany*-........ France Warsaw v Berlin Italyt Paris x >Athens L Spain*......... Rome poftJgi"........................... -.................... Z't^^ 3 -Lisbon i EMBEDDINGS IN MT E Z 0.15 0.1 0 05 □ -006 -01 -0.1S ~C2 -O^S -0.3 <■ -a i o cat o horse o cow o pig o dog -0-25 -0.2 -0.15 -0.1 -OOS 005 01 0.15 0.5 u 3.3 : z : ' d -0,1 -0.2 -0 3 -0.4 -os i -os o caballo (horse) o vaca (cow) o gato (cat) perg) (dog) o cerdo (pig) -04 -0.3 -0-2 -01 02 0.3 0-4 0.5 LONG SHORT-TERM MEMORY RNN model, can learn to memorize and learn to forget beats RNN in sequence learning LSTM TRANSLATION MODELS LEXICAL TRANSLATION Standard lexicon does not contain information about frequency of translations of individual meanings of words. key klíč, tónina, klávesa How often are the individual translations used in translations? key - klíč (0.7), tónina (0.18), klávesa (0.11) probability distribution pf. ^2pf{e) = 1 e Ve : 0 < pf(e) < 1 EM ALGORITHM - INITIALIZATION - . , la D maison . / . . la maison blue . . , la \ fleur ... xl V . . . the \ house .- . the blue house .. / . the \ flower ... EM ALGORITHM - FINAL PHASE ... la maison ... la maison bleu ... la fleur ... /I IX ... the house ... the blue house ... the flower ... \ p(la|the) = 0.453 p(lelthe) = 0,334 p{maison|house) = 0.876 p{bleu|blue) = 0.563 IBM MODELS IBM-1 does not take context into account, cannot add and skip words. Each of the following models adds something more to the previous. • IBM-1: lexical translation • IBM-2: + absolute alignment model • IBM-3: + fertility model • IBM-4: + relative alignment model • IBM-5: + further tuning WORD ALIGNMENT MATRIX WORD ALIGNMENT ISSUES PH RASE-BASE TRANSLATION MODEL natuerlich hat john| spass am spiel 1 of course John has fun with the game] Phrases not linguistically, but statistically motivated. German am is seldom translated with single English to. Cf. (fun (with (the game))) ADVANTAGES OF PBTM translating n:m words word is not a suitable element for translation for many lang. pairs models learn to translate longer phrases simpler: no fertility, no NULL token etc. PHRASE-BASED MODEL Translation probabilityp(f\e) is split to phrases: P(/ilgi) = J\{fi\^i)d(starti -endi-i - 1) i=l Sentence / is split to J phrases / it all segmentations are of the same probability. Function is translation probability for phrases. Function d is distance-based reordering model, starts is position of the first word of phrase from sentence /, which is translated to i-th phrase of sentence e. PHRASE EXTRACTION CD O E cz -t—> o co ( > co co CD CO CO l_ CTJ - "D CD CO _Q CO 0) J= _Q EXTRACTED PHRASES phrl phr2 michael michael assumes geht davon aus / geht davon aus that dass/, dass he er will stay bleibt in the im house haus michael assumes michael geht davon aus / michael geht davon aus, assumes that geht davon aus , dass assumes that he geht davon aus , dass er that he dass er /, dass er phrl phr2 in the house im haus michael assumes michael geht davon aus , dass that PHRASE-BASED MODEL OF SMT argmaxe TT <^(/Jej) d(starti — endi-i i=l el JJpLM(e»|ei.. .ej_i) DECODING Given a model Plm and translation model p(f\e) we need to find a translation with the highest probability but from exponential number of all possible translations. Heuristic search methods are used. It is not guaranteed to find the best translation. Errors in translations are caused by 1) decoding process, when the best translation is not found owing to the heuristics or 2) models, where the best translation according to the probability functions is not the best possible. EXAMPLE OF NOISE-INDUCED ERRORS (GOOGLE TRANSLATE) Rinneadh claruchan an usaideora yxca eiteach go rathuil. The user registration yxcmade a successful rejection. Rinneadh claruchan an usaideora qqq a eiteach go rathuil. Qqq made registration a user successfully refused. PHRASE-WISE SENTENCE TRANSLATION er geht ja nicht nach hause er geht ja nicht nach hause he does not go home n each step of translation we count preliminary values of probabilities from the translation, reordering and language models. SEARCH SPACE OF TRANSLATION HYPOTHESES er geht ja nicht nach hause Tie" IS it yes are "Re goes go it is he will be E IS , ot course 1 not not do not does not is not C 1 after to according to ) T in is not it goes he goes does not do not house home chamber at home home under house return home do not is are is after all to following not after does not to not is not are not is not a Exponential space of all possible translations limit this space! HYPOTHESIS CONSTRUCTION, BEAM SEARCH BEAM SEARCH breadth-first search on each level of the tree: generate all children of nodes on that level, sort them according to various heuristics store only a limited number of the best states on each level (beam width) only these states are investigated further the wider beam the smaller number of children are pruned with an unlimited width breadth-first search the width correlates with memory consumption the best final state might not be found since it can be pruned NEURAL NETWORK MACHINE TRANSLATION • very close to state-of-the-art (PBSMT) • a problem: variable length input and output • learning to translate and align at the same time • LISA • hot topic (2014, 2015) NN MODELS IN MT Source Sentence Neural Network 1 Target Sentence [ Source Sentence SMT Neural Net Target Sentence Source Sentence SMT Neural Net Target Sentence Neural MT (Schwenketal. 2006) (Devlin etal. 2014) SUMMARY VECTOR FOR SENTENCES OMary admires John OMary is in love with John OJohn admires Mary OJohn is in love with Mary OJohn respects Mary OMary respects John -6 10 15 10 -10 -15 O I was given a card by her in the garden O In the garden , she gave me a card O She gave me a card in the garden O She was given a card by me in the garden O In the garden , I gave her a card O I gave her a card in the garden -20 L--15 -10 10 15 20 BIDIRECTIONAL RNN e — (Economic, growth, has, slowed, down, in, recent, years,.) ATTENTION MECHANISM A neural network with a single hidden layer, a sin scalar output f = (La, croissance, economique, s'est, ralentie, ces, dernieres. annees,.) e — (Economic, growth, has, slowed, down, in, recent, years,.) ALIGNMENT FROM ATTETION Economic growth has slowed down in recent years Das Wirtschaftswachstum hat sich in den letzten Jahren verlangsamt . Economic growth has slowed down in recent years La croissance economique s1 est ralentie ces dernieres annees . ...more details here ALIGNMENT WITH CO-OCCURRENCE STATISTICS Dice fx H~ fy log Dice = 14 + log2 D biterms in SkE MT QUALITY EVALUATION OTHER MINOR TOPICS MOTIVATION FOR MT EVALUATION fluency: is the translation fluent, in a natural word order? adequacy: does the translation preserve meaning? intelligibility: do we understand the translation? EVALUATION SCALE adequacy fluency 5 all meaning 5 flawless Eng ish 4 most meaning 4 good 3 much meaning 3 non-native 2 little meaning 2 dis-fluent 1 no meaning 1 incomprehensible DISADVANTAGES OF MANUAL EVALUATION • slow, expensive, subjective • inter-annotator agreement (IAA) shows people agree more on fluency than on adequacy • another option: is X better than Y? higher IAA • or time spent on post-editing • or how much cost of translation is reduced AUTOMATIC TRANSLATION EVALUATION • advantages: speed, cost • disadvantages: do we really measure quality of translation? • gold standard: manually prepared reference translations • candidate c is compared with n reference translations V{ • the paradox of automatic evaluation: the task corresponds to situation where students are to assess their own exam: how they know where they made a mistake? • various approaches: n-gram shared between c and n, edit distance,... RECALL AND PRECISION ON WORDS system a: Israeli officials responsibility ef airport safety reference: Israeli officials are responsible for airport security . . correct 3 Krxny precision =---— = — = 50% output-length 6 correct 3 Jrtrw recall = —---- = - = 43% reference-length 7 precision x recall ^ .5 x .43 f-score = 2 x-- = 2 x -= precision + recall .5 + .43 RECALL AND PRECISION: SHORTCOMINGS system a: Israeli officials responsibility ef airport safety reference: Israeli officials are responsible for airport security system b: airport security Israeli officials are responsible metrics system A system B precision 50% 100% reca 43% 100% f-score 46% 100% It does not capture wrong word order. BLEU • standard metrics (2001) • IBM, Papineni • n-gram match between reference and candidate translations • precision is calculated for 1 -, 2- ,3- and 4-grams • + brevity penalty BLEU = min 1 reference-length output-length I I precisio BLEU: AN EXAMPLE system a: Israeli officials | responsibility of | airport | safety 2-gram match 1-gram match reference: Israeli officials are responsible for airport security system b: airport security 11Israeli officials are responsible 2-gram match 4-gram match metrics system A system B precision (1 gram) 3/6 6/6 precision (2gram) 1/5 4/5 precision (3gram) 0/4 2/4 precision (4gram) 0/3 1/3 brevity penalty 6/7 6/7 BLEU 0% 52% NIST NIST: National Institute of Standards and Technolog weighted matches of n-grams (information value) very similar results as for BLEU (a variant) NEVA • Ngram EVAIuation • BLEU score adapted for short sentences • it takes into account synonyms (stylistic richness) WAFT Word Accuracy for Translation edit distance between c and r WAFT = 1 - d+f+*, TER Translation Edit Rate the least edit steps (deletion, insertion, swap, replacement) r = dnes jsem si při fotbalu zlomil kotník c = při fotbalu jsem si dnes zlomil kotník TER = ? number of edits TER = avg. number of ref. words HTER • Human TER • r manually prepared and then TER is applied METEOR aligns hypotheses to one or more references exact, stem (morphology), synonym (WordNet), paraphrase matches various scores including WMT ranking and NIST adequacy extended support for English, Czech, German, French, Spanish, and Arabic. high correlation with human judgments EVALUATION OF EVALUATION METRICS Correlation of automatic evaluation with manua evaluation. 3 0 1 I -rrs-? n - . O ♦ Adequacy O Fluency ■ ,y ■ 1 t - Ff = 35.0q/ 1.3 i n 1,U > If, z> i\ ri r S -2 .0 -L r i * 1 0 1 & 2 f 0 £ i n - 0 i -1 _u 1 c . ■? n -«- Human Judgments EUROMATRIX KuRO Matrix output language Danish BLEU 10 49 BLEU 21 12 BLEU 28 57 BLEU 14 24 BLEU 2879 BLEU 22.22 BLEU 24.32 BLEU 26.49 BLEU 20 51 BLEU IS 39 BLEU 17.49 BLEU 2301 BLEU 10.34 BLEU 24.67 BLEU 20.07 BLEU 20.71 BLEU 22.95 BLEU 19.03 BLEU 22.35 BLEU 23.40 German BLEU 20.75 BLEU 25.36 BLEU 11 8B BLEU 2775 BLEU 21.36 BLEU 2328 BLEU 25.49 BLEU 20.51 BLEU 22 79 BLEU 20 02 BLEU 17 42 Greek BLEU 27.20 BLEU BLEU 32.15 BLEU 26.34 BLEU 27.67 BLEU 31.26 BLEU 21 23 11 44 BLEU 25.24 BLEU 21.02 BLEU 17.641 BLEU 23.23 English BLEU 13.00 BLEU 31.IB BLEU 25.38 BLEL 27 10 BLEU 30.16 BLEU 24 83 BLEU 20 02 BLEU 1709 BLEU 14.57 BLEU 13.20 BLEU 21.30 Finnish BLEU 22.49 BLEU 1339 BLEU 19.14 BLEU 21.16 BLEU 13.35 tid BLEU 23.73 BLEU 21.13 BLEU 1B.54I BLEU 2613 BLEU 30.00 BLEU 1263 French BLEU 32 M BLEU 35.37 BLEU 3B.47 BLEU 22. B8 BLEU 21M BLEU 20 07 BLEu 16 92 BLEU 24 83 BLEU 27 39 ELEU 11 08 BLEU 36.09 Italian BLEU 31.20 ELEU 34.04 BLEU 20 26 a BLEU 23.27 BLEU 20.23 BLEU IB 27 BLEU 26.4B BLEU 30.11 BLEU 11 99 BLEU 39.04 BLEU 32.07 Portuguese BLEU 37.95 BLEU 21.96 BLEU BLEU BLEU BLEU BLEU BLEU BLEU BLEU BLEU Spanish BLEU 24 10 21 42 18.29 28.33 30.51 12.57 40.27 S3.31 35.92 ■™ 23 90 BLEU 30.35 BLEU 21.94 BLEU 18.97 BLEU 22. B6 BLEU 30 20 BLEU 15.37 BLEU 23.77 BLEU 23.94 BLEU 25.95 BLEU 2a .06 Swedish EUROMATRIX II "-1. 3j - HL LT LV MT n- FT no | Si ÍV EH J -:■: -e e 32 JS 30 JQ 410 2 342 322 30.1 372 304 39 JS 434 392 323 49.2 :: :■ 49J0 44.7 30.7 32 JO 5: eis -ř 33.7 394 39 JS 34J 45.9 233 25.7 424 22.0 4e : 293 29.1 239 449 33.1 439 363 34.1 34.1 59.9 DE 33 J6 253 ■/ 33 j4 43.1 323 47.1 26.7 293 394 27 JS 42.7 27 JS 303 192 ::■ e 302 44.1 30.7 294 314 41.2 CS 334 32 JÜ 42 JS J 43 JS 54jS 42.9 30.7 e j 3 41 JS 274- 443 343 332 252 463 332 43.7 363 43JS 413 42.9 DA 37 JS 23.7 44.1 i: t -ř 34J 47.3 273 31.5 413 242 432 2= _ El E 21.1 433 343 414 333 33J0 352 47.2 EL 393 E I — 43.1 37.7 443 J- :- : 1=: 29 JO 433. 13.7 49 j5 23.0 32.5 232 433 2á2 ::: =■1 = E - 353 45.3 ES 50 JO 311 42.7 it: 444 394 J 234 223 313 24J0 31.7 252 303 24J6 433 333 :r e 32.1 31.7 339 43.7 ET 32 J0 24.5 373 332 373 222 - J 37.7 334 309 37 JO 33 JO 359 203 413 S20 372 íe :■ e: ; 325 57.3 H 493 222 35 JO 32 JO 373 27.2 39.7 343 J 293 272 ee e 30: 323 194 406 2S2 37 : 263 273 222 57.6 = =i St n 343 43.1 ee : 47j4 422 60.9 25.7 30 JO J 233 35.1 223 313 23 e 31.5 e: r 81J0 433 33.1 33 JS 43.3 HU 430 24.7 343 30 JO 33 JO li : 34.1 23.6 234 30.7 J 333 23.4 313 13.1 3=5.1 232 342 23.7 21JS 2E 2 30.3 IT ELB 32.1 443 339 433 40.5 "z E 23 JO 29.7 32.7 2i2 J 294 32.5 242 ::■: 332 353 393 323 34.7 44.3 LT 312 27.5 339 37 JO 3S3 2*3} 11.1 342 32 JO 344 223 352 J 401 222 33.1 316 316 29.3 312 e: e 33.5 LV 34 J0 29.1 = : j 372 323 29.7 21.3 342 324 336 293 329 324 233 413 344 396 31J0 333 37JL 53.0 \— 7i.l 322 372 379 33.9 33.7 42.7 253 2: e 424 224 43.7 302 332 J 44J0 37.1 439 32.9 332 400 41.6 Hl 363 293 453 37 JO 434 333 43.7 27J 293 434 233 443 22 JS 3L7 220 J 12.0 47.7 33J0 30.1 34 JS 45.5 FL 503 313 40.2 442 42.1 342 46.2 232 29 JO 40JO 243 432 332 336 273 442 J 441 332 332 393 42.1 FT 60.7 31.4 423 334 423 40-2 60.7 254 292 332 232 322 23 o 313 242 433 343 J 334 32.1 344 45.9 KCl »2 33.1 333 373 403 33 JS 30.4 24 j5 252 453 2: :■ 442 224 29.9 23.7 430 333 -e : J * 331 59.4 SK 502 Eid 394 43.1 41J0 333 46.2 292 224 394 274 412 332 35.7 233 444 390 433 333 ^ -i e 412 SL tu 33.1 373 433 -1 e 34J0 47.0 31.1 222 322 23.7 423 34 JS 373 30O 439 322 44.1 333 323 J 42.7 SV 323 263 410 33 JS -e E 333 46.6 27j* 303 329 22.7 420 222 310 23.7 435 322 442 32.7 213 553 J SYNTACTIC RULES EXTRACTION — prp I md shall — vb be vbg passing rp on to to prp you dt some nns comments HYBRID SMT+RBMT • Chimera, UFAL • TectoMT + Moses • better than Google Translate (En-Cz)