Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 1Neural Machine Translation Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Output Word Predictions Given Output Words Error Output Word Embedding the house is big . das Haus ist groß , Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 2Neural Machine Translation • Last lecture: architecture of attentional sequence-to-sequence neural model • Today: practical considerations and refinements – ensembling – handling large vocabularies – using monolingual data – deep models – alignment and coverage – use of linguistic annotation – multiple language pairs Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 3 ensembling Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 4Ensembling • Train multiple models • Say, by different random initializations • Or, by using model dumps from earlier iterations (most recent, or interim models with highest validation score) Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 5Decoding with Single Model si-1 ci Context State ti-1 ti Word Prediction yi-1 Eyi-1 Selected Wordyi Eyi Embedding ci-1 the cat this of fish there dog these yi Eyi si Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 6Combine Predictions .54the .01cat .11this .00of .00fish .03there .00dog .05these .52 .02 .12 .00 .01 .03 .00 .09 Model 1 Model 2 .12 .33 .06 .01 .15 .00 .05 .09 Model 3 .29 .03 .14 .08 .00 .07 .20 .00 Model 4 .37 .10 .08 .02 .07 .03 .00 Model Average .06 Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 7Ensembling • Surprisingly reliable method in machine learning • Long history, many variants: bagging, ensemble, model averaging, system combination, ... • Works because errors are random, but correct decisions unique Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 8Right-to-Left Inference • Neural machine translation generates words right to left (L2R) the → cat → is → in → the → bag → . • But it could also generate them right to left (R2L) the ← cat ← is ← in ← the ← bag ← . Obligatory notice: Some languages (Arabic, Hebrew, ...) have writing systems that are right-to-left, so the use of ”right-to-left” is not precise here. Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 9Right-to-Left Reranking • Train both L2R and R2L model • Score sentences with both ⇒ use both left and right context during translation • Only possible once full sentence produced → re-ranking 1. generate n-best list with L2R model 2. score candidates in n-best list with R2L model 3. chose translation with best average score Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 10 large vocabularies Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 11Zipf’s Law: Many Rare Words frequency rank frequency × rank = constant Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 12Many Problems • Sparse data – words that occur once or twice have unreliable statistics • Computation cost – input word embedding matrix: |V | × 1000 – outout word prediction matrix: 1000 × |V | Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 13Some Causes for Large Vocabularies • Morphology tweet, tweets, tweeted, tweeting, retweet, ... → morphological analysis? • Compounding homework, website, ... → compound splitting? • Names Netanyahu, Jones, Macron, Hoboken, ... → transliteration? ⇒ Breaking up words into subwords may be a good idea Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 14Byte Pair Encoding • Start by breaking up words into characters t h e f a t c a t i s i n t h e t h i n b a g • Merge frequent pairs t h→th th e f a t c a t i s i n th e th i n b a g a t→at th e f at c at i s i n th e th i n b a g i n→in th e f at c at i s in th e th in b a g th e→the the f at c at i s in the th in b a g • Each merge operation increases the vocabulary size – starting with the size of the character set (maybe 100 for Latin script) – stopping at, say, 50,000 Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 15Example: 49,500 BPE Operations Obama receives Net@@ any@@ ahu the relationship between Obama and Net@@ any@@ ahu is not exactly friendly . the two wanted to talk about the implementation of the international agreement and about Teheran ’s destabil@@ ising activities in the Middle East . the meeting was also planned to cover the conflict with the Palestinians and the disputed two state solution . relations between Obama and Net@@ any@@ ahu have been stra@@ ined for years . Washington critic@@ ises the continuous building of settlements in Israel and acc@@ uses Net@@ any@@ ahu of a lack of initiative in the peace process . the relationship between the two has further deteriorated because of the deal that Obama negotiated on Iran ’s atomic programme . in March , at the invitation of the Republic@@ ans , Net@@ any@@ ahu made a controversial speech to the US Congress , which was partly seen as an aff@@ ront to Obama . the speech had not been agreed with Obama , who had rejected a meeting with reference to the election that was at that time im@@ pending in Israel . Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 16 using monolingual data Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 17Traditional View • Two core objectives for translation Adequacy Fluency meaning of source and target match target is well-formed translation model language model parallel data monolingual data • Language model is key to good performance in statistical models • But: current neural translation models only trained on parallel data Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 18Integrating a Language Model • Integrating a language model into neural architecture – word prediction informed by translation model and language model – gated unit that decides balance • Use of language model in decoding – train language model in isolation – add language model score during inference (similar to ensembling) • Proper balance between models (amount of training data, weights) unclear Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 19Backtranslation • No changes to model architecture • Create synthetic parallel data – train a system in reverse direction – translate target-side monolingual data into source language – add as additional parallel data • Simple, yet effective reverse system final system Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 20 deeper models Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 21Deeper Models • Encoder and decoder are recurrent neural networks • We can add additional layers for each step • Recall shallow and deep language models Input Hidden Layer Output Input Hidden Layer 2 Output Hidden Layer 1 Hidden Layer 3 Shallow Deep • Adding residual connections (short-cuts through deep layers) help Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 22Deep Decoder • Two ways of adding layers – deep transitions: several layers on path to output – deeply stacking recurrent neural networks • Why not both? Context Decoder State: Stack 1, Transition 1 Decoder State: Stack 1, Transition 2 Decoder State: Stack 2, Transition 1 Decoder State: Stack 2, Transition 2 Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 23Deep Encoder • Previously proposed encoder already has 2 layers – left-to-right recurrent network, to encode left context – right-to-left recurrent network, to encode right context ⇒ Third way of adding layers Input Word Embedding Encoder Layer 1: L2R Encoder Layer 2: R2L Encoder Layer 3: L2R Encoder Layer 4: R2L Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 24Reality Check: Edinburgh WMT 2017 Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 25 alignment and coverage Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 26Alignment • Attention model fulfills role of alignment • Traditional methods for word alignment – based on co-occurence, word position, etc. – expectation maximization (EM) algorithm – popular: IBM models, fast-align Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 27Attention vs. Alignment relations between Obama and Netanyahu have been strained for years . die Beziehungen zwischen Obama und Netanjahu sind seit Jahren angespannt . 56 89 72 16 26 96 79 98 42 11 11 14 38 22 84 23 54 10 98 49 Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 28Guided Alignment • Guided alignment training for neural networks – traditional objective function: match output words – now: also match given word alignments • Add as cost to objective function – given alignment matrix A, with j Aij = 1 (from IBM Models) – computed attention αij (also j αij = 1 due to softmax) – added training objective (cross-entropy) costCE = − 1 I I i=1 J j=1 Aij log αij Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 29Coverage in order to solve the problem , the ” Social Housing ” alliance suggests a fresh start . um das Problem zu l¨osen , schl¨agt das Unternehmen der Gesellschaft f¨ur soziale Bildung vor . 37 33 63 81 84 10 80 12 40 13 71 18 86 84 80 45 40 12 10 41 44 10 89 10 40 37 10 30 80 11 13 43 7 46 161 108 89 62 112 392 121 110 130 26 132 22 19 6 6 Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 30Tracking Coverage • Neural machine translation may drop or duplicate content • Track coverage during decoding coverage(j) = i αi,j over-generation = max 0, j coverage(j) − 1 under-generation = min 1, j coverage(j) • Add as cost to hypotheses Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 31Coverage Models • Use as information for state progression a(si−1, hj) = Wa si−1 + Ua hj + V a coverage(j) + ba • Add to objective function log i P(yi|x) + λ j (1 − coverage(j))2 • May also model fertility – some words are typically dropped – some words produce multiple output words Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 32 linguistic annotation Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 33Example Words the girl watched attentively the beautiful fireflies Part of speech DET NN VFIN ADV DET JJ NNS Lemma the girl watch attentive the beautiful firefly Morphology - SING. PAST - - - PLURAL Noun phrase BEGIN CONT OTHER OTHER BEGIN CONT CONT Verb phrase OTHER OTHER BEGIN CONT CONT CONT CONT Synt. dependency girl watched - watched fireflies fireflies watched Depend. relation DET SUBJ - ADV DET ADJ OBJ Semantic role - ACTOR - MANNER - MOD PATIENT Semantic type - HUMAN VIEW - - - ANIMATE Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 34Input Annotation • Input words are encoded in one-hot vectors • Additional linguistic annotation – part-of-speech tag – morphological features – etc. • Encode each annotation in its own one-hot vector space • Concatenate one-hot vecors • Essentially: – each annotation maps to embedding – embeddings are added Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 35Output Annotation • Same can be done for output • Additional output annotation is latent feature – ultimately, we do not care if right part-of-speech tag is predicted – only right output words matter • Optimizing for correct output annotation → better prediction of output words Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 36Linearized Output Syntax Sentence the girl watched attentively the beautiful fireflies Syntax tree S NP DET the NN girl VP VFIN watched ADVP ADV attentively NP DET the JJ beautiful NNS fireflies Linearized (S (NP (DET the ) (NN girl ) ) (VP (VFIN watched ) (ADVP (ADV attentively ) ) (NP (DET the ) (JJ beautiful ) (NNS fireflies ) ) ) ) Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 37 multiple language pairs Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 38One Model, Multiple Language Pairs • One language pair → train one model • Multiple language pairs → train one model for each • Multiple language pair → train one model for all Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 39Multiple Input Languages • Given – French–English corpus – German–English corpus • Train one model on concatenated corpora • Benefit: sharing monolingual target language data Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 40Multiple Output Languages • Multiple output languages – French–English corpus – French–Spanish corpus • Need to mark desired output language with special token [ENGLISH] N’y a-t-il pas ici deux poids, deux mesures? ⇒ Is this not a case of double standards? [SPANISH] N’y a-t-il pas ici deux poids, deux mesures? ⇒ No puede verse con toda claridad que estamos utilizando un doble rasero? Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 41Zero Shot English French Spanish German MT • Can the model translate German to Spanish? [SPANISH] Messen wir hier nicht mit zweierlei Maß? ⇒ No puede verse con toda claridad que estamos utilizando un doble rasero? Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 42Zero Shot: Vision • Direct translation only requires bilingual mapping • Zero shot requires interlingual representation Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017 43Zero Shot: Reality Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017