Neural Machine Translation III Philipp Koehn 24 October 2017 Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Neural Machine Translation the house is Input Word Embeddings Left-to-Right Recurrent NN Right-to-Left Recurrent NN Attention Input Context Hidden State Output Word Predictions Error Given Output Words Output Word Embedding =4 V das Haus i st groß Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Google: Neural vs. Statistical MT 6 im- perfect translation human neural (GNMT) phrase-based (PBMT) English English English Spanish French Chinese > > > > > > Spanish French Chinese English English English Translation model Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 WMT 2016 HUMAN .6 - 4 Statistical MT .2 -- .2 -- A -- Neural MT • UEDIN-NMT • METAMIND UEDIN-SYNTAX • • NYU-UMONTREAL LOMT-RULE-BASED #^»NLINE-B KIT-LIMSI •• • . CAMBRIDGE K11r ONLINE-A JHU-SYNTAX • JHU-PBMT / UEDIN-PBMT ONLINE-F ONLINE-G BLEU 18 20 22 24 26 28 30 32 34 36 (in 2017 barely any statistical machine translation submissions) Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Today's Agenda • Challenges — lack of training data — domain mismatch — noisy data — sentence length — word alignment — beam search • Alternative architectures — convolutional neural networks — self-attention Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 challenges Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Amount of Training Data Corpus Size (English Words) English-Spanish systems trained on 0.4 million to 385.7 million words Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Translation Examples Source A Republican strategy to counter the re-election of Obama l 1024 Un órgano de coordinación para el anuncio de libre determinación 1 512 Lista de una estrategia para luchar contra la elección de hoj as de Ohio 256 Explosion realiza una estrategia divisiva de luchar contra las elecciones de autor 1 128 Una estrategia republicana para la eliminación de la reelección de Obama 1 64 Estrategia siria para contrarrestar la reelección del Obama . 1 i 32 1 Una estrategia republicana para contrarrestar la reelección de Obama Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 domain mismatch Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Domain Mismatch 9 ^jiy System | Law Medical IT Koran Subtitles All Data 30.532.8 45.142.2 35.344.7 17.917.9 26.420.8 Law 31.134.4 12.118.2 3.5 6.9 1.3 2.2 2.8 6.0 Medical 3.910.2 39.443.5 2.0 8.5 0.6 2.0 1.4 5.8 IT 1.9 3.7 6.5 5.3 42.139.8 1.8 1.6 3.9 4.7 Koran 0.4 1.8 0.0 2.1 0.0 2.3 15.918.8 1.0 5.5 Subtitles 7.0 9.9 9.317.8 9.213.6 9.0 8.4 25.922.1 Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Translation Examples Source Schaue um dich herum. Ref. Look around you. All NMT: Look around you. SMT: Look around you. Law NMT: Sughum gravecorn. SMT: In order to implement dich Schaue . Medical NMT: EMEA / MB / 049 / 01-EN-Final Work progamme for 2002 SMT: Schaue by dich around . IT NMT: Switches to paused. SMT: To Schaue by itself . \t \t Koran NMT: Take heed of your own souls. SMT: And you see. Subtitles NMT: Look around you. SMT: Look around you . Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 noisy data Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Noise in Training Data • Chen et al. [2016] add noise to WMT EN-FR training data — artificial noise: permute order of target sentences — conclusion: NMT is more sensitive to (some types of) noise than SMT Noise 0% 10% 20% 50% SMT 32.7 32.7 (±0.0) 32.6 (-0.1) 32.0 (-0.7) NMT 35.4 (-0.1) 34.8 (-0.6) 32.1 (-3.3) 30.1 (-5.3) • Other kind of noise: non-text, text in wrong languages Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 13 sentence length Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Sentence Length 14 35 34,7- 34,7- CO 30 25 33.9 26.9 Neural Phrase-Based 27.7 0 10 20 30 40 50 60 70 80 Sentence Length (source, subword count) Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 15 word alignment Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Word Alignment § c ^ 5 qj to •2 g 3 OS PI ^ I J§ ^ CÜ 73 qj 03 qj C ^ > qj 03 03 qj h CD 5-1 - ^_ q qj 4^ ,£> cn <-m >-> 89 die 56 Beziehungen zwischen Obama und Netanjahu 72 16 26 96 79 98 sind 42 11 38 seit 22 54 10 Jahren 98 angespannt 84 • 11 14 23 49 Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Word Alignment? the relationship between Obama and Netanyahu has been stretched for years •I c m > N U 3 PI 03 -»-> QJ 47 17 81 72 87 93 95 11 38 16 26 54 77 21 14 38 33 12 90 19 32 17 Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 18 beam search Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 1 2 4 8 12 20 30 50 100 200 500 1,000 Beam Size Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Just Better Fluency? Adequacy +1% 100 80 60 I I I CS^EN DE^EN RO^EN RU^EN llONLINE-BllUEDIN-NMT Fluency +13% 100 80 60 Jl É ll CS^EN DE^EN RO^EN RU^EN Iionline-bIiuedin-nmt (from: Sennrich and Haddow, 2017) Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 21 alternative architectures Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Beyond Recurrent Neural Networks 22 • We presented the currently dominant model — recurrent neural networks for encoder and decoder — attention • Convolutional neural networks • Self attention Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 23 convolutional neural networks Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Convolutional Neural Networks ^ Wtf UTiTlUTi Input Word Em beddings K2 Layer K3 Layer L3 Layer • Build sentence representation bottom-up — merge any n neighboring nodes - n maybe 2, 3,... Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Generation \-xxxx_rv_rv_? Input Word Embeddings K2 Encoding Layer K2 Encoding Layer Transfer Layer K3 Decoding Layer K2 Decoding Layer Selected Word Output Word Embedding Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Generation • Encode with convolutional neural network • Decode with convolutional neural network • Also include a linear recurrent neural network • Important: predict length of output sentence • Does it work? used successfully in re-ranking (Cho et al., 2014) Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Convolutional Network with Attention 27 la maison de Lea Encoder Attention Decoder Lea 's (Facebook, 2017) Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Convolutional Encoder 1 Input Word Em beddings Convolution Layer 1 Convolution Layer 2 Convolution Layer 3 • Similar idea as deep recurrent neural networks • Good: more parallelizable • Bad: less context when refining representation of a word Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Convolutional Decoder Decoder Convolution 2 Decoder Convolution 1 Output Word Embedding Selected Word Convolutions over output words Only previously produced output words (still left-to-right decoding) Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Convolutional Decoder Input Context Decoder Convolution 2 Decoder Convolution 1 Output Word Embedding Selected Word • Inclusion of Input context • Context result of attention mechanism (similar to previous) Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Convolutional Decoder Input Context Output Word Predictions Decoder Convolution 2 Decoder Convolution 1 Output Word Embedding Selected Word • Predict output word distribution • Select output word Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 32 self-attention Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Attention Encoder States Attention Input Context Hidden State • Compute association between last hidden state and encoder states Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Attention Math Input word representation h k Decoder state s j Computations 1 h exp(gjfc) E«exp(ajlc) self-attention (hj) = otj^hk k raw association normalized association (softmax) weighted sum Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Self-Attention 35 ^ Attention - 1 hT ajk — —Sjnk Self-attention 1 T ajk = —hjhk Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 • Refine representation of word with related words making... more difficult refines making • Good: more parallelizable than recurrent neural network • Good: wide context when refining representation of a word Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Stacked Attention in Decoder Input Word Embeddings Self Attention Layer 1 Self Attention Layer 2 Decoder Layer 1 Decoder Layer 2 Output Word Prediction Selected Output Word Output Word Embedding Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 Where Are We Now? • Recurrent neural network with attention currently dominant model • Still many challenges • New proposals in Spring 2017 - convolutions (Facebook) - self-attention (Google) • Too early to tell if either becomes the new paradigm • Open source implementations are available Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017 39 questions? Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017