Neural Machine Translation III
Philipp Koehn 24 October 2017
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Neural Machine Translation
the house is
Input Word Embeddings
Left-to-Right Recurrent NN
Right-to-Left Recurrent NN
Attention Input Context
Hidden State
Output Word Predictions
Error
Given Output Words
Output Word Embedding
=4
V
das
Haus
i st
groß
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Google: Neural vs. Statistical MT
6
im-
perfect translation
human
neural (GNMT) phrase-based (PBMT)
English English English Spanish French Chinese > > > > > >
Spanish French Chinese English English English
Translation model
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
WMT 2016
HUMAN
.6 -
4 Statistical MT
.2 --
.2 --
A --
Neural MT
• UEDIN-NMT
• METAMIND
UEDIN-SYNTAX •
• NYU-UMONTREAL
LOMT-RULE-BASED
#^»NLINE-B KIT-LIMSI •• • . CAMBRIDGE K11r ONLINE-A
JHU-SYNTAX
• JHU-PBMT
/
UEDIN-PBMT
ONLINE-F ONLINE-G
BLEU
18 20 22 24 26 28 30 32 34 36
(in 2017 barely any statistical machine translation submissions)
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Today's Agenda
• Challenges
— lack of training data
— domain mismatch
— noisy data
— sentence length
— word alignment
— beam search
• Alternative architectures
— convolutional neural networks
— self-attention
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
challenges
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Amount of Training Data
Corpus Size (English Words)
English-Spanish systems trained on 0.4 million to 385.7 million words
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Translation Examples
Source A Republican strategy to counter the re-election of Obama
l 1024 Un órgano de coordinación para el anuncio de libre determinación
1 512 Lista de una estrategia para luchar contra la elección de hoj as de Ohio
256 Explosion realiza una estrategia divisiva de luchar contra las elecciones de autor
1 128 Una estrategia republicana para la eliminación de la reelección de Obama
1 64 Estrategia siria para contrarrestar la reelección del Obama .
1 i 32 1 Una estrategia republicana para contrarrestar la reelección de Obama
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
domain mismatch
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Domain Mismatch 9 ^jiy
System | Law Medical IT Koran Subtitles
All Data 30.532.8 45.142.2 35.344.7 17.917.9 26.420.8
Law
31.134.4 12.118.2 3.5 6.9 1.3 2.2 2.8 6.0
Medical
3.910.2 39.443.5 2.0 8.5 0.6 2.0 1.4 5.8
IT
1.9 3.7 6.5 5.3 42.139.8 1.8 1.6 3.9 4.7
Koran
0.4 1.8 0.0 2.1 0.0 2.3 15.918.8 1.0 5.5
Subtitles
7.0 9.9 9.317.8 9.213.6 9.0 8.4 25.922.1
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Translation Examples
Source Schaue um dich herum.
Ref. Look around you.
All NMT: Look around you. SMT: Look around you.
Law NMT: Sughum gravecorn. SMT: In order to implement dich Schaue .
Medical NMT: EMEA / MB / 049 / 01-EN-Final Work progamme for 2002 SMT: Schaue by dich around .
IT NMT: Switches to paused. SMT: To Schaue by itself . \t \t
Koran NMT: Take heed of your own souls. SMT: And you see.
Subtitles NMT: Look around you. SMT: Look around you .
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
noisy data
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Noise in Training Data
• Chen et al. [2016] add noise to WMT EN-FR training data
— artificial noise: permute order of target sentences
— conclusion: NMT is more sensitive to (some types of) noise than SMT
Noise 0% 10% 20% 50%
SMT 32.7 32.7 (±0.0) 32.6 (-0.1) 32.0 (-0.7)
NMT 35.4 (-0.1) 34.8 (-0.6) 32.1 (-3.3) 30.1 (-5.3)
• Other kind of noise: non-text, text in wrong languages
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
13
sentence length
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Sentence Length
14
35
34,7- 34,7-
CO
30
25
33.9
26.9
Neural Phrase-Based
27.7
0
10 20 30 40 50 60 70 80 Sentence Length (source, subword count)
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
15
word alignment
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Word Alignment
§ c ^
5 qj to
•2 g
3
OS PI
^ I J§ ^ CÜ
73
qj
03 qj C ^ > qj 03 03 qj h
CD
5-1
- ^_ q qj
4^ ,£> cn <-m >->
89
die 56 Beziehungen
zwischen Obama und Netanjahu
72
16
26
96
79
98
sind 42 11 38
seit 22 54 10
Jahren 98
angespannt 84
• 11 14 23
49
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Word Alignment?
the
relationship between Obama and Netanyahu
has
been
stretched
for years
•I c
m
> N U
3
PI 03 -»-> QJ
47
17
81
72
87
93
95
11
38 16 26 54 77
21 14
38 33 12
90
19 32 17
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
18
beam search
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
1 2 4 8 12 20 30 50 100 200 500 1,000
Beam Size
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Just Better Fluency?
Adequacy +1%
100
80
60
I I I
CS^EN DE^EN RO^EN RU^EN
llONLINE-BllUEDIN-NMT
Fluency
+13%
100
80
60
Jl É ll
CS^EN DE^EN RO^EN RU^EN
Iionline-bIiuedin-nmt
(from: Sennrich and Haddow, 2017)
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
21
alternative architectures
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Beyond Recurrent Neural Networks 22
• We presented the currently dominant model
— recurrent neural networks for encoder and decoder
— attention
• Convolutional neural networks
• Self attention
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
23
convolutional neural networks
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Convolutional Neural Networks ^ Wtf
UTiTlUTi
Input Word Em beddings
K2 Layer K3 Layer L3 Layer
• Build sentence representation bottom-up
— merge any n neighboring nodes
- n maybe 2, 3,...
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Generation
\-xxxx_rv_rv_?
Input Word Embeddings
K2 Encoding Layer K2 Encoding Layer
Transfer Layer K3 Decoding Layer K2 Decoding Layer
Selected Word
Output Word Embedding
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Generation
• Encode with convolutional neural network
• Decode with convolutional neural network
• Also include a linear recurrent neural network
• Important: predict length of output sentence
• Does it work?
used successfully in re-ranking (Cho et al., 2014)
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Convolutional Network with Attention 27
la maison de Lea
Encoder
Attention
Decoder
Lea 's
(Facebook, 2017)
Philipp Koehn Machine Translation: Neural Machine Translation III 24 October 2017
Convolutional Encoder
1
Input Word Em beddings
Convolution Layer 1
Convolution Layer 2
Convolution Layer 3
• Similar idea as deep recurrent neural networks
• Good: more parallelizable
• Bad: less context when refining representation of a word
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Convolutional Decoder
Decoder Convolution 2
Decoder Convolution 1
Output Word Embedding
Selected Word
Convolutions over output words
Only previously produced output words (still left-to-right decoding)
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Convolutional Decoder
Input Context
Decoder Convolution 2
Decoder Convolution 1
Output Word Embedding
Selected Word
• Inclusion of Input context
• Context result of attention mechanism (similar to previous)
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Convolutional Decoder
Input Context
Output Word Predictions
Decoder Convolution 2
Decoder Convolution 1
Output Word Embedding
Selected Word
• Predict output word distribution
• Select output word
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
32
self-attention
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Attention
Encoder States Attention
Input Context Hidden State
• Compute association between last hidden state and encoder states
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Attention Math
Input word representation h
k
Decoder state s
j
Computations
1
h
exp(gjfc) E«exp(ajlc)
self-attention (hj) = otj^hk
k
raw association
normalized association (softmax)
weighted sum
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Self-Attention 35 ^
Attention
- 1 hT
ajk — —Sjnk
Self-attention
1 T
ajk = —hjhk
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
• Refine representation of word with related words
making... more difficult refines making
• Good: more parallelizable than recurrent neural network
• Good: wide context when refining representation of a word
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Stacked Attention in Decoder
Input Word Embeddings
Self Attention Layer 1
Self Attention Layer 2
Decoder Layer 1
Decoder Layer 2
Output Word Prediction
Selected Output Word
Output Word Embedding
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
Where Are We Now?
• Recurrent neural network with attention currently dominant model
• Still many challenges
• New proposals in Spring 2017
- convolutions (Facebook)
- self-attention (Google)
• Too early to tell if either becomes the new paradigm
• Open source implementations are available
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017
39
questions?
Philipp Koehn
Machine Translation: Neural Machine Translation III
24 October 2017