Neural Machine Translation II
Refinements
Philipp Koehn
17 October 2017
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
1Neural Machine Translation
Input Word
Embeddings
Left-to-Right
Recurrent NN
Right-to-Left
Recurrent NN
Attention
Input Context
Hidden State
Output Word
Predictions
Given
Output Words
Error
Output Word
Embedding
the house is big .
das Haus ist groß ,
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
2Neural Machine Translation
• Last lecture: architecture of attentional sequence-to-sequence neural model
• Today: practical considerations and refinements
– ensembling
– handling large vocabularies
– using monolingual data
– deep models
– alignment and coverage
– use of linguistic annotation
– multiple language pairs
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
3
ensembling
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
4Ensembling
• Train multiple models
• Say, by different random initializations
• Or, by using model dumps from earlier iterations
(most recent, or interim models with highest validation score)
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
5Decoding with Single Model
si-1
ci Context
State
ti-1 ti
Word
Prediction
yi-1
Eyi-1
Selected
Wordyi
Eyi Embedding
ci-1
the
cat
this
of
fish
there
dog
these
yi Eyi
si
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
6Combine Predictions
.54the
.01cat
.11this
.00of
.00fish
.03there
.00dog
.05these
.52
.02
.12
.00
.01
.03
.00
.09
Model
1
Model
2
.12
.33
.06
.01
.15
.00
.05
.09
Model
3
.29
.03
.14
.08
.00
.07
.20
.00
Model
4
.37
.10
.08
.02
.07
.03
.00
Model
Average
.06
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
7Ensembling
• Surprisingly reliable method in machine learning
• Long history, many variants:
bagging, ensemble, model averaging, system combination, ...
• Works because errors are random, but correct decisions unique
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
8Right-to-Left Inference
• Neural machine translation generates words right to left (L2R)
the → cat → is → in → the → bag → .
• But it could also generate them right to left (R2L)
the ← cat ← is ← in ← the ← bag ← .
Obligatory notice: Some languages (Arabic, Hebrew, ...) have writing systems that are right-to-left,
so the use of ”right-to-left” is not precise here.
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
9Right-to-Left Reranking
• Train both L2R and R2L model
• Score sentences with both
⇒ use both left and right context during translation
• Only possible once full sentence produced → re-ranking
1. generate n-best list with L2R model
2. score candidates in n-best list with R2L model
3. chose translation with best average score
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
10
large vocabularies
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
11Zipf’s Law: Many Rare Words
frequency
rank
frequency × rank = constant
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
12Many Problems
• Sparse data
– words that occur once or twice have unreliable statistics
• Computation cost
– input word embedding matrix: |V | × 1000
– outout word prediction matrix: 1000 × |V |
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
13Some Causes for Large Vocabularies
• Morphology
tweet, tweets, tweeted, tweeting, retweet, ...
→ morphological analysis?
• Compounding
homework, website, ...
→ compound splitting?
• Names
Netanyahu, Jones, Macron, Hoboken, ...
→ transliteration?
⇒ Breaking up words into subwords may be a good idea
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
14Byte Pair Encoding
• Start by breaking up words into characters
t h e f a t c a t i s i n t h e t h i n b a g
• Merge frequent pairs
t h→th th e f a t c a t i s i n th e th i n b a g
a t→at th e f at c at i s i n th e th i n b a g
i n→in th e f at c at i s in th e th in b a g
th e→the the f at c at i s in the th in b a g
• Each merge operation increases the vocabulary size
– starting with the size of the character set (maybe 100 for Latin script)
– stopping at, say, 50,000
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
15Example: 49,500 BPE Operations
Obama receives Net@@ any@@ ahu
the relationship between Obama and Net@@ any@@ ahu is not exactly
friendly . the two wanted to talk about the implementation of the
international agreement and about Teheran ’s destabil@@ ising activities
in the Middle East . the meeting was also planned to cover the conflict
with the Palestinians and the disputed two state solution . relations
between Obama and Net@@ any@@ ahu have been stra@@ ined for years .
Washington critic@@ ises the continuous building of settlements in
Israel and acc@@ uses Net@@ any@@ ahu of a lack of initiative in the
peace process . the relationship between the two has further
deteriorated because of the deal that Obama negotiated on Iran ’s
atomic programme . in March , at the invitation of the Republic@@ ans
, Net@@ any@@ ahu made a controversial speech to the US Congress , which
was partly seen as an aff@@ ront to Obama . the speech had not been
agreed with Obama , who had rejected a meeting with reference to the
election that was at that time im@@ pending in Israel .
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
16
using monolingual data
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
17Traditional View
• Two core objectives for translation
Adequacy Fluency
meaning of source and target match target is well-formed
translation model language model
parallel data monolingual data
• Language model is key to good performance in statistical models
• But: current neural translation models only trained on parallel data
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
18Integrating a Language Model
• Integrating a language model into neural architecture
– word prediction informed by translation model and language model
– gated unit that decides balance
• Use of language model in decoding
– train language model in isolation
– add language model score during inference (similar to ensembling)
• Proper balance between models (amount of training data, weights) unclear
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
19Backtranslation
• No changes to model architecture
• Create synthetic parallel data
– train a system in reverse direction
– translate target-side monolingual data
into source language
– add as additional parallel data
• Simple, yet effective
reverse system
final system
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
20
deeper models
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
21Deeper Models
• Encoder and decoder are recurrent neural networks
• We can add additional layers for each step
• Recall shallow and deep language models
Input
Hidden
Layer
Output
Input
Hidden
Layer 2
Output
Hidden
Layer 1
Hidden
Layer 3
Shallow Deep
• Adding residual connections (short-cuts through deep layers) help
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
22Deep Decoder
• Two ways of adding layers
– deep transitions: several layers on path to output
– deeply stacking recurrent neural networks
• Why not both?
Context
Decoder State: Stack 1, Transition 1
Decoder State: Stack 1, Transition 2
Decoder State: Stack 2, Transition 1
Decoder State: Stack 2, Transition 2
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
23Deep Encoder
• Previously proposed encoder already has 2 layers
– left-to-right recurrent network, to encode left context
– right-to-left recurrent network, to encode right context
⇒ Third way of adding layers
Input Word Embedding
Encoder Layer 1: L2R
Encoder Layer 2: R2L
Encoder Layer 3: L2R
Encoder Layer 4: R2L
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
24Reality Check: Edinburgh WMT 2017
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
25
alignment and coverage
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
26Alignment
• Attention model fulfills role of alignment
• Traditional methods for word alignment
– based on co-occurence, word position, etc.
– expectation maximization (EM) algorithm
– popular: IBM models, fast-align
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
27Attention vs. Alignment
relations
between
Obama
and
Netanyahu
have
been
strained
for
years
.
die
Beziehungen
zwischen
Obama
und
Netanjahu
sind
seit
Jahren
angespannt
.
56
89
72
16
26
96
79
98
42
11
11
14
38
22
84
23
54 10
98
49
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
28Guided Alignment
• Guided alignment training for neural networks
– traditional objective function: match output words
– now: also match given word alignments
• Add as cost to objective function
– given alignment matrix A, with j Aij = 1 (from IBM Models)
– computed attention αij (also j αij = 1 due to softmax)
– added training objective (cross-entropy)
costCE = −
1
I
I
i=1
J
j=1
Aij log αij
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
29Coverage
in
order
to
solve
the
problem
,
the
”
Social
Housing
”
alliance
suggests
a
fresh
start
.
um
das
Problem
zu
l¨osen
,
schl¨agt
das
Unternehmen
der
Gesellschaft
f¨ur
soziale
Bildung
vor
.
37 33
63
81
84
10 80
12
40
13
71 18
86
84
80
45
40
12
10
41
44 10
89
10
40
37
10
30
80
11 13
43 7 46 161 108 89 62 112 392 121 110 130 26 132 22 19 6 6
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
30Tracking Coverage
• Neural machine translation may drop or duplicate content
• Track coverage during decoding
coverage(j) =
i
αi,j
over-generation = max 0,
j
coverage(j) − 1
under-generation = min 1,
j
coverage(j)
• Add as cost to hypotheses
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
31Coverage Models
• Use as information for state progression
a(si−1, hj) = Wa
si−1 + Ua
hj + V a
coverage(j) + ba
• Add to objective function
log
i
P(yi|x) + λ
j
(1 − coverage(j))2
• May also model fertility
– some words are typically dropped
– some words produce multiple output words
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
32
linguistic annotation
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
33Example
Words the girl watched attentively the beautiful fireflies
Part of speech DET NN VFIN ADV DET JJ NNS
Lemma the girl watch attentive the beautiful firefly
Morphology - SING. PAST - - - PLURAL
Noun phrase BEGIN CONT OTHER OTHER BEGIN CONT CONT
Verb phrase OTHER OTHER BEGIN CONT CONT CONT CONT
Synt. dependency girl watched - watched fireflies fireflies watched
Depend. relation DET SUBJ - ADV DET ADJ OBJ
Semantic role - ACTOR - MANNER - MOD PATIENT
Semantic type - HUMAN VIEW - - - ANIMATE
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
34Input Annotation
• Input words are encoded in one-hot vectors
• Additional linguistic annotation
– part-of-speech tag
– morphological features
– etc.
• Encode each annotation in its own one-hot vector space
• Concatenate one-hot vecors
• Essentially:
– each annotation maps to embedding
– embeddings are added
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
35Output Annotation
• Same can be done for output
• Additional output annotation is latent feature
– ultimately, we do not care if right part-of-speech tag is predicted
– only right output words matter
• Optimizing for correct output annotation → better prediction of output words
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
36Linearized Output Syntax
Sentence the girl watched attentively the beautiful fireflies
Syntax tree S
NP
DET
the
NN
girl
VP
VFIN
watched
ADVP
ADV
attentively
NP
DET
the
JJ
beautiful
NNS
fireflies
Linearized (S (NP (DET the ) (NN girl ) ) (VP (VFIN watched ) (ADVP (ADV attentively
) ) (NP (DET the ) (JJ beautiful ) (NNS fireflies ) ) ) )
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
37
multiple language pairs
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
38One Model, Multiple Language Pairs
• One language pair → train one model
• Multiple language pairs → train one model for each
• Multiple language pair → train one model for all
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
39Multiple Input Languages
• Given
– French–English corpus
– German–English corpus
• Train one model on concatenated corpora
• Benefit: sharing monolingual target language data
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
40Multiple Output Languages
• Multiple output languages
– French–English corpus
– French–Spanish corpus
• Need to mark desired output language with special token
[ENGLISH] N’y a-t-il pas ici deux poids, deux mesures?
⇒ Is this not a case of double standards?
[SPANISH] N’y a-t-il pas ici deux poids, deux mesures?
⇒ No puede verse con toda claridad que estamos utilizando un doble rasero?
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
41Zero Shot
English
French
Spanish
German
MT
• Can the model translate German to Spanish?
[SPANISH] Messen wir hier nicht mit zweierlei Maß?
⇒ No puede verse con toda claridad que estamos utilizando un doble rasero?
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
42Zero Shot: Vision
• Direct translation only requires bilingual mapping
• Zero shot requires interlingual representation
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017
43Zero Shot: Reality
Philipp Koehn Machine Translation: Neural Machine Translation II – Refinements 17 October 2017