IBM Model 1 and the EM Algorithm
Philipp Koehn
10 September 2020
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
1Lexical Translation
• How to translate a word → look up in dictionary
Haus — house, building, home, household, shell.
• Multiple translations
– some more frequent than others
– for instance: house, and building most common
– special cases: Haus of a snail is its shell
• Note: In all lectures, we translate from a foreign language into English
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
2Collect Statistics
Look at a parallel corpus (German text along with English translation)
Translation of Haus Count
house 8,000
building 1,600
home 200
household 150
shell 50
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
3Estimate Translation Probabilities
Maximum likelihood estimation
pf(e) =



0.8 if e = house,
0.16 if e = building,
0.02 if e = home,
0.015 if e = household,
0.005 if e = shell.
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
4Alignment
• In a parallel text (or when we translate), we align words in one language with
the words in the other
das Haus ist klein
the house is small
1 2 3 4
1 2 3 4
• Word positions are numbered 1–4
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
5Alignment Function
• Formalizing alignment with an alignment function
• Mapping an English target word at position i to a German source word at
position j with a function a : i → j
• Example
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
6Reordering
Words may be reordered during translation
das Hausistklein
the house is small
1 2 3 4
1 2 3 4
a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
7One-to-Many Translation
A source word may translate into multiple target words
das Haus ist klitzeklein
the house is very small
1 2 3 4
1 2 3 4
5
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
8Dropping Words
Words may be dropped when translated
(German article das is dropped)
das Haus ist klein
house is small
1 2 3
1 2 3 4
a : {1 → 2, 2 → 3, 3 → 4}
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
9Inserting Words
• Words may be added during translation
– The English just does not have an equivalent in German
– We still need to map it to something: special NULL token
das Haus ist klein
the house is just small
NULL
1 2 3 4
1 2 3 4
5
0
a : {1 → 1, 2 → 2, 3 → 3, 4 → 0, 5 → 4}
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
10IBM Model 1
• Generative model: break up translation process into smaller steps
– IBM Model 1 only uses lexical translation
• Translation probability
– for a foreign sentence f = (f1, ..., flf
) of length lf
– to an English sentence e = (e1, ..., ele) of length le
– with an alignment of each English word ej to a foreign word fi according to
the alignment function a : j → i
p(e, a|f) =
(lf + 1)le
le
j=1
t(ej|fa(j))
– parameter is a normalization constant
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
11Example
das Haus ist klein
e t(e|f)
the 0.7
that 0.15
which 0.075
who 0.05
this 0.025
e t(e|f)
house 0.8
building 0.16
home 0.02
household 0.015
shell 0.005
e t(e|f)
is 0.8
’s 0.16
exists 0.02
has 0.015
are 0.005
e t(e|f)
small 0.4
little 0.4
short 0.1
minor 0.06
petty 0.04
p(e, a|f) =
43
× t(the|das) × t(house|Haus) × t(is|ist) × t(small|klein)
=
43
× 0.7 × 0.8 × 0.8 × 0.4
= 0.0028
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
12
ﬁnding translations
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
13Centauri-Arcturan Parallel Text
1a. ok-voon ororok sprok . 7a. lalok farok ororok lalok sprok izok enemok .
1b. at-voon bichat dat . 7b. wat jjat bichat wat dat vat eneat .
————————————————– ————————————————–
2a. ok-drubel ok-voon anok plok sprok . 8a. lalok brok anok plok nok .
2b. at-drubel at-voon pippat rrat dat . 8b. iat lat pippat rrat nnat .
————————————————– ————————————————–
3a. erok sprok izok hihok ghirok . 9a. wiwok nok izok kantok ok-yurp .
3b. totat dat arrat vat hilat . 9b. totat nnat quat oloat at-yurp .
————————————————– ————————————————–
4a. ok-voon anok drok brok jok . 10a. lalok mok nok yorok ghirok clok .
4b. at-voon krat pippat sat lat . 10b. wat nnat gat mat bat hilat .
————————————————– ————————————————–
5a. wiwok farok izok stok . 11a. lalok nok crrrok hihok yorok zanzanok .
5b. totat jjat quat cat . 11b. wat nnat arrat mat zanzanat .
————————————————– ————————————————–
6a. lalok sprok izok jok stok . 12a. lalok rarok nok izok hihok mok .
6b. wat dat krat quat cat . 12b. wat nnat forat arrat vat gat .
Translation challenge: farok crrrok hihok yorok clok kantok ok-yurp
(from Knight (1997): Automating Knowledge Acquisition for Machine Translation)
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
14
em algorithm
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
15Learning Lexical Translation Models
• We would like to estimate the lexical translation probabilities t(e|f) from a
parallel corpus
• ... but we do not have the alignments
• Chicken and egg problem
– if we had the alignments,
→ we could estimate the parameters of our generative model
– if we had the parameters,
→ we could estimate the alignments
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
16EM Algorithm
• Incomplete data
– if we had complete data, would could estimate model
– if we had model, we could ﬁll in the gaps in the data
• Expectation Maximization (EM) in a nutshell
1. initialize model parameters (e.g. uniform)
2. assign probabilities to the missing data
3. estimate model parameters from completed data
4. iterate steps 2–3 until convergence
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
17EM Algorithm
... la maison ... la maison blue ... la fleur ...
... the house ... the blue house ... the flower ...
• Initial step: all alignments equally likely
• Model learns that, e.g., la is often aligned with the
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
18EM Algorithm
... la maison ... la maison blue ... la fleur ...
... the house ... the blue house ... the flower ...
• After one iteration
• Alignments, e.g., between la and the are more likely
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
19EM Algorithm
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
• After another iteration
• It becomes apparent that alignments, e.g., between ﬂeur and ﬂower are more
likely (pigeon hole principle)
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
20EM Algorithm
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
• Convergence
• Inherent hidden structure revealed by EM
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
21EM Algorithm
... la maison ... la maison bleu ... la fleur ...
... the house ... the blue house ... the flower ...
p(la|the) = 0.453
p(le|the) = 0.334
p(maison|house) = 0.876
p(bleu|blue) = 0.563
...
• Parameter estimation from the aligned corpus
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
22IBM Model 1 and EM
• EM Algorithm consists of two steps
• Expectation-Step: Apply model to the data
– parts of the model are hidden (here: alignments)
– using the model, assign probabilities to possible values
• Maximization-Step: Estimate model from data
– take assign values as fact
– collect counts (weighted by probabilities)
– estimate model from counts
• Iterate these steps until convergence
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
23IBM Model 1 and EM
• We need to be able to compute:
– Expectation-Step: probability of alignments
– Maximization-Step: count collection
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
24IBM Model 1 and EM
• Probabilities
p(the|la) = 0.7 p(house|la) = 0.05
p(the|maison) = 0.1 p(house|maison) = 0.8
• Alignments
la •
maison•
the•
house•
la •
maison•
the•
house•
d
d
d
la •
maison•
the•
house• 
 
  la •
maison•
the•
house•
d
d
d 
 
 
p(e, a|f) = 0.56 p(e, a|f) = 0.035 p(e, a|f) = 0.08 p(e, a|f) = 0.005
p(a|e, f) = 0.824 p(a|e, f) = 0.052 p(a|e, f) = 0.118 p(a|e, f) = 0.007
• Counts
c(the|la) = 0.824 + 0.052 c(house|la) = 0.052 + 0.007
c(the|maison) = 0.118 + 0.007 c(house|maison) = 0.824 + 0.118
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
25IBM Model 1 and EM: Expectation Step
• We need to compute p(a|e, f)
• Applying the chain rule:
p(a|e, f) =
p(e, a|f)
p(e|f)
• We already have the formula for p(e, a|f) (deﬁnition of Model 1)
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
26IBM Model 1 and EM: Expectation Step
• We need to compute p(e|f)
p(e|f) =
a
p(e, a|f)
=
lf
a(1)=0
...
lf
a(le)=0
p(e, a|f)
=
lf
a(1)=0
...
lf
a(le)=0
(lf + 1)le
le
j=1
t(ej|fa(j))
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
27IBM Model 1 and EM: Expectation Step
p(e|f) =
lf
a(1)=0
...
lf
a(le)=0
(lf + 1)le
le
j=1
t(ej|fa(j))
=
(lf + 1)
le
lf
a(1)=0
...
lf
a(le)=0
le
j=1
t(ej|fa(j))
=
(lf + 1)
le
le
j=1
lf
i=0
t(ej|fi)
• Note the trick in the last line
– removes the need for an exponential number of products
→ this makes IBM Model 1 estimation tractable
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
28The Trick
(case le = lf = 2)
2
a(1)=0
2
a(2)=0
=
32
2
j=1
t(ej|fa(j)) =
= t(e1|f0) t(e2|f0) + t(e1|f0) t(e2|f1) + t(e1|f0) t(e2|f2)+
+ t(e1|f1) t(e2|f0) + t(e1|f1) t(e2|f1) + t(e1|f1) t(e2|f2)+
+ t(e1|f2) t(e2|f0) + t(e1|f2) t(e2|f1) + t(e1|f2) t(e2|f2) =
= t(e1|f0) (t(e2|f0) + t(e2|f1) + t(e2|f2)) +
+ t(e1|f1) (t(e2|f1) + t(e2|f1) + t(e2|f2)) +
+ t(e1|f2) (t(e2|f2) + t(e2|f1) + t(e2|f2)) =
= (t(e1|f0) + t(e1|f1) + t(e1|f2)) (t(e2|f2) + t(e2|f1) + t(e2|f2))
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
29IBM Model 1 and EM: Expectation Step
• Combine what we have:
p(a|e, f) = p(e, a|f)/p(e|f)
=
(lf +1)le
le
j=1 t(ej|fa(j))
(lf +1)le
le
j=1
lf
i=0 t(ej|fi)
=
le
j=1
t(ej|fa(j))
lf
i=0 t(ej|fi)
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
30IBM Model 1 and EM: Maximization Step
• Now we have to collect counts
• Evidence from a sentence pair e,f that word e is a translation of word f:
c(e|f; e, f) =
a
p(a|e, f)
le
j=1
δ(e, ej)δ(f, fa(j))
• With the same simplication as before:
c(e|f; e, f) =
t(e|f)
lf
i=0 t(e|fi)
le
j=1
δ(e, ej)
lf
i=0
δ(f, fi)
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
31IBM Model 1 and EM: Maximization Step
After collecting these counts over a corpus, we can estimate the model:
t(e|f; e, f) =
(e,f)
c(e|f; e, f))
e (e,f)
c(e|f; e, f))
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
32IBM Model 1 and EM: Pseudocode
Input: set of sentence pairs (e, f)
Output: translation prob. t(e|f)
1: initialize t(e|f) uniformly
2: while not converged do
3: // initialize
4: count(e|f) = 0 for all e, f
5: total(f) = 0 for all f
6: for all sentence pairs (e,f) do
7: // compute normalization
8: for all words e in e do
9: s-total(e) = 0
10: for all words f in f do
11: s-total(e) += t(e|f)
12: end for
13: end for
14: // collect counts
15: for all words e in e do
16: for all words f in f do
17: count(e|f) += t(e|f)
s-total(e)
18: total(f) += t(e|f)
s-total(e)
19: end for
20: end for
21: end for
22: // estimate probabilities
23: for all foreign words f do
24: for all English words e do
25: t(e|f) = count(e|f)
total(f)
26: end for
27: end for
28: end while
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
33Convergence
das Haus
the house
das Buch
the book
ein Buch
a book
e f initial 1st it. 2nd it. 3rd it. ... ﬁnal
the das 0.25 0.5 0.6364 0.7479 ... 1
book das 0.25 0.25 0.1818 0.1208 ... 0
house das 0.25 0.25 0.1818 0.1313 ... 0
the buch 0.25 0.25 0.1818 0.1208 ... 0
book buch 0.25 0.5 0.6364 0.7479 ... 1
a buch 0.25 0.25 0.1818 0.1313 ... 0
book ein 0.25 0.5 0.4286 0.3466 ... 0
a ein 0.25 0.5 0.5714 0.6534 ... 1
the haus 0.25 0.5 0.4286 0.3466 ... 0
house haus 0.25 0.5 0.5714 0.6534 ... 1
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
34Perplexity
• How well does the model ﬁt the data?
• Perplexity: derived from probability of the training data according to the model
log2 PP = −
s
log2 p(es|fs)
• Example ( =1)
initial 1st it. 2nd it. 3rd it. ... ﬁnal
p(the haus|das haus) 0.0625 0.1875 0.1905 0.1913 ... 0.1875
p(the book|das buch) 0.0625 0.1406 0.1790 0.2075 ... 0.25
p(a book|ein buch) 0.0625 0.1875 0.1907 0.1913 ... 0.1875
perplexity 4095 202.3 153.6 131.6 ... 113.8
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
35Higher IBM Models
IBM Model 1 lexical translation
IBM Model 2 adds absolute reordering model
IBM Model 3 adds fertility model
IBM Model 4 relative reordering model
IBM Model 5 ﬁxes deﬁciency
• Only IBM Model 1 has global maximum
– training of a higher IBM model builds on previous model
• Compuationally biggest change in Model 3
– trick to simplify estimation does not work anymore
→ exhaustive count collection becomes computationally too expensive
– sampling over high probability alignments is used instead
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
36
word alignment
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
37Word Alignment
Given a sentence pair, which words correspond to each other?
house
the
in
stay
will
he
that
assumes
michael
michael
geht
davon
aus
dass
er
im
haus
bleibt
,
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
38Word Alignment?
here
live
not
does
john
john
hier
nicht
wohnt
??
Is the English word does aligned to
the German wohnt (verb) or nicht (negation) or neither?
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
39Word Alignment?
bucket
the
kicked
john
john
ins
grass
biss
How do the idioms kicked the bucket and biss ins grass match up?
Outside this exceptional context, bucket is never a good translation for grass
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
40Measuring Word Alignment Quality
• Manually align corpus with sure (S) and possible (P) alignment points (S ⊆ P)
• Common metric for evaluation word alignments: Alignment Error Rate (AER)
AER(S, P; A) = 1 −
|A ∩ S| + |A ∩ P|
|A| + |S|
• AER = 0: alignment A matches all sure, any possible alignment points
• However: different applications require different precision/recall trade-offs
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
41
symmetrization
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
42Word Alignment with IBM Models
• IBM Models create a many-to-one mapping
– words are aligned using an alignment function
– a function may return the same value for different input
(one-to-many mapping)
– a function can not return multiple values for one input
(no many-to-one mapping)
• Real word alignments have many-to-many mappings
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
43Symmetrization
• Run IBM Model training in both directions
→ two sets of word alignment points
• Intersection: high precision alignment points
• Union: high recall alignment points
• Reﬁnement methods explore the sets between intersection and union
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
44Example
Maria no daba una
bofetada
a la
bruja
verde
Mary
witch
green
the
slap
not
did
Maria no daba una
bofetada
a la
bruja
verde
Mary
witch
green
the
slap
not
did
Maria no daba una
bofetada
a la
bruja
verde
Mary
witch
green
the
slap
not
did
english to spanish spanish to english
intersection
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020
45Growing Heuristics
Maria no daba una
bofetada
a la
bruja
verde
Mary
witch
green
the
slap
not
did
black: intersection grey: additional points in union
• Add alignment points from union based on heuristics:
– directly/diagonally neighboring points
– ﬁnally, add alignments that connect unaligned words in source and/or target
• Popular method: grow-diag-ﬁnal-and
Philipp Koehn Machine Translation: IBM Model 1 and the EM Algorithm 10 September 2020