Decoding
Philipp Koehn
16 September 2021
Philipp Koehn Machine Translation: Decoding 16 September 2021
1Decoding
• We have a mathematical model for translation
p(e|f)
• Task of decoding: ﬁnd the translation ebest with highest probability
ebest = argmaxe p(e|f)
• Two types of error
– the most probable translation is bad → ﬁx the model
– search does not ﬁnd the most probably translation → ﬁx the search
• Decoding is evaluated by search error, not quality of translations
(although these are often correlated)
Philipp Koehn Machine Translation: Decoding 16 September 2021
2
translation process
Philipp Koehn Machine Translation: Decoding 16 September 2021
3Translation Process
• Task: translate this sentence from German into English
er geht ja nicht nach hause
Philipp Koehn Machine Translation: Decoding 16 September 2021
4Translation Process
• Task: translate this sentence from German into English
er geht ja nicht nach hause
er
he
• Pick phrase in input, translate
Philipp Koehn Machine Translation: Decoding 16 September 2021
5Translation Process
• Task: translate this sentence from German into English
er geht ja nicht nach hause
er ja nicht
he does not
• Pick phrase in input, translate
– it is allowed to pick words out of sequence reordering
– phrases may have multiple words: many-to-many translation
Philipp Koehn Machine Translation: Decoding 16 September 2021
6Translation Process
• Task: translate this sentence from German into English
er geht ja nicht nach hause
er geht ja nicht
he does not go
• Pick phrase in input, translate
Philipp Koehn Machine Translation: Decoding 16 September 2021
7Translation Process
• Task: translate this sentence from German into English
er geht ja nicht nach hause
er geht ja nicht nach hause
he does not go home
• Pick phrase in input, translate
Philipp Koehn Machine Translation: Decoding 16 September 2021
8Computing Translation Probability
• Probabilistic model for phrase-based translation:
ebest = argmaxe
I
i=1
φ( ¯fi|¯ei) d(starti − endi−1 − 1) pLM(e)
• Score is computed incrementally for each partial hypothesis
• Components
Phrase translation Picking phrase ¯fi to be translated as a phrase ¯ei
→ look up score φ( ¯fi|¯ei) from phrase translation table
Reordering Previous phrase ended in endi−1, current phrase starts at starti
→ compute d(starti − endi−1 − 1)
Language model For n-gram model, need to keep track of last n − 1 words
→ compute score pLM(wi|wi−(n−1), ..., wi−1) for added words wi
Philipp Koehn Machine Translation: Decoding 16 September 2021
9
decoding process
Philipp Koehn Machine Translation: Decoding 16 September 2021
10Translation Options
he
er geht ja nicht nach hause
it
, it
, he
is
are
goes
go
yes
is
, of course
not
do not
does not
is not
after
to
according to
in
house
home
chamber
at home
not
is not
does not
do not
home
under house
return home
do not
it is
he will be
it goes
he goes
is
are
is after all
does
to
following
not after
not to
,
not
is not
are not
is not a
• Many translation options to choose from
– in Europarl phrase table: 2727 matching phrase pairs for this sentence
– by pruning to the top 20 per phrase, 202 translation options remain
Philipp Koehn Machine Translation: Decoding 16 September 2021
11Translation Options
he
er geht ja nicht nach hause
it
, it
, he
is
are
goes
go
yes
is
, of course
not
do not
does not
is not
after
to
according to
in
house
home
chamber
at home
not
is not
does not
do not
home
under house
return home
do not
it is
he will be
it goes
he goes
is
are
is after all
does
to
following
not after
not to
not
is not
are not
is not a
• The machine translation decoder does not know the right answer
– picking the right translation options
– arranging them in the right order
→ Search problem solved by heuristic beam search
Philipp Koehn Machine Translation: Decoding 16 September 2021
12Decoding: Precompute Translation Options
er geht ja nicht nach hause
consult phrase translation table for all input phrases
Philipp Koehn Machine Translation: Decoding 16 September 2021
13Decoding: Start with Initial Hypothesis
er geht ja nicht nach hause
initial hypothesis: no input words covered, no output produced
Philipp Koehn Machine Translation: Decoding 16 September 2021
14Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
pick any translation option, create new hypothesis
Philipp Koehn Machine Translation: Decoding 16 September 2021
15Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
he
create hypotheses for all other translation options
Philipp Koehn Machine Translation: Decoding 16 September 2021
16Decoding: Hypothesis Expansion
er geht ja nicht nach hause
are
it
he
goes
does not
yes
go
to
home
home
also create hypotheses from created partial hypothesis
Philipp Koehn Machine Translation: Decoding 16 September 2021
17Decoding: Find Best Path
er geht ja nicht nach hause
are
it
he
goes
does not
yes
go
to
home
home
backtrack from highest scoring complete hypothesis
Philipp Koehn Machine Translation: Decoding 16 September 2021
18
dynamic programming
Philipp Koehn Machine Translation: Decoding 16 September 2021
19Computational Complexity
• The suggested process creates exponential number of hypothesis
• Machine translation decoding is NP-complete
• Reduction of search space:
– recombination (risk-free)
– pruning (risky)
Philipp Koehn Machine Translation: Decoding 16 September 2021
20Recombination
• Two hypothesis paths lead to two matching hypotheses
– same foreign words translated
– same English words in the output
it is
it is
• Worse hypothesis is dropped
it is
Philipp Koehn Machine Translation: Decoding 16 September 2021
21Recombination
• Two hypothesis paths lead to hypotheses indistinguishable in subsequent search
– same foreign words translated
– same last two English words in output (assuming trigram language model)
– same last foreign word translated
it
he
does not
does not
• Worse hypothesis is dropped
it
he does not
Philipp Koehn Machine Translation: Decoding 16 September 2021
22Restrictions on Recombination
• Translation model: Phrase translation independent from each other
→ no restriction to hypothesis recombination
• Language model: Last n − 1 words used as history in n-gram language model
→ recombined hypotheses must match in their last n − 1 words
• Reordering model: Distance-based reordering model based on distance to end
position of previous input phrase
→ recombined hypotheses must have that same end position
• Other feature function may introduce additional restrictions
Philipp Koehn Machine Translation: Decoding 16 September 2021
23
pruning
Philipp Koehn Machine Translation: Decoding 16 September 2021
24Pruning
• Recombination reduces search space, but not enough
(we still have a NP complete problem on our hands)
• Pruning: remove bad hypotheses early
– put comparable hypothesis into stacks
(hypotheses that have translated same number of input words)
– limit number of hypotheses in each stack
Philipp Koehn Machine Translation: Decoding 16 September 2021
25Stacks
are
it
he
goes does not
yes
no word
translated
one word
translated
two words
translated
three words
translated
• Hypothesis expansion in a stack decoder
– translation option is applied to hypothesis
– new hypothesis is dropped into a stack further down
Philipp Koehn Machine Translation: Decoding 16 September 2021
26Stack Decoding Algorithm
1: place empty hypothesis into stack 0
2: for all stacks 0...n − 1 do
3: for all hypotheses in stack do
4: for all translation options do
5: if applicable then
6: create new hypothesis
7: place in stack
8: recombine with existing hypothesis if possible
9: prune stack if too big
10: end if
11: end for
12: end for
13: end for
Philipp Koehn Machine Translation: Decoding 16 September 2021
27Pruning
• Pruning strategies
– histogram pruning: keep at most k hypotheses in each stack
– stack pruning: keep hypothesis with score α × best score (α < 1)
• Computational time complexity of decoding with histogram pruning
O(max stack size × translation options × sentence length)
• Number of translation options is linear with sentence length, hence:
O(max stack size × sentence length2
)
• Quadratic complexity
Philipp Koehn Machine Translation: Decoding 16 September 2021
28Reordering Limits
• Limiting reordering to maximum reordering distance
• Typical reordering distance 5–8 words
– depending on language pair
– larger reordering limit hurts translation quality
• Reduces complexity to linear
O(max stack size × sentence length)
• Speed / quality trade-off by setting maximum stack size
Philipp Koehn Machine Translation: Decoding 16 September 2021
29
future cost estimation
Philipp Koehn Machine Translation: Decoding 16 September 2021
30Translating the Easy Part First?
the tourism initiative addresses this for the ﬁrst time
the
die
tm:-0.19,lm:-0.4,
d:0, all:-0.65
tourism
touristische
tm:-1.16,lm:-2.93
d:0, all:-4.09
the ﬁrst time
das erste mal
tm:-0.56,lm:-2.81
d:-0.74. all:-4.11
initiative
initiative
tm:-1.21,lm:-4.67
d:0, all:-5.88
both hypotheses translate 3 words
worse hypothesis has better score
Philipp Koehn Machine Translation: Decoding 16 September 2021
31Estimating Future Cost
• Future cost estimate: how expensive is translation of rest of sentence?
• Optimistic: choose cheapest translation options
• Cost for each translation option
– translation model: cost known
– language model: output words known, but not context
→ estimate without context
– reordering model: unknown, ignored for future cost estimation
Philipp Koehn Machine Translation: Decoding 16 September 2021
32Cost Estimates from Translation Options
the tourism initiative addresses this for the ﬁrst time
-1.0 -2.0 -1.5 -2.4 -1.0 -1.0 -1.9 -1.6-1.4
-4.0 -2.5
-1.3
-2.2
-2.4
-2.7
-2.3
-2.3
-2.3
cost of cheapest translation options for each input span (log-probabilities)
Philipp Koehn Machine Translation: Decoding 16 September 2021
33Cost Estimates for all Spans
• Compute cost estimate for all contiguous spans by combining cheapest options
ﬁrst future cost estimate for n words (from ﬁrst)
word 1 2 3 4 5 6 7 8 9
the -1.0 -3.0 -4.5 -6.9 -8.3 -9.3 -9.6 -10.6 -10.6
tourism -2.0 -3.5 -5.9 -7.3 -8.3 -8.6 -9.6 -9.6
initiative -1.5 -3.9 -5.3 -6.3 -6.6 -7.6 -7.6
addresses -2.4 -3.8 -4.8 -5.1 -6.1 -6.1
this -1.4 -2.4 -2.7 -3.7 -3.7
for -1.0 -1.3 -2.3 -2.3
the -1.0 -2.2 -2.3
ﬁrst -1.9 -2.4
time -1.6
• Function words cheaper (the: -1.0) than content words (tourism -2.0)
• Common phrases cheaper (for the ﬁrst time: -2.3)
than unusual ones (tourism initiative addresses: -5.9)
Philipp Koehn Machine Translation: Decoding 16 September 2021
34Combining Score and Future Cost
the ﬁrst time
das erste mal
tm:-0.56,lm:-2.81
d:-0.74. all:-4.11
the tourism initiative
die touristische
initiative
tm:-1.21,lm:-4.67
d:0, all:-5.88
-6.1 -9.3
this for ... time
für diese zeit
tm:-0.82,lm:-2.98
d:-1.06. all:-4.86
-6.9 -2.2
-5.88
-11.98
-6.1 +
= -4.11
-13.41
-9.3 +
= -4.86
-13.96
-9.1 +
=
• Hypothesis score and future cost estimate are combined for pruning
– left hypothesis starts with hard part: the tourism initiative
score: -5.88, future cost: -6.1 → total cost -11.98
– middle hypothesis starts with easiest part: the ﬁrst time
score: -4.11, future cost: -9.3 → total cost -13.41
– right hypothesis picks easy parts: this for ... time
score: -4.86, future cost: -9.1 → total cost -13.96
Philipp Koehn Machine Translation: Decoding 16 September 2021
35
cube pruning
Philipp Koehn Machine Translation: Decoding 16 September 2021
36Stack Decoding Algorithm
• Exhaustive matching of hypotheses to applicable translations options
→ too much computation
1: place empty hypothesis into stack 0
2: for all stacks 0...n − 1 do
3: for all hypotheses in stack do
4: for all translation options do
5: if applicable then
6: create new hypothesis
7: place in stack
8: recombine with existing hypothesis if possible
9: prune stack if too big
10: end if
11: end for
12: end for
13: end for
Philipp Koehn Machine Translation: Decoding 16 September 2021
37Group Hypotheses and Options
• Group hypotheses by coverage vector
–
–
–
– ...
• Group translation options by span
–
–
–
– ...
⇒ Loop over groups, check for applicability once for each pair of groups
(not much gained so far)
Philipp Koehn Machine Translation: Decoding 16 September 2021
38All Hypotheses, All Options
go
walk
goes
are
is
he does not
he just does
it does not
he just does not
he is not
it is not
• Example: group with 6 hypotheses, group with 5 translation options
• Should we really create all 6 × 5 of them?
Philipp Koehn Machine Translation: Decoding 16 September 2021
39Rank by Score
-1.1go
-1.2walk
-1.4goes
-1.7are
-2.1is
he does not -3.2
he just does -3.5
it does not -4.1
he just does not -4.3
he is not -4.7
it is not -5.1
• Rank hypotheses by score so far
• Rank translation options by score estimate
Philipp Koehn Machine Translation: Decoding 16 September 2021
40Expected Score of New Hypothesis
-1.0go
-1.2walk
-1.4goes
-1.7are
-2.1is
he does not -3.2 -4.2 -4.4 -4.6 -4.9 -5.3
he just does -3.5 -4.5 -4.7 -4.9 -5.2 -5.6
it does not -4.1 -5.1 -5.3 -5.5 -5.8 -6.2
he just does not -4.3 -5.3 -5.5 -5.7 -6.0 -6.4
he is not -4.7 -5.7 -5.9 -6.1 -6.4 -6.8
it is not -5.1 -6.1 -6.3 -6.5 -6.8 -7.2
• Expected score: hypothesis score + translation option score
• Real score will be different, since language model score depends on context
Philipp Koehn Machine Translation: Decoding 16 September 2021
41Only Compute Half
-1.0go
-1.2walk
-1.4goes
-1.7are
-2.1is
he does not -3.2 -4.2 -4.4 -4.6 -4.9 -5.3
he just does -3.5 -4.5 -4.7 -4.9 -5.2 -5.6
it does not -4.1 -5.1 -5.3 -5.5 -5.8 -6.2
he just does not -4.3 -5.3 -5.5 -5.7 -6.0 -6.4
he is not -4.7 -5.7 -5.9 -6.1 -6.4 -6.8
it is not -5.1 -6.1 -6.3 -6.5 -6.8 -7.2
• If we want to save computational cost, we could decide to only compute some
• One way to do this: based on expected score
Philipp Koehn Machine Translation: Decoding 16 September 2021
42Cube Pruning
-1.0go
-1.2walk
-1.4goes
-1.7are
-2.1is
he does not -3.2 -3.9 -4.4 -4.6 -4.9 -5.3
he just does -3.5 -4.5 -4.7 -4.9 -5.2 -5.6
it does not -4.1 -5.1 -5.3 -5.5 -5.8 -6.2
he just does not -4.3 -5.3 -5.5 -5.7 -6.0 -6.4
he is not -4.7 -5.7 -5.9 -6.1 -6.4 -6.8
it is not -5.1 -6.1 -6.3 -6.5 -6.8 -7.2
• Start with best hypothesis, best translation option
• Create new hypothesis (actual score becomes available)
Philipp Koehn Machine Translation: Decoding 16 September 2021
43Cube Pruning (2)
-1.0go
-1.2walk
-1.4goes
-1.7are
-2.1is
he does not -3.2 -3.9 -4.1 -4.6 -4.9 -5.3
he just does -3.5 -4.3 -4.7 -4.9 -5.2 -5.6
it does not -4.1 -5.1 -5.3 -5.5 -5.8 -6.2
he just does not -4.3 -5.3 -5.5 -5.7 -6.0 -6.4
he is not -4.7 -5.7 -5.9 -6.1 -6.4 -6.8
it is not -5.1 -6.1 -6.3 -6.5 -6.8 -7.2
• Commit it to the stack
• Create its neighbors
Philipp Koehn Machine Translation: Decoding 16 September 2021
44Cube Pruning (3)
-1.0go
-1.2walk
-1.4goes
-1.7are
-2.1is
he does not -3.2 -3.9 -4.1 -4.7 -4.9 -5.3
he just does -3.5 -4.3 -4.4 -4.9 -5.2 -5.6
it does not -4.1 -5.1 -5.3 -5.5 -5.8 -6.2
he just does not -4.3 -5.3 -5.5 -5.7 -6.0 -6.4
he is not -4.7 -5.7 -5.9 -6.1 -6.4 -6.8
it is not -5.1 -6.1 -6.3 -6.5 -6.8 -7.2
• Commit best neighbor to the stack
• Create its neighbors in turn
Philipp Koehn Machine Translation: Decoding 16 September 2021
45Cube Pruning (4)
-1.0go
-1.2walk
-1.4goes
-1.7are
-2.1is
he does not -3.2 -3.9 -4.1 -4.7 -4.9 -5.3
he just does -3.5 -4.3 -4.4 -4.9 -5.2 -5.6
it does not -4.1 -4.0 -5.3 -5.5 -5.8 -6.2
he just does not -4.3 -5.3 -5.5 -5.7 -6.0 -6.4
he is not -4.7 -5.7 -5.9 -6.1 -6.4 -6.8
it is not -5.1 -6.1 -6.3 -6.5 -6.8 -7.2
• Keep doing this for a speciﬁc number of hypothesis
• Different hypothesis / translation options groups compete as well
Philipp Koehn Machine Translation: Decoding 16 September 2021
46
heaﬁeld pruning
Philipp Koehn Machine Translation: Decoding 16 September 2021
47Heaﬁeld Pruning
• Main idea
– a lot of hypotheses share sufﬁxes
– a lot of translation options share preﬁxes
– combining
∗ the last word of a hypothesis
∗ the ﬁrst word of a translation options
may already indicate if we should pursue further
• Method
– organize hypotheses by sufﬁx tree
– organize translation options by preﬁx tree
– process priority queue based on pairs of nodes in these trees
Philipp Koehn Machine Translation: Decoding 16 September 2021
48Example
Hypotheses with 2 words translated
• -2.1 a big country
• -2.2 large countries
• -2.7 the big countries
• -2.8 a large country
• -2.9 the big country
• -3.1 a big nation
Translation options for a source span
• -1.1 does not waver
• -1.5 do not waver
• -1.7 wavers not
• -1.9 does not hesitate
• -2.1 does rarely waver
Philipp Koehn Machine Translation: Decoding 16 September 2021
49Encode in Sufﬁx and Preﬁx Trees
Hypotheses with 2 words translated
• -2.1 a big country
• -2.2 large countries
• -2.7 the big countries
• -2.8 a large country
• -2.9 the big country
• -3.1 a big nation
Translation options for a source span
• -1.1 does not waver
• -1.5 do not waver
• -1.7 wavers not
• -1.9 does not hesitate
• -2.1 does rarely waver
countries
the big -0.5
large 0
-2.2
country
a large -0.7
big
a 0
the -0.8
0
-2.1
a big nation
-3.1
do not waver
-1.5
does
rarely waver
-1.0
not
hesitate
-0.8
waver0
0
-1.1
wavers not
-1.7
Philipp Koehn Machine Translation: Decoding 16 September 2021
50Set up Priority Queue
countries
the big -0.5
large 0
-2.2
country
a large -0.7
big
a 0
the -0.8
0
-2.1
a big nation
-3.1
do not waver
-1.5
does
rarely waver
-1.0
not
hesitate
-0.8
waver0
0
-1.1
wavers not
-1.7
• Priority queue
– ( , ), score: -3.2 (-2.1 + -1.1)
Philipp Koehn Machine Translation: Decoding 16 September 2021
51Pop off First Item
countries
the big -0.5
large 0
-2.2
country
a large -0.7
big
a 0
the -0.8
0
-2.1
a big nation
-3.1
do not waver
-1.5
does
rarely waver
-1.0
not
hesitate
-0.8
waver0
0
-1.1
wavers not
-1.7
• Priority queue
– ( , ), score: -3.2 (-2.1 + -1.1)
• Pop off: ( , )
• Expand left (hypothesis): best is country
• Add new items
– (country, ), score: -3.2 (-2.1 + -1.1)
– ( [1+], ), score: -3.3 (-2.2 + -1.1)
Philipp Koehn Machine Translation: Decoding 16 September 2021
52Pop off Second Item
countries
the big -0.5
large 0
-2.2
country
a large -0.7
big
a 0
the -0.8
0
-2.1
a big nation
-3.1
do not waver
-1.5
does
rarely waver
-1.0
not
hesitate
-0.8
waver0
0
-1.1
wavers not
-1.7
• Priority queue
– (country, ), score: -3.2 (-2.1 + -1.1)
– ( [1+], ), score: -3.3 (-2.2 + -1.1)
• Pop off: (country, )
• Expand left (translation option): best is does
• Update language model probability estimate log
p(does|country)
p(does) = +0.2
• Add new items
– (country,does), score: -3.0 (-2.1 + -1.1 + +0.2)
– (country, [1+]), score: -3.6 (-2.1 + -1.5)
Philipp Koehn Machine Translation: Decoding 16 September 2021
53Pop off Next Item
countries
the big -0.5
large 0
-2.2
country
a large -0.7
big
a 0
the -0.8
0
-2.1
a big nation
-3.1
do not waver
-1.5
does
rarely waver
-1.0
not
hesitate
-0.8
waver0
0
-1.1
wavers not
-1.7
• Priority queue
– (country,does), score: -3.0 (-2.1 + -1.1 + +0.2)
– ( [1+], ), score: -3.3 (-2.2 + -1.1)
– (country, [1+]), score: -3.6 (-2.1 + -1.5)
• Pop off: (country,does)
• Expand left (hypothesis): best is big
• Update language model probability estimate log
p(does|big country)
p(does|country) = +0.1
• Add new items
– (big country,does), score: -2.9 (-2.1 + -1.1 + +0.2 + +0.1)
– (country[1+],does), score: -3.7 (-2.1 + -1.1 + +0.2 + -0.7 )
Philipp Koehn Machine Translation: Decoding 16 September 2021
54Continue...
countries
the big -0.5
large 0
-2.2
country
a large -0.7
big
a 0
the -0.8
0
-2.1
a big nation
-3.1
do not waver
-1.5
does
rarely waver
-1.0
not
hesitate
-0.8
waver0
0
-1.1
wavers not
-1.7
• Priority queue
– (big country,does), score: -2.9 (-2.1 + -1.1 + +0.2 + +0.1)
– ( [1+], ), score: -3.3 (-2.2 + -1.1)
– (country, [1+]), score: -3.6 (-2.1 + -1.5)
– (country[1+],does), score: -3.7 (-2.1 + -1.1 + +0.2 + -0.7 )
• And so on...
– once a full combination is completed (a big country,does not waver), add it to the stack
– badly matching updates will push items down the priority queue
e.g., logp(does|countries)
p(does) = −2.1
Philipp Koehn Machine Translation: Decoding 16 September 2021
55Performance
Philipp Koehn Machine Translation: Decoding 16 September 2021
56
other decoding algorithms
Philipp Koehn Machine Translation: Decoding 16 September 2021
57Other Decoding Algorithms
• A* search
• Greedy hill-climbing
• Using ﬁnite state transducers (standard toolkits)
Philipp Koehn Machine Translation: Decoding 16 September 2021
58A* Search
probability+heuristicestimate
number of words covered
① depth-ﬁrst
expansion to completed path
② recombination
③ alternative path leading to
hypothesis beyond threshold
cheapest score
• Uses admissible future cost heuristic: never overestimates cost
• Translation agenda: create hypothesis with lowest score + heuristic cost
• Done, when complete hypothesis created
Philipp Koehn Machine Translation: Decoding 16 September 2021
59Greedy Hill-Climbing
• Create one complete hypothesis with depth-ﬁrst search (or other means)
• Search for better hypotheses by applying change operators
– change the translation of a word or phrase
– combine the translation of two words into a phrase
– split up the translation of a phrase into two smaller phrase translations
– move parts of the output into a different position
– swap parts of the output with the output at a different part of the sentence
• Terminates if no operator application produces a better translation
Philipp Koehn Machine Translation: Decoding 16 September 2021
60Summary
• Translation process: produce output left to right
• Translation options
• Decoding by hypothesis expansion
• Reducing search space
– recombination
– pruning (requires future cost estimate)
• Other decoding algorithms
Philipp Koehn Machine Translation: Decoding 16 September 2021