Language Models
Philipp Koehn
5 September 2023
Philipp Koehn Machine Translation: Language Models 5 September 2023
1Language models
• Language models answer the question:
How likely is a string of English words good English?
• Help with reordering
pLM(the house is small) > pLM(small the is house)
• Help with word choice
pLM(I am going home) > pLM(I am going house)
Philipp Koehn Machine Translation: Language Models 5 September 2023
2N-Gram Language Models
• Given: a string of English words W = w1, w2, w3, ..., wn
• Question: what is p(W)?
• Sparse data: Many good English sentences will not have been seen before
→ Decomposing p(W) using the chain rule:
p(w1, w2, w3, ..., wn) = p(w1) p(w2|w1) p(w3|w1, w2)...p(wn|w1, w2, ...wn−1)
(not much gained yet, p(wn|w1, w2, ...wn−1) is equally sparse)
Philipp Koehn Machine Translation: Language Models 5 September 2023
3Markov Chain
• Markov assumption:
– only previous history matters
– limited memory: only last k words are included in history
(older words less relevant)
→ kth order Markov model
• For instance 2-gram language model:
p(w1, w2, w3, ..., wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn−1)
• What is conditioned on, here wi−1 is called the history
Philipp Koehn Machine Translation: Language Models 5 September 2023
4Estimating N-Gram Probabilities
• Maximum likelihood estimation
p(w2|w1) =
count(w1, w2)
count(w1)
• Collect counts over a large text corpus
• Millions to billions of words are easy to get
(trillions of English words available on the web)
Philipp Koehn Machine Translation: Language Models 5 September 2023
5Example: 3-Gram
• Counts for trigrams and estimated word probabilities
the green (total: 1748)
word c. prob.
paper 801 0.458
group 640 0.367
light 110 0.063
party 27 0.015
ecu 21 0.012
the red (total: 225)
word c. prob.
cross 123 0.547
tape 31 0.138
army 9 0.040
card 7 0.031
, 5 0.022
the blue (total: 54)
word c. prob.
box 16 0.296
. 6 0.111
ﬂag 6 0.111
, 3 0.056
angel 3 0.056
– 225 trigrams in the Europarl corpus start with the red
– 123 of them end with cross
→ maximum likelihood probability is 123
225 = 0.547.
Philipp Koehn Machine Translation: Language Models 5 September 2023
6How good is the LM?
• A good model assigns a text of real English W a high probability
• This can be also measured with cross entropy:
H(W) = −
1
n
log2 p(Wn
1 )
• Or, perplexity
perplexity(W) = 2H(W )
Philipp Koehn Machine Translation: Language Models 5 September 2023
7Example: 3-Gram
prediction pLM -log2 pLM
pLM(i|</s><s>) 0.109 3.197
pLM(would|<s>i) 0.144 2.791
pLM(like|i would) 0.489 1.031
pLM(to|would like) 0.905 0.144
pLM(commend|like to) 0.002 8.794
pLM(the|to commend) 0.472 1.084
pLM(rapporteur|commend the) 0.147 2.763
pLM(on|the rapporteur) 0.056 4.150
pLM(his|rapporteur on) 0.194 2.367
pLM(work|on his) 0.089 3.498
pLM(.|his work) 0.290 1.785
pLM(</s>|work .) 0.99999 0.000014
average 2.634
Philipp Koehn Machine Translation: Language Models 5 September 2023
8Comparison 1–4-Gram
word unigram bigram trigram 4-gram
i 6.684 3.197 3.197 3.197
would 8.342 2.884 2.791 2.791
like 9.129 2.026 1.031 1.290
to 5.081 0.402 0.144 0.113
commend 15.487 12.335 8.794 8.633
the 3.885 1.402 1.084 0.880
rapporteur 10.840 7.319 2.763 2.350
on 6.765 4.140 4.150 1.862
his 10.678 7.316 2.367 1.978
work 9.993 4.816 3.498 2.394
. 4.896 3.020 1.785 1.510
</s> 4.828 0.005 0.000 0.000
average 8.051 4.072 2.634 2.251
perplexity 265.136 16.817 6.206 4.758
Philipp Koehn Machine Translation: Language Models 5 September 2023
9
count smoothing
Philipp Koehn Machine Translation: Language Models 5 September 2023
10Unseen N-Grams
• We have seen i like to in our corpus
• We have never seen i like to smooth in our corpus
→ p(smooth|i like to) = 0
• Any sentence that includes i like to smooth will be assigned probability 0
Philipp Koehn Machine Translation: Language Models 5 September 2023
11Add-One Smoothing
• For all possible n-grams, add the count of one.
p =
c + 1
n + v
– c = count of n-gram in corpus
– n = count of history
– v = vocabulary size
• But there are many more unseen n-grams than seen n-grams
• Example: Europarl 2-bigrams:
– 86, 700 distinct words
– 86, 7002
= 7, 516, 890, 000 possible bigrams
– but only about 30, 000, 000 words (and bigrams) in corpus
Philipp Koehn Machine Translation: Language Models 5 September 2023
38
efﬁciency
Philipp Koehn Machine Translation: Language Models 5 September 2023
39Managing the Size of the Model
• Millions to billions of words are easy to get
(trillions of English words available on the web)
• But: huge language models do not ﬁt into RAM
Philipp Koehn Machine Translation: Language Models 5 September 2023
40Number of Unique N-Grams
Number of unique n-grams in Europarl corpus
29,501,088 tokens (words and punctuation)
Order Unique n-grams Singletons
unigram 86,700 33,447 (38.6%)
bigram 1,948,935 1,132,844 (58.1%)
trigram 8,092,798 6,022,286 (74.4%)
4-gram 15,303,847 13,081,621 (85.5%)
5-gram 19,882,175 18,324,577 (92.2%)
→ remove singletons of higher order n-grams
Philipp Koehn Machine Translation: Language Models 5 September 2023
41Efﬁcient Data Structures
verythe large
boff:-0.385
majority p:-1.147
number p:-0.275
important
boff:-0.231
and p:-1.430
areas p:-1.728
challenge p:-2.171
debate p:-1.837
discussion p:-2.145
fact p:-2.128
international p:-1.866
issue p:-1.157
...
best
boff:-0.302
serious
boff:-0.146
very
very large
boff:-0.106
amount p:-2.510
amounts p:-1.633
and p:-1.449
area p:-2.658
companies p:-1.536
cuts p:-2.225
degree p:-2.933
extent p:-2.208
ﬁnancial p:-2.383
foreign p:-3.428
...
important
boff:-0.250
best
boff:-0.082
serious
boff:-0.176
4-gram
3-gram backoff
large
boff:-0.470
accept p:-3.791
acceptable p:-3.778
accession p:-3.762
accidents p:-3.806
accountancy p:-3.416
accumulated p:-3.885
accumulation p:-3.895
action p:-3.510
additional p:-3.334
administration p:-3.729
...
2-gram backoff
aa-afns p:-6.154
aachen p:-5.734
aaiun p:-6.154
aalborg p:-6.154
aarhus p:-5.734
aaron p:-6.154
aartsen p:-6.154
ab p:-5.734
abacha p:-5.156
aback p:-5.876
...
1-gram backoff
• Need to store probabilities
for
– the very large majority
– the very large number
• Both share history
the very large
→ no need to store history
twice
→ Trie
Philipp Koehn Machine Translation: Language Models 5 September 2023
42Reducing Vocabulary Size
• For instance: each number is treated as a separate token
• Replace them with a number token NUM
– but: we want our language model to prefer
pLM(I pay 950.00 in May 2007) > pLM(I pay 2007 in May 950.00)
– not possible with number token
pLM(I pay NUM in May NUM) = pLM(I pay NUM in May NUM)
• Replace each digit (with unique symbol, e.g., @ or 5), retain some distinctions
pLM(I pay 555.55 in May 5555) > pLM(I pay 5555 in May 555.55)
Philipp Koehn Machine Translation: Language Models 5 September 2023