Language Models Philipp Koehn 5 September 2023 Philipp Koehn Machine Translation: Language Models 5 September 2023 1Language models • Language models answer the question: How likely is a string of English words good English? • Help with reordering pLM(the house is small) > pLM(small the is house) • Help with word choice pLM(I am going home) > pLM(I am going house) Philipp Koehn Machine Translation: Language Models 5 September 2023 2N-Gram Language Models • Given: a string of English words W = w1, w2, w3, ..., wn • Question: what is p(W)? • Sparse data: Many good English sentences will not have been seen before → Decomposing p(W) using the chain rule: p(w1, w2, w3, ..., wn) = p(w1) p(w2|w1) p(w3|w1, w2)...p(wn|w1, w2, ...wn−1) (not much gained yet, p(wn|w1, w2, ...wn−1) is equally sparse) Philipp Koehn Machine Translation: Language Models 5 September 2023 3Markov Chain • Markov assumption: – only previous history matters – limited memory: only last k words are included in history (older words less relevant) → kth order Markov model • For instance 2-gram language model: p(w1, w2, w3, ..., wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn−1) • What is conditioned on, here wi−1 is called the history Philipp Koehn Machine Translation: Language Models 5 September 2023 4Estimating N-Gram Probabilities • Maximum likelihood estimation p(w2|w1) = count(w1, w2) count(w1) • Collect counts over a large text corpus • Millions to billions of words are easy to get (trillions of English words available on the web) Philipp Koehn Machine Translation: Language Models 5 September 2023 5Example: 3-Gram • Counts for trigrams and estimated word probabilities the green (total: 1748) word c. prob. paper 801 0.458 group 640 0.367 light 110 0.063 party 27 0.015 ecu 21 0.012 the red (total: 225) word c. prob. cross 123 0.547 tape 31 0.138 army 9 0.040 card 7 0.031 , 5 0.022 the blue (total: 54) word c. prob. box 16 0.296 . 6 0.111 flag 6 0.111 , 3 0.056 angel 3 0.056 – 225 trigrams in the Europarl corpus start with the red – 123 of them end with cross → maximum likelihood probability is 123 225 = 0.547. Philipp Koehn Machine Translation: Language Models 5 September 2023 6How good is the LM? • A good model assigns a text of real English W a high probability • This can be also measured with cross entropy: H(W) = − 1 n log2 p(Wn 1 ) • Or, perplexity perplexity(W) = 2H(W ) Philipp Koehn Machine Translation: Language Models 5 September 2023 7Example: 3-Gram prediction pLM -log2 pLM pLM(i|) 0.109 3.197 pLM(would|i) 0.144 2.791 pLM(like|i would) 0.489 1.031 pLM(to|would like) 0.905 0.144 pLM(commend|like to) 0.002 8.794 pLM(the|to commend) 0.472 1.084 pLM(rapporteur|commend the) 0.147 2.763 pLM(on|the rapporteur) 0.056 4.150 pLM(his|rapporteur on) 0.194 2.367 pLM(work|on his) 0.089 3.498 pLM(.|his work) 0.290 1.785 pLM(|work .) 0.99999 0.000014 average 2.634 Philipp Koehn Machine Translation: Language Models 5 September 2023 8Comparison 1–4-Gram word unigram bigram trigram 4-gram i 6.684 3.197 3.197 3.197 would 8.342 2.884 2.791 2.791 like 9.129 2.026 1.031 1.290 to 5.081 0.402 0.144 0.113 commend 15.487 12.335 8.794 8.633 the 3.885 1.402 1.084 0.880 rapporteur 10.840 7.319 2.763 2.350 on 6.765 4.140 4.150 1.862 his 10.678 7.316 2.367 1.978 work 9.993 4.816 3.498 2.394 . 4.896 3.020 1.785 1.510 4.828 0.005 0.000 0.000 average 8.051 4.072 2.634 2.251 perplexity 265.136 16.817 6.206 4.758 Philipp Koehn Machine Translation: Language Models 5 September 2023 9 count smoothing Philipp Koehn Machine Translation: Language Models 5 September 2023 10Unseen N-Grams • We have seen i like to in our corpus • We have never seen i like to smooth in our corpus → p(smooth|i like to) = 0 • Any sentence that includes i like to smooth will be assigned probability 0 Philipp Koehn Machine Translation: Language Models 5 September 2023 11Add-One Smoothing • For all possible n-grams, add the count of one. p = c + 1 n + v – c = count of n-gram in corpus – n = count of history – v = vocabulary size • But there are many more unseen n-grams than seen n-grams • Example: Europarl 2-bigrams: – 86, 700 distinct words – 86, 7002 = 7, 516, 890, 000 possible bigrams – but only about 30, 000, 000 words (and bigrams) in corpus Philipp Koehn Machine Translation: Language Models 5 September 2023 38 efficiency Philipp Koehn Machine Translation: Language Models 5 September 2023 39Managing the Size of the Model • Millions to billions of words are easy to get (trillions of English words available on the web) • But: huge language models do not fit into RAM Philipp Koehn Machine Translation: Language Models 5 September 2023 40Number of Unique N-Grams Number of unique n-grams in Europarl corpus 29,501,088 tokens (words and punctuation) Order Unique n-grams Singletons unigram 86,700 33,447 (38.6%) bigram 1,948,935 1,132,844 (58.1%) trigram 8,092,798 6,022,286 (74.4%) 4-gram 15,303,847 13,081,621 (85.5%) 5-gram 19,882,175 18,324,577 (92.2%) → remove singletons of higher order n-grams Philipp Koehn Machine Translation: Language Models 5 September 2023 41Efficient Data Structures verythe large boff:-0.385 majority p:-1.147 number p:-0.275 important boff:-0.231 and p:-1.430 areas p:-1.728 challenge p:-2.171 debate p:-1.837 discussion p:-2.145 fact p:-2.128 international p:-1.866 issue p:-1.157 ... best boff:-0.302 serious boff:-0.146 very very large boff:-0.106 amount p:-2.510 amounts p:-1.633 and p:-1.449 area p:-2.658 companies p:-1.536 cuts p:-2.225 degree p:-2.933 extent p:-2.208 financial p:-2.383 foreign p:-3.428 ... important boff:-0.250 best boff:-0.082 serious boff:-0.176 4-gram 3-gram backoff large boff:-0.470 accept p:-3.791 acceptable p:-3.778 accession p:-3.762 accidents p:-3.806 accountancy p:-3.416 accumulated p:-3.885 accumulation p:-3.895 action p:-3.510 additional p:-3.334 administration p:-3.729 ... 2-gram backoff aa-afns p:-6.154 aachen p:-5.734 aaiun p:-6.154 aalborg p:-6.154 aarhus p:-5.734 aaron p:-6.154 aartsen p:-6.154 ab p:-5.734 abacha p:-5.156 aback p:-5.876 ... 1-gram backoff • Need to store probabilities for – the very large majority – the very large number • Both share history the very large → no need to store history twice → Trie Philipp Koehn Machine Translation: Language Models 5 September 2023 42Reducing Vocabulary Size • For instance: each number is treated as a separate token • Replace them with a number token NUM – but: we want our language model to prefer pLM(I pay 950.00 in May 2007) > pLM(I pay 2007 in May 950.00) – not possible with number token pLM(I pay NUM in May NUM) = pLM(I pay NUM in May NUM) • Replace each digit (with unique symbol, e.g., @ or 5), retain some distinctions pLM(I pay 555.55 in May 5555) > pLM(I pay 5555 in May 555.55) Philipp Koehn Machine Translation: Language Models 5 September 2023