MUNI FI LM Smoothing (The EM Algorithm) PA154 Language Modeling (3.2) Pavel Rychlý pary@fi.muni.cz March 5, 2024 The Zero Problem "Raw" n-gram language model estimate: ■ necessarily, some zeros ■ !many: trigram model -> 2.16 x 1014 parameters, data ~109 words ■ which are true 0? ■ optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it's probability vs. other trigrams ■ optimal situation cannot happen, unfortunately (open question: how many data would we need?) m -» we don't know ■ we must eliminate zeros Two kinds of zeros: p(w|h) = 0, or even p(h) = 0! Source: Introduction to Natural Language Processing (600.465) Jan Hajlc, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajlc Davel Rychlý ■ LM Smoothing ■ March 5, 2024 Why do we need Nonzero Probs? Eliminating the Zero Probabilites: Smoothing To avoid infinite Cross Entropy: happens when an event is found in test data which has not been seen in training data H(p) = oo: prevents comparing data with > 0 "errors" To make the system more robust ■ low count estimates: they typically happen for "detailed" but relatively rare appearances ■ high count estimates: reliable but less "detailed" Get new p'(w) (same Q): almost p(w) but no zeros Discount w for (some) p(w) > 0: new p'(w) < p(w) we discounted {p{w)-p'{w)) = D Distribute D to all w; p(w) = 0: new p'(w) > p(w) m possibly also to other w with low p(w) For some w (possibly): p/{w) = p(w) Make sure EwenP'(w) = 1 There are many ways of smoothing Davel Rychlý ■ LM Smoothing ■ March 5, 2024 Davel Rychlý ■ LM Smoothing ■ March 5, 2024 Smoothing by Adding 1 Simplest but not really usable: ■ Predicting words w from a vocabulary V, training data T: c{h, w) +1 p'{w\h) c(h) + I V\ m for non-conditional distributions: p'(w) = ^t\+Iv\ Problem if | V\ > c(h) (as is often the case; even >> c(h)l) Example Training data: what is it what is small? |T| = V = {what, is, it, small, ?, ,flying, birds, are, a, bird, .}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 s .001 p(it is flying.) = .125X.25 x O2 = 0 p'(what is it?) = .152 x .12 = .0002 p'(it) = .1, p'(what) = .15, p'(.) = .05 p'(it is flying.) = .1 x .15 x .052 .00004 Adding less than 1 Equally simple: ■ Predicting word w from a vocabulary V, training data T: c{h, w) + A for non-conditional distributions: p'(w) A < 1 c(w)+X T|+A|l/| Example Training data: what is it what is small? TI = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird,.}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 .001 p(it is flying.) = .125X.25 x O2 = 0 Use A = .1 p'(it) = .12, p'(what) = ■ 23, p'(what is it?) = .232 x .122 s .0007 p'(.) s .01 p'(it is flying.) = .12x.23 x .01 2 s .000003 Davel Rychlý ■ LM Smoothing ■ March 5, 2024 5 / ie Davel Rychlý ■ LM Smoothing ■ March 5, 2024 6/ ie Good-Turing Suitable for estimation from large data ■ similar idea: discount/boost the relative frequency estimate: Pr(W) = (c(w) + 1) x A/(c(w) + 1) \T\ x N{c{w)) where N(c) is the count of words with count c (count-of-counts) specifically, for c(w) = 0 (unseen words), pr[w) - \t\x.n(o) good for small counts (< 5-10, where N[c) is high) normalization! (so that we have ^2wp'{w) = 1) Good-Turing: An Example Remember: pr(w) = ^XtoT^ Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, .flying, birds, are, a, bird, .}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 s .001 p(it is flying.) = .125X.25 x 02 = 0 ■ Raw estimation (W(0) = 6, W(1) = 4, W(2) = 2, N(i) = 0, for / > 2): pr(it) = (1+1)xN(1+1)/(8xN(1)) = 2x2/(8x4) = .125 pr(what) = (2+1)xN(2+1)/(8xN(2)) = 3x0/(8x2) = 0: keep orig. p(what) pr(.) = (0+1)xN(0+1)/(8xN(0)) = 1 x4/(8x6) = .083 ■ Normalize (divide by 1.5 = Y,weiv\ Pr(w)) ancl compute: p'(it) = .08, p'(what) = .17, p'(.) = -06 p'jwhat is it?) = .172 x .082 = .0002 p'(it is flying.) = .082 x .17 x .062 = .00004 Davel Rychly ■ LM Smoothing ■ March 5, 2024 Davel Rychly ■ LM Smoothing ■ March 5, 2024 Smoothing by Combination: Linear Interpolation Combine what? ■ distribution of various level of detail vs. reliability ■ n-gram models: ■ use (n-1)gram, (n-2)gram.....uniform —> reliability <— detail ■ Simplest possible combination: - sum of probabilities, normalize: ■ p(0|0) = .8, p(1|0) = .2, p(0|1) = 1,p(1|1) = 0, p(0) = .4, p(1) = .6 ■ p'(0|0) = .6, p'(1|0) = .4, p'(0|1) = .7, p'(1|1) = .3 Typical n-gram LM Smoothing i Weight in less detailed distributions using A = (A0, , A2, A3): p\(Wi\Wi-2, W,--\) = A3P3(lV,|lV,_2, + A2p2(w,|w,_1) + XiPi(Wj) + A0/| V\ I Normalize: A> > 0, Eto A< = 1 is sufficient (A0 = 1 - Eti >~i)(" = 3) I Estimation using MLE: ■ fix the P3,p^,p^ and |V| parameters as estimated from the training data ■ then find such {A,} which minimizes the cross entropy (maximazes probablity of data): -ypy Yl\=\ l°9z(P'\(wi\^i)) Davel Rychly ■ LM Smoothing ■ March 5, 2024 Davel Rychly ■ LM Smoothing ■ March 5, 2024 Held-out Data ■ What data to use? - try training data T: but we will always get A3 = 1 ■ why? let Pit be an i-gram distribution estimated using r.f. from T) ■ minimizing HT(p\) over a vector A, p\ = A3P3r + A2p2r + A,pir + Ao/| V\ - remember HT(p\) = H(p3r) + D(P3t||p'a); P3rfixed -> H(p3r) fixed, best) -which p\ minimizes HT(p\)? Obviously, ap'A forwhich D(p3r||p'A) = 0 - ...and that's p3r (because D(p||p) = 0, as we know) - ...and certainly p\ = p3rifA3 = 1 (maybe in some other cases, too). - (p'A = 1 x p3r + 0 x p2r + 1 x p, T + 0/|V|) - thus: do not use the training data for estimation of A! ■ must hold out part of the training data (heldout data, H) ■ ...call remaining data the (true/raw) training data, I ■ the test data S (e.g., for comparison purposes): still different data! The Formulas -1 IHI Repeat: minimizing thtSLi logz{pfx{wi\h)) over A I'll p'x{Wi\hi) = pfx[Wi\Wi-2, w,-i) = A y'.nexf — 3 C(Xj) "Expected counts of lambdas": j = 0..3 |H| /'=1 \iPi{w,\hi) P'xiwilh,) "Next A": j = 0..3 Davel Rychly ■ LM Smoothing ■ March 5, 2024 11/16 Davel Rychly ■ LM Smoothing ■ March 5, 2024 12/ ie The (Smoothing) EM Algorithm Remark on Linear Interpolation Smoothing Start with some A, such that A > 0 for all j e 0..3 2. Compute "Expected Counts" for eachAy. 3. Compute new set of Ay, using "Next A" formula. 4. Start over at step 2, unless a termination condition is met. ■ Termination condition: convergence of A. - Simply set an e, and finish if |Ay - Ay„ext| < e for each j (step 3). ■ Guaranteed to converge: follows from Jensen's inequality, plus a technical proof. "Bucketed Smoothing": - use several vectors of A instead of one, based on (the frequency of) history: A(h) ■ e.g. for h = (micrograms,per) we will have A(h) = (.999, .0009, .00009, .00001) (because "cubic" is the only word to follow...) - actually: not a separate set for each history, but rather a set for "similar" histories ("bucket"): -----■—^ A(b(h)), where b: V2 -> W (in the case of trigrams) b classifies histories according to their reliability (-frequency) Davel Rychlý ■ LM Smoothing ■ March 5, 2024 Davel Rychlý ■ LM Smoothing ■ March 5, 2024 Bucketed Smoothing: The Algorithm First, determine the bucketing function b (use heldoutl): - decide in advance you want e.g. 1000 buckets - compute the total frequency of histories in 1 bucket (feax(b)) - gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed fmax(b) (you might end up with slightly more than 1000 buckets) Divide your heldout data according to buckets Apply the previous algorithm to each bucket and its data Davel Rychlý ■ LM Smoothing ■ March 5, 2024 Simple Example Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p(a) = 1/64 for a e {c.r}, = 0 for the rest: s, t, u, v, w, x, y, z ■ Heldout data: baby; use one set of A (A^ unigram, A0: uniform) ■ Start with A0 = Aj = .5: p\(b) = .5 x .5 + .5/26 = .27 /ýA(a) = .5 x .25 + .5/26= .14 p'x(y) = .5x0 + .5/26 = .02 c(h) = .5x.5/.27 + .5x.25/.14+ .5x.5/.27+ .5x0/.02 = 2.27 c(A0) = .5x.04/.27 + .5x.04/.14+ .5x.04/.27+ .5x.04/.02 = 1.28 Normalize A-i next = .68, Aojfiexí = .32 Repeat from step 2 (recompute p\ first for efficient computation, then c(A,), ...). Finish when new lambdas almost equal to the old ones (say, < 0.01 difference). Davel Rychlý ■ LM Smoothing ■ March 5, 2024 16/1 Some More Technical Hints Back-off model Set V = {all words from training data}. ■ You may also consider V = T u H, but it does not make the coding in any way simpler (in fact, harder). ■ But: you must never use the test data for your vocabulary Prepend two "words" in front of all data: ■ avoids beginning-of-data problems ■ call these index -1 and 0: then the formulas hold exactly When c„(w,h) = 0: ■ Assing 0 probability to p„(w|h) where c„_i(h) > 0, but a uniform probablity (1/|V|) to those p„(w|h) where c„_i(h) = 0 (this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!) Combines n-gram models using lower order in not enough information in higher order = d„, PiJ0(iv/|wi-_n+1 C(W,_„+1 • • • Wj) • C(w,_„+1 ...Wi-i) ,,_,Pi»(lV,ilV/_„+2...IV/_ if C(w,_„+i ...w,)>k otherwise Davel Rychlý ■ LM Smoothing ■ March 5, 2024 17/16 Davel Rychlý ■ LM Smoothing ■ March 5, 2024 18/ 18