LM Smoothing (The EM Algorithm) Introduction to Natural Language Processing Dr. Jan Hajic CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/ hajic June 26, 2014 o June 26, 2014 1 / 17 PA154 Statistické nástroje pro korpusy The Zero Problem • "Raw" n-gram language model estimate: - necessarily, some zeros ► !many: trigram model ->• 2.16 x 1014 parameters, data ~109 words - which are true 0? ► optimal situation: even the least grequent trigram would be seen several times, in order to distinguish it's probability vs. other trigrams ► optimal situation cannot happen, unfortunately (open question: how many data would we need?) —> we don't know - we must eliminate zeros o Two kinds of zeros: p(w|h) = 0, or even p(h) = 0! o June 26, 2014 2/ 17 PA154 Statistické nástroje pro korpusy Why do we need Nonzero Probs? a To avoid infinite Cross Entropy: - happens when an event is found in test data which has not been seen in training data H(p) = oo: prevents comparing data with > 0 "errors" • To make the system more robust - low count estimates: ► they typically happen for "detailed" but relatively rare appearances - high count estimates: reliable but less "detailed" o June 26, 2014 3/ 17 PA154 Statistické nástroje pro korpusy Eliminating the Zero Probabilites: Smoothing • Get new p'(w) (same ft): almost p(w) but no zeros • Discount w for (some) p(w) > 0: new p'(w) < p(w) ^discounted (p(w) - p'(w)) = D • Distribute d to all w; p(w) = 0: new p'(w) > p(w) - possibly also to other w with low p(w) • For some w (possibly): p'(w) = p(w) • Make sure Zwenp'(w) = 1 • There are many ways of smoothing o June 26, 2014 4/ 17 PA154 Statistické nástroje pro korpusy Smoothing by Adding 1 • Simplest but not really usable: - Predicting words w from a vocabulary V, training data T: p'(w|h) = (c(h,w) + 1)/(c(h + |V|) ► for non-conditional distributions: p'(w) = (c(w) + 1) / (|T| + |V|) - Problem if |V| > c(h) (as is often the case; even » c(h)!) • Example: Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird, .}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 ^ .001 p(it is flying?) = .125x.25 x 02 = 0 p'(it) = .1, p'(what) = .15, p'(what is it?) = .15x.12 ^ .0002 p'(.) = .05 p'(what is flying?) = .1 x .15 x .052 ^ .00004 o June 26, 2014 5/ 17 PA154 Statistické nástroje pro korpusy Adding less than 1 • Equally simple: - Predicting word w from a vocabulary V, training data T: p'(w|h) = (c(h,w) + A / (c(h) + A|V|), A < 1 ► for non-conditional distributions: p'(w) = (c(w) + A / (|T| +A|V|) • Example: Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird, .}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 ^ .001 p(it is flying?) = . 125x .25 x 02 = 0 Use A = .1 p'(it) = .12, p'(what) ^ .23, p'(what is it?) = .232 x .122 ^ .0007 p'(.) = .01 p'(it is flying) = .12x.23 x .012 ^ .000003 PA154 Statistické nástroje pro korpusy Good-Turing • Suitable for estimation from large data - similar idea: discount/boost the relative frequency estimate: pr(w) = (c(w) + 1) x N(c(w) + 1) / (|T| x N(0)) where N(c) is the count of words with count c (count-of-counts) specifically, for c(w) = 0 (unseen words), pr(w) = N(1) / (|T| x N(0)) - good for small counts (< 5-10, where N(c) is high) - variants (see MS) - normalization! (so that we have Ewp'(w) = 1) o June 26, 2014 7/ 17 PA154 Statistické nástroje pro korpusy Good-Turing: An Example • Example: remember: pr(w) = (c(w) + 1) x N(c(w) + 1) / (|T| x N(c(w))) Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird, .}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 ^ .001 p(it is flying?) = .125x.25 x 02 = 0 ► Raw estimation (N(0) = 6, N(1) = 4, N(2) = 2, N(i) = 0, for i > 2): Pr(it) = (1+1)xN(1+1)/(8xxN(1)) = 2x2/(8x4) = .125 Pr(what) = (2+1)xN(2+1)/(8x+N(2)) = 3x0/(8x2) = 0: keeporig. p(what) pr(.) = (0+1)xN(0+1)/(8x+N(0)) = 1 x4/(8x6) .083 ► Normalize (divide by 1.5 = Twe\V\pr(w)) and compute: p'(it) = .08, p'(what) ^ .17, p'(.) = .06 p'(what is it?) = .172 x .082 = .0002 p'(it is flying.) = .082 x .17 x .062 ^ .00004 o June 26, 2014 8/ 17 PA154 Statistické nástroje pro korpusy Smoothing by Combination: Linear Interpolation • Combine what? ► distribution of various level of detail vs. reliability • n-gram models: ► use (n-1)gram, (n-2)gram, uniform —> reliability <— detail a Simplest possible combination: - sum of probabilities, normalize: p(0|0) = .8, p(110) = .2, p(0|1) = 1, p(111) = 0, p(0) = .4, p(1) = .6: - p'(0|0) = .6, p'(110) = .4, p'(110) = .7, p'(111) = .3 o June 26, 2014 9/ 17 PA154 Statistické nástroje pro korpusy Typical n-gram LM Smoothing • Weight in less detailed distributions using A = (A0, X^, A2, A3): A2P2(w/| ) + A1P1 (wi) + Xo/\V\ • Normalize: A/ > 0, Hi — 0..nA, = 1 is sufficient (Ao = 1 - E/ — 1 ..nA,)(n = 3) • Estimation using MLE: - fix the P3, P2, pi and |V| parameters as estimated from the training data - then find such {A,} which minimizes the cross entropy (maximazes probablity of data): -(1/|D|)E/=1..|D|/og2(pA(w/|fy)) 0 June 26, 2014 10/ 17 PA154 Statistické nástroje pro korpusy Held-out Data • What data to use? - try training data T: but we will always get A3 = 1 ► why? let Pit be an i-gram distribution estimated using r.f. from T) ► minimizing HT(p\) over a vector A, p'A = A3p3r + A2p2r + MP\t + A0/| V\ - remember HT(p\) = H(p3T) + D(p3T||p'A); p3rfixed H(p3T) fixed, best) -which p'a minimizes HT(p\)7 Obviously, a p\ for which D(p37-||p'a) = 0 - ...and that's p3r (because D(p||p) = 0, as we know) - ...and certainly p'A = p3rifA3 = 1 (maybe in some other cases, too). - (p'a = 1 x p3T + 0 x p2T + 1 x p1T + 0/|v|) - thus: do not use the training data for estimation of A! ► must hold out part of the training data (heldout data, H) ► ...call remaining data the (true/raw) training data, T ► the test data S (e.g., for comparison purposes): still different data! o June 26, 2014 11 / 17 PA154 Statistické nástroje pro korpusy The Formulas • Repeat: minimizing (-1 ..\H\log2(p'x{Wj\hj)) over A p'\(Wj\hj) = p'x(Wj\Wj-2,Wj-i) = \3P3(Wi\Wi_2, W>-l) + | A2te(w/|w/-i) + A1p1(w/) + A0/|y| j_ • "Expected counts of lambdas": j = 0..3_ | c{Xj) = E/ = 1 ..\H\j\jPjjWj\hj) Ip'xjwjlhj)) T~| • "Next A": j = 0..3_ hnext = C{\j)/Hk=0..3{c(\k)) ! 0 June 26, 2014 12/17 PA154 Statistické nástroje pro korpusy The (Smoothing) EM Algorithm O Start with some A, such that A > 0 for all j e 0..3 O Compute "Expected Counts" for eachAy. O Compute new set of Ay, using "Next A" formula. O Start over at step 2, unless a termination condition is met. • Termination condition: convergence of A. - Simply set an e, and finish if |Ay- - \j^ext\ < £ for each j (step 3). • Guaranteed to converge: follows from Jensen's inequality, plus a technical proof. o June 26, 2014 13/ 17 PA154 Statistické nástroje pro korpusy Remark on Linear Interpolation Smoothing • "Bucketed Smoothing": - use several vectors of A instead of one, based on (the frequency of) history: A(h) ► e.g. for h = (micrograms,per) we will have A(h) = (.999, .0009, .00009, .00001) (because "cubic" is the only word to follow...) - actually: not a separate set for each history, but rather a set for "similar" histories ("bucket"): A(b(h)), where b: V2 ->• N (in the case of trigrams) b classifies histories according to their reliability (-frequency) o June 26, 2014 14/ 17 PA154 Statistické nástroje pro korpusy Bucketed Smoothing: The Algorithm • First, determine the bucketing function b (use heldout!): - decide in advance you want e.g. 1000 buckets - compute the total frequency of histories in 1 bucket (fmax(b)) - gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed fmax{b) (you might end up with slightly more than 1000 buckets) • Divide your heldout data according to buckets • Apply the previous algorithm to each bucket and its data o June 26, 2014 15/ 17 PA154 Statistické nástroje pro korpusy Simple Example • Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p(a) = 1/64 for a e {c.r}, = 0 for the rest: s, t, u, v, w, x, y, z • Heldout data: baby; use one set of A (A-i: unigram, A2: uniform) • Start with A = 5: p'A(b) = .5 x .5 + .5/26 = .27 p'A(a) = .5 x .25 + .5/26 = .14 p'A(y) = .5x0 + .5/26 = .02 c(Ai) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x0/.02 = 2.27 c(A0) = .5x.04/.27+ .5x.04/.14+ .5x.04/.27+ .5x.04/.02 = 1.28 Normalize \-\ next — -68, Ao,nexf - .32 Repeat from step 2 (recomputep'A first for efficient computation, then c(A,), ...). Finish when new lambdas almost equal to the old ones (say, < 0.01 difference). PA154 Statistické nástroje pro korpusy Some More Technical Hints • Set V = {all words from training data}. ► You may also consider V = T u H, but it does not make the coding in any way simpler (in fact, harder). ► But: you must never use the test data for your vocabulary • Prepend two "words" in front of all data: ► avoids beginning-of-data problems ► call these index -1 and 0: then the formulas hold exactly • When c„(w,h) = 0: ► Assing 0 probability to pn(w|h) where cn_i(h) > 0, but a uniform probablity (1/|V|) to those pn(w|h) where cn_i(h) = 0 (this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!) o June 26, 2014 17/ 17