The Zero Problem LM Smoothing (The EM Algorithm) PA154 Jazykové modelování (3) Pavel Rychlý pary@fi.muni.cz March 16, 2021 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic "Raw" n-gram language model estimate: - necessarily, some zeros ► !many: trigram model -> 2.16 x 1014 parameters, data ~109 words - which are true 0? ► optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it's probability vs. other trigrams ► optimal situation cannot happen, unfortunately (open question: how many data would we need?) —> we don't know - we must eliminate zeros Two kinds of zeros: p(w|h) = 0, or even p(h) = 0! PA154 Jazykové modelování (3) Why do we need Nonzero Probs? Eliminating the Zero Probabilites: Smoothing To avoid infinite Cross Entropy: ► happens when an event is found in test data which has not been seen in training data H(p) = oo: prevents comparing data with > 0 "errors" To make the system more robust ► low count estimates: ► they typically happen for "detailed" but relatively rare appearances ► high count estimates: reliable but less "detailed" Get new p'(w) (same Q): almost p(w) but no zeros Discount w for (some) p(w) > 0: new p'(w) < p(w) we discounted {p{w)-p'{w)) = D Distribute D to all w; p(w) = 0: new p'(w) > p(w) ► possibly also to other w with low p(w) For some w (possibly): p'(w) = p(w) Make sure ewenP'(w) = 1 There are many ways of smoothi PA154 Jazykové modelování (3) PA154 Jazykové modelování (3) Smoothing by Adding 1 ■ Simplest but not really usable: Predicting words w from a vocabulary V, training data T: p/(w\h) - c(h, w) +1 c(h) + \V\ ► for non-conditional distributions: p'(w) = -pfpqvf Problem if \ V\ > c(h) (as is often the case; even >> c(h)\] Adding less than 1 ■ Equally simple: Predicting word w from a vocabulary V, training data T: Ff[w\h) = c{h, w) + A c(h) + \\V\' A < 1 for non-conditional distributions: p'(w) = ^(",)+A Example: \\v\ Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird, .}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 s .001 p(it is flying.) = .125X.25 x O2 = 0 p'(it) = .1, p'(what) = .15, p'(what is it?) = .15x.12 s .0002 p'(.) = .05 p'(it is flying.) = .1 x .15 x .052 s .00004 Training data: what is it what is small? |T| = 8 V = {what, is, it, small, ?, ,flying, birds, are, a, bird,.}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 .001 p(it is flying.) = .125X.25 x O2 = 0 Use A = .1 p'(it) = .12, p'(what) = .23, p'(what is it?) = .232 x .122 s .0007 p'(.) = .01 p'(it is flying.) = .12x.23 x .012 s .000003 PA154 Jazykové modelování (3) PA154 Jazykové modelování (3) LM Smoothing Good-Turing Good-Turing: An Example Suitable for estimation from large data - similar idea: discount/boost the relative frequency estimate: Pr(w) = (c(iv) + 1) x A/(c(iv) + 1) 7~| x N(c(w)) where W(c) is the count of words with count c (count-of-counts) specifically, for c(w) = 0 (unseen words), pr(w) = t^n(o) - good for small counts (< 5-10, where W(c) is high) - normalization! (so that we have EivP'M = 1) Remember: Pr{w] ^ íä^m^l Training data: what is it what is small? |T| = V = {what, is, it, small, ?, ,flying, birds, are, a, bird, .}, |V| = 12 p(it) = .125, p(what) = .25, p(.)=0 p(what is it?) = .252 x .1252 s .001 p(it is flying.) = .125X.25 x O2 = 0 Raw estimation (W(0) = 6, W(1) = 4, W(2) = 2, W(/) = 0, for / > 2): Pr(it) = (1+1)xN(1+1)/(8xN(1)) = 2x2/(8x4) = .125 Pr(what) = (2+1)xN(2+1)/(8xN(2)) = 3x0/(8x2) = 0: keep orig. p(what) pr(.) = (0+1)xN(0+1)/(8xN(0)) = 1 x4/(8x6) = .083 Normalize (divide by 1.5 = Y reliability <— detail Simplest possible combination: -sum of probabilities, normalize: ► p(0|0) = .8,p(1|0) = .2,p(0|1) = 1,p(1|1) = 0, p(0) = .4, p(1) = .6 ► p'(0|0) = .6,p'(1|0) = .4,p'(1|0) = .7,p'(1|1) = .3 Weight in less detailed distributions using A = (A0, Xu A2, A3): ^2Pz(Wi\Wi-i) + A1p1(iv,-) + X0/\V\ Normalize: A< > 0, E"=o A< = 1 is sufficient (A0 = 1 - E"=1 A')(n = 3) Estimation using MLE: - fix the ps, P2, pi and |V| parameters as estimated from the training data - then find such {A,} which minimizes the cross entropy (maximazes probablity of data): -4r Elf i tog2(p'A(w,|^)) PA154 Jazykové modelování (3) PA154 Jazykové modelování (3) Held-out Data The Formulas What data to use? - try training data T: but we will always get A3 = 1 why? let Pit be an i-gram distribution estimated using r.f. from T) ► minimizing HT(p\) over a vector A, p\ = A3p3r + A2p2r + Aipir + A0/| V\ - remember HT(p\) = H(p3r) + D(P3t||p'a); P3rfixed H(p3r) fixed, best) -which p\ minimizes HT(p\)7 Obviously, a p\ for which D(p3r||p'A) = 0 - ...and that's p3r (because D(p||p) = 0, as we know) - ...and certainly p\ = p3rifA3 = 1 (maybe in some other cases, too). - (p'a = 1 x p3r + 0 x p2T + 1 x pir + 0/|V|) - thus: do not use the training data for estimation of A! must hold out part of the training data (heldout data, H) ...call remaining data the (true/raw) training data, T the test data S (e.g., for comparison purposes): still different data! Repeat: minimizing — Yl\=i 'P£fe(PÁ(1*l/ľ'/0) over a Px( Wi\hi)=pfx{ WiI W,_2, i*í_ 1) = = hP3Í Wi\Wi-2, W,-i ) + A2p2( W,\Wi--, ) + Mpi{wi) + A0^ "Expected counts of lambdas": j = 0..3 "Next A": j = 0..3 c(Ay) ~ ľLo c(A„) PA154 Jazykové modelování (3) PA154 Jazykové modelování (3) The (Smoothing) EM Algorithm Remark on Linear Interpolation Smoothing Start with some A, such that A > 0 for all j e 0..3 Compute "Expected Counts" for eachAy. Compute new set of Ay, using "Next A" formula. Start over at step 2, unless a termination condition is met. Termination condition: convergence of A. - Simply set an e, and finish if |Ay - Ay„ext| < e for each j (step 3). Guaranteed to converge: follows from Jensen's inequality, plus a technical proof. "Bucketed Smoothing": - use several vectors of A instead of one, based on (the frequency of) history: A(h) ► e.g. for h = (micrograms,per) we will have A(h) = (.999, .0009, .00009, .00001) (because "cubic" is the only word to follow...) - actually: not a separate set for each history, but rather a set for "similar" histories ("bucket"): A(b(h)), where b: V2 -> W (in the case of trigrams) b classifies histories according to their reliability (-frequency) PA154 Jazykové modelování (3) PA154 Jazykové modelování (3) Bucketed Smoothing: The Algorithm Simple Example First, determine the bucketing function b (use heldoutl): - decide in advance you want e.g. 1000 buckets - compute the total frequency of histories in 1 bucket (fmax(b)) - gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed fmSx(b) (you might end up with slightly more than 1000 buckets) Divide your heldout data according to buckets Apply the previous algorithm to each bucket and its data Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p(a) = 1/64 for a e {c.r}, = 0 for the rest: s, t, u, v, w, x, y, z Heldout data: baby; use one set of A (A^ unigram, A0: uniform) Start with A0 = Aj = .5: p'x(b) = .5 x .5 + .5/26 = .27 p/x(a) = .5 x .25 + .5/26= .14 p'x(y) = .5x0 + .5/26 = .02 c(A!) = .5x.5/.27 + .5 x.25/. 14 + .5x.5/.27 + .5x0/.02 = 2.27 c(A0) = .5x.04/.27+ .5 x.04/. 14 + .5x.04/.27 + .5x.04/.02= 1.28 Normalize A-i nexi = .68, Ao^exf = .32 Repeat from step 2 (recompute p\ first for efficient computation, then c(A,), ...). Finish when new lambdas almost equal to the old ones (say, < 0.01 difference). PA154 Jazykové modelování (3) PA154 Jazykové modelování (3) Some More Technical Hints Back-off model Set V = {all words from training data}. ► You may also consider V = T u H, but it does not make the coding in any way simpler (in fact, harder). ► But: you must never use the test data for your vocabulary Prepend two "words" in front of all data: avoids beginning-of-data problems ► call these index -1 and 0: then the formulas hold exactly When c„(w,h) = 0: ► Assing 0 probability to p„(w|h) where c„_i(h) > 0, but a uniform probablity (1/|V|) to those p„(w|h) where c„_i(h) = 0 (this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!) Combines n-gram models using lower order in not enough information in higher order Pbo{Wl\Wl-n+\ ■■■W;-\) = _ . C(w,_n+1 ...Wj_iWi) C(w,_„+1 ...Wj_i) if C(lV/_n+i ...Wj)> k otherwise PA154 Jazykové modelování (3) PA154 Jazykové modelování (3)