Introduction to Natural Language Processing (600.465) Language Modeling (and the Noisy Channel) Dr. Jan Hajic CS Dept., Johns Hopkins Univ. haj ic@cs.jhu.edu www.cs.j hu.edu/-haj ic 9/19/2000 JHU CS 600.465/IntrotoNLP/JanHajic 1 The Noisy Channel • Prototypical case: Input Output (noisy) ■-- The channel (adds noise) • Model: probability of error (noise): • Example: p(0|l) = .3 p(l|l) = .7 p(l|0) = .4 p(0|0) = .6 • The Task: known: the noisy output; want to know: the input (decoding) 9/19/2000 JHU CS 600.405/Intro to NLP/JanHajic Noisy Channel Applications • OCR - straightforward: text —^ print (adds noise), scan —> image • Handwriting recognition - text —k neurons, muscles ("noise"), scan/digitize —> image • Speech recognition (dictation, commands, etc.) - text —> conversion to acoustic signal ("noise") —> acoustic waves • Machine Translation - text in target language —> translation ("'noise") —f source language • Also: Part of Speech Tagging - sequence of tags —> selection of word forms —> text 9/19/2000 JHU CS 600.463/Into to NLP/JanHajic 3 Noisy Channel: The Golden Rule of... Cqcr^asr, hr^t^?^ • Recall: p(A|B) = p(B| A) p(A) / p(B) (Bayes formula) Abest = argmaxA p(B|A) p(A) (The Golden Rule) • p(b|a): the acoustic/image/translation/lexical model - application-specific name - will explore later • p(a): the language model 9/19/2000 JHU CS 600.405/Intro to NLP/JanHajic 4 The Perfect Language Model Sequence Of WOrd forms [forget about tagging for the moment] Notation: A ~ W = (w1;w2,w3,...,wd) The big (modeling) question: p(W) = ? Well, we know (Bayes/chain rule —>): p(W) = p(w1?w2;w3;...;wd) = = p(wt) x p(w2|w1) x p(w3|w1?w2) x...x p(wd|w1?w2,...,wd_ Not practical (even short W —> too many parameters) 9/19/2000 JHU CS 600.463/Intro to NLP/JanHajic Markov Chain • Unlimited memory (cf. previous foil): - for w1? we know all its predecessors w^w^w^...^ • Limited memory: - we disregard "too old" predecessors - remember only £ previous words: w^w^^...;^.! - called "k1*1 order Markov approximation" • + stationary character (no change over time): p(W) = Ok d= |W| 9/19/2000 JHU CS 600.403/Intro to NLP/JanHajic n-gram Language Models (n-l)^ order Markov approximation —> n-gram LM: prediction history p(w) =m 14=i. .dPCwiK-^i 2> ■ ■ *Wi J In particular (assume vocabulary | V| = 60k): 0- gram LM: uniform model, 1- gram LM: unigram model, 2- gram LM: bigram model, 3- gram LM: trig ram model, p(w) = 1/|V|, 1 parameter p(w), 6*104 parameters p(wjwi_j) 3.6xl(]P parameters p(wi|wi.2,wi.1) 2.16xl014 parameters 9/19/2000 JHU CS 600.463/Intro to NLP/JanHajic LM: Observations • How large nl - nothing is enough (theoretically) - but anyway: as much as possible {-> close to "perfect" model) - empirically: 3 * parameter estimation? (reli ability > data avail ability f storage space,...) * 4 is too much: |V|=60k —> 1.296x1019 parameters * but: 6-7 would be (almost) ideal (having enough data): in fact, one can recover original from 7-grams! • Reliability ~ (1 / Detail) (W need compromise) • For now, keep word forms (no "linguistic" processing) 9/19/2000 JHU CS 600.463/Intro to NLP/JanHajic The Length Issue - VlU SwGQnP(W) = 1 ^ Sn=l..mSweQn p(w) » 1 (->oo) • We want to model all sequences of words - for "fixed" length tasks: no problem - n fixed, sum is 1 * tagging, OCR/handwriting (if words identified ahead of time) - for "variable" length tasks: have to account for • discount shorter sentences • General model: for each sequence of words of length n, define p'(w) - lnp(w) such that En=1 mXn = 1 => Sn=l..«£w^P'(w)=l e.g., estimate Xn from data; or use normal or other distribution 9/19/2000 JHU CS 600.465/Intro to NLP/JanHajic 9 Parameter Estimation • Parameter: numerical value needed to compute p(w|h) • From data (how else?) • Data preparation: • get rid of formatting etc. ("text cleaning") • define words (separate but include punctuation, call it "word") • define sentence boundaries (insert "words" and ) • letter case: keep, discard, or be smart: - name recognition - number type identification [these are huge problems per se! ] • numbers: keep, replace by , or be smart (form ~ pronunciation) 9/19/2000 JHU CS 600.465/Intro to NLP/JanHajic Maximum Likelihood Estimate • MLE: Relative Frequency... - ...best predicts the data at hand (the "training data") • Trigrams from Training Data T: - count sequences of three words in T: c^w^^w^^Wj) ■- [NB: notation: just saying that the three words follow each other] - count sequences of two words in T: c2(w1.1?w1): • either use c2(yfz) = Sw c3(yfzfw) • or count differently at the beginning (& end) of data! p(w1|w1.2?w1.1) = st c^w^^.^) / e|^%4| • 9/19/2000 JHU CS 600.465/Intro to NLP/JanHajic 11 Character Language Model • Use individual characters instead of words: p(W) =dfrTi=1 ..^(qlq.^q.^,...^) • Same formulas etc. • Might consider 4-grams, 5-grams or even more • G ood only for language comparison • Transform cross-entropy between letter- and word-based models: ris(Pc) = Hs(pw) / avg. # of characters/word in S 9/19/2000 JHU CS 600.465/Intro to NLP/JanHajic LM: an Example • Training data: He can buy the can of soda. - Unigram: p^e) = p^uy) = Pi (the) = p^of) = p^soda) = PjQ = .125 Pi (can) = .25 - Bigram: p2(He|) = 1, p2(can|He) = l, p2(buy|can) = .5, p2(oflcan) = .5, p2(the|buy) = 1,... - Trigram: p3(He|,) = %, p3(can|,He) = 1, p3(buy|He,can) = 1, p3(oflthe,can) = 1, p3(.|of>oda) = 1. - Entropy: H(p^) = 2.75, H(p2) = .25, H(p3) = 0 <- Great?! 9/19/2000 JHU CS 600.465/Intro to NLP/JanHajic LM: an Example (The Problem) • Cross-entropy: • S = It was the greatest buy of all. • Even Hs(p}) fails (= Hs(p2) = Hs(p3) = oo), because: - all unigrams but pj(the), p^buy), p^of) and p^.) are 0. - all bigram probabilities are 0. - all trigram probabilities are 0. • We want: to make all (theoretically possible*) probabilities non-zero. in fact, all: remember our graph from day 1? 9/19/2000 JHU CS 600.465/Intro to NLP/JanHajic 14 Introduction to Natural Language Processing (600.465) LM Smoothing (The EM Algorithm) Dr. Jan Hajie CS Dept., Johns Hopkins Univ. haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic The Zero Problem "Raw" n-gram language model estimate: - necessarily, some zeros * Imany: frigram model -4 2.16xl014parameters, data ~ 109 words - which are true 0? * optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it's probability vs. other trigrams * optimal situation cannot happen, unfortunately (open question: how many data would we need?) - -> we don't know - we must eliminate the zeros Two kinds of zeros: p(w|h) = 0? or even p(h) = 0! 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic Why do we need Nonzero Probs? • To avoid infinite Cross Entropy: - happens when an event is found in test data which has not been seen in training data H(p) = go: prevents comparing data with > 0 "errors" • To make the system more robust - low count estimates: • they typically happen for "detailed" but relatively rare appearances - high count estimates: reliable but less "detailed" 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic Eliminating the Zero Probabilities: Smoothing • Get new p'(w) (same Q): almost p(w) but no zeros • Discount w for (some) p(w) > 0: new p'(w) < p(w) ^wechscounted (P(W)-P'(W)) = D • Distribute D to all w; p(w) = 0: new p'(w) > p(w) - possibly also to other w with low p(w) • For some w (possibly): p'(w) = p(w) • Make sure £weGp'(w) = 1 • There are many ways of smoothing 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 4 Smoothing by Adding 1 • Simplest but not really usable: - Predicting words w from a vocabulary V, training data T: p'(w|h) = (C(h,w) + 1) / (c(h) + IV) • for non-conditional distributions: p'(w) = (c(w) +1)/(|T| + |V|) - Problem if |V| > c(h) (as is often the case; even » c(h)!) • Example: Training data: what is it what is small ? |T| = 8 • V = { what, is, it, small, ?, , flying, birds, are, a, bird,. }, |V| = 12 • p(it)=. 125, p(what)=.25, p(.)=0 p(what is it?) = .252x.l252 I .001 p(it is flying.) = .125x.25x02 = 0 • p'(it) =. l,p5(what)=.15,p5(.)=.05 p'(what is it?) = .152x.l2 ^ 0002 p'Citis flying.) = .1x.15x.052^.00004 9/20/2000 JHU CS 600.46^Intro to NLP/JanHajic J Adding less than 1 • Equally simple: - Predicting words w from a vocabulary V, training data T: p'(w|h) = (c(h,w) + X) / (c(h) + A|V|), X< 1 • for non-conditional distributions: p'(w) = (c(w) + X) / (]T| + X|V|) • Example: Training data: what is it what is small ? |X| = S • V = { what, is, it, small, ?, , flying, birds, are, a, bird,. }, |V| = 12 • p(it)=. 125, p(what)=.25, p(.)=0 p(what is it?) = .252x.l252 i .001 p(it is flying.) = .125x.25x02 = 0 • Use A = .1: • p'(it)^.12,p'(what)^.23,p'(.)^.01 p'(whatis it?) = .232x.l22 g .0007 p'(it is flying.) = .12x.23x.012 | .000003 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 6 Good - Turing • Suitable for estimation from large data - similar idea: discount/boost the relative frequency estimate: pr(w) - (c(w) + 1) x N(c(w) + 1) / (|T| x N(c(w))), where N(c) is the count of words with count c (count-of-counts) specific allyf for c(w) = 0 (unseen words), pr(w) = N(l) / (|T| x N(0)) - good for small counts (< 5-10, where N(c) is high) - variants (seeMs) - normalization! (so that we have S^p^w) = 1) 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic Good-Turing: An Example Example: remember: pr(w) = (c(w) + 1) x N(c(w) + 1) / (|T| x N(c(w))) Training data: what is it what is small ? |T| = 8 * V = { what, is, it, small, ?, , flying, birds, are, a, bird,. }, |V| = 12 p(it)=. 125, p(what)=.25, p(.)=0 p(what is it?) = .252x.l252 k .001 p(it is flying.) = .125x.25x02= 0 * Raw re estimation (N(0) = 6, N(l) = 4, N(2) = 2, N(i) = 0 for i > 2): pr(it) = (1+1)xN(1+1)/(8xN(1)) = 2x2/(8x4) = .125 pr(what) = (2+l)xN(2+l)/(8xN(2)) =3x0/(8x2) = 0: keep orig. p(what) Pr() = (0+l)xN(0+l)/(8xN(0)) = 1x4/(8x6) i .083 * Normalize (divide by 1.5 = Z^yip^w)) and compute: p'(it)^ .08,p'(what)^.17,p'(.)^ .06 p5(whatis it?) = .l^x.OS2 £ .0002 p'(it is flying.) = .08x.17x.062 = .00004 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 8 Smoothing by Combination: Linear Interpolation • Combine what? * distributions of various level of detail vs. reliability • n-gram models: * use (n-l)gramf (n-2)gram, uniform ~-*s reliability *- detail • Simplest possible combination: - sum of probabilities, normalize: • p(0|0) =:.m P(l|0)=.2,p(0|l)= 1, p(l|l) = 0, p(0) = .4fp(l) = .6: • p'(0|0) = }% p>(l|0) = .4, p'(0|l) = % p*(l|l) = .3 9/20/2000 JHU CS 600.46^Intro to NLP/JanHajic 9 Typical n-gram LM Smoothing • Weight in less detailed distributions using ^(A^A^A^): • Normalize: % > 0, = 1 is sufficient (A0 = 1 - £1=1 A) (n=3) • Estimation using MLE: - fix the p3, p2, Pi and |V| parameters as estimated from the training data - then find such {AJwhich minimizes the cross entropy (maximizes probability of data): -(1/|D|)S1=1 ,D,log2(p\(w1|h1)) 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic Held-out Data • What data to use? - try the training data T: but we will always get X3= 1 • why? (letpiT be an i-gram distribution estimated using r.f. from T) • minimizing HT(p'j) over a vector A,> p\ = Xjp3T+A^t+XjP 1T+X0/|V | - remember: Hj.(p'= H(p3T) + D(p3T||p'}J); (p3Tfixed ^H(p3T) fixed, best) - which p\minimizes HT(p\)? Obviously, a p\ for which D(p3T|| p'>)=0 - ...and that'sp3T (because D(p||p) =0, as we know). - ...and certainly p\ = p3Tif A,3= 1 (maybe in some other cases, too). - (p\ = 1 x p3T+ Ox p2t+ Ox plx+ 0/|V|) - thus: do not use the training data for estimation of 1! • must hold out part of the training data (heldout data, H): • ...call the remaining data the (Irue/raw) traimng data, T • the test data S (e.g., for comparison purposes): still different data! 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 11 The Formulas • Repeat: minimizing -(1/|H|)S1=1 |H|log2(p\(w1|h1)) over X p\(wj h§ = p\(wj w,2 ,wM) = X3p3(wi wm ,wM) + m |_^p2(wi|wi.1) + X1p1(wi) + Xtl/|V| / • "Expected Counts (of lambdas)": j = 0..3 c(^) = S1=1..|H|(^pJ(w1|h1)/p\(w1|h1))/ • "Next X": j =0..3 W = c^)/sk=o..3 W) / 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 12 The (Smoothing) EM Algorithm 1. Start with some % such that h > 0 for all j e 0..3. 2. Compute "Expected Counts" for each h. 3. Compute new set of L- using the "Next V formula. 4. Start over at step 2, unless a termination condition is met. • Termination condition: convergence of X. - Simply set an £? and finish if |^ - X} neJ < £ for each j (step 3). • Guaranteed to converge: follows from Jensen's inequality, plus a technical proof. 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic Remark on Linear Interpolation Smoothing • "Bucketed" smoothing: - use several vectors of X instead of one, based on (the frequency of) history: X(h) • e.g. for h = (micro grams ,per) we will have X(h) = (.999,.0009,.00009,.00001) (because "cubic" is the only word to follow...) - actually: not a separate set for each history, but rather a set for "similar" histories ("bucket"):_ X(b(h)), where b: V2 -> N (in the case of trigrams) b classifies histories according to their reliability (~ frequency) 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 14 Bucketed Smoothing: The Algorithm • First, determine the bucketing function b^(use heldout!): - decide in advance you want e.g. 1000 buckets - compute the total frequency of histories in 1 bucket (fmax(b)) - gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed fmaK(b) (you might end up with slightly more than 1000 buckets) • Divide your heldout data according to buckets • Apply the previous algorithm to each bucket and its data 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 15 Simple Example Raw distribution (unigram only; smooth with uniform): p(a)= .25, p(b) = .5, p(ot) = 1/64 for a e{c..r}, = 0 for the rest: s,t,u,v,w,x,y,z Heldout data: baby; use one set of X (\ : unigram, \: uniform) Start with = .5; p\(b) = .5 x .5 + .5 lie = .27 p\(a) = .5 x .25 + .5 / 26 = .14 p\(y)= .5x0 + .5/26 = .02 c(\) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x07.02 = 2.72 c(Xq) = .5x.04/.27 + .5x.04/.14 + .5x.04/.27 + .5x.04/.02 = 1.28 Normalize: \?neKt - .68 \ \mn - .32. Repeat from step 2 (recompute p5^ first for efficient computation then c(Aj),.. Finish when new lambdas almost equal to the old ones (say, < 0.01 difference). 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic Some More Technical Hints • Set V = {all words from training data}. • You may also consider V = T U H, but it does not make the coding in any way simpler (in fact, harder). • But: you must never use the test data for you vocabulary! • Prepend two "words" in front of all data: • avoids be ginning-of-data problems • call these index -1 and 0: then the formulas hold exactly • Whencn(w,h) = 0: • Assign 0 probability to pn(w|h) where cn i(h) > 0, but a uniform probability (1/|V|) to those pn(w|h) where cn i(h) — 0 [this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!] 9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 17