Introduction to Natural Language Processing (600.465)
Language Modeling (and the Noisy Channel)
Dr. Jan Hajic CS Dept., Johns Hopkins Univ.
haj ic@cs.jhu.edu www.cs.j hu.edu/-haj ic
9/19/2000
JHU CS 600.465/IntrotoNLP/JanHajic
1
The Noisy Channel
• Prototypical case:
Input		Output (noisy) ■--
	The channel	
	(adds noise)	
• Model:   probability of error (noise):
• Example: p(0|l) = .3 p(l|l) = .7 p(l|0) = .4 p(0|0) = .6
• The Task:
known: the noisy output; want to know: the input (decoding)
9/19/2000
JHU CS 600.405/Intro to NLP/JanHajic
Noisy Channel Applications
• OCR
- straightforward: text —^ print (adds noise), scan —> image
• Handwriting recognition
- text —k neurons, muscles ("noise"), scan/digitize —> image
• Speech recognition (dictation, commands, etc.)
- text —> conversion to acoustic signal ("noise") —> acoustic waves
• Machine Translation
- text in target language —> translation ("'noise") —f source language
• Also: Part of Speech Tagging
- sequence of tags —> selection of word forms —> text
9/19/2000 JHU CS 600.463/Into to NLP/JanHajic 3
Noisy Channel: The Golden Rule of...
Cqcr^asr, hr^t^?^
• Recall:
p(A|B) = p(B| A) p(A) / p(B) (Bayes formula) Abest = argmaxA p(B|A) p(A) (The Golden Rule)
• p(b|a): the acoustic/image/translation/lexical model
- application-specific name
- will explore later
• p(a): the language model
9/19/2000
JHU CS 600.405/Intro to NLP/JanHajic
4
The Perfect Language Model
Sequence Of WOrd forms [forget about tagging for the moment]
Notation: A ~ W = (w1;w2,w3,...,wd) The big (modeling) question:
p(W) = ?
Well, we know (Bayes/chain rule —>):
p(W) = p(w1?w2;w3;...;wd) = = p(wt) x p(w2|w1) x p(w3|w1?w2) x...x p(wd|w1?w2,...,wd_ Not practical (even short W —> too many parameters)
9/19/2000
JHU CS 600.463/Intro to NLP/JanHajic
Markov Chain
• Unlimited memory (cf. previous foil):
- for w1? we know all its predecessors w^w^w^...^
• Limited memory:
- we disregard "too old" predecessors
- remember only £ previous words: w^w^^...;^.!
- called "k1*1 order Markov approximation"
• + stationary character (no change over time):
p(W) = Ok d= |W|
9/19/2000
JHU CS 600.403/Intro to NLP/JanHajic
n-gram Language Models
(n-l)^ order Markov approximation —> n-gram LM:
prediction
history
p(w) =m 14=i. .dPCwiK-^i    2> ■ ■ *Wi J
In particular (assume vocabulary | V| = 60k):
0- gram LM: uniform model,
1- gram LM: unigram model,
2- gram LM: bigram model,
3- gram LM: trig ram model,
p(w) = 1/|V|,   1 parameter p(w), 6*104 parameters
p(wjwi_j)       3.6xl(]P parameters p(wi|wi.2,wi.1) 2.16xl014 parameters
9/19/2000
JHU CS 600.463/Intro to NLP/JanHajic
LM: Observations
• How large nl
- nothing is enough (theoretically)
- but anyway: as much as possible {-> close to "perfect" model)
- empirically: 3
* parameter estimation? (reli ability > data avail ability f storage space,...)
* 4 is too much: |V|=60k —> 1.296x1019 parameters
* but: 6-7 would be (almost) ideal (having enough data): in fact, one can recover original from 7-grams!
• Reliability ~ (1 / Detail) (W need compromise)
• For now, keep word forms (no "linguistic" processing)
9/19/2000
JHU CS 600.463/Intro to NLP/JanHajic
The Length Issue
-    VlU SwGQnP(W) = 1 ^ Sn=l..mSweQn p(w) » 1 (->oo)
• We want to model all sequences of words
- for "fixed" length tasks: no problem - n fixed, sum is 1
* tagging, OCR/handwriting (if words identified ahead of time)
- for "variable" length tasks: have to account for
• discount shorter sentences
• General model: for each sequence of words of length n,
define p'(w) - lnp(w) such that En=1 mXn = 1 =>
Sn=l..«£w^P'(w)=l e.g., estimate Xn from data; or use normal or other distribution
9/19/2000 JHU CS 600.465/Intro to NLP/JanHajic 9
Parameter Estimation
• Parameter: numerical value needed to compute p(w|h)
• From data (how else?)
• Data preparation:
• get rid of formatting etc. ("text cleaning")
• define words (separate but include punctuation, call it "word")
• define sentence boundaries (insert "words" <s> and </s>)
• letter case: keep, discard, or be smart:
- name recognition
- number type identification [these are huge problems per se! ]
• numbers: keep, replace by <num>, or be smart (form ~ pronunciation)
9/19/2000
JHU CS 600.465/Intro to NLP/JanHajic
Maximum Likelihood Estimate
• MLE: Relative Frequency...
- ...best predicts the data at hand (the "training data")
• Trigrams from Training Data T:
- count sequences of three words in T: c^w^^w^^Wj)
■- [NB: notation: just saying that the three words follow each other]
- count sequences of two words in T: c2(w1.1?w1):
• either use c2(yfz) = Sw c3(yfzfw)
• or count differently at the beginning (& end) of data!
p(w1|w1.2?w1.1) = st c^w^^.^) / e|^%4| •
9/19/2000
JHU CS 600.465/Intro to NLP/JanHajic
11
Character Language Model
• Use individual characters instead of words:
p(W) =dfrTi=1 ..^(qlq.^q.^,...^)
• Same formulas etc.
• Might consider 4-grams, 5-grams or even more
• G ood only for language comparison
• Transform cross-entropy between letter- and word-based models:
ris(Pc) = Hs(pw) / avg. # of characters/word in S
9/19/2000
JHU CS 600.465/Intro to NLP/JanHajic
LM: an Example
• Training data:
<s> <s> He can buy the can of soda.
- Unigram: p^e) = p^uy) = Pi (the) = p^of) = p^soda) = PjQ = .125
Pi (can) = .25
- Bigram: p2(He|<s>) = 1, p2(can|He) = l, p2(buy|can) = .5,
p2(oflcan) = .5, p2(the|buy) = 1,...
- Trigram: p3(He|<s>,<s>) = %, p3(can|<s>,He) = 1,
p3(buy|He,can) = 1, p3(oflthe,can) = 1,      p3(.|of>oda) = 1.
- Entropy: H(p^) = 2.75, H(p2) = .25, H(p3) = 0   <- Great?!
9/19/2000
JHU CS 600.465/Intro to NLP/JanHajic
LM: an Example (The Problem)
• Cross-entropy:
• S = <s> <s> It was the greatest buy of all.
• Even Hs(p}) fails (= Hs(p2) = Hs(p3) = oo), because:
- all unigrams but pj(the), p^buy), p^of) and p^.) are 0.
- all bigram probabilities are 0.
- all trigram probabilities are 0.
• We want: to make all (theoretically possible*) probabilities non-zero.
in fact, all: remember our graph from day 1?
9/19/2000
JHU CS 600.465/Intro to NLP/JanHajic
14
Introduction to Natural Language Processing (600.465)
LM Smoothing (The EM Algorithm)
Dr. Jan Hajie CS Dept., Johns Hopkins Univ.
haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
The Zero Problem
"Raw" n-gram language model estimate:
- necessarily, some zeros
* Imany: frigram model -4 2.16xl014parameters, data ~ 109 words
- which are true 0?
* optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it's probability vs. other trigrams
* optimal situation cannot happen, unfortunately (open question: how many data would we need?)
- -> we don't know
- we must eliminate the zeros
Two kinds of zeros: p(w|h) = 0? or even p(h) = 0!
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
Why do we need Nonzero Probs?
• To avoid infinite Cross Entropy:
- happens when an event is found in test data which has not been seen in training data
H(p) = go: prevents comparing data with > 0 "errors"
• To make the system more robust
- low count estimates:
• they typically happen for "detailed" but relatively rare appearances
- high count estimates: reliable but less "detailed"
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
Eliminating the Zero Probabilities:
Smoothing
• Get new p'(w) (same Q): almost p(w) but no zeros
• Discount w for (some) p(w) > 0: new p'(w) < p(w)
^wechscounted (P(W)-P'(W)) = D
• Distribute D to all w; p(w) = 0: new p'(w) > p(w) - possibly also to other w with low p(w)
• For some w (possibly): p'(w) = p(w)
• Make sure £weGp'(w) = 1
• There are many ways of smoothing
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
4
Smoothing by Adding 1
• Simplest but not really usable:
- Predicting words w from a vocabulary V, training data T:
p'(w|h) = (C(h,w) + 1) / (c(h) + IV)
• for non-conditional distributions: p'(w) = (c(w) +1)/(|T| + |V|)
- Problem if |V| > c(h) (as is often the case; even » c(h)!)
• Example:     Training data:     <s> what is it what is small ?       |T| = 8
• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird,. }, |V| = 12
• p(it)=. 125, p(what)=.25, p(.)=0   p(what is it?) = .252x.l252 I .001
p(it is flying.) = .125x.25x02 = 0
• p'(it) =. l,p5(what)=.15,p5(.)=.05 p'(what is it?) = .152x.l2 ^ 0002
p'Citis flying.) = .1x.15x.052^.00004
9/20/2000 JHU CS 600.46^Intro to NLP/JanHajic J
Adding less than 1
• Equally simple:
- Predicting words w from a vocabulary V, training data T: p'(w|h) = (c(h,w) + X) / (c(h) + A|V|), X< 1
• for non-conditional distributions: p'(w) = (c(w) + X) / (]T| + X|V|)
• Example:     Training data:     <s> what is it what is small ?       |X| = S
• V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird,. }, |V| = 12
• p(it)=. 125, p(what)=.25, p(.)=0   p(what is it?) = .252x.l252 i .001
p(it is flying.) = .125x.25x02 = 0
• Use A = .1:
• p'(it)^.12,p'(what)^.23,p'(.)^.01 p'(whatis it?) = .232x.l22 g .0007
p'(it is flying.) = .12x.23x.012 | .000003
9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 6
Good - Turing
• Suitable for estimation from large data
- similar idea: discount/boost the relative frequency estimate:
pr(w) - (c(w) + 1) x N(c(w) + 1) / (|T| x N(c(w))),
where N(c) is the count of words with count c (count-of-counts) specific allyf for c(w) = 0 (unseen words), pr(w) = N(l) / (|T| x N(0))
- good for small counts (< 5-10, where N(c) is high)
- variants (seeMs)
- normalization! (so that we have S^p^w) = 1)
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
Good-Turing: An Example
Example:  remember: pr(w) = (c(w) + 1) x N(c(w) + 1) / (|T| x N(c(w))) Training data:     <s> what is it what is small ?      |T| = 8
* V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird,. }, |V| = 12
p(it)=. 125, p(what)=.25, p(.)=0   p(what is it?) = .252x.l252 k .001
p(it is flying.) = .125x.25x02= 0
* Raw re estimation (N(0) = 6, N(l) = 4, N(2) = 2, N(i) = 0 for i > 2):
pr(it) = (1+1)xN(1+1)/(8xN(1)) = 2x2/(8x4) = .125
pr(what) = (2+l)xN(2+l)/(8xN(2)) =3x0/(8x2) = 0: keep orig. p(what)
Pr() = (0+l)xN(0+l)/(8xN(0)) = 1x4/(8x6) i .083
* Normalize (divide by 1.5 = Z^yip^w)) and compute:
p'(it)^ .08,p'(what)^.17,p'(.)^ .06 p5(whatis it?) = .l^x.OS2 £ .0002
p'(it is flying.) = .08x.17x.062 = .00004
9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 8
Smoothing by Combination: Linear Interpolation
• Combine what?
* distributions of various level of detail vs. reliability
• n-gram models:
* use (n-l)gramf (n-2)gram, uniform
~-*s reliability
*- detail
• Simplest possible combination: - sum of probabilities, normalize:
• p(0|0) =:.m P(l|0)=.2,p(0|l)= 1, p(l|l) = 0,   p(0) = .4fp(l) = .6:
• p'(0|0) = }% p>(l|0) = .4, p'(0|l) = % p*(l|l) = .3
9/20/2000 JHU CS 600.46^Intro to NLP/JanHajic 9
Typical n-gram LM Smoothing
• Weight in less detailed distributions using ^(A^A^A^):
• Normalize:
% > 0, = 1 is sufficient (A0 = 1 - £1=1 A) (n=3)
• Estimation using MLE:
- fix the p3, p2, Pi and |V| parameters as estimated from the training data
- then find such {AJwhich minimizes the cross entropy (maximizes probability of data): -(1/|D|)S1=1 ,D,log2(p\(w1|h1))
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
Held-out Data
• What data to use?
- try the training data T: but we will always get X3= 1
• why? (letpiT be an i-gram distribution estimated using r.f. from T)
• minimizing HT(p'j) over a vector A,> p\ = Xjp3T+A^t+XjP 1T+X0/|V |
- remember: Hj.(p'= H(p3T) + D(p3T||p'}J); (p3Tfixed ^H(p3T) fixed, best)
- which p\minimizes HT(p\)? Obviously, a p\ for which D(p3T|| p'>)=0
- ...and that'sp3T (because D(p||p) =0, as we know).
- ...and certainly p\ = p3Tif A,3= 1 (maybe in some other cases, too).
- (p\ = 1 x p3T+ Ox p2t+ Ox plx+ 0/|V|)
- thus: do not use the training data for estimation of 1!
• must hold out part of the training data (heldout data, H):
• ...call the remaining data the (Irue/raw) traimng data, T
• the test data S (e.g., for comparison purposes): still different data!
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
11
The Formulas
• Repeat: minimizing -(1/|H|)S1=1 |H|log2(p\(w1|h1)) over X
p\(wj h§ = p\(wj w,2 ,wM) = X3p3(wi wm ,wM) + m |_^p2(wi|wi.1) + X1p1(wi) + Xtl/|V| /
• "Expected Counts (of lambdas)": j = 0..3
c(^) = S1=1..|H|(^pJ(w1|h1)/p\(w1|h1))/
• "Next X": j =0..3
W = c^)/sk=o..3 W) /
9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 12
The (Smoothing) EM Algorithm
1. Start with some % such that h > 0 for all j e 0..3.
2. Compute "Expected Counts" for each h.
3. Compute new set of L- using the "Next V formula.
4. Start over at step 2, unless a termination condition is met.
• Termination condition: convergence of X.
- Simply set an £? and finish if |^ - X} neJ < £ for each j (step 3).
• Guaranteed to converge:
follows from Jensen's inequality, plus a technical proof.
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
Remark on Linear Interpolation
Smoothing
• "Bucketed" smoothing:
- use several vectors of X instead of one, based on (the frequency of) history: X(h)
• e.g. for h = (micro grams ,per) we will have
X(h) = (.999,.0009,.00009,.00001)
(because "cubic" is the only word to follow...)
- actually: not a separate set for each history, but rather a set for "similar" histories ("bucket"):_
X(b(h)), where b: V2 -> N (in the case of trigrams)
b classifies histories according to their reliability (~ frequency)
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
14
Bucketed Smoothing: The Algorithm
• First, determine the bucketing function b^(use heldout!):
- decide in advance you want e.g. 1000 buckets
- compute the total frequency of histories in 1 bucket (fmax(b))
- gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed fmaK(b) (you might end up with slightly more than 1000 buckets)
• Divide your heldout data according to buckets
• Apply the previous algorithm to each bucket and its data
9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 15
Simple Example
Raw distribution (unigram only; smooth with uniform):
p(a)= .25, p(b) = .5, p(ot) = 1/64 for a e{c..r}, = 0 for the rest: s,t,u,v,w,x,y,z
Heldout data: baby; use one set of X (\ : unigram, \: uniform) Start with    = .5; p\(b) = .5 x .5 + .5 lie = .27
p\(a) = .5 x .25 + .5 / 26 = .14
p\(y)= .5x0 + .5/26 = .02 c(\) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x07.02 = 2.72 c(Xq) = .5x.04/.27 + .5x.04/.14 + .5x.04/.27 + .5x.04/.02 = 1.28 Normalize: \?neKt - .68 \ \mn - .32.
Repeat from step 2 (recompute p5^ first for efficient computation then c(Aj),.. Finish when new lambdas almost equal to the old ones (say, < 0.01 difference).
9/20/2000
JHU CS 600.465/Intro to NLP/JanHajic
Some More Technical Hints
• Set V = {all words from training data}.
• You may also consider V = T U H, but it does not make the coding in any way simpler (in fact, harder).
• But: you must never use the test data for you vocabulary!
• Prepend two "words" in front of all data:
• avoids be ginning-of-data problems
• call these index -1 and 0: then the formulas hold exactly
• Whencn(w,h) = 0:
• Assign 0 probability to pn(w|h) where cn i(h) > 0, but a uniform probability (1/|V|) to those pn(w|h) where cn i(h) — 0 [this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!]
9/20/2000 JHU CS 600.465/Intro to NLP/JanHajic 17