Introduction to Natural Language Processing (600.465) Probability Dr. Jan Hajic CS Dept., Johns Hopkins Univ. haj ic@cs.j hu.edu .www. cs . j hu . edu/ -ha j ic 2/2000 JHU CS 600.465/Intro to NLP/JanHajic Experiments & Sample Spaces • Experiment, process, test, ... • Set of possible basic outcomes: sample space Q - coin toss (Q - {head?tail})? die (Q = {1..6}) - yes/no opinion poll, quality test (bad/good) (Q - {0,1}) - lottery (| Q | = 107 .. 1012) - # of traffic accidents somewhere per year (Q = N) - spelling errors (Q = Z*)f where Z is an alphabet, and Z* is a set of possible strings over such and alphabet - missing word (| Q | = vocabulary size) 9/12/2000 JHU CS 600.405/Intro to NLP/JanHajic Events • Event A is a set of basic outcomes • Usually AcQ, and all A e 2G (the event space) - Q is then the certain event, 0 is the impossible event • Example: - experiment: three times coin toss • Q - {HHH, HHT, HTH, HIT, THH, THT, TTH, TTTJ - count cases with exactly two tails: then • A= {HTT, THT, TTH} - all heads: • A = {HHH} 9/12/2000 JHU CS 600.465/Intro to NLP/JanHajic 3 Probability • Repeat experiment many times, record how many times a given event A occurred ("count" c^. • Do this whole series many times; remember all cp. • Observation: if repeated really many times, the ratios of ci/Ti (where Tt is the number of experiments run in the i-th series) are close to some (unknown but) constant value. • Call this constant a probability of A. Notation: p(A) 9/12/2000 JHU CS 600.455/Intro to NLP/JanHajic 4 Estimating probability • Remember: ... close to an unknown constant. • We can only estimate it: - from a single series (typical case, as mostly the outcome of a series is given to us and we cannot repeat the experiment), set p(A)= c|3% - otherwise, take the weighted average of all c1/T1 (or, if the data allows, simply look at the set of series as if it is a single long series). • This is the best estimate. 9/12/2000 JHU CS 600.463/Intro to NLP/JanHajic Example • Recall our example: - experiment: three times coin toss • Q = {HHH? HUT, HTHj HTT, THH, THT, TTH, TTT} - count cases with exactly two tails: A = piTT, tht, tth} • Run an experiment 1000 times (i.e. 3000 tosses) • Counted: 386 cases with two tails (htt, tht, or tth) • estimate: p(A) = 386 / 1000 = .386 • Run again: 373, 399, 382, 355, 372, 406, 359 - p(A) = .379 (weighted average) or simply 3032 / 8000 • Uniform distribution assumption: p(A) = 3/8 = .375 9/12/2000 JHU CS 600.463/Intro to NLP/JanHajic Basic Properties • Basic properties: - p: 2 n -> [0,1] - P(O) = 1 - Disjoint events: ptvA^ = EiP(Aj) • [NB: axiomatic definition of probability: take the above three conditions as axioms] • Immediate consequences: -p(0) = O, p(A) = l-p(A), AcB ^> p(A) < p(B) - SaeGp(a) = l 9/12/2000 JHU CS 600.463/Intro to NLP/JanHajic Joint and Conditional Probability p(A,B) = P(A n B) p(A|B)=p(A,B)/p(B) - Estimating form counts: • p(A|B) = p(A,B) / p(B) = (c(A n B) / T) / (c(B) 1 T) = c(A n B) / c(B) 9/12/2000 JHU CS 600.405/Intro to NLP/JanHajic Bayes Rule • p(A,B) = p(B,A) since p(A n B) = p(B n A) - therefore: p(A|B) p(B) = p(B|A) p(A), and therefore n p(A|B) = p(B|A) p(A)/p(B) # 9/12/2000 JHU CS 600.46^Intro to NLP/JanHajic 9 Independence • Can we compute p(A,B) from p(A) and p(B)? ■ Recall from previous foil: p(A|B) = p(B|A) p(A)/p(B) p(A|B) p(B) = p(B|A) p(A) p(A,B) = P(B|A) p(A) ... we're almost there: how p(B|A) relates to p(B)? - p(B| A) = P(B]f@A and B are independent • Example:/two coin tosses, weather today and weather on March 4th 1789; • Any two events for which p(B| A) = P(B)! 9/12/2000 JHU CS 600.465/Intro to NLP/JanHajic 10 Chain Rule p(A1? A2, A3, A4, p(AJ A2? A3?A4?.. A^) x p(A2|A-j,A^,AJ x xpfAJA^...^) x ... pCAjJAJx p(An) • this is a direct consequence of the Bayes rule. 9/12/2000 JHU CS 600.465/Intro to NLP/JanHajic 11 The Golden Rule (of Classic Statistical NLP) • Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A|B)): • take Bayes rule, max over all Bs: • argmaxA p(A|B) - argmaxA p(B|A) . p(A) / p(B) - argmaxA p(B| A) p(A) # • ... as p(B) is constant when changing As 9/12/2000 JHU CS 600.405/Intro to NLP/JanHajic Random Variables • is a function X: Q -> Q - in general: Q = Rn, typically R - easier to handle real numbers than real-world events • random variable is discrete if Q is countable (i.e. also if finite) • Example: die: natural ''numbering" [1,6], coin: {0,1} • Probability distribution: - PxOO = P(X=x) =dfp(AJ where As={aeQ: X(a) = x} - often just p(x) if it is clear from context what X is 9/12/2000 JHU CS 600.405/Intro to NLP/JanHajic Expectation Joint and Conditional Distributions • is a mean of a random variable (weighted average) - E(X) = Z^xp) x . px(x) • Example: one six-sided die: 3.5, two dice (sum) 7 • Joint and Conditional distribution rules: - analogous to probability of events • Bayes: px|Y(x,y) = notation Pxy(xIy) = even simpler notation p(*|y) = p(y|x) - pOO / p(y) • Chain rule: p(w,x,y,z) =p(z).p(y|z).p(x|y,z).p(w|x,y,z) 9/12/2000 JHU CS 600.465/Intro to NLP/JanHajic 14 Standard distributions • Binomial (discrete) - outcome: 0 or 1 (thus: toomial) - make n trials - interested in the (probability of) number of successes r • Must be careful: it's not uniform! • pb(r|n) = (" ) / 2n (for equally likely outcome) • ( ") counts how many possibilities there are for choosing r objects out of n; = n! / (n-r)!r! 9/12/2000 JHU CS 600.465/Intro to NLP/JanHajic 15 Continuous Distributions • The normal distribution ("Gaussian") • pnomi(x|ju?a) = e p(y) • Upper bound? - none in general - for | ft | = n: H(p) < log2n • nothing can be more uncertain than the uniform distribution 9/13/2000 JHU CS 600.465/Intro to NLP/JanHajic Entropy and Expectation • Recall: - E(X) = ExeX(G)px(x) x x • Then: EOog^l/p^x))) = px(x) log2(l/px(x)) - = -SxeX(n)px(x)log2px(x) = = H(Px) =mm h(p) 9/13/2000 JHU CS 600.465/Intro to NLP/JanHajic Perplexity: motivation • Recall: - 2 equiprobable outcomes: H(p) = 1 bit - 32 equiprobable outcomes: H(p) - 5 bits -4.3 billion equiprobable outcomes: H(p) ~= 32 bits • What if the outcomes are not equiprobable? - 32 outcomes, 2 equiprobable at .5, rest impossible: • H(p) = 1 bit - Any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes? 9/13/2000 JHU CS 600.465/Intro to NLP/JanHajic Perplexity • Perplexity: - G(p) = 2h 0 - proof: (recall: H(X) = - Exe G p(x) log2 p(x)) * l°g(p(x)) 1S negative or zero for x < 1, * p(x) is non-negative; their product p(x)log(p(x) is thus negative; * sum of negative numbers is negative; * and -f is positive for negative/ Chain rule: - H(X,Y) = H(Y|X) + H(X), as well as - H(X,Y) = H(X|Y) + H(Y) (since H(Y;X) = H(XfY)) 9/13/2000 JHU CS 600.465/Intro to NLP/JanHajic Properties of Entropy II • Conditional Entropy is better (than unconditional): - H(Y|X) Xf(x) + (l-X)f(y) * function f is convex if ^f is concave [for proofs and generalizations, see Cover/Thomas] 9/13/2000 JHU CS 600.465/Intro to NLP/JanHajic "Coding" Interpretation of Entropy • The least (average) number of bits needed to encode a message (string, sequence, series,...) (each element having being a result of a random process with some distribution p): = H(p) • Remember various compressing algorithms? - they do well on data with repeating (= easily predictable = low entropy) patterns - their results though have high entropy => compressing compressed data does nothing 9/13/2000 JHU CS 600.465/Intro to NLP/JanHajic 14 Coding: Example • How many bits do we need for ISO Latin 1? - => the trivial answer: 8 • Experience: some chars are more common, some (very) rare: • ...so what if we use more bits for the rare, and less bits for the frequent? [be careful: want to decode (easily)!] • suppose: p(' a5) = 0.3, p('b') = 0.3, p(V) = 0.3, the rest: p(x)^ .0004 • code: V ~ 00, V ~ 01, V ~ 10, rest: llb^b^b^b^bg • code acbbecbaac: 0010010111000011111001000010 acbb e cbaac • number of bits used: 28 (vs. 80 using "naive" coding) • code length ~ 1 / probability; conditional prob OK! 9/13/2000 JHU CS 600.465/Intro to NLP/JanHajic 15 Entropy of a Language • Imagine that we produce the next letter using p(in+1|i1?...?U where l1;...?lnis the sequence of all the letters which had been uttered so far (i.e. n is really big!); let's call ll5...,lnthe history h (hn+iX and all histories H: • Then compute its entropy: - -ZheHSleAp(l,h)log2p(l|h) • Not very practical, isn't it? 9/13/2000 JHU CS 600.4a3/Intro to NLP/JanHajic Cross-Entropy • Typical case: we've got series of observations T = {tp t2, t3, t4, (numbers, words, t{ e Q); estimate (simple): Vy e Q: P (y) = c(y) / |T|, def c(y) = |{te T;t = y} | • ...but the true p is unknown; every sample is too small! • Natural question: how well do we do using p [instead of p]? • Idea: simulate actual p by using a different T? (or rather: by using different observation we simulate the insufficiency of T vs. some other data ("random" difference)) 9/18/2000 JHU CS 600.465/Intro to NLP/JanHajic 11 Cross Entropy: The Formula • Hp,(p) = H(p') + D(p'||p) Hn,(p) --SKgnP'(x)log2P(x) j • p? is certainly not the true p, but we can consider it the "real world" distribution against which we test p • note on notation (confusing...): p/p' <-> p , also HT,(p) • (Cross)Perplexity: Gp,(p) = GT,(p)= 2hpM 9/18/2000 JHU CS 600.465/Intro to NLP/JanHajic 12 Conditional Cross Entropy • So far: "unconditional" distribution^) p(x), p'(x)-- • In practice: virtually always conditioning on context • Interested in: sample space 96 r.v. Y, y e 4^; context: sample space Q, r.v. X, x e Q;: "our" distribution p(y|x), test against p'(y,x); which is taken from some independent data: ■Hp.(p) = -SyG^eQp'(y>x) iog2p(ylx) 9/18/2000 JHU CS 600.465/Intro to NLP/JanHajic Sample Space vs. Data In practice, it is often inconvenient to sum over the Sample Space(s) T, Q (especially for cross entropy!) Use the following formula: Hp>(p) -Xy^,xeQP5(y,x) log2p(y|x) = t • This is in fact the normalized log probability of the "test" data: Hp>(p) = - 1/|T'| log2rTi = 1 ..in pfelxj) 9/18/2000 JHU CS 600.465/Intro to NLP/JanHajic 14 Computation Example Q = {a,b,..,z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) = .5, p(a) = 1/64 for a e{c..r}, = 0 for the rest: s,t,u,v,w,x,y,z Data (test): barb p5(a) = p5(r) = .25, p'(b) = .5 Sum over Q: a abcdefg...pqr s t .. . z -p'(ct)log2p(a) .5+.5+0+0+0+0+0+0+040+0+1.5+0 +0+0+0+0 = 2 .5 Sum over data: i/Si l/b 2/a 3/r 4/b ^—1/|T'| -log2p(Sj) 1 + 2 + 6 + 1 =10 (1/4) x 10 =2.5 9/18/2000 JHU CS 600.465/Intro to NLP/JanHajic 15 Cross Entropy: Some Observations • H(p) ??<,=,>?? Hp,(p): ALL! • Previous example: [p(a) = .25, p(b) = .5, p(a) = 1/64 for a 1.5 bits = H(p') (abba) • But what about: baby -pi(y)iog2p(y) = -.25iog2o = <» (??) 9/18/2000 JHU CS 600.46^Intro to NLP/JanHajic 16 Cross Entropy: Usage • Comparing data?? - NO! (we believe that we test on real data!) • Rather: comparing distributions (vs. real data) • Have (got) 2 distributions: p and q (on some Q, X) - which is better? - better: has lower cross-entropy (perplexity) on real data S • "Real" data: S • hs(p) = - i/|si x1=USI kmkm (3)hs(