Cross Entropy PA154 Jazykové modelování (2.1) Pavel Rychlý pary@fi.muni.cz February 24, 2020 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic "Coding" Interpretation of Entropy ■ The least (average) number of bits needed to encode a message (string, sequence, series, ...) (each element having being a result of a random process with some distribution p): = H(p) ■ Remember various compressing algorithms? ► they do well on data with repeating (= easily predictable = = low entropy) patterns ► their results though have high entropy ^> compressing compressed data does nothing PA154 Jazykové modelování (2.1) Cross Entropy 2/12 Coding: Example ■ How many bits do we need for ISO Latin 1? ► ^> the trivial answer: 8 ■ Experience: some chars are more common, some (very) rare: ► ... so what if we use more bits for the rare, and less bits for the frequent? (be careful: want to decode (easily)!) ► suppose: p('a') = 0.3, p('b') = 0.3, p('c') = 0.3, the rest: p(x)^.0004 ► code: 'a' ^ 00, 'b' ^ 01, 'c' ^ 10, rest: 116162^3^4^5^6^7^8 ► code 'acbbecbaac': 00 10 01 01 1100001111 10 01 00 00 10 acbb e cbaac ► number of bits used: 28 (vs. 80 using "naive" coding) ■ code length ~ —log(probability) PA154 Jazykové modelování (2.1) Cross Entropy 3/12 Entropy of Language ■ Imagine that we produce the next letter using p(ln+i\li, • • • In), where /i,... In is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let's call /i,... In the history /7(/7n+i), and all histories H: ■ Then compute its entropy: ■ Not very practical, isn't it? PA154 Jazykové modelování (2.1) Cross Entropy 4/12 Cross-Entropy ■ Typical case: we've got series of observations T — {ti, t2, £3, £4,..., tn} (numbers, words, ...; t\ G ft); estimate (sample): Vy G ft : p(y) = ~pFj^» def. c(y) = |{te T;t = y}\ ■ ... but the true p is unknown; every sample is too small! ■ Natural question: how well do we do using p (instead of p)? ■ Idea: simulate actual p by using a different T (or rather: by using different observation we simulate the insufficiency of T vs. some other data ("random" difference)) PA154 Jazykové modelování (2.1) Cross Entropy 5/12 Cross Entropy: The Formula Hp,(~p) = H(p') + D(p'\\~p) p' is certainly not the true p, but we can consider it the "real world" distribution against which we test p note on notation (confusing ...): — p, also Hjr(p) P (Cross)Perplexity: Gp/(p) = GT/(p) = 2Hp'^p) PA154 Jazykové modelování (2.1) Cross Entropy 6/12 Conditional Cross Entropy ■ So far: "unconditional" distribution(s) p(x), p'(x)... ■ In practice: virtually always conditioning on context ■ Interested in: sample space v|/, r.v. Y, y e *|/; context: sample space fi, r.v.X, x 6 Q: "our" distribution p(y|x), test against p;(y,x), which is taken from some independent data: PA154 Jazykové modelování (2.1) Cross Entropy 7/12 Sample Space vs. Data ■ In practice, it is often inconvenient to sum over the space(s) v|/,Q (especially for cross entropy!) ■ Use the following formula: Hpr(p) — -EyG^GQ p'(y>x) lQg2 p(y|*) = - VI^IEz-i...!^! lQg2 p(y/l*/) ■ This is in fact the normalized log probability of the "test" data: Hp/(p) = -i/\r\iog2 H p(yi\xi) ;=1...I7~'I PA154 Jazykové modelování (2.1) Cross Entropy 8/12 Computation Example Q — {a, b, ..,z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) = .5, p(a) = ^ for a G {c.r}, p(a)= 0 for the rest: s.t.u.v.w.x.y.z Data (test): barb p'(a) = P'(r) = .25, p'(b) = .5 Sum over Q: a a bcdefg...pq r st...z -p'(a)log2p(a) .5+.5+0+0+0+0+Q+0+04O+0+1.5+0+0+0+0+0 = 2 Sum over data: Usi 1/b 2/a 3/r 4/b ^^1/\T\ -log^) 1 + 2+ 6 + 1 =10 (1/4) x 10 =2 PA154 Jazykové modelování (2.1) Cross Entropy 9/12 Cross Entropy: Some Observations H(p) ??<,=,>?? Hp/(p) : ALL! Previous example: p(a) = .25, p(b) = .5, p(a)= A for a £ {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z H(p) = 2.5bits = H(p')(barb) Other data: probable: (±)(6 + 6 + 6 + 1 + 2 + 1 + 6 + 6)= 4.25 H(p) < A.25bits = H{p'){probable) And finally: abba: (±)(2 + 1 + 1 + 2) = 1.5 H(p) > 1.5bits = H(p')(abba) But what about: baby -p'('y') log2 p('y') = -.25 log2 0 = oo (??) PA154 Jazykové modelování (2.1) Cross Entropy 10/12 Cross Entropy: Usage ■ Comparing data?? NO! (we believe that we test on real data!) ■ Rather: comparing distributions (vs. real data) ■ Have (got) 2 distributions: p and q (on some Q,X) ► which is better? ► better: has lower cross-entropy (perplexity) on real data S ■ "Real" data: S Hs(P) = -1/ 5|E,=i..|si loS2P{y, x,) © Hs(q) = -1/ 5|E,=i..|s| loS2q{y, PA154 Jazykové modelování (2.1) Cross Entropy 11/12 Comparing Distributions p(.) from previous example: p(a) = .25, p(b) = .5, p(a) = for a G {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z q(.|.) (conditional; defined by a table): a b e i □ p r other 0 í 0 0 0 .125 D 0 b i 0 D 0 1 .125 0 0 e 0 0 0 1 0 125 0 1 0 .5 0 D a .125 0 s 0 □ 0 0 0 0 0 .1 2Í ľ 0 P 0 0 0 G 0 .1 25 n 1^ r 0 I] ■;i 0 0 .1 25 - n — other 0 0 i 0 ú .1 25 0 Ij 1 _gx.: q(o|r) q(r|p) =. 125 (1/8) (log(p|oth.)+log(rjP)+log(ojr)+log(b|o)+log(a|b)+log(b|a)+log(I|b)+log(e|l)) (1/8) ( 0 +3+0+0+1 + 0+1 + 0 ) H3(q) = .625 PA154 Jazykové modelovaní (2.1) Cross Entropy 12/12