MUNI FI "Coding" Interpretation of Entropy Cross Entropy PA154 Language Modeling (2.2) Pavel Rychly pary@fi.muni.cz February 27, 2024 The Least (average) number of bits needed to encode a message (string, sequence, series,...) (each element having being a result of a random process with some distribution p): = H(p) Remember various compressing algorithms? ■ they do well on data with repeating (= easily predictabLe = = Low entropy) patterns ■ their resuLtsthough have high entropy^ compressing compressed data does nothing Source: Introduction to Natural Language Processing (600.465) Jan Hajic.CS Dept.,Johns Hopkins Univ. www.cs.jn u.edu/~najic ^avel Rycniy ■ Cross Entropy ■ Fepruary 27,2024 Coding: Example Entropy of Language How many bits do we need for ISO Latin 1? ■ =>■ the trivia L answer: 8 Experience: some chars are more common, some (very) rare: ■ ...so what if we use more bits for the rare, and Less bits for the frequent? (be carefuL: want to decode (easiLy)!) ■ suppose: p('a') = 0.3, p('b') = 0.3, p('c') = 0.3, the rest: p(x)^.0004 ■ code: 'a' ~ 00, 'b' ~ 01, 'c' ~ 10, rest: llb^b^bsbebybs ■ code 'acbbecbaac': 00 10 01 01 1100001111 10 01 00 00 10 acbb e cbaac ■ number of bits used: 28 (vs. 80 using "naive" coding) code length ~ -log(probability) Imagine that we produce the next letter using p(ln+1\k,...l„), where . A„ is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let's call llt... l„ the history h(hn+i), and all histories H: Then compute its entropy: ■ -E/,eHE^P(',")iog2P('l") Not very practical, isn't it? ^avel Rycniy ■ Cross Entropy ■ Fepruary 27,2024 ^avel Rycniy ■ Cross Entropy ■ Fepruary 27,2024 Cross-Entropy Cross Entropy: The Formula Typical case: we've got series of observations T = {ti, t2, h, ?4,..., t„} (numbers, words,...; ti_ e ft); estimate cM (sample): Vy Ell : p(y) =-jyp def. c(y) = \{t e T; t = y} ...but the true p is unknown; every sample is too small! Natural question: how well do we do using p (instead of p)l Idea: simulate actual p by using a different T(or rather: by using different observation we simulate the insufficiency of Tvs. some other data ("random" difference)) Hp,(~p) = H(p') + D(p'\\~p) p' is certainly not the true p, but we can consider it the "real world" distribution against which we test p note on notation (confusing ...): —f o p, also Hr{p) (Cross)Perplexity: Gp,{p) = Gr{p) = 2h/x)|og2P(yM In practice, it is often inconvenient to sum over the space(s) V, ft (especially for cross entropy!) Use the following formula: Hp. (p) = - Eyev,xenP'(y>x) loS2P(yM = - Wl E;=i...|7'| io&2 PM*,) This is in fact the normalized log probability of the "test" data: H„(p) =-l/\r\log2 J] p[y,\xi) /=1...|7"'| ^avel Rychly ■ Cross Entropy ■ February 27,2024 ^avel Rychly ■ Cross Entropy ■ February 27,2024 Computation Example ft = {a, b, ..,z}, prob. distribution (assumed/estimated from data): p(a) = .25, p(b) = .5, p(a) = ^ for a e {c.r}, p(a)= 0 for the rest: s,t,u,v,w,x,y,z Data (test): barb p'(a) = p'(r) = .25, p'(t>) = .5 Sum over ft: a abedefg__.pqrst.__3 -p'(a)log2p(a) -5+.5+0+0+0+0+0+0+0+0+0+1.S+0+0+0+0+0 = 2.5 Sum over data: i/s1 1/b 2/a 3/r 4/b 4og2p(Si) 1 + 2 + S + 1 1/|T!| 10 (1/4) x 10 = 2.S Cross Entropy: Some Observations ■ H(p) ??<,=,>?? Hp.(p):ALL! ■ Previous example: p(a) = .25, p(b) = .5, p(a)= ^ for a e {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z H{p) = l.Sbits = H{p'){barb) ■ Other data: probable: (|)(6 + 6 + 6 + 1 + 2 + 1 + 6 + 6) = 4.25 H{p) < A.lSbits = H{p'){probable) m And finally: abba: (\)(2 + 1 + 1 + 2) = 1.5 H{p) > l.Sbits = H(p')(abba) m But what about: baby -p'('y') iog2 p('y') = -.25 log2o = oo (??) ^avel Rychly ■ Cross Entropy ■ February 27,2024 ^avel Rychly ■ Cross Entropy ■ February 27,2024 Cross Entropy: Usage Comparing Distributions ■ Comparing data?? NO! (we believe that we test on real data!) ■ Rather: comparing distributions (vs. real data) ■ Have (got) 2 distributions: p and q (on some ft,X) ■ which is better? ■ better: has Lower cross-entropy (perpLexity) on reaL data S ■ "Real" data: S Hs(p) = -1/|S| £,-=i..|s| logip(yi\x,) © Hs(q) = -1/|S| £,=1..|s| 'og2q(yi\xi) p(.) from previous example: {Hs{p) = 4.25J p(a) = .25, p(b) = .5, p(a) = ^ for a e {c.r}, = 0 for the rest: s,t,u,v,w,x,y,z <7(.|.) (conditional; defined by a table): I 4 4 b 0 .5 e 1 0 0 a 0 P r 125 0 oth 6. 0 « 1 1 0 0 0 0 .3 0 0 D 1 0 0 1 0 a 125 0 .125 0 1 25 0 S 0 0 0 0 0 0 0 0 125 1' 0 P i 0 0 0 0 0 0 0 0 0 0 125 0 125 <-e-- 1 other 0 0 i 0 0 125 0 0 ex.: q(o|r) = 1 gCr|p)= 125 (1/8) (log(p|oth.)+log(r|p)+loe(0|r)+log(b|o)+l08(a|b)+log(b|a)+log(l|b)+log(e|l)) (1/8) ( 0 +3+0+0+1 + 0+1 + 0 ) |Hs(q| = .625 ^avel Rychly ■ Cross Entropy ■ February 27,2024 11/12 ^avel Rychly ■ Cross Entropy ■ February 27,2024 12/12