Probability PA154 Jazykové modelování (1.2) Pavel Rychlý pary@fi.muni.cz February 17, 2020 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic Experiments & Sample Spaces Experiment, process, test, ... Set of possible basic outcomes: sample space Q (základní prostor obsahující možné výsledky) ► coin toss (O = {head, tail}), die (O = {1..6}) ► yes/no opinion poll, quality test (bad/good) (O = {0,1}) ► lottery \9á 10r..1012) ► # of traffic accidents somewhere per year (O = N) ► spelling errors (O = Z*), where Z is an aplhabet, and Z* is set of possible strings over such alphabet ► missing word (|0 |= vocabulary size) PA154 Jazykové modelování (1.2) Probability Events ■ Event (jev) A is a set of basic outcomes ■ Usually A C fl, and all A £ 2^ (the event space, jevové pole) ► O is the certain event (jistý jev), 0 is the impossible event (nemôž jev) ■ Example: ► experiment: three times coin toss ► ft = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} ► count cases with exactly two tails: then ► A = {HTT, THT, TTH} ► all heads: ► A = {HHH} PA154 Jazykové modelování (1.2) Probability 3/16 Probability ■ Repeat experiment many times, record how many times a given event A occured ("count" ci). ■ Do this whole series many times; remember all qs. ■ Observation: if repeated really many times, the ratios of (where Tj is the number of experiments run in the i-th series) are close to some (unknown but) constant value. ■ Call this constant a probability of A. Notation: p(A) PA154 Jazykové modelování (1.2) Probability 4/16 Estimating Probability Remember: ... close to an unknown constant. We can only estimate it: ► from a single series (typical case, as mostly the outcome of a series is given to us we cannot repeat the experiment): 7i C; ► otherwise, take the weighted average of all — (or, if the data allows, simply look at the set of series as if it is a single long series). This is the best estimate. PA154 Jazykové modelování (1.2) Probability 5/16 Example Recall our example: ► experiment: three times coin toss ► ft = {HHH, HHT, HTH, HTTTHH, THT, TTH, TTT} ► count cases with exactly two tails: A = {HTT, THT, TTH} Run an experiment 1000 times (i.e. 3000 tosses) Counted: 386 cases with two tails (HTT, THT or TTH) estimate: p(A) = 386/100 = .386 Run again: 373, 399, 382, 355, 372, 406, 359 ► p(A) = .379 (weighted average) or simply 3032/8000 Uniform distribution assumption: p(A) = 3/8 = .375 PA154 Jazykové modelování (1.2) Probability 6/16 Basic Properties Basic properties: ► p: 2Q -> [0,1] - P(fl) = 1 ► Disjoint events: p(U A,) = X);P(A;) NB: axiomatic definiton of probability: take the above three conditions as axioms Immediate consequences: ► P(0) = 0 ► p(A) = 1 - p(a) ► ACB^ p(A) < P(B) " EaenP(a) = 1 PA154 Jazykové modelování (1.2) Probability 7/16 Joint and Conditional Probability p(A, B) = p(A n ß) > Estimating form counts: P(A\B) = p{A, B) P{B) c(A n B) c(B) T c(A n B) c(ß) ( A ( ß j PA154 Jazykové modelování (1.2) Probability 8/16 Bayes Rule p(A,B) = p(B,A) since p(A n B) = p(B n A) ► therefore p(A\B)p(B) = p(B\A)p(A), and therefore p(A\B) = p{B\A) x p(A) P(B) Probability Independence Can we compute p(A,B) from p(A) and p(B)? Recall from previous foil: p(A\B) = p(B\A) x p(A) P(B) p(A\B) x p(B) = p(B\A) x p(A) p(A, B) = p(B\A) x p(A) .. .we're almost there: how p(B\A) relates to p(B)? ► p(B|A) = p(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789; Any two events for which p(B|A) = P(B)! PA154 Jazykové modelování (1.2) Probability 10/16 Chain Rule p(A1,A2,A3,A4,.. ■ ■> An) — p{Ai\A2,A3,AA,... ,An) x p(A2\A3,A4,.. .,An)x xp(A3\A4,...,A„) X • • • x p{An-1\An) x p(A„) ■ this is a direct consequence of the Bayes rule. PA154 Jazykové modelování (1.2) Probability 11/16 "he Golden Rule of Classic Statistical NLP Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A\B)\ take Bayes rule, max over all Bs: p(B\A) x p(A) argmaxAp(A\B) = argmaxA P(B) argmaxA(p(B\A) x p(A)) as p(B) is constant when changing As PA154 Jazykové modelování (1.2) Probability 12/16 Random Variables ■ is a function X : ft —>> Q ► in general (? = Rn, typically R ► easier to handle real numbers than real-world events ■ random variable is discrete if Q is countable (i.e. also if finite) ■ Example: die: natural "numbering" [1,6], coin: {0,1} ■ Probability distribution: ► px(x) = p(X = x) =df p(Ax) where Ax = {a e ft : X(a) = x} ► often just p(x) if it is clear from context what X is PA154 Jazykové modelování (1.2) Probability 13/16 Expectation Joint and Conditional Distributions ■ is a mean of a random variable (weighted average) ► E(X) = £xGX(Q)x.px(x) ■ Example: one six-sided die: 3.5, two dice (sum): 7 ■ Joint and Conditional distribution rules: ► analogous to probability of events ■ Bayes: Pxiy(x,y) —notation even simpler notation ■ Chain rule: p(w,x,y,z) = p{z).p{y\z).p{x\y,z).p{w\x,y,z) p{y\x).p{x) p(y) p(*|y) Probability 14/16 Standard Distributions Binomial (discrete) ► outcome: 0 or 1 (thus b/nomial) ► make n trials ► interested in the (probability of) numbers of successes r Must be careful: it's not uniform! (n) Pt>(r\n) = (for equally likely outcome) (") counts how many possibilities there are for choosing r objects out of r?; n = n! KrJ {n-r)\r\ PA154 Jazykové modelování (1.2) Probability 15/16 Continuous Distributions The normal distribution ("Gaussian") -(x-M)2 Pnorm{x\fl,a) = exP 2a2 where: ► ji is the mean (x-coordinate of the peak) (0) a is the standard deviation (1) other: hyperbolic, t PA154 Jazykové modelování (1.2) Probability 16/16