Experiments & Sample Spaces Probability PA154 Jazykové modelování (1.2) Pavel Rychlý pary@fi.muni.cz March 2, 2021 Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic Experiment, process, test, ... Set of possible basic outcomes: sample space f! (základr obsahující možné výsledky) ► coin toss (fž = {head, tail}), die (íí = {1..6}) ► yes/no opinion poll, quality test (bad/good) (ÍÍ = {0,1}) ► lottery (|íí |= 107..1012) ► # of traffic accidents somewhere per year (fí = N) ► spelling errors (fí = Z*), where Z is an aplhabet, and Z* is set of possible strings over such alphabet ► missing word (|fi = vocabulary size) Events Probability Event (jev) A is a set of basic outcomes Usually A c f!, and all A e 2n (the event space, jevové pole) ► ÍÍ is the certain event (jistý jev), 0 is the impossible event (nemožn jev) Example: ► experiment: three times coin toss ► n = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} ► count cases with exactly two tails: then ► A = {HTT, THT, TTH} ► all heads: ► A = {HHH} Repeat experiment many times, record how many times a given event A occured ("count" Ci). Do this whole series many times; remember all c,-s. Observation: if repeated really many times, the ratios of y- (where T; is the number of experiments run in the i-th series) are close to some (unknown but) constant value. Call this constant a probability of A. Notation: p(A) Estimating Probability Example Remember: ... close to an unknown constant. We can only estimate it: ► from a single series (typical case, as mostly the outcome of a series is given to us we cannot repeat the experiment): P(») = f 'l ► otherwise, take the weighted average of all -y (or, if the data allows, simply look at the set of series as if it is a single long series). This is the best estimate. Recall our example: ► experiment: three times coin toss ► n = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT} ► count cases with exactly two tails: A = {HTT, THT, TTH} Run an experiment 1000 times (i.e. 3000 tosses) Counted: 386 cases with two tails (HTT, THT or TTH) estimate: p(A) = 386/1000 = .386 Run again: 373, 399, 382, 355, 372, 406, 359 ► p(A) = .379 (weighted average) or simply 3032/8000 Uniform distribution assumption: p(A) = 3/8 = .375 PA154 Jazykové modelování (1.2) PA154 Jazykové modelováni (1.2) Basic Properties Joint and Conditional Probability Basic properties: ► p: 2n -> [0,1] ► p(fi) = 1 Disjoint events: p(U A,) = X);P(Ai) NB: axiomatic definiton of probability: take the above three conditions as axioms Immediate consequences: ► P(0) = 0 ► p(A) = 1 - p(A) ► A C B=> p(A) < P(B) - EsenP(3) = 1 p(A, B) = p{A n ß) p(A,B) p{A\B) P(B) Estimating form counts: P(A\B): p(A, B) P(B) c(AnB) c(B) T c(A n e) c(ß) G 0 Bayes Rule p(A,B) = p(B,A) since p(A n B) = p(B n A) ► therefore p(A\B)p(B) = p(B\A)p(A), and therefore: p(A\B) = Bayes Rule p{B\A) x p{A) P(ß) G Q Independence ■ Can we compute p(A,B) from p(A) and p(B)? ■ Recall from previous foil: p{A\B). p(B\A) x p(A) P(B) p{A\B) x p{B) = p{B\A) x p{A) p(A,B)=p(B\A)xp(A) .. .we're almost there: how p{B\A) relates to p(B)? ► p(B|A) = p(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789; Any two events for which p(B|A) = P(B)I Chain Rule The Golden Rule of Classic Statistical NLP p{A1,A2,A3,A4,.. p{A1\A2,A3,A4,.. xp{A3\A4,...,A„) ,A„) = ,A„) x p{A2\A3,A4,...,An) x ■ ■ ■ x p(/l„-i|/l„) x p{A„) this is a direct consequence of the Bayes rule. Interested in an event A given B (where it is not easy or practical or desirable) to estimate p{A\B): take Bayes rule, max over all Bs: p{B\A) x p{A) argmax/\p(A\B) = argmax^- P(B) argmaxA(p(B\A) x p{A)) ... as p(B) is constant when changing As PA154 Jazykové modelováni (1.2) PA154 Jazykové modelováni (1.2) Random Variables Expectation Joint and Conditional Distributions is a function X : Q —► Q ► in general Q = Rn, typically R ► easier to handle real numbers than real-world events random variable is discrete if Q is countable (i.e. also if finite) Example: die: natural "numbering" [1,6], coin: {0,1} Probability distribution: ► px(x) = p(X = x) =df p(Ax) where Ax = {a e ft : X(a) = x} ► often just p(x) if it is clear from context what X is Standard Distributions Binomial (discrete) ► outcome: 0 or 1 (thus 6/nomial) ► make n trials ► interested in the (probability of) numbers of successes r Must be careful: it's not uniform! (") Pb(r\ri) = (for equally likely outcome) (") counts how many possibilities there are for choosing r objects out of n; (") = is a mean of a random variable (weighted average) ► £(X) = Exex(n)*-PxM Example: one six-sided die: 3.5, two dice (sum): 7 Joint and Conditional distribution rules: ► analogous to probability of events Bayes: pX|y(x,y) — notation Pxv(x|y) —even simpler notation p(x|y) = p(y|x).p(x) p(y) Chain rule: j^p(w,x,y,z) = p(z).p(y|z).p(x|y, z).p(w|x,y,z)j Continuous Distributions The normal distribution ("Gaussian Pnorm{x\ß,v) = exp 2