Experiments & Sample Spaces
Probability
PA154 Jazykové modelování (1.2) Pavel Rychlý
pary@fi.muni.cz
March 2, 2021
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.c5.jhu.edu/~hajic
Experiment, process, test, ...
Set of possible basic outcomes: sample space f! (základr obsahující možné výsledky)
► coin toss (fž = {head, tail}), die (íí = {1..6})
► yes/no opinion poll, quality test (bad/good) (ÍÍ = {0,1})
► lottery (|íí |= 107..1012)
► # of traffic accidents somewhere per year (fí = N)
► spelling errors (fí = Z*), where Z is an aplhabet, and Z* is set of possible strings over such alphabet
► missing word (|fi = vocabulary size)
Events
Probability
Event (jev) A is a set of basic outcomes
Usually A c f!, and all A e 2n (the event space, jevové pole)
► ÍÍ is the certain event (jistý jev), 0 is the impossible event (nemožn jev)
Example:
► experiment: three times coin toss
► n = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
► count cases with exactly two tails: then
► A = {HTT, THT, TTH}
► all heads:
► A = {HHH}
Repeat experiment many times, record how many times a given event A occured ("count" Ci).
Do this whole series many times; remember all c,-s.
Observation: if repeated really many times, the ratios of y- (where
T; is the number of experiments run in the i-th series) are close to
some (unknown but) constant value.
Call this constant a probability of A. Notation: p(A)
Estimating Probability
Example
Remember: ... close to an unknown constant. We can only estimate it:
► from a single series (typical case, as mostly the outcome of a series is given to us we cannot repeat the experiment):
P(») = f 'l
► otherwise, take the weighted average of all -y (or, if the data allows, simply look at the set of series as if it is a single long series).
This is the best estimate.
Recall our example:
► experiment: three times coin toss
► n = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
► count cases with exactly two tails: A = {HTT, THT, TTH} Run an experiment 1000 times (i.e. 3000 tosses) Counted: 386 cases with two tails (HTT, THT or TTH) estimate: p(A) = 386/1000 = .386
Run again: 373, 399, 382, 355, 372, 406, 359
► p(A) = .379 (weighted average) or simply 3032/8000 Uniform distribution assumption: p(A) = 3/8 = .375
PA154 Jazykové modelování (1.2)
PA154 Jazykové modelováni (1.2)
Basic Properties
Joint and Conditional Probability
Basic properties:
► p: 2n -> [0,1]
► p(fi) = 1
Disjoint events: p(U A,) = X);P(Ai)
NB: axiomatic definiton of probability: take the above three conditions as axioms Immediate consequences:
► P(0) = 0
► p(A) = 1 - p(A)
► A C B=> p(A) < P(B) - EsenP(3) = 1
p(A, B) = p{A n ß) p(A,B)
p{A\B)
P(B)
Estimating form counts:
P(A\B):
p(A, B) P(B)
c(AnB)
c(B) T
c(A n e)
c(ß)
		
G	0	
Bayes Rule
p(A,B) = p(B,A) since p(A n B) = p(B n A)
► therefore p(A\B)p(B) = p(B\A)p(A), and therefore:
p(A\B) =
Bayes Rule
p{B\A) x p{A)
P(ß)
		
G	Q	
Independence
■ Can we compute p(A,B) from p(A) and p(B)?
■ Recall from previous foil:
p{A\B).
p(B\A) x p(A)
P(B)
p{A\B) x p{B) = p{B\A) x p{A) p(A,B)=p(B\A)xp(A)
.. .we're almost there: how p{B\A) relates to p(B)?
► p(B|A) = p(B) iff A and B are independent Example: two coin tosses, weather today and weather on March 4th 1789;
Any two events for which p(B|A) = P(B)I
Chain Rule
The Golden Rule of Classic Statistical NLP
p{A1,A2,A3,A4,.. p{A1\A2,A3,A4,.. xp{A3\A4,...,A„)
,A„) =
,A„) x p{A2\A3,A4,...,An) x ■ ■ ■ x p(/l„-i|/l„) x p{A„)
this is a direct consequence of the Bayes rule.
Interested in an event A given B (where it is not easy or practical or desirable) to estimate p{A\B): take Bayes rule, max over all Bs:
p{B\A) x p{A)
argmax/\p(A\B) = argmax^-
P(B)
argmaxA(p(B\A) x p{A))
... as p(B) is constant when changing As
PA154 Jazykové modelováni (1.2)
PA154 Jazykové modelováni (1.2)
Random Variables
Expectation
Joint and Conditional Distributions
is a function X : Q —► Q
► in general Q = Rn, typically R
► easier to handle real numbers than real-world events
random variable is discrete if Q is countable (i.e. also if finite) Example: die: natural "numbering" [1,6], coin: {0,1} Probability distribution:
► px(x) = p(X = x) =df p(Ax) where Ax = {a e ft : X(a) = x}
► often just p(x) if it is clear from context what X is
Standard Distributions
Binomial (discrete)
► outcome: 0 or 1 (thus 6/nomial)
► make n trials
► interested in the (probability of) numbers of successes r Must be careful: it's not uniform!
(")
Pb(r\ri) =       (for equally likely outcome)
(") counts how many possibilities there are for choosing r objects out of n;
(") =
is a mean of a random variable (weighted average)
► £(X) = Exex(n)*-PxM
Example: one six-sided die: 3.5, two dice (sum): 7 Joint and Conditional distribution rules:
► analogous to probability of events
Bayes: pX|y(x,y) — notation Pxv(x|y) —even simpler notation
p(x|y) =
p(y|x).p(x)
p(y)
Chain rule: j^p(w,x,y,z) = p(z).p(y|z).p(x|y, z).p(w|x,y,z)j
Continuous Distributions
The normal distribution ("Gaussian
Pnorm{x\ß,v) = exp
2<J2
cr\/27r
where:
► /i is the mean (x-coordinate of the peak) (0)
► a is the standard deviation (1)
other: hyperbolic, t