Probability
PA154 Jazykové modelování (1.2) Pavel Rychlý
pary@fi.muni.cz
February 17, 2020
Source: Introduction to Natural Language Processing (600.465) Jan Hajic, CS Dept., Johns Hopkins Univ. www.cs.jhu.edu/~hajic
Experiments & Sample Spaces
Experiment, process, test, ...
Set of possible basic outcomes: sample space Q (základní prostor obsahující možné výsledky)
► coin toss (O = {head, tail}), die (O = {1..6})
► yes/no opinion poll, quality test (bad/good) (O = {0,1})
► lottery      \9á 10r..1012)
► # of traffic accidents somewhere per year (O = N)
► spelling errors (O = Z*), where Z is an aplhabet, and Z* is set of possible strings over such alphabet
► missing word (|0 |= vocabulary size)
PA154 Jazykové modelování (1.2)
Probability
Events
■ Event (jev) A is a set of basic outcomes
■ Usually A C fl, and all A £ 2^ (the event space, jevové pole)
► O is the certain event (jistý jev), 0 is the impossible event (nemôž
jev)
■ Example:
► experiment: three times coin toss
► ft = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
► count cases with exactly two tails: then
► A = {HTT, THT, TTH}
► all heads:
► A = {HHH}
PA154 Jazykové modelování (1.2)
Probability
3/16
Probability
■ Repeat experiment many times, record how many times a given event A occured ("count" ci).
■ Do this whole series many times; remember all qs.
■ Observation: if repeated really many times, the ratios of (where
Tj is the number of experiments run in the i-th series) are close to some (unknown but) constant value.
■ Call this constant a probability of A. Notation: p(A)
PA154 Jazykové modelování (1.2)
Probability
4/16
Estimating Probability
Remember: ... close to an unknown constant.
We can only estimate it:
► from a single series (typical case, as mostly the outcome of a series is given to us we cannot repeat the experiment):
7i
C;
► otherwise, take the weighted average of all — (or, if the data allows, simply look at the set of series as if it is a single long series). This is the best estimate.
PA154 Jazykové modelování (1.2)
Probability
5/16
Example
Recall our example:
► experiment: three times coin toss
► ft = {HHH, HHT, HTH, HTTTHH, THT, TTH, TTT}
► count cases with exactly two tails: A = {HTT, THT, TTH}
Run an experiment 1000 times (i.e. 3000 tosses)
Counted: 386 cases with two tails (HTT, THT or TTH)
estimate: p(A) = 386/100 = .386
Run again: 373, 399, 382, 355, 372, 406, 359
► p(A) = .379 (weighted average) or simply 3032/8000
Uniform distribution assumption: p(A) = 3/8 = .375
PA154 Jazykové modelování (1.2)
Probability
6/16
Basic Properties
Basic properties:
► p: 2Q -> [0,1] - P(fl) = 1
► Disjoint events: p(U A,) = X);P(A;)
NB: axiomatic definiton of probability: take the above three conditions as axioms
Immediate consequences:
► P(0) = 0
► p(A) = 1 - p(a)
► ACB^ p(A) < P(B)
" EaenP(a) = 1
PA154 Jazykové modelování (1.2)
Probability
7/16
Joint and Conditional Probability
p(A, B) = p(A n ß)
> Estimating form counts:
P(A\B) =
p{A, B) P{B)
c(A n B)
c(B) T
c(A n B) c(ß)
	
(  A (	ß j
PA154 Jazykové modelování (1.2)
Probability
8/16
Bayes Rule
p(A,B) = p(B,A) since p(A n B) = p(B n A) ► therefore p(A\B)p(B) = p(B\A)p(A), and therefore
p(A\B) =
p{B\A) x p(A) P(B)
Probability
Independence
Can we compute p(A,B) from p(A) and p(B)? Recall from previous foil:
p(A\B) =
p(B\A) x p(A) P(B)
p(A\B) x p(B) = p(B\A) x p(A)
p(A, B) = p(B\A) x p(A)
.. .we're almost there: how p(B\A) relates to p(B)? ► p(B|A) = p(B) iff A and B are independent
Example: two coin tosses, weather today and weather on March 4th 1789;
Any two events for which p(B|A) = P(B)!
PA154 Jazykové modelování (1.2)
Probability
10/16
Chain Rule
		
p(A1,A2,A3,A4,..	■ ■> An)	—
p{Ai\A2,A3,AA,...	,An)	x p(A2\A3,A4,.. .,An)x
xp(A3\A4,...,A„)	X • • •	x p{An-1\An) x p(A„)
■ this is a direct consequence of the Bayes rule.
PA154 Jazykové modelování (1.2)
Probability
11/16
"he Golden Rule of Classic Statistical NLP
Interested in an event A given B (where it is not easy or practical or desirable) to estimate p(A\B)\
take Bayes rule, max over all Bs:
p(B\A) x p(A)
argmaxAp(A\B) = argmaxA
P(B)
argmaxA(p(B\A) x p(A))
as p(B) is constant when changing As
PA154 Jazykové modelování (1.2)
Probability
12/16
Random Variables
■ is a function X : ft —>> Q
► in general (? = Rn, typically R
► easier to handle real numbers than real-world events
■ random variable is discrete if Q is countable (i.e. also if finite)
■ Example: die: natural "numbering" [1,6], coin: {0,1}
■ Probability distribution:
► px(x) = p(X = x) =df p(Ax) where Ax = {a e ft : X(a) = x}
► often just p(x) if it is clear from context what X is
PA154 Jazykové modelování (1.2)
Probability
13/16
Expectation
Joint and Conditional Distributions
■ is a mean of a random variable (weighted average)
► E(X) = £xGX(Q)x.px(x)
■ Example: one six-sided die: 3.5, two dice (sum): 7
■ Joint and Conditional distribution rules:
► analogous to probability of events
■ Bayes: Pxiy(x,y) —notation even simpler notation
■ Chain rule:  p(w,x,y,z) = p{z).p{y\z).p{x\y,z).p{w\x,y,z)
p{y\x).p{x)
p(y)
p(*|y)
Probability
14/16
Standard Distributions
Binomial (discrete)
► outcome: 0 or 1 (thus b/nomial)
► make n trials
► interested in the (probability of) numbers of successes r
Must be careful: it's not uniform!
(n)
Pt>(r\n) =       (for equally likely outcome)
(") counts how many possibilities there are for choosing r objects out of r?;
n = n!
KrJ {n-r)\r\
PA154 Jazykové modelování (1.2)
Probability
15/16
Continuous Distributions
The normal distribution ("Gaussian")
-(x-M)2
Pnorm{x\fl,a) = exP
2a2
where:
► ji is the mean (x-coordinate of the peak) (0)
a is the standard deviation (1)
other: hyperbolic, t
PA154 Jazykové modelování (1.2)
Probability
16/16