1
A review of basic probability theory
This is a book about the applications of probability. It is hoped to convey that this subject is both a fascinating and important one. The examples are drawn mainly from the biological sciences but some originate in the engineering, physical, social and statistical sciences. Furthermore, the techniques are not limited to any one area.
The reader is assumed to be familiar with the elements of probability or to be studying it concomitantly. In this chapter we will briefly review some of this basic material. This will establish notation and provide a convenient reference place for some formulas and theorems which are needed later at various points.
1.1 PROBABILITY AND RANDOM VARIABLES
When an experiment is performed whose outcome is uncertain, the collection of possible elementary outcomes is called a sample space, often denoted by Ct. Points in Q, denoted in the discrete case by (Oj, i = 1,2,... have an associated probability P{a>j}. This enables the probability of any subset A of Q, called an event, to be ascertained by finding the total probability associated with all the points in the given subset:
P{A} = £ Pw
cote A
We always have
0<P{A}^1,
and in particular P{Q} = 1 and P{0} = 0, where 0 is the empty set relative to Q.
A random variable is a real-valued function defined on the elements of a sample space. Roughly speaking it is an observable which takes on numerical values with certain probabilities.
Discrete random variables take on finitely many or countably infinitely many values. Their probability laws are often called probability mass functions. The following discrete random variables are frequently encountered.
2    Basic probability theory Binomial
A binomial random variable X with parameters n and p has the probability law
(1.1)
= b{k;n,p),     k = 0,1,2,....n,
where 0<p«%l,g = l — p and n is a positive integer (= means we are defining a new symbol). The binomial coefficients are
\kj k\(n-k)\'
being the number of ways of choosing k items, without regard for order, from n distinguishable items. When n = 1, so we have
Pr{jr=l}=p=l-Pr{Z = 0},
the random variable is called Bernoulli. Note the following.
Convention
Random variables are always designated by capital letters (e.g. X, Y) whereas symbols for the values they take on, as in Pr {X = k], are always designated by lowercase letters.
The converse, however, is not true. Sometimes we use capital letters for non-random quantities.
Poisson
A Poisson random variable with parameter A>0 takes on non-negative integer values and has the probability law
pk = Pr{X = k} =
e~xXk
k\
fc = 0,l,2,...
(1.2)
For any random variable the total probability mass is unity. Hence if pk is given by either (1.1) or (1.2),
where summation is over the possible values k as indicated.
Random variables 3
For any random variable X, the distribution function is
F(x) = Pr {X < x},   - oo < x < oo.
Continuous random variables take on a continuum of values. Usually the probability law of a continuous random variable can be expressed through its probability density function,/(x), which is the derivative of the distribution function. Thus
m=Txm
= lim = lim = lim
Ax->0
= lim
Ax-0
F(x + Ax) - F(x) Ax
PrfX^x + Axj-Prpf^x} Ax
Pr {x < X ^ x + Ax} Ax
Pr {Xe(x,x + Ax~]} Ax
(1.3)
The last two expressions in (1.3) often provide a convenient prescription for calculating probability density functions. Often the latter is abbreviated to p.d.f. but we will usually just say 'density'.
If the interval (x^Xj) is in the range of X then the probability that X takes values in this interval is obtained by integrating the probability density over (x1,x2).
Pr{x1<X<x2} = /(x)dx.
The following continuous random variables are frequently encountered.
Normal (or Gaussian)
A random variable with density
1
2c2
— 00 <X < 00,
(1.4)
where — oo < fi < oo  and  0 < a2 < oo,
is called normal. The quantities fi and a2 are the mean and variance (elaborated upon below) and such a random variable is often designated
4    Basic probability theory
N(n, a). If n = 0 and (7 = 1 the random variable is called a standard normal random variable, for which the usual symbol is Z.
Uniform
A random variable with constant density
— oo<a<x^i<oo,
is said to be uniformly distributed on {a, b) and is denoted U(a, b). If a = 0, b = 1 the density is unity on the unit interval,
/(x) = l, O^x^l and the random variable is designated [7(0,1).
Gamma
A random variable is said to have a gamma density (or gamma distribution) with parameters X and p if
/(*) =
Hp)
x ^ 0;     X, p > 0.
The quantity T{p) is the gamma function defined as
r(p)= | xf-^-'dx, p>0.
When p = 1 the gamma density is that of an exponentially distributed random variable
f(x) = ke
x>0.
For continuous random variables the density must integrate to unity:
where the interval of integration is the whole range of values of X.
Conditional probability 5
1.2 MEAN AND VARIANCE
Let X be a discrete random variable with
Pi{X = xk} = pk, fc=l,2,.... The mean, average or expectation of X is
For a binomial random variable E(X) = np whereas a Poisson random variable has mean E{X) = L For a continuous random variable with density f(x),
If X is normal with density given by (1.4) then E(X) = p; a uniform (a,b) random variable has mean E{X) = %a + b); and a gamma variate has mean E(X) = pß.
The nth moment of X is the expected value of X":
E{X") = i
£ pkxk     if X is discrete,
x"f(x)dx     if X is continuous.
If n = 2 we obtain the second moment E(XZ). The variance, which measures the degree of dispersion of the probability mass of a random variable about its mean, is
Var{X) = El(X-E{X))2] = E{X2)-E2(X). The variances of the above-mentioned random variables are: binomial, npq; Poisson, X; normal, a1; uniform, ^(6 — a)1; gamma, pjk1. The square root of the variance is called the standard deviation.
1.3 CONDITIONAL PROBABILITY AND INDEPENDENCE
Let A and B be two random events. The conditional probability of A given B is, provided Pr {B} # 0,
6    Basic probability theory
where AB is the intersection of A and B, being the event that both A and B occur (sometimes written AnB). Thus only the occurrences of A which are simultaneous with those of B are taken into account. Similarly, if X, Y are random variables defined on the same sample space, taking on values x,, i = 1,2,..., yj,j =1,2,..., then the conditional probability that X = xf given Y = yJht if Pr{y = yj}#0,
Pr^-x^-y^}-     Pr(y = 3,.} .
the comma between X = xt and Y = y;- meaning 'and'. The conditional expectation of X given 7 = y, is
£(X 17 = y;) = X x, Pr {Jf = x,| y = y,-}.
i
The expected value of XY is
£(Xy)=Vxfy>Pr{Z = x,,y = yJ},
and the covariance of X, Y is
Cov(X, Y) = E{(X - E(X))(Y - £(y))] = E{XY)-E(X)E(Y).
The covariance is a measure of the linear dependence of X on Y.
If X, Y are independent then the value of Y should have no effect on the probability that X takes on any of its values. Thus we define X, Y as independent if
Pr{X = xi\Y = yJ} = Pr{X = xi}, all
Equivalently X, Y are independent if
Pr{X = xi,y = yJ-} = Pr{X = xi}Pr{y = y7.},
with a similar formula for arbitrary independent events. Hence for independent random variables
E(XY) = E(X)E{Y),
so their covariance is zero. Note, however, that Cov (X, Y) = 0 does not always imply X, Y are independent. The covariance is often normalized by defining the correlation coefficient
Covpr,y)
axoY
Law of total probability 7
where ax, aY are the standard deviations of X, Y. pXY is bounded above and below by
-1 < Pxr ^ 1
Let Xl,X2,...,Xn be mutually independent random variables. That is,
Pi{X1eAuX2eA2,...,XneAn}
= Pr {X^AJ Pr {X2eA2}... Pr {X„eAn},
for all appropriate sets Alt..., An. Then
= E Var(^()
i=i
so that variances add in the case of independent random variables. We also note the formula
Var {aX + bY) = a2 Var (X) + b2 Var (Y),
which holds if X, Y are independent. If XUX2,...,X„ are independent identically distributed (abbreviated to i.i.d.) random variables with = fi, Var(Z1) = ff2, then
e( 1      = m     Var( £ X^j = no2.
If X is a random variable and {Xu X2,X„} are i.i.d. with the distribution of X, then the collection {Xk} is called a random sample of size n for X. Random samples play a key role in computer simulation (Chapter 5) and of course are fundamental in statistics.
1.4 LAW OF TOTAL PROBABILITY
Let be a sample space for a random experiment and let {Ah i = 1,2,...} be a collection of nonempty subsets of Q such that
(i) AiAj = 0,
(ii) vj^ = Q.
i
(Here 0 is the null set, the impossible event, being the complement of Jl) Condition (i) says that the At represent mutually exclusive events. Condition (ii) states that when an experiment is performed, at least one of the Ax must be observed. Under these conditions the sets or events {Ah i = 1,2,...} are said to form a partition or decomposition of the sample space.
8    Basic probability theory The law or theorem of total probability states that for any event (set) B,
Pv{B}=^Pr{B\Ai}Pt{Ai}
_[__
A similar relation holds for expectations. By definition the expectation of X conditioned on the event At is
E(X\Ai) = ^xkPv{X = xk\Ai},
it
where {xk} is the set of possible values of X. Thus £(X) = 2>fcPr =
k
= YIxkZPr{X = xk\Ai}PT{Ai}
k i
i k
Thus
E(X)=ZE(X\Ai)T>r{Ai}
which we call the law of total probability applied to expectations.
We note also the fundamental relation for any two events A, B in the same sample space:
Pr {A u B) = Pr {A} + Pr {B} - Pr {AB}
where A u B is the union of A and B, consisting of those points which are in A or in B or in both A and B.
1.5 CHANGE OF VARIABLES
Let X be a continuous random variable with distribution function Fx and density fx. Let
y = g(x)
be a strictly increasing function of x (see Fig. 1.1) with inverse function
x = fc(y).
Then
Y = g(X)
is a random variable which we let have distribution function FY and density fY.
Change of variables 9
Figure 1.1 g(x) is a strictly increasing function of x.
It is easy to see that X ^ x implies Y ^ g(x). Hence we arrive at
Pr{X<x} = Pr{y<0(x)} By the definition of distribution functions this can be written
Fx(x) = FM*))- d-5)
Therefore
Fy(y) = Fx(h(y)Y
On differentiating with respect to y we obtain, assuming that h is differentiable,
dFY dFxix)
dy
dx
dh
or in terms of densities
fr(y)=fx(Hy))
dh
(1.6)
If y is a strictly decreasing function of x we obtain Pr{X^x} = Pr{y^^(x)}. Working through the steps between (1.5) and (1.6) in this case gives
fyiy) = fx(h(y))(-^\
(1.7)
10    Basic probability theory Both formulas (1.6) and (1.7) are covered by the single formula
fr(y) = fx(h(y))
dh dy
where | | denotes absolute value. Cases where g is neither strictly increasing nor strictly decreasing require special consideration.
1.6 TWO-DIMENSIONAL RANDOM VARIABLES
Let X, ybe random variables defined on the same sample space. Then their joint distribution function is
FXY(x,y) = -PT{X^x,Y^y}.
The mixed partial derivative of FXY, if it exists, is the joint density of X and Y:
As a rough guide we have, for small enough Ax, Ay,
fXY(x, y)AxAy ~ Pr {Xe(x, x + Ax], Y e(y, y + Av]}.
If X, Y are independent then their joint distribution function and joint density function factor into those of the individual random variables:
FxY(x,y) = Fx(x)Fy(y),
fxYix,y)=fxix)fY(y)-
In particular, if X, Y are independent standard normal random variables,
fxy(x,y) = ^
which can be written
1 r-x2
2*eXP{—
1
exp
-r
oo <x,y< oo,
fxrix, y) =    exp { - &c2 + y2)}
(1.8)
In fact if the joint density X, Y is as given by (1.8) we may conclude that X, Y are independent standard normal random variables.
Change of variables
Let U, V be random variables with joint density fvv{u,v). Suppose that the one-one mappings
Notation 13
means that 5% of the time, values of x2 greater than the critical value occur even when H0 is true. That is, there is a 5% chance that we will (incorrectly) reject H0 when it is true.
In applying the above x2 goodness of fit test, the number of degrees of freedom is given by the number n, of 'cells', minus the number of linear relations between the AT,-. (There is at least one, = N.) The number of degrees of freedom is reduced further by one for each estimated parameter needed to describe the distribution under H0.
It is recommended that the expected numbers of observations in each category should not be less than 5, but this requirement can often be relaxed. A table of critical values of x2 is given in the Appendix, p. 219.
For a detailed account of hypothesis testing and introductory statistics generally, see for example Walpole and Myers (1985), Hogg and Craig (1978) and Mendenhall, Scheaffer and Wackerly (1981). For full accounts of basic probability theory see also Chung (1979) and Feller (1968). Two recent books on applications of probability at an undergraduate level are those of Ross (1985) and Taylor and Karlin (1984).
1.8 NOTATION Little o
A quantity which depends on Ax but vanishes more quickly than Ax as Ax -> 0 is said to be 'little o of Ax', written o(Ax). Thus for example (Ax)2 is o(Ax) because (Ax)2 vanishes more quickly than Ax. In general, if
The little o notation is very useful to abbreviate expressions in which terms will not contribute after a limiting operation is taken. To illustrate, consider the Taylor expansion of e**:
= 0,
we write
#(Ax) = o(Ax).
e** = 1 + Ax 4
(Ax)2 (Ax)3
2! 3!
= 1 + Ax + o(Ax).
We then have
—e*     = lim
dX     x = q Ax-»(
14    Basic probability theory
..   1+Ax + o(Ax)-1 = lim----
Ax->0 Ax
,.   Ax o(Ax) = lirn     + ——
Ajc-»0AX ax
= 1.
£^«a/ 6y definition
As seen already, when we write, for example,
9 = (1~p)
we are defining the symbol q to be equal to 1 - p. This is not to be confused with approximately equal to, which is indicated by <a.
Unit step function
The unit (or Heaviside) step function located at x0 is
H(x"Xo)-jl, x^x0. Thus H(x - x0) has a jump of + 1 at x0 and it is right-continuous.
i.U
As seen already, the letters i.i.d. stand for independent and identically distributed.
Probability
Usually the probability of an event A is written
Pr{^}
but occasionally we just write
P{A).
REFERENCES
Blake, I.F. (1979). An Introduction to Applied Probability. Wiley, New York. Chung, K.L. (1979). Elementary Probability Theory. Springer-Verlag, New York. Feller, W. (1968). An Introduction to Probability Theory and its Applications. Wiley, New York.
Hogg, R.V. and Craig, A.T. (1978). Introduction to Mathematical Statistics. Macmillan, New York.
Mendenhall, W., Scheaffer, R.L. and Wackerly, D.D. (1981). Mathematical Statistics with Applications. Duxbury, Boston.