Some applications of the hypergeometric and Poisson distributions
In this chapter we will consider some practical situations where the hypergeometric and Poisson distributions arise. We will first consider a technique for estimating animal populations known as capture-recapture. This, as we shall see, involves the hypergeometric distribution. Poisson random variables arise when we consider randomly distributed points in space or time. One of the applications of this is in the analysis of spatial patterns of plants, which is important in forestry. Finally we consider compound Poisson random variables with a view to analysing some experimental results in neurophysiology.
3.1 THE HYPERGEOMETRIC DISTRIBUTION
The hypergeometric distribution is obtained as follows. A sample of size n is drawn, without replacement, from a population of size N composed of M individuals of type 1 and N-M individuals of type 2. Then the number X of individuals of type 1 in the sample is a hypergeometric random variable with probability mass function
pk = Pt{X = k}--
	
	{ »-*/
	
	
max(0,ft-N + M)<fc^min (M,n).
(3.1)
To derive (3.1) we note that there are I    ) ways of choosing the sample of
size n from N individuals. The k individuals of type 1 can be chosen from M in
ways, and the n — k individuals of type 2 can be chosen from N-M in
distinct samples with k
„_ftJways. Hence there are (k)(n_k j
individuals of type 1 and so (3.1) gives the proportion of samples of size n which contain k individuals of type 1.
K The range of X is as indicated in (3.1) as the following arguments show. ■Recall that there are N-M type 2 individuals. If n < JV - M all members of ;the sample can be type 2 so it is possible that there are zero type 1 individuals. However, if n > N - M, there must be some, and in fact at least n-[N — M), itype 1 individuals in the sample. Thus the smallest possible value of X is the larger of 0 and n — N + M. Also, there can be no more than n individuals of type lifn^M and no more than MiiM^n. Hence the largest possible value of X is the smaller of M and «.
M = 31 n = 3. N = 5
M = 2, n = 3i N = 5
M = 3, n = 3, N
Kl 0123 0123
I Figure 3.1 Probability mass functions for hypergeometric distributions with various Rvalues of the parameters N, M and n.
For the curious we note that the distribution is called hypergeometrie because the corresponding generating function is a hypergeometrie series (Kendall and Stuart, 1958).
The shape of the hypergeometrie distribution depends on the values of the parameters N, M and n. Some examples for small parameter values are shown in Fig. 3.1. Tables are given in Liebermann and Owen (1961).
Mean and variance
The mean of X is
E{X):
nM
' N
and its variance is
Var(X)
_nM(N ~n)(N- M) N2(N-l)
(3.2)
Proof We follow the method of Moran (1968). Introduce the indicator random variables defined as follows, Let
1,     if the ith member of the sample is type 1,
(0, otherwise. Then the total number of type 1 individuals is
Each member of the sample has the same probability of being type 1. Indeed,
Pr {*,«!}«
M
N'
i = l,...,n.
(3.3)
as follows from the probability law (3.1) when k = n = 1. Since
M N'
we see that
E(X) = nE{X,) -
nM ~N~'
To find VarLY) we note that the second moment of X is
= E[ t X?+ I S X,XS). (3.4) ttj
The expected value of Xf is, from (3.3), M
E(Xf) = -,     1=1,2.....n. (3.5)
We now use (3.1) with n «= k = 2, to get the probability that X = 2 when n = 2. This gives
2/1 0
Pr{Z( = l,^=l} =
Thus
N 2
M(M-l) '' N(N-1}'
M{M -1)
i,j= l,2,...,n, l^j.
(3.6)
N(N-l)'
and there are n(n - 1) such terms in (3.4). Substituting (3.5) and (3.6) in (3.4) gives
nM »(n-l)M(M-l) £(Jr) = ^+      JV(N-1)-'
Formula (3.2) follows from the relation
Va.tiX)-E(X3)-E2(X).
3.2 ESTIMATING A POPULATION FROM CAPTURE-RECAPTURE DATA
Assume now that a population size N is unknown. The population may be the kangaroos or emus in a certain area or perhaps the fish in a lake or stream. We wish to estimate N without counting every individual. A method of doing this is called capture-recapture. Here M individuals are captured, marked in some
distinguishing way and then released back into the population. Later, after a satisfactory mixing of the marked and unmarked individuals is attained, a sample of size n is taken from the population and the number X of marked individuals is noted. This method, introduced by Laplace in 1786 to estimate France's population, is often employed by biologists and individuals in resource management to estimate animal populations. The method we have described is called direct sampling. Another method (inverse sampling) is considered in Exercises 5 and 6,
In the capture-recapture model the marked individuals correspond to type 1 and the unmarked to type 2. The probability that the number of marked individuals is k is thus given by (3.1). Suppose now that k is the number of marked individuals in the sample; then values of N for which Pr [X = k] is very small are considered unlikely. One takes as an estimate, N, of N that value which maximizes Pi{X = k). N is a random variable and is called the maximum likelihood estimate of N.
Theorem 3.1 The maximum likelihood estimate of N is
»-[?],
where [z] denotes the greatest integer less than z.
Proof We follow Feller (1968). Holding M and n constant we let Pr {X = k} for a fixed value of N be denoted pN(k). Then
which simplifies to
Pdk)
Px
_ N1 - MN -nN + Mn .,(*) ~ N2 — MN — nN + kN'
We see that p^ is greater than, equal to, or less than p^-i according as Mn is greater than, equal to, or less than kN; or equivalently as N is less than, equal to or greater than Mnjk, Excluding for now the case where Mn/k is an integer, the sequence {pN,N —1,2,...} is increasing as long as N<Mn/k and is decreasing when N > Mn/k. Thus the maximum value of p„ occurs when
which is the largest integer less than Mn/k.
In the event that Mn/k is an integer the maximum vaiue of ps will be pmk and Pun/k-u tliese be'n8 equa'' One may *en use
Mb
k
Mn k
as an estimate of the population. This completes the proof.
Approximate confidence intervals for TV
' In situations of practical interest N will be much larger than both M and n. Let Cvr us assume in fact that N is large enough to regard the sampling as approximately with replacement. If X, approximates X, in this scheme, then, for all (from 1 to n,
Pr{jr(=l}=^ = l-Pr{£, = 0}. The approximation to X is then given by
1M This is a binomial random variable with parameters n and M/N so that The expectation and variance of X are
fc = 0,l,...,».
Var(X)'
nVnr(X),
nMf _MS ~~ N\ N;
Furthermore, if the sample size n is fairly large, the distribution of X can be approximated by that of a normal random variable with the same mean and variance (see Chapter 6). Replacing N by the observed value, A, of its maximum likelihood estimator gives
where i means 'has approximately the same distribution'. Ignoring the technicality of integer values, we have
nM
where k is the observed value of X, so
Mm k, k l-
Using the standard symbol Z for an N(Q, 1) random variable and the usual notation
Pr{Z>z„/2}=;
we find
l-jj\<X<k + zalK
<k 1
However, N = MnJX, so we obtain the following approximate result when the observed number of marked individuals in the recaptured sample is k.
Theorem 3.2 An approximate 100(1 — cr)% confidence interval for the estimator N of the population is
nM ~ nM
Pr
:<A'<-
k 1-:
:l-a.
Thus for example, if a 95% confidence interval is required we put *«/2 = z.025 = 1-96 m this formula.
Manly (1984) and references therein; see also Cormack (1968) and the conference article of the same author (1973) who begins with the following remarks:
Many of the papers in this volume are concerned with the process of describing the I development of an animal population by a mathematical model, The properties of such H model can then be derived, either by elegant mathematics or equally elegant computer simulation, in order to describe the future state of the population in terms of certain initial boundary conditions. The model becomes of scientific value when such "predictions can be tested, which requires in turn that the mathematical symbols can be replaced by numbers. The parameters of the model must be estimated from data of a type that a biologist can collect about the population he is studying.
For an introductory treatment written for biologists, see Begon (1979).
3.3 THE POISSON DISTRIBUTION
We recall the definition and some elementary properties of Poisson random variables.
Definition A non-negative integer-valued random variable X has a Poisson distribution with parameter X > 0 if
pk = ¥r{X=k} =
e-U*
k\
A-0,1,2,...
(3.7)
From the definition of ek as £ Xk/k\ we find
Discussion
The above estimates have been obtained for direct sampling in the ideal situation. Before applying them in any real situation an examination of the assumptions made would be worth while. Among these are:
(i) The marked individuals disperse randomly and homogeneously throughout the population.
(ii) All marked individuals retain their marks.
(iii) Each individual, whether marked or not, has the same chance of being in the recaptured sample.
(iv) There are no losses due to death or emigration and no gains due to birth or immigration.
Some of these assumptions can be relaxed in a relatively simple way (see Exercise 7). In the approach mentioned earlier called inverse sampling, the recapturing takes place until a predetermined number of marked individuals is obtained. For useful refinements of the basic method presented above see
I Pr{X = ft}-l.
k = 0
The mean and variance of X will easily be found to be
£(Z) = Var(A:) = /l.
The shape of the probability mass function depends on X as Table 3.1 and the graphs of Fig. 3.2 illustrate.
Table 3.1 Probability mass functions for some Poisson random variables
	f>0	Pi	Pi	Pi	Pi	Ps          Ps         Pi P»
	.607	.303	.076	.013	.002	<.001
k = l	.368	.368	.184	.061	.015	,003 <.00i
,1 = 2	.135	.271	.271	.180	.090	.036        .012    .003 <.00