Some applications of the hypergeometric and Poisson distributions In this chapter we will consider some practical situations where the hypergeometric and Poisson distributions arise. We will first consider a technique for estimating animal populations known as capture-recapture. This, as we shall see, involves the hypergeometric distribution. Poisson random variables arise when we consider randomly distributed points in space or time. One of the applications of this is in the analysis of spatial patterns of plants, which is important in forestry. Finally we consider compound Poisson random variables with a view to analysing some experimental results in neurophysiology. 3.1 THE HYPERGEOMETRIC DISTRIBUTION The hypergeometric distribution is obtained as follows. A sample of size n is drawn, without replacement, from a population of size N composed of M individuals of type 1 and N-M individuals of type 2. Then the number X of individuals of type 1 in the sample is a hypergeometric random variable with probability mass function pk = Pt{X = k}-- { »-*/ max(0,ft-N + M) N - M, there must be some, and in fact at least n-[N — M), itype 1 individuals in the sample. Thus the smallest possible value of X is the larger of 0 and n — N + M. Also, there can be no more than n individuals of type lifn^M and no more than MiiM^n. Hence the largest possible value of X is the smaller of M and «. M = 31 n = 3. N = 5 M = 2, n = 3i N = 5 M = 3, n = 3, N Kl 0123 0123 I Figure 3.1 Probability mass functions for hypergeometric distributions with various Rvalues of the parameters N, M and n. For the curious we note that the distribution is called hypergeometrie because the corresponding generating function is a hypergeometrie series (Kendall and Stuart, 1958). The shape of the hypergeometrie distribution depends on the values of the parameters N, M and n. Some examples for small parameter values are shown in Fig. 3.1. Tables are given in Liebermann and Owen (1961). Mean and variance The mean of X is E{X): nM ' N and its variance is Var(X) _nM(N ~n)(N- M) N2(N-l) (3.2) Proof We follow the method of Moran (1968). Introduce the indicator random variables defined as follows, Let 1, if the ith member of the sample is type 1, (0, otherwise. Then the total number of type 1 individuals is Each member of the sample has the same probability of being type 1. Indeed, Pr {*,«!}« M N' i = l,...,n. (3.3) as follows from the probability law (3.1) when k = n = 1. Since M N' we see that E(X) = nE{X,) - nM ~N~' To find VarLY) we note that the second moment of X is = E[ t X?+ I S X,XS). (3.4) ttj The expected value of Xf is, from (3.3), M E(Xf) = -, 1=1,2.....n. (3.5) We now use (3.1) with n «= k = 2, to get the probability that X = 2 when n = 2. This gives 2/1 0 Pr{Z( = l,^=l} = Thus N 2 M(M-l) '' N(N-1}' M{M -1) i,j= l,2,...,n, l^j. (3.6) N(N-l)' and there are n(n - 1) such terms in (3.4). Substituting (3.5) and (3.6) in (3.4) gives nM »(n-l)M(M-l) £(Jr) = ^+ JV(N-1)-' Formula (3.2) follows from the relation Va.tiX)-E(X3)-E2(X). 3.2 ESTIMATING A POPULATION FROM CAPTURE-RECAPTURE DATA Assume now that a population size N is unknown. The population may be the kangaroos or emus in a certain area or perhaps the fish in a lake or stream. We wish to estimate N without counting every individual. A method of doing this is called capture-recapture. Here M individuals are captured, marked in some distinguishing way and then released back into the population. Later, after a satisfactory mixing of the marked and unmarked individuals is attained, a sample of size n is taken from the population and the number X of marked individuals is noted. This method, introduced by Laplace in 1786 to estimate France's population, is often employed by biologists and individuals in resource management to estimate animal populations. The method we have described is called direct sampling. Another method (inverse sampling) is considered in Exercises 5 and 6, In the capture-recapture model the marked individuals correspond to type 1 and the unmarked to type 2. The probability that the number of marked individuals is k is thus given by (3.1). Suppose now that k is the number of marked individuals in the sample; then values of N for which Pr [X = k] is very small are considered unlikely. One takes as an estimate, N, of N that value which maximizes Pi{X = k). N is a random variable and is called the maximum likelihood estimate of N. Theorem 3.1 The maximum likelihood estimate of N is »-[?], where [z] denotes the greatest integer less than z. Proof We follow Feller (1968). Holding M and n constant we let Pr {X = k} for a fixed value of N be denoted pN(k). Then which simplifies to Pdk) Px _ N1 - MN -nN + Mn .,(*) ~ N2 — MN — nN + kN' We see that p^ is greater than, equal to, or less than p^-i according as Mn is greater than, equal to, or less than kN; or equivalently as N is less than, equal to or greater than Mnjk, Excluding for now the case where Mn/k is an integer, the sequence {pN,N —1,2,...} is increasing as long as N Mn/k. Thus the maximum value of p„ occurs when which is the largest integer less than Mn/k. In the event that Mn/k is an integer the maximum vaiue of ps will be pmk and Pun/k-u tliese be'n8 equa'' One may *en use Mb k Mn k as an estimate of the population. This completes the proof. Approximate confidence intervals for TV ' In situations of practical interest N will be much larger than both M and n. Let Cvr us assume in fact that N is large enough to regard the sampling as approximately with replacement. If X, approximates X, in this scheme, then, for all (from 1 to n, Pr{jr(=l}=^ = l-Pr{£, = 0}. The approximation to X is then given by 1M This is a binomial random variable with parameters n and M/N so that The expectation and variance of X are fc = 0,l,...,». Var(X)' nVnr(X), nMf _MS ~~ N\ N; Furthermore, if the sample size n is fairly large, the distribution of X can be approximated by that of a normal random variable with the same mean and variance (see Chapter 6). Replacing N by the observed value, A, of its maximum likelihood estimator gives where i means 'has approximately the same distribution'. Ignoring the technicality of integer values, we have nM where k is the observed value of X, so Mm k, k l- Using the standard symbol Z for an N(Q, 1) random variable and the usual notation Pr{Z>z„/2}=; we find l-jj\ 0 if pk = ¥r{X=k} = e-U* k\ A-0,1,2,... (3.7) From the definition of ek as £ Xk/k\ we find Discussion The above estimates have been obtained for direct sampling in the ideal situation. Before applying them in any real situation an examination of the assumptions made would be worth while. Among these are: (i) The marked individuals disperse randomly and homogeneously throughout the population. (ii) All marked individuals retain their marks. (iii) Each individual, whether marked or not, has the same chance of being in the recaptured sample. (iv) There are no losses due to death or emigration and no gains due to birth or immigration. Some of these assumptions can be relaxed in a relatively simple way (see Exercise 7). In the approach mentioned earlier called inverse sampling, the recapturing takes place until a predetermined number of marked individuals is obtained. For useful refinements of the basic method presented above see I Pr{X = ft}-l. k = 0 The mean and variance of X will easily be found to be £(Z) = Var(A:) = /l. The shape of the probability mass function depends on X as Table 3.1 and the graphs of Fig. 3.2 illustrate. Table 3.1 Probability mass functions for some Poisson random variables f>0 Pi Pi Pi Pi Ps Ps Pi P» .607 .303 .076 .013 .002 <.001 k = l .368 .368 .184 .061 .015 ,003 <.00i ,1 = 2 .135 .271 .271 .180 .090 .036 .012 .003 <.00