Quasi-Likelihood
And Its Application:
A General Approach to
Optimal Parameter
Estimation
Christopher C. Heyde
Springer
Preface
This book is concerned with the general theory of optimal estimation of parameters
in systems subject to random effects and with the application of this
theory. The focus is on choice of families of estimating functions, rather than
the estimators derived therefrom, and on optimization within these families.
Only assumptions about means and covariances are required for an initial discussion.
Nevertheless, the theory that is developed mimics that of maximum
likelihood, at least to the first order of asymptotics.
The term quasi-likelihood has often had a narrow interpretation, associated
with its application to generalized linear model type contexts, while that
of optimal estimating functions has embraced a broader concept. There is,
however, no essential distinction between the underlying ideas and the term
quasi-likelihood has herein been adopted as the general label. This emphasizes
its role in extension of likelihood based theory. The idea throughout involves
finding quasi-scores from families of estimating functions. Then, the quasilikelihood
estimator is derived from the quasi-score by equating to zero and
solving, just as the maximum likelihood estimator is derived from the likelihood
score.
This book had its origins in a set of lectures given in September 1991 at
the 7th Summer School on Probability and Mathematical Statistics held in
Varna, Bulgaria, the notes of which were published as Heyde (1993). Subsets
of the material were also covered in advanced graduate courses at Columbia
University in the Fall Semesters of 1992 and 1996. The work originally had
a quite strong emphasis on inference for stochastic processes but the focus
gradually broadened over time. Discussions with V.P. Godambe and with R.
Morton have been particularly influential in helping to form my views.
The subject of estimating functions has evolved quite rapidly over the period
during which the book was written and important developments have been
emerging so fast as to preclude any attempt at exhaustive coverage. Among the
topics omitted is that of quasi- likelihood in survey sampling, which has generated
quite an extensive literature (see the edited volume Godambe (1991),
Part 4 and references therein) and also the emergent linkage with Bayesian
statistics (e.g., Godambe (1994)). It became quite evident at the Conference
on Estimating Functions held at the University of Georgia in March 1996 that
a book in the area was much needed as many known ideas were being rediscovered.
This realization provided the impetus to round off the project rather
vi PREFACE
earlier than would otherwise have been the case.
The emphasis in the monograph is on concepts rather than on mathematical
theory. Indeed, formalities have been suppressed to avoid obscuring "typical"
results with the phalanx of regularity conditions and qualifiers necessary to
avoid the usual uninformative types of counterexamples which detract from
most statistical paradigms. In discussing theory which holds to the first order
of asymptotics the treatment is especially informal, as befits the context.
Sufficient conditions which ensure the behaviour described are not difficult to
furnish but are fundamentally uninlightening.
A collection of complements and exercises has been included to make the
material more useful in a teaching environment and the book should be suitable
for advanced courses and seminars. Prerequisites are sound basic courses in
measure theoretic probability and in statistical inference.
Comments and advice from students and other colleagues has also contributed
much to the final form of the book. In addition to V.P. Godambe and
R. Morton mentioned above, grateful thanks are due in particular to Y.-X. Lin,
A. Thavaneswaran, I.V. Basawa, E. Saavendra and T. Zajic for suggesting corrections
and other improvements and to my wife Beth for her encouragement.
C.C. Heyde
Canberra, Australia
February 1997
Contents
Preface v
1 Introduction 1
1.1 The Brief . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . 3
1.4 Relationship with the Score Function . . . . . . . . . . . . . . . 6
1.5 The Road Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 The Message of the Book . . . . . . . . . . . . . . . . . . . . . 10
1.7 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 The General Framework 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Fixed Sample Criteria . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Scalar Equivalences and Associated Results . . . . . . . . . . . 19
2.4 Wedderburn's Quasi-Likelihood . . . . . . . . . . . . . . . . . . 21
2.4.1 The Framework . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.3 Generalized Estimating Equations . . . . . . . . . . . . 25
2.5 Asymptotic Criteria . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 A Semimartingale Model for Applications . . . . . . . . . . . . 30
2.7 Some Problem Cases for the Methodology . . . . . . . . . . . . 35
2.8 Complements and Exercises . . . . . . . . . . . . . . . . . . . . 38
3 An Alternative Approach: E-Sufficiency 43
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Complement and Exercise . . . . . . . . . . . . . . . . . . . . . 51
4 Asymptotic Confidence Zones of Minimum Size 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 The Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Confidence Zones: Theory . . . . . . . . . . . . . . . . . . . . . 56
vii
viii CONTENTS
4.4 Confidence Zones: Practice . . . . . . . . . . . . . . . . . . . . 60
4.5 On Best Asymptotic Confidence Intervals . . . . . . . . . . . . 62
4.5.1 Introduction and Results . . . . . . . . . . . . . . . . . 62
4.5.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . 64
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Asymptotic Quasi-Likelihood 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 The Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Generalized Linear Model . . . . . . . . . . . . . . . . . 79
5.3.2 Heteroscedastic Autoregressive Model . . . . . . . . . . 79
5.3.3 Whittle Estimation Procedure . . . . . . . . . . . . . . 82
5.3.4 Addendum to the Example of Section 5.1 . . . . . . . . 87
5.4 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6 Combining Estimating Functions 91
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.2 Composite Quasi-Likelihoods . . . . . . . . . . . . . . . . . . . 92
6.3 Combining Martingale Estimating Functions . . . . . . . . . . 93
6.3.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4 Application. Nested Strata of Variation . . . . . . . . . . . . . 99
6.5 State-Estimation in Time Series . . . . . . . . . . . . . . . . . . 103
6.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Projected Quasi-Likelihood 107
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2 Constrained Parameter Estimation . . . . . . . . . . . . . . . . 107
7.2.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 Generalizing the E-M Algorithm: The P-S Method . . . . . . . 116
7.4.1 From Log-Likelihood to Score Function . . . . . . . . . 117
7.4.2 From Score to Quasi-Score . . . . . . . . . . . . . . . . 118
7.4.3 Key Applications . . . . . . . . . . . . . . . . . . . . . . 121
7.4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
8 Bypassing the Likelihood 129
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
8.2 The REML Estimating Equations . . . . . . . . . . . . . . . . . 129
8.3 Parameters in Diffusion Type Processes . . . . . . . . . . . . . 131
8.4 Estimation in Hidden Markov Random Fields . . . . . . . . . . 136
8.5 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
CONTENTS ix
9 Hypothesis Testing 141
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.2 The Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
9.3 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10 Infinite Dimensional Problems 147
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.2 Sieves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.3 Semimartingale Models . . . . . . . . . . . . . . . . . . . . . . 148
11 Miscellaneous Applications 153
11.1 Estimating the Mean of a Stationary Process . . . . . . . . . . 153
11.2 Estimation for a Heteroscedastic Regression . . . . . . . . . . . 159
11.3 Estimating the Infection Rate in an Epidemic . . . . . . . . . . 162
11.4 Estimating Population Size . . . . . . . . . . . . . . . . . . . . 164
11.5 Robust Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 169
11.5.1 Optimal Robust Estimating Functions . . . . . . . . . . 170
11.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
11.6 Recursive Estimation . . . . . . . . . . . . . . . . . . . . . . . . 176
12 Consistency and Asymptotic Normality
for Estimating Functions 179
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
12.2 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
12.3 The SLLN for Martingales . . . . . . . . . . . . . . . . . . . . . 186
12.4 The CLT for Martingales . . . . . . . . . . . . . . . . . . . . . 190
12.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
13 Complements and Strategies for Application 199
13.1 Some Useful Families of Estimating Functions . . . . . . . . . . 199
13.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 199
13.1.2 Transform Martingale Families . . . . . . . . . . . . . . 199
13.1.3 Use of the Infinitesimal Generator of a Markov Process 200
13.2 Solution of Estimating Equations . . . . . . . . . . . . . . . . . 201
13.3 Multiple Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 202
13.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 204
13.3.3 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
13.4 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . 210
References 211
Index 227
Chapter 1
Introduction
1.1 The Brief
This monograph is primarily concerned with parameter estimation for a random
process {Xt} taking values in r-dimensional Euclidean space. The distribution
of Xt depends on a characteristic  taking values in a open subset
 of p-dimensional Euclidean space. The framework may be parametric or
semiparametric;  may be, for example, the mean of a stationary process. The
object will be the "efficient" estimation of  based on a sample {Xt, t  T}.
1.2 Preliminaries
Historically there are two principal themes in statistical parameter estimation
theory:
least squares (LS) - introduced by Gauss and Legendre and
founded on finite sample considerations
(minimum distance interpretation)
maximum likelihood (ML) - introduced by Fisher and with a justification
that is primarily asymptotic (minimum
size asymptotic confidence intervals, ideas of
which date back to Laplace)
It is now possible to unify these approaches under the general description
of quasi-likelihood and to develop the theory of parameter estimation in a very
general setting. The fixed sample optimality ideas that underly quasi-likelihood
date back to Godambe (1960) and Durbin (1960) and were put into a
stochastic process setting in Godambe (1985). The asymptotic justification is
due to Heyde (1986). The ideas were combined in Godambe and Heyde (1987).
It turns out that the theory needs to be developed in terms of estimating
functions (functions of both the data and the parameter) rather than the estimators
themselves. Thus, our focus will be on functions that have the value of
the parameter as a root rather than the parameter itself.
The use of estimating functions dates back at least to K. Pearson's introduction
of the method of moments (1894) although the term "estimating function"
may have been coined by Kimball (1946). Furthermore, all the standard
methods of estimation, such as maximum likelihood, least-squares, conditional
least-squares, minimum chi-squared, and M-estimation, are included under minor
regularity conditions. The subject has now developed to the stage where
books are being devoted to it, e.g., Godambe (1991), McLeish and Small (1988).
1
2 CHAPTER 1. INTRODUCTION
The rationale for the use of the estimating function rather than the estimator
derived therefrom lies in its more fundamental character. The following
dot points illustrate the principle.
* Estimating functions have the property of invariance under one-to-one
transformations of the parameter .
* Under minor regularity conditions the score function (derivative of the
log-likelihood with respect to the parameter), which is an estimating function,
provides a minimal sufficient partitioning of the sample space. However,
there is often no single sufficient statistic.
For example, suppose that {Zt} is a Galton-Watson process with offspring
mean E(Z1 | Z0 = 1) = . Suppose that the offspring distribution belongs
to the power series family (which is the discrete exponential family).
Then, the score function is
UT () = c
T
t=1
(Zt -  Zt-1),
where c is a constant and the maximum likelihood estimator
^T =
T
t=1
Zt
T
t=1
Zt-1
is not a sufficient statistic. Details are given in Chapter 2.
* Fisher's information is an estimating function property (namely, the variance
of the score function) rather than that of the maximum likelihood
estimator (MLE).
* The Cram´er-Rao inequality is an estimating function property rather
than a property of estimators. It gives the variance of the score function
as a bound on the variances of standardized estimating functions.
* The asymptotic properties of an estimator are almost invariably obtained,
as in the case of the MLE, via the asymptotics of the estimating function
and then transferred to the parameter space via local linearity.
* Separate estimating functions, each with information to offer about an
unknown parameter, can be combined much more readily than the estimators
therefrom.
We shall begin our discussion by examining the minimum variance ideas that
underly least squares and then see how optimality is conveniently phrased in
terms of estimating functions. Subsequently, we shall show how the score function
and maximum likelihood ideas mesh with this. The approach is along the
general lines of the brief overviews that appear in Godambe and Heyde (1987),
Heyde (1989b), Desmond (1991), Godambe and Kale (1991). An earlier version
1.3. THE GAUSS-MARKOV THEOREM 3
appeared in the lecture notes Heyde (1993). Another approach to the subject
of optimal estimation, which also uses estimating functions but is based on
extension of the idea of sufficiency, appears in McLeish and Small (1988); the
theories do substantially overlap, although this is not immediately transparent.
Details are provided in Chapter 3.
1.3 Estimating Functions and
the Gauss-Markov Theorem
To indicate the basic LS ideas that we wish to incorporate, we consider the
simplest case of independent random variables (rv's) and a one-dimensional
parameter . Suppose that X1, . . . , XT are independent rv's with EXt = ,
var Xt = 2
. In this context the Gauss-Markov theorem has the following form.
GM Theorem: Let the estimator ST =
T
t=1 at Xt be unbiased for , the
at being constants. Then, the variance, var ST , is minimized for at = 1/T, t =
1, . . . , T. That is, the sample mean X = T-1 T
t=1 Xt is the linear unbiased
minimum variance estimator of .
The proof is very simple; we have to minimize var ST = 2 T
t=1 a2
t subject to
T
t=1 at = 1 and
var ST = 2
T
t=1
a2
t -
2at
T
+
1
T2
+
2
T
= 2
T
t=1
at -
1
T
2
+
2
T

2
T
.
Now we can restate the GM theorem in terms of estimating functions. Consider
the set G0 of unbiased estimating functions G = G(X1, . . . , XT , ) of the form
G =
T
t=1 bt(Xt - ), the bt's being constants with
T
t=1 bt = 0.
Note that the estimating functions kG, k constant and G produce the same
estimator, namely
T
t=1 bt Xt/
T
t=1 bt, so some standardization is necessary if
variances are to be compared.
One possible standardization is to define the standardized version of G as
G(s)
=
T
t=1
bt
T
t=1
bt (Xt - ) 2
T
t=1
b2
t
-1
.
The estimator of  is unchanged and, of course, kG and G have the same
standardized form. Let us now motivate this standardization.
(1) In order to be used as an estimating equation, the estimating function G
4 CHAPTER 1. INTRODUCTION
needs to be as close to zero as possible when  is the true value. Thus we
want var G = 2 T
t=1 b2
t to be as small as possible. On the other hand, we
want G( + ),  > 0, to differ as much as possible from G() when  is
the true value. That is, we want (E ˙G())2
=
T
t=1 bt
2
, the dot denoting
derivative with respect to , to be as large as possible. These requirements can
be combined by maximizing var G(s)
= (E ˙G)2
/EG2
.
(2) Also, if max1tT bt/
T
t=1 bt  0 as T  , then
T
t=1
bt (Xt - ) 2
T
t=1
b2
t
1/2
d
- N(0, 1)
using the Lindeberg-Feller central limit theorem. Thus, noting that our estimator
for  is
^T =
T
t=1
bt Xt
T
t=1
bt,
we have
var G
(s)
T
1/2
(^T - )
d
- N(0, 1),
i.e.,
^T - 
d
N 0, var G
(s)
T
-1
.
We would wish to choose the best asymptotic confidence intervals for  and
hence to maximize var G
(s)
T .
(3) For the standardized version G(s)
of G we have
var G(s)
=
T
t=1
bt
2
2
T
t=1
b2
t = -E ˙G(s)
,
i.e., G(s)
possesses the standard likelihood score property.
Having introduced standardization we can say that G
 G0 is an optimal
estimating function within G0 if var G(s)
 var G(s)
, G  G0. This leads to
the following result.
GM Reformation The estimating function G
=
T
t=1(Xt -) is an optimal
estimating function within G0. The estimating equation G
= 0 provides the
sample mean as an optimal estimator of .
1.3. THE GAUSS-MARKOV THEOREM 5
The proof follows immediately from the Cauchy-Schwarz inequality. For
G  G0 we have
var G(s)
=
T
t=1
bt
2
2
T
t=1
b2
t  T/2
= var G(s)
and the argument holds even if the bt's are functions of .
Now the formulation that we adapted can be extended to estimating functions
G in general by defining the standardized version of G as
G(s)
= -(E ˙G) (EG2
)-1
G.
Optimality based on maximization of var G(s)
leads us to define G
to be optimal
within a class H if
var G(s)
 var G(s)
, G  H.
That this concept does differ from least squares in some important respects
is illustrated in the following example.
We now suppose that Xt, t = 1, 2, . . . , T are independent rv's with EXt =
t(), var Xt = 2
t (), the t's, 2
t 's being specified differentiable functions.
Then, for the class of estimating functions
H = H : H =
T
t=1
bt() (Xt - t()) ,
we have
var H(s)
=
T
t=1
bt() ˙t()
2 T
t=1
b2
t () 2
t (),
which is maximized (again using the Cauchy-Schwarz inequality) if
bt() = k() ˙t() -2
t (), t = 1, 2, . . . , T,
k() being an undetermined multiplier. Thus, an optimal estimating function
is
H
=
T
t=1
˙t() -2
t () (Xt - t()).
Note that this result is not what one gets from least squares (LS). If we applied
LS, we would minimize
T
t=1
(Xt - t())2
-2
t (),
which leads to the estimating equation
T
t=1
˙t() -2
t () (Xt - t()) +
T
t=1
(Xt - t())2
-3
t () ˙t() = 0.
6 CHAPTER 1. INTRODUCTION
This estimating equation will generally not be unbiased, and it may behave
very badly depending on the t's. It will not in general provide a consistent
estimator.
1.4 Relationship with the Score Function
Now suppose that {Xt, t = 1, 2, . . . , T} has likelihood function
L =
T
t=1
ft(Xt; ).
The score function in this case is a sum of independent rv's with zero means,
U =
 log L

=
T
t=1
 log ft(Xt; )

and, when H =
T
t=1 bt() (Xt - t()), we have
E(U H) =
T
t=1
bt() E
 log ft(Xt; )

(Xt - t()) .
If the ft's are such that integration and differentiation can be interchanged,
E
 log ft(Xt; )

Xt =


EXt = ˙t(),
so that
E(U H) =
T
t=1
bt() ˙t() = -E ˙H.
Also, using corr to denote correlation,
corr2
(U, H) = (E(U H))2
/(EU2
)(EH2
)
= (var H(s)
)/EU2
,
which is maximized if var H(s)
is maximized. That is, the choice of an optimal
estimating function H
 H is giving an element of H that has maximum
correlation with the generally unknown score function.
Next, for the score function U and H  H we find that
E(H(s)
- U(s)
)2
= var H(s)
+ var U(s)
- 2E(H(s)
U(s)
)
= EU2
- var H(s)
,
since
U(s)
= U
1.5. THE ROAD AHEAD 7
and
EH(s)
U(s)
= var H(s)
when differentiation and integration can be interchanged. Thus E(H(s)
-U(s)
)2
is minimized when an optimal estimating function H
 H is chosen. This gives
an optimal estimating function the interpretation of having minimum expected
distance from the score function. Note also that
var H(s)
 EU2
,
which is the Cram´er-Rao inequality.
Of course, if the score function U  H, the methodology picks out U as
optimal. In the case in question U  H if and only if U is of the form
U =
T
t=1
bt() (Xt - t()),
that is,
 log f(Xt; )

= bt() (Xt - t()),
so that the Xt's are from an exponential family in linear form.
Classical quasi-likelihood was introduced in the setting discussed above by
Wedderburn (1974). It was noted by Bradley (1973) and Wedderburn (1974)
that if the Xt's have exponential family distributions in which the canonical
statistics are linear in the data, then the score function depends on the parameters
only through the means and variances. They also noted that the score
function could be written as a weighted least squares estimating function. Wedderburn
suggested using the exponential family score function even when the
underlying distribution was unspecified. In such a case the estimating function
was called a quasi-score estimating function and the estimator derived therefore
a quasi-likelihood estimator.
The concept of optimal estimating functions discussed above conveniently
subsumes that of quasi-score estimating functions in the Wedderburn sense, as
we shall discuss in vector form in Chapter 2. We shall, however, in our general
theory, take the names quasi-score and optimal for estimating functions to be
essentially synonymous.
1.5 The Road Ahead
In the above discussion we have concentrated on the simplest case of independent
random variables and a scalar parameter, but the basis of a general
formulation of the quasi-likelihood methodology is already evident.
In Chapter 2, quasi-likelihood is developed in its general framework of a
(finite dimensional) vector valued parameter to be estimated from vector valued
data. Quasi-likelihood estimators are derived from quasi-score estimating
functions whose selection involves maximization of a matrix valued information
criterion in the partial order of non-negative definite matrices. Both fixed
8 CHAPTER 1. INTRODUCTION
sample and asymptotic formulations are considered and the conditions under
which they hold are shown to be substantially overlapping. Also, since matrix
valued criteria are not always easy to work with, some scalar equivalences are
formulated. Here there is a strong link with the theory of optimal experimental
design.
The original Wedderburn formulation of quasi-likelihood in an exponential
family setting is then described together with the limitations of its direct extension.
Also treated is the closely related methodology of generalized estimating
equations, developed for longitudinal data sets and typically using approximate
covariance matrices in the quasi-score estimating function.
The basic formulation having been provided, it is now shown how a semimartingale
model leads to a convenient class of estimating functions of wide
applicability. Various illustrations are provided showing how to use these ideas
in practice, and some discussion of problem cases is also given.
Chapter 3 outlines an alternative approach to optimal estimation using
estimating functions via the concepts of E-sufficiency and E-ancillarity. Here
E refers to expectation. This approach, due to McLeish and Small, produces
results that overlap substantially with those of quasi-likelihood, although this is
not immediately apparent. The view is taken in this book that quasi-likelihood
methodology is more transparent and easier to apply.
Chapter 4 is concerned with asymptotic confidence zones. Under the usual
sort of regularity conditions, quasi-likelihood estimators are associated with
minimum size asymptotic confidence intervals within their prespecified spaces
of estimating functions. Attention is given to the subtle question of whether to
normalize with random variables or constants in order to obtain the smallest
intervals. Random normings have some important advantages.
Ordinary quasi-likelihood theory is concerned with the case where the maximum
information criterion holds exactly for fixed T or for each T as T  .
Chapter 5 deals with the case where optimality holds only in a certain asymptotic
sense. This may happen, for example, when a nuisance parameter is replaced
by a consistent estimator thereof. The discussion focuses on situations
where the properties of regular quasi-likelihood of consistency and possession
of minimum size asymptotic confidence zones are preserved for the estimator.
Estimating functions from different sources can conveniently be added, and
the issue of their optimal combination is addressed in Chapter 6. Various applications
are given, including dealing with combinations of estimating functions
where there are nested strata of variation and providing methods of filtering
and smoothing in time series estimation. The well-known Kalman filter is a
special case.
Chapter 7 deals with projection methods that are useful in situations where
a standard application of quasi-likelihood is precluded. Quasi-likelihood approaches
are provided for constrained parameter estimation, for estimation in
the presence of nuisance parameters, and for generalizing the E-M algorithm
for estimation where there are missing data.
In Chapter 8 the focus is on deriving the score function, or more generally
quasi-score estimating function, without use of the likelihood, which may be
1.5. THE ROAD AHEAD 9
difficult to deal with, or fail to exist, under minor perturbations of standard conditions.
Simple quasi-likelihood derivations of the score functions are provided
for estimating the parameters in the covariance matrix, where the distribution
is multivariate normal (REML estimation), in diffusion type models, and in
hidden Markov random fields. In each case these remain valid as quasi-score
estimating functions under significantly broadened assumptions over those of
a likelihood based approach.
Chapter 9 deals briefly with issues of hypothesis testing. Generalizations of
the classical efficient scores statistic and Wald test statistic are treated. These
are shown to usually be asymptotically 2
distributed under the null hypothesis
and to have asymptotically, noncentral 2
distributions, with maximum noncentrality
parameter, under the alternative hypothesis, when the quasi-score
estimating function is used.
Chapter 10 provides a brief discussion of infinite dimensional parameter
(function) estimation. A sketch is given of the method of sieves, in which
the dimension of the parameter is increased as the sample size increases. An
informal treatment of estimation in linear semimartingale models, such as occur
for counting processes and estimation of the cumulative hazard function, is also
provided.
A diverse collection of applications is given in Chapter 11. Estimation is
discussed for the mean of a stationary process, a heteroscedastic regression, the
infection rate of an epidemic, and a population size via a multiple recapture
experiment. Also treated are estimation via robustified estimating functions
(possibly with components that are bounded functions of the data) and recursive
estimation (for example, for on-line signal processing).
Chapter 12 treats the issues of consistency and asymptotic normality of estimators.
Throughout the book it is usually expected that these will ordinarily
hold under appropriate regularity conditions. The focus here is on martingale
based methods, and general forms of martingale strong law and central limit
theorems are provided for use in particular cases. The view is taken that it
is mostly preferable directly to check cases individually rather than to rely on
general theory with its multiplicity of regularity conditions.
Finally, in Chapter 13 a number of complementary issues involved in the
use of quasi-likelihood methods are discussed. The chapter begins with a collection
of methods for generating useful families of estimating functions. Integral
transform families and the use of the infinitesimal generator of a Markov
process are treated. Then, the numerical solution of estimating equations is
considered, and methods are examined for dealing with multiple roots when a
scalar objective function may not be available. The final section is concerned
with resampling methods for the provision of confidence intervals, in particular
the jackknife and bootstrap.
10 CHAPTER 1. INTRODUCTION
1.6 The Message of the Book
For estimation of parameters, in stochastic systems of any kind, it has become
increasingly clear that it is possible to replace likelihood based techniques by
quasi-likelihood alternatives, in which only assumptions about means and variances
are made, in order to obtain estimators. There is often little, if any,
loss in efficiency, and all the advantages of weighted least squares methods are
also incorporated. Additional assumptions are, of course, required to ensure
consistency of estimators and to provide confidence intervals.
If it is available, the likelihood approach does provide a basis for benchmarking
of estimating functions but not more than that. It is conjectured that
everything that can be done via likelihoods has a corresponding quasi-likelihood
generalization.
1.7 Exercise
1. Suppose {Xi, i = 1, 2, . . .} is a sequence of independent rv's, Xi having a
Bernoulli distribution with P(Xi = 1) = pi = 1
2 +  ai, P(Xi = 0) = 1 - pi,
and 0 < ai  0 as i  . Show that there is a consistent estimator of  if and
only if

i=1 a2
i = . (Adaped from Dion and Ferland (1995).)
Chapter 2
The General Framework
2.1 Introduction
Let {Xt, t  T} be a sample of discrete or continuous data that is randomly
generated and takes values in r-dimensional Euclidean space. The distribution
of Xt depends on a "parameter"  taking values in an open subset  of pdimensional
Euclidean space and the object of the exercise is the estimation of
.
We assume that the possible probability measures for Xt are {P} a union
(possibly uncountable) of families of parametric models, each family being indexed
by  and that each (, F, P) is a complete probability space.
We shall focus attention on the class G of zero mean, square integrable estimating
functions GT = GT ({Xt, t  T}, ), which are vectors of dimension
p for which EGT () = 0 for each P and for which the p-dimensional matrices
E ˙GT = (E GT,i()/j) and EGT GT are nonsingular, the prime denoting
transpose. The expectations are always with respect to P. Note that ˙G is the
transpose of the usual derivative of G with respect to .
In many cases P is absolutely continuous with respect to some -finite
measure T giving a density pT (). Then we write UT () = p-1
T () ˙pT () for
the score function, which we suppose to be almost surely differentiable with
respect to the components of . In addition we will also suppose that differentiation
and integration can be interchanged in E(GT UT ) and E(UT GT ) for
GT  G.
The score function UT provides, modulo minor regularity conditions, a
minimal sufficient partitioning of the sample space and hence should be used
for estimation if it is available. However, it is often unknown, or in semiparametric
cases, does not exist. The framework here allows a focus on models
in which the error distribution has only its first and second moment properties
specified, at least initially.
2.2 Fixed Sample Criteria
In practice we always work with specified subsets of G. Take H  G as such
a set. As motivated in the previous chapter, optimality within H is achieved
by maximizing the covariance matrix of the standardized estimating functions
G
(s)
T = -(E ˙GT ) (EGT GT )-1
GT , GT  H. Alternatively, if UT exists, an
optimal estimating function within H is one with minimum dispersion distance
from UT . These ideas are formalized in the following definition and equivalence,
which we shall call criteria for OF -optimality (fixed sample optimality). Later
11
12 CHAPTER 2. THE GENERAL FRAMEWORK
we shall introduce similar criteria for optimality to hold for all (sufficiently
large) sample sizes. Estimating functions that are optimal in either sense will
be referred to as quasi-score estimating functions and the estimators that come
from equating these to zero and solving as quasi-likelihood estimators.
OF -optimality involves choice of the estimating function GT to maximize,
in the partial order of nonnegative definite (nnd) matrices (sometimes known
as the Loewner ordering), the information criterion
E(GT ) = E(G
(s)
T G
(s)
T ) = (E ˙GT ) (EGT GT )-1
(E ˙GT ),
which is a natural generalization of Fisher information. Indeed, if the score
function UT exists,
E(UT ) = (E ˙UT ) (EUT UT )-1
(E ˙UT ) = EUT UT
is the Fisher information.
Definition 2.1 G
T  H is an OF -optimal estimating function within H if
E(G
T ) - E(GT ) (2.1)
is nonnegative definite for all GT  H,    and P.
The term Loewner optimality is used for this concept in the theory of
optimal experimental designs (e.g., Pukelsheim (1993, Chapter 4)).
In the case where the score function exists there is the following equivalent
form to Definition 2.1 phrased in terms of minimizing dispersion distance.
Definition 2.2 G
T  H is an OF -optimal estimating function within H if
E U
(s)
T - G
(s)
T U
(s)
T - G
(s)
T - E U
(s)
T - G
(s)
T U
(s)
T - G
(s)
T (2.2)
is nonnegative definite for all GT  H,    and P.
Proof of Equivalence We drop the subscript T for convenience. Note that
E G(s)
U(s)
= -(E ˙G) (EGG )-1
EGU = E G(s)
G(s)
since
EGU = -E ˙G  G  H
EGU = G
 log L

L
= G
L

= -
G

L
2.2. FIXED SAMPLE CRITERIA 13
and similarly
E U(s)
G(s)
= E G(s)
G(s)
.
These results lead immediately to the equality of the expressions (2.1) and (2.2)
and hence the equivalence of Definition 2.1 and Definition 2.2.
A further useful interpretation of quasi-likelihood can be given in a Hilbert
space setting. Let H be a closed subspace of L2
= L2
(, F, P0) of (equivalence
classes) of random vectors with finite second moment. Then, for X, Y  L2
,
taking inner product (X, Y ) = E(X Y ) and norm X = (X, X)1/2
the
space L2
is a Hilbert space. We say that X is orthogonal to Y , written XY ,
if (X, Y ) = 0 and that subsets L2
1 and L2
2 of L2
are orthogonal, which holds if
XY for every X  L2
1, Y  L2
2 (written L2
1L2
2).
For X  L2
, let (X | H) denote the element of H such that
X - (X | H) 2
= inf
Y H
X - Y 2
,
that is, (X | H) is the orthogonal projection of X onto H.
Now suppose that the score function UT  G. Then, dropping the subscript
T and using Definition 2.2, the standardized quasi-score estimating function
H(s)
 H is given by
inf
H(s)H
E U - H(s)
U - H(s)
,
and since
tr E(U - H(s)
)(U - H(s)
) = U - H(s) 2
,
tr denoting trace, the quasi-score is (U | H), the orthogonal projection of the
score function onto the chosen space H of estimating functions. For further
discusion of the Hilbert space approach see Small and McLeish (1994) and
Merkouris (1992).
Next, the vector correlation that measures the association between GT =
(GT,1, . . . , GT,p) and UT = (UT,1, . . . , UT,p) , defined, for example, by Hotelling
(1936), is
2
=
(det(EGT UT ))2
det(EGT GT ) det(EUT UT )
,
where det denotes determinant. However, under the regularity conditions that
have been imposed, E ˙GT = -E(GT UT ), so a maximum correlation requirement
is to maximize
(det(E ˙GT ))2
/det(EGT GT ),
which can be achieved by maximizing E(GT ) in the partial order of nonnegative
definite matrices. This corresponds to the criterion of Definition 2.1.
Neither Definition 2.1 nor Definition 2.2 is of direct practical value for applications.
There is, however, an essentially equivalent form (Heyde (1988a)),
14 CHAPTER 2. THE GENERAL FRAMEWORK
that is very easy to use in practice.
Theorem 2.1 G
T  H is an OF -optimal estimating function within H if
E G
(s)
T G
(s)
T = E G
(s)
T G
(s)
T = E G
(s)
T G
(s)
T (2.3)
or equivalently
E ˙GT
-1
EGT G
T
is a constant matrix for all GT  H. Conversely, if H is convex and G
T  H
is an OF -optimal estimating function, then (2.3) holds.
Proof. Again we drop the subscript T for convenience. When (2.3) holds,
E G(s)
- G(s)
G(s)
- G(s)
= E G(s)
G(s)
- E G(s)
G(s)
is nonnegative definite,  G  H, since the left-hand side is a covariance function.
This gives optimality via Definition 2.1.
Now suppose that H is convex and G
is an OF -optimal estimating function.
Then, if H =  G + G
, we have that
E G(s)
G(s)
- E H(s)
H(s)
is nonnegative definite, and after inverting and some algebra this gives that
2
EGG - E ˙G E ˙G
 -1
EG
G
E ˙G
 -1
E ˙G
-  -EGG
+ E ˙G E ˙G
 -1
EG
G
-  -EG
G + EG
G
E ˙G
 -1
E ˙G
is nonnegative definite. This is of the form 2
A - B, where A and B are
symmetric and A is nonnegative definite by Definition 2.1.
Let u be an arbitrary nonzero vector of dimension p. We have u Au  0
and
u Au  -1
u Bu
for all , which forces u Bu = 0 and hence B = 0.
Now B = 0 can be rewritten as
EGG E ˙G
-1
C + C E ˙G
-1
EGG = 0,
where
C = EG(s)
G(s)
- EG(s)
G(s)
E ˙G
 -1
EG
G
2.2. FIXED SAMPLE CRITERIA 15
and, as this holds for all G  H, it is possible to replace G by DG, where D =
diag (1, . . . , p) is an arbitrary constant matrix. Then, in obvious notation
i (EGG ) (E ˙G)
-1
C
j
+ C (E ˙G)-1
(EGG )
i
j = 0
for each i, j, which forces C = 0 and hence (2.3) holds. This completes the
proof.
In general, Theorem 2.1 provides a straightforward way to check whether
an OF -optimal estimating function exists for a particular family H. It should
be noted that existence is by no means guaranteed.
Theorem 2.1 is especially easy to use when the elements G  H have orthogonal
differences and indeed this is often the case in applications. Suppose,
for example, that
H = H : H =
T
t=1
at() ht() ,
with at() constants to be chosen, ht's fixed and random with zero means and
Ehs()ht() = 0, s = t. Then
EHH
=
T
t=1
atEhthta
t
E ˙H =
T
t=1
atE ˙ht
and (E ˙H)-1
EHH
is constant for all H  H if
a
t = E ˙ht (Ehtht)-1
.
An OF -optimal estimating function is thus
T
t=1
E ˙ht() Eht()ht()
-1
ht().
As an illustration consider the estimation of the mean of the offspring distribution
in a Galton-Watson process {Zt},  = E(Z1|Z0 = 1). Here the data
are {Z0, . . . , ZT }.
Let Fn = (Z0, . . . , Zn). We seek a basic martingale (MG) from the {Zi}.
This is simple since
Zi - E (Zi | Fi-1) = Zi -  Zi-1
16 CHAPTER 2. THE GENERAL FRAMEWORK
are MG differences (and hence orthogonal). Let
H = h : hT =
T
t=1
at()(Zt -  Zt-1), at() is Ft-1 measurable .
We find that the OF -optimal choice for at() is
a
t () = -1/2
,
where 2
= var(Z1|Z0 = 1). The OF -optimal estimator of  is
(Z1 + . . . + ZT )/(Z0 + . . . + ZT -1).
We would call this a quasi-likelihood estimator from the family H. It is actually
the MLE for the power series family of offspring distributions
P(Z1 = j|Z0 = 1) = A(j)
(a())j
F()
, j = 0, 1, 2, . . .
where
F() =

j=0
A(j)(a())j
.
These form the discrete exponential family for this context.
To obtain the MLE result for the power series family note that
P(Z0 = z0, . . . , ZT = zT )
= P(Z0 = z0)
T
k=1
P Zk = zk Zk-1 = zk-1
= P(Z0 = z0)
T
k=1


j1+...+jzk-1
=zk
A(j1) . . . A(jzk-1
)

 (a())zk
(F())zk-1
= P(Z0 = z0)
(a())z1+...+zT
(F())z0+...+zT -1
× term not involving 
and hence if L = L(Z0, . . . , ZT ; ) is the likelihood,
d log L
d 
= (Z1 + . . . + ZT )
˙a()
a()
- (Z0 + . . . + ZT -1)
˙F()
F()
.
However
 =

j=1
j A(j)
(a())j
F()
, 1 =

j=0
A(j)
(a())j
F()
2.2. FIXED SAMPLE CRITERIA 17
and differentiating with respect to  in the latter result,
0 =

j=1
j A(j)˙a()
(a())j-1
F()
-
˙F()
F2()

j=0
A(j)(a())j
so that

˙a()
a()
=
˙F()
F()
and the score function is
UT () =
˙a()
a()
[(Z1 + . . . + ZT ) - (Z0 + . . . + ZT -1)] .
The same estimator can also be obtained quite generally as a nonparametric
MLE (Feigin (1977)).
This example illustrates one important general strategy for finding optimal
estimating functions. This strategy is to compute the score function for
some plausible underlying distribution (such as a convenient member of the
appropriate exponential family)  log L0/, say, then use the differences in
this martingale to form ht's. Finally, choose the optimal estimating function
within the class H = {
T
t=1 atht} by suitably specifying the weights at.
The previous discussion and examples have concentrated on the case of
discrete time. However, the theory operates in entirely similar fashion in the
case of continuous time.
For example, consider the diffusion process
dXt =  Xt dt + dWt,
where Wt is standard Brownian motion. We wish to estimate  on the basis of
the data {Xt, 0  t  T}.
Here the natural martingale to use is Wt, and we seek an OF -optimal estimating
function from the set H = {
T
0
bs dWs, bs predictable}. Note that,
for convenience, and to emphasize the methodology, we shall often write estimating
functions in a form that emphasizes the noise component of the model
and suppresses the dependence on the observations. Here
T
0
bs dWs is to be
interpreted as
T
0
bs(dXs -  Xs ds).
We have, writing HT =
T
0
bs dWs, H
T =
T
0
b
s dWs,
HT =
T
0
bs dWs =
T
0
bs dXs - 
T
0
bs Xs ds,
so that
E ˙HT = -E
T
0
bs Xs ds, EHT H
T = E
T
0
bs b
s ds
18 CHAPTER 2. THE GENERAL FRAMEWORK
and (E ˙HT )-1
EHT H
T is constant for all H  H if b
s = Xs. Then,
H
T =
T
0
Xs dXs - 
T
0
X2
s ds
and the QLE is
^T =
T
0
Xs dXs
T
0
X2
s ds.
This is also the MLE.
As another example, we consider the multivariate counting process Xt =
(Xt,1, . . . , Xt,p) , each Xt,i being of the form
Xt,i = i
t
0
Ji(s) ds + Mt,i
with multiplicative intensity i(t) = i Ji(t), Ji(t) > 0 a.s. being predictable
and Mt,i a square integrable martingale. This is a special case of the framework
considered by Aalen (1978), see also Andersen et.al. (1982) and it covers a
variety of contexts for processes such as those of birth and death type. The
case p = 1 has been discussed by Thavaneswaran and Thompson (1986).
The data are {Xt, 0  t  T} and we write
XT =
T
0
J(s) ds  + MT ,
where
J(s) = diag(J1(s), . . . , Jp(s)),  = (1, . . . , p) , MT = (MT,1, . . . , MT,p)
and we note that for counting processes MT,i and MT,j are orthogonal for i = j.
Then, we seek an OF -optimal estimating function from the set
H =
T
0
bs dMs, bs predictable .
Let HT =
T
0
bs dMs, H
T =
T
0
b
s dMs. We have
E ˙HT = -E
T
0
bs J(s) ds,
EHT H
T = E
T
0
bs J(s)  b
s ds,
and (E ˙HT )-1
EHT H
T is constant for all H  H if b
s  I. Thus H
T = MT
is an OF -optimal estimating function and the corresponding QLE is
^T =
T
0
Js ds
-1
XT .
2.3. SCALAR EQUIVALENCES AND ASSOCIATED RESULTS 19
That this ^ is also the MLE under rather general conditions follows from § 3.3
of Aalen (1978). The simplest particular case is where each Xt,i is a Poisson
process with parameter i.
General comment: The art, as distinct from the science, in using quasilikelihood
methods is in a good choice of the family H of estimating functions
with which to work. The ability to choose H is a considerable strength as the
family can be tailor made to the requirements of the context, and regularity
conditions can be built in. However, it is also a source of weakness, since it
is by no means always clear what competing families of estimating functions
might exist with better properties and efficiencies. Of course, the quasi-likelihood
framework does provide a convenient basis for comparison of families via
the information criterion.
For examples of estimators for autoregressive processes with positive or
bounded innovations that are much better than the naive QLE chosen from the
natural family of estimating functions for standard autoregressions see Davis
and McCormick (1989).
2.3 Scalar Equivalences and
Associated Results
Comparison of information matrices in the partial order of nonnegative definite
matrices may be difficult in practice, especially if the information matrices are
based on quite different families of estimating functions. In the case where an
OF -optimal estimating function exists, however, we may replace the matrix
comparison by simpler scalar ones. The following result is essentially due to
Chandrasekar and Kale (1984).
Theorem 2.2 Suppose that H is a space of estimating functions for which an
OF -optimal estimating function exists. The condition that G
T is OF -optimal
in H, i.e., that E(G
T )-E(GT ) is nonnegative definite for all GT  H, is equivalent
to either of the two alternative conditions: for all GT  H,
(i) (trace (T) criterion) tr E(G
T )  tr E(GT );
(ii) (determinant (D) criterion) det(E(G
T ))  det(E(GT )).
Remark There are further alternative equivalent conditions in addition to
(i) and (ii), in particular
(iii) (smallest eigenvalue (E) criterion)
min (E (G
T ))  min (E (GT ))
and
(iv) (average variance (A) criterion)
p-1
tr (E (G
T ))
-1
-1
 p-1
tr (E (GT ))
-1
-1
.
20 CHAPTER 2. THE GENERAL FRAMEWORK
Conditions (i) ­ (iv) have been widely used in the theory of optimal experimental
design where they respectively correspond to T, D, E and A-optimality.
See Pukelsheim (1993), e.g., Chapter 9, and references therein. For experimental
designs it often happens that a Loewner optimal design ( i.e., OF -optimal
estimating function) does not exist, but an A, D, E or T optimal design (i.e.,
estimating function optimal in the sense of the A, D, E or T criterion) can
be found. See Pukelsheim (1993, p. 104) for a discussion of nonexistence of
Loewner optimal designs.
Proof of Theorem 2.2 We shall herein drop the subscript T for conve-
nience.
(i) The condition that E(G
) - E(G) is nnd immediately gives
tr (E(G
) - E(G)) = tr E(G
) - tr E(G)  0.
Conversely, suppose H satisfies tr E(H)  tr E(G) for all G  H. If there is
an OF -optimal G
, then tr E(H)  tr E(G
). But from the definition of OF optimality
we also have tr E(G
)  tr E(H) and hence tr E(G
) = tr E(H).
Thus, we have that A = E(G
) - E(H) is nnd and tr A = 0. But A being
symmetric and nnd implies that all its eigenvalues are positive, while tr A = 0
implies that the sum of all the eigenvalues of A is zero. This forces all the
eigenvalues of A to be zero, which can only happen if A  0, since the sum
of squares of the elements of A is the sum of squares of its eigenvalues. Thus,
E(G
) = E(H) and we have an OF -optimal estimating function.
(ii) Here we apply the Simultaneous Reduction Lemma (e.g., Rao (1973, p. 41))
which states that if A and B are symmetric matrices and B is positive definite
(pd), then there is a nonsingular matrix R such that A = (R-1
)  R-1
and
B = (R-1
) R-1
, where  is diagonal.
In the nontrivial case we first suppose that E(G
) is pd. Then using the Simultaneous
Reduction Lemma we may suppose that there exists a nonsingular
matrix R such that for fixed G
E(G
) = (R-1
) R-1
, E(G) = (R-1
) GR-1
,
where G is diagonal. Then the condition that
E(G
) - E(G) = (R-1
) (I - G)R-1
is nnd forces
det(E(G
) - E(G)) = (det(R-1
))2
det(I - G)  0.
This means that detG  1 and hence
det(E(G
)) = det(R-1
)2
 det(E(G)) = det(R-1
)2
det(G).
Conversely, suppose that H satisfies det(E(H))  det(E(G)) for all G  H.
As with the proof of (i) we readily find that det(E(H)) = det(E(G
)) when G
2.4. WEDDERBURN'S QUASI-LIKELIHOOD 21
is OF -optimal. An application of the Simultaneous Reduction Lemma to the
pair E(G
), E(H), the former taken as pd, leads immediately to E(G
) = E(H)
and an OF -optimal solution.
Remark It must be emphasized that the existence of an OF -optimal estimating
function within H is a crucial assumption in the theorem. For example,
if G
satisfies the trace criterion (i), it is not ensured that G
is an OF -optimal
estimating function within H; there may not be one.
2.4 Wedderburn's Quasi-Likelihood
2.4.1 The Framework
Historically there have been two distinct approaches to parameter inference
developed from both classical least squares and maximum likelihood methods.
One is the optimal estimation approach introduced by Godambe (1960), and
others, from the viewpoint of estimating functions. The other, introduced by
Wedderburn (1974) as a basis for analyzing generalized linear regressions, was
termed quasi-likelihood from the outset. Both approaches have seen considerable
development in their own right. For those based on the Wedderburn
approach see, for example, Liang and Zeger (1986), Morton (1987), and Mc
Cullagh and Nelder (1989).
In this book our emphasis is on the optimal estimating functions approach
and, as we shall show in this section, the Wedderburn approach can be regarded
as a particular case of the optimal estimating function approach where we
restrict the space of estimating functions to a special class.
Wedderburn observed that, from a computational point of view, the only
assumptions on a generalized linear model necessary to fit the model were
a specification of the mean (in terms of the regression parameters) and the
relationship between the mean and the variance, not necessarily a fully specified
likelihood. Therefore, he replaced the assumptions on the probability
distribution by defining a function based solely on the mean-variance relationship,
which had algebraic and frequency properties similarly to those of
log-likelihoods. For example, for a regression model
Y = () + e
with Ee = 0, we suppose that a function q can be defined by the differential
equation
q

= Q() = ˙ V -1
(Y - ()),
for matrices ˙ = (i/j) and V = Eee . Then
E{Q()} = 0.
E


(Q()) = - ˙ V -1
˙.
22 CHAPTER 2. THE GENERAL FRAMEWORK
cov {Q()} = ˙ V -1
˙.
Thus Q() behaves like the derivative of a log-likelihood (a score function) and
is termed a quasi-score or quasi-score estimating function from the viewpoint of
estimating functions, while q itself is called a quasi-(log)likelihood. A common
approach to get the quasi-score estimating function has been to first write down
the general weighted sum of squares of residuals,
(Y - ()) V -1
(Y - ()),
and then differentiate it with respect to  assuming V is independent of . We
now put this approach into an estimating function setting.
Consider a model
Y = () + e, (2.4)
where Y is an n × 1 data vector, Ee = 0 and (), which now may be random
but for which E(e e | ˙) = V , involves an unknown parameter  of dimension
p.
We consider the estimating function space
H = {A(Y - ())},
for p × p matrices A not depending on  which are ˙-measurable and satisfy
the conditions that EA ˙ and E(A e e A ) are nonsingular. Then we have the
following theorem.
Theorem 2.3 The estimating function
G
= ˙ V -1
(Y - ) (2.5)
is a quasi-score estimating function within H.
This result follows immediately from Theorem 2.1. If
G = A(Y - ),
we have
EG G
= E(A e e V -1
˙)
= E(AE(e e | ˙)V -1
˙)
= E(A ˙) = -E ˙G
as required.
The estimating function (2.5) has been widely used in practice, in particular
through the family of generalized linear models (e.g, McCullagh and Nelder
(1989, Chapter 10)). A particular issue has been to deal with dispersion ,
which is modeled by using  V () in place of the V in (2.5) (e.g., Nelder and
2.4. WEDDERBURN'S QUASI-LIKELIHOOD 23
Lee (1992)). Another method of introducing extra variation into the model has
been investigated by Morton (1989).
The Wedderburn estimating function (2.5) is also appropriate well beyond
the usual setting of the generalized linear model. For example, estimation
for all standard ARMA time series models are covered in Theorem 2.3. In
practice, we obtain the familiar Yule-Walker equations for the quasi-likelihood
estimation of the parameters of an autoregressive process under assumptions
on only the first and second moments of the underlying distribution.
2.4.2 Limitations
It is not generally the case, however, that the Wedderburn estimating function
(2.5) remains as a quasi-score estimating function if the class H of estimating
functions is enlarged. Superior estimating functions may be found as we shall
illustrate below.
It should be noted that, when () is nonrandom, H confines attention to
nonrandom weighting matrices A. Allowing for random weights may improve
precision in estimation.
As a simple example, suppose that Y = (x1, . . . , xn) , e = (e1, . . . , en) and
the model (2.4) is of the form
xi =  + ei, i = 1, 2, . . . , n
with
E ei Fi-1 = 0, E e2
i Fi-1 = 2
x2
i-1,
Fi being the -field generated by x1, . . . , xi. Then, it is easily checked using
Theorem 2.1 that the quasi-score estimating function from the estimating
function space
H =
n
i=1
ai(xi - ), ai is Fi-1 measurable
is
G
1 = -2
n
i=1
x-2
i-1(xi - )
in contrast to the Wedderburn estimating function (2.5), i.e.,
G
= -2
n
i=1
(Ex2
i-1)-1
(xi - ),
which is the quasi-score estimating function from H in which all i are assumed
constant. If 
1 and 
are, respectively, the solutions of G
1(
1) = 0 and
G
(
) = 0, then
(E ˙G
1)2
(E(G
1)2
)-1
= -2
n
i=1
E(x-2
i-1)
24 CHAPTER 2. THE GENERAL FRAMEWORK
 -2
n
i=1
(Ex2
i-1)-1
= (E ˙G
)2
(E(G
)2
)-1
since (EZ)(EZ-1
)  1 via the Cauchy-Schwarz inequality. Thus G
1 is superior
to G
.
Now it can also happen that linear forms of the kind that are used in H
are substantially inferior to nonlinear functions of the data. The motivation
for H comes from exponential family considerations, and distributions that are
far from this type may of course arise. These will fit within different families
of estimating functions.
Suppose, for example, that Y = (x1, . . . , xn) , e = (e1, . . . , en) where
xi =  + ei, i = 1, . . . , n
with the ei being i.i.d. with density function
f(x) = 21/4
e-x4
/(1/4), - < x < .
Then, we find that the true score function is
U() = 4 
n
i=1
(xi - )3
.
This is much superior to the Wedderburn estimating function (2.5) for estimation
of , i.e.,
G
= -2
n
i=1
(xi - )
where 2
= var(x1) = Ee2
1. Indeed, after some calculation we obtain
(E ˙U)-2
E(U)2
=
Ee6
1
(9(Ee2
1)2)n
=
-1/2
(1/4)
12 n (3/4)
,
which is approximately 0.729477 of
(E ˙G
)-2
E(G
)2
=
1
n
E e2
1 =
-1/2
(3/4)
n (1/4)
.
This means that the length of an asymptotic confidence interval for  derived
from G
will be 1/0.729477  1.1708 times the corresponding one derived
from U. The true score estimating function here could, for example, be regarded
as an estimating function from the family
H2 =
n
i=1
(ai(xi - ) + bi(xi - )3
), ai's, bi's constants ,
2.4. WEDDERBURN'S QUASI-LIKELIHOOD 25
third moments being assumed zero, in contrast to G
derived from
H =
n
i=1
ai(xi - ), ai's constant .
From the above discussion, we see the flexibility of being able to choose an
appropriate space of estimating functions in the optimal estimating functions
approach.
2.4.3 Generalized Estimating Equations
Closely associated with Wedderburn's quasi-likelihood is the moment based
generalized estimating equation (GEE) method developed by Liang and Zeger
(1986), and Prentice (1988). The GEE approach was formulated to deal with
problems of longitudiual data analysis where one typically has a series of repeated
measurements of a response variable, together with a set of covariates,
on each unit or individual observed chronologically over time. The response
variables will usually be positively correlated. Furthermore, in the commonly
occurring situation of data with discrete responses there is no comprehensive
likelihood based approach analogous to that which comes from multivariate
Gaussian assumptions. Consequently, there has been considerable interest in
an approach that does not require full specification of the joint distribution of
the repeated responses . For a recent survey of the area see Fitzmaurice, Laird
and Rotnitzky (1993).
We shall follow Desmond (1996) in describing the formulation. This deals
with a longitudinal data set consisting of responses Yit, t = 1, 2, . . . , ni, i =
1, 2, . . . , k, say, where i indexes the individuals and t the repeated observations
per individual. Observations on different individuals would be expected to
be independent, while those on the same individual are correlated over time.
Then, the vector of observations Y = (Y11, . . . , Y1n1
, . . . , Yk1, . . . , Yknk
) will
have a covariance matrix V with block-diagonal structure
V = diag (V 1, V 2, . . . , V k).
Suppose also that V i = V i(i, i), i = 1, 2, . . . , k where i = (i1 , . . . , ini ) is
the vector of means for the ith individual and i is a parameter including variance
and correlation components. Finally, the means it depend on covariates
and a p × 1 regression parameter , i.e.,
it = it(), t = 1, 2, . . . , ni, i = 1, 2, . . . , k.
For example, in the case of binary response variables and ni = T for each i, we
may suppose that
P Yit = 1 xit,  = it, log it (1 - it) = xit ,
where xit represents a covariate vector associated with individual i and time t.
This is the logit link function.
26 CHAPTER 2. THE GENERAL FRAMEWORK
The basic model is then
Y =  + ,
say, where E = 0, E = V and, assuming the i, i = 1, 2, . . . , k known, the
quasi-score estimating function from the family
H = {A (Y - )}
for
k
i=1 ni ×
k
i=1 ni matrices A satisfying appropriate conditions is
Q() = ˙ V -1
(Y - )
as in Theorem 2.3. Equivalently, upon making use of the block-diagonal structure
of V , this may be written as
Q() =
k
i=1
˙i V -1
i (Y i - i),
where ˙i = i/, Y i = (Yi1, . . . , Yini
) , i = 1, 2, . . . , k. The GEE is based
on the estimating equation Q() = 0.
Now the particular feature of the GEE methodology is the use of "working"
or approximate covariance matrices in place of the generally unknown V i. The
idea is that the estimator thereby obtained will ordinarily be consistent regardless
of the true correlation between the responses. Of course any replacement
of the true V i by other covariance matrices renders the GEE suboptimal. It
is no longer based on a quasi-score estimating function although it may be
asymptotically equivalent to one. See Chapter 5 for a discussion of asymptotic
quasi-likelihood and, in particular, Chapter 5.5, Exercise 3.
Various common specifications of possible time dependence and the associated
methods of estimation that have been proposed are detailed in Fitzmaurice,
Laird and Rotnitzky (1993).
2.5 Asymptotic Criteria
Here we deal with the widely applicable situation in which families of estimating
functions that are martingales are being considered.
Since the score function, if it exists, is usually a martingale, it is quite natural
to approximate it using families of martingale estimating functions. Furthermore,
the availability of comprehensive strong law and central limit results
for martingales allows for straightforward discussion of issues of consistency,
asymptotic normality, and efficiency in this setting.
For n × 1 vector valued martingales MT and NT , the n × n process
M, N T is the mutual quadratic characteristic, a predictable increasing process
such that MT NT - M, N T is an n × n martingale. We shall write
M T for M, M T , the quadratic characteristic of MT . A convenient sketch
of these concepts is given by Shiryaev (1981); see also Rogers and Williams
(1987, IV. 26 and VI. 34).
2.5. ASYMPTOTIC CRITERIA 27
Let M1 denote the subset of G that are square integrable martingales. For
{GT }  M1 there is, under quite broad conditions, a multivariate central limit
result
G
- 1
2
T GT  M V N(0, Ip) (2.6)
in distribution, as T  ; see Chapter 12 for details.
For the general theory there are significant advantages for the random normalization
using G T rather than constant normalization using the covariance
EGT GT . There are many cases in which normings by constants are unable to
produce asymptotic normality, but instead lead to asymptotic mixed normality.
These are cases in which operational time involves an intrinsic rescaling
of ordinary time. Estimation of the mean of the offspring distribution in a
Galton-Watson branching process described in Section 2.1 illustrates the point.
The (martingale) quasi-score estimating function is QT =
T
i=1(Zi - Zi-1),
the quadratic characteristic is Q T = 2 T
i=1 Zi-1 and, on the set {W =
limn -n
Zn > 0},
Q
- 1
2
T QT
d
- N(0, 1), (EQ2
T )- 1
2 QT
d
- W
1
2 N(0, 1),
the product in this last limit being a mixture of independent W
1
2 and N(0, 1)
random variables. Later, in Chapter 4.5.1, it is shown that the normal limit
form of central limit theorem has advantages, for provision of asymptotic confidence
intervals, over the mixed normal form.
Let M2  M1 be the subclass for which (2.6) obtains. Next, with GT 
M2, let 
be a solution of GT () = 0 and use Taylor's expansion to obtain
0 = GT (
) = GT () + ˙GT (
)(
- ), (2.7)
where  - 
  - 
, the norm denoting sum of squares of elements.
Then, if ˙GT () is nonsingular for  in a suitable neighborhood and
( ˙GT (
))-1 ˙GT ()  Ip
in probability as T  , expressions (2.6) and (2.7) lead to
G()
- 1
2
T
˙GT ()(
- )  M V N(0, Ip)
in distribution.
Now, for {FT } the filtration corresponding to {GT } define the predictable
process
Gt() =
t
0
E d ˙Gs() Fs- ,
Fs- being the -field generated by r<s Fr, and assume that ˙GT () admits a
Doob-Meyer type decomposition
˙GT () = MG,T () + GT (),
28 CHAPTER 2. THE GENERAL FRAMEWORK
MG,T () being a martingale. Then, under modest conditions, for example, if
˙GT ()   almost surely as T  ,
MG,T () = op( GT () )
as T  , op denoting small order in probability.
Thus, considerations which can be formalized under appropriate regularity
conditions indicate that, for GT belonging to some M3  M2,
G()
- 1
2
T
G()(
- )  M V N(0, Ip)
in distribution, and hence
(
- ) GT () G() -1
T
GT ()(
- )  2
p (2.8)
in distribution. Best (asymptotic) estimation within a class M  M3 of estimating
functions is then achieved by choosing G
T  M so that Definition 2.3
below is satisfied. In this case we shall say that G
T is OA-optimal within M,
OA meaning optimal in the asymptotic sense.
Definition 2.3 G
T  M is an OA-optimal estimating function within M if
G

T () G
() -1
T
G

T () - GT () G() -1
T
GT ()
is almost surely nonnegative definite for all GT  M,   , P and T > 0.
It is evident that maximizing the martingale information
IG() = GT () G() -1
T
GT ()
leads to (asymptotic) confidence regions centered on 
of minimum size (see
for example Rao (1973, § 4b. 2)). Note that IG() can be replaced by IG(
)
in (2.8).
The relation between OF and OA optimality, both restricted to the same
family of estimating functions, is very close and it should be noted that
E GT () = E ˙GT (), E G() T = EGT () GT ().
Below we shall consider the widely applicable class of martingale estimating
function
A = GT () =
T
0
s() dMs(), s predictable (2.9)
and it will be noted that OF and OA optimality coincide. Applications of this
class of estimating functions will be discussed in some detail in Section 2.6.
As with Definition 2.1 the criterion of Definition 2.3 is hard to apply directly,
but as an analogue of Theorem 2.1 we have the following result, which
2.5. ASYMPTOTIC CRITERIA 29
is easy to use in practice.
Theorem 2.4 Suppose that M  M1. Then, G
T ()  M is an OA-optimal
estimating function within M if
GT ()
-1
G(), G
() T = G

T ()
-1
G
() T (2.10)
for all GT  M,   , P and T > 0. Conversely, if M is convex and
G
T  M is an OA-optimal estimating function, then (2.10) holds.
Proof. This follows much the same lines as that of Theorem 2.1 and we
shall just sketch the necessary modifications.
Write G = (G1, . . . , Gp) and G
= (G
1, . . . , G
p) , where the subscript T
has been deleted for convenience.
To obtain the first part of the theorem we write the 2p × 2p quadratic
characteristic matrix of the set of variables
Z = G1, . . . , Gp, G
1, . . . , G
p
in partitioned matrix form as
C =
G G, G
G, G
G .
Now C is a.s. nonnegative definite since for u, an arbitrary 2p × 1 vector,
u Cu = u Z, Z u = u Z, Z u = u Z  0
and then the method of Rao (1973, p. 327) gives the a.s. nonnegative definitness
of
G - G, G
( G
)
-1
G, G
. (2.11)
But, condition (2.10) gives
G, G
= ( G)( G

)-1
G
and using this in (2.11) gives OA-optimality via Definition 2.3.
The converse part of the proof carries through as with Theorem 2.1 using
the fact that H = G + G
 M for arbitrary scalar .
With the aid of Theorem 2.4 we are now able to show that OA-optimality
implies OF -optimality in an important set of cases. The reverse implication
does not ordinary hold.
Theorem 2.5 Suppose that G
T is OA-optimal within the convex class of
martingale estimating functions M. If ( G
T )-1
G
T is nonrandom for T > 0,
then G
T is also OF -optimal within M.
30 CHAPTER 2. THE GENERAL FRAMEWORK
Proof. For each T > 0,
G
T = G

T T , (2.12)
say, where T is a nonrandom p × p matrix. Then, from Theorem 2.4 we have
G, G
T = GT T , (2.13)
and taking expectations in (2.12) and (2.13) leads to
EGT G
T = (E ˙GT )T = (E ˙GT )(E ˙G

T )-1
EG
T G
T .
The required result then follows from Theorem 2.1.
For estimating functions of the form
GT () =
T
0
s() dMs(),
s predictable, of the class A (see (2.9) above) it is easily checked that
G
T () =
T
0
(d Ms) (d M s)-
dMs
is OA-optimal within A. Here Adenotes
generalized inverse of a matrix A,
which satisfies AAA
= A, A-
AA=
A.
It is often convenient to use A+
,
the Moore-Penrose generalized inverse of a matrix A, namely the unique matrix
A+
possessing the properties AA+
A = A, A+
AA+
= A+
, A+
A = AA+
.
Note that
G

T =
T
0
(d Ms) ( dM s)d
Ms = G
T
and Theorem 2.5 applies to give OF -optimality also.
2.6 A Semimartingale Model for Applications
Under ordinary circumstances the process of interest can, perhaps after suitable
transformation, be modeled in terms of a signal plus noise relationship,
process = signal + noise.
The signal incorporates the predictable trend part of the model and the noise
is the stochastic disturbance left behind after the signal part of the model is
fitted. Parameters of interest are involved in the signal term but may also be
present in the noise.
More specifically, there is usually a special semimartingale representation.
This framework is one in which there is a filtration {Ft} and the process of
interest {Xt} is (uniquely) representable in the form
Xt = X0 + At() + Mt(),
2.6. A SEMIMARTINGALE MODEL FOR APPLICATIONS 31
where At() is a predictable finite variation process which is of locally integrable
variation, Mt() is a local martingale and A0 = M0 = 0 (e.g., Rogers and
Williams (1987, Chapter VI, 40)). The local martingale has a natural role to
play in inference as it represents the residual stochastic noise after fitting of
the signal which is encapsulated in the finite variation process.
The semimartingale restriction itself is not severe. All discrete time processes
and most respectable continuous time processes are semimartingales.
Indeed, most statistical models fit very naturally into a semimartingale frame-
work.
Example 1. {Xt} is an autoregression of order k. This is a model of the form
Xt =  + 1Xt-1 + . . . + kXt-k + t
consisting of a trend term +1Xt-1+. . .+kXt-k and a martingale difference
disturbance t.
Example 2. {Xt} is a Poisson process with intensity  and
Xt =  t + (Xt -  t),
which consists of a predictable trend  t and a martingale disturbance Xt -  t.
Example 3. {Xt} is a 1-type Galton-Watson branching process; E(X1 | X0 =
1) = . Here
Xt = X
(1)
1,t-1 + . . . + X
(Xt-1)
1,t-1
= Xt-1 + [(X
(1)
1,t-1 - ) + . . . + (X
(Xt-1)
1,t-1 - )]
= Xt-1 + t,
say, where X
(i)
1,t-1 is the number of offspring of individual i in generation t,
i = 1, 2, . . . , Xt-1. Here Xt-1 represents the predictable trend and t a martingale
difference disturbance.
Example 4. In mathematical finance the price {St} of a risky asset is commonly
modeled by
St = S0 exp  -
1
2
2
t +  Wt ,
where Wt is standard Brownian motion (e.g. the Black-Scholes formulation; see
e.g. Duffie (1992)). Then we can use the semimartingale representation
Xt = log St = log S0 +  -
1
2
2
t +  Wt. (2.14)
It is worth noting, however, that the semimartingale property is lost if the
Brownian motion noise is replaced by fractional Brownian motion (e.g. Cutland,
Kopp and Willinger (1995)).
32 CHAPTER 2. THE GENERAL FRAMEWORK
For various applications the modeling is done most naturally through a stochastic
differential equation (sde) representation, which we may think of informally
as
d (process) = d (signal) + d (noise).
In the case of the model (2.14) the corresponding sde is, using the Ito formula,
dSt = St ( dt +  dWt)
and it should be noted that the parameter  is no longer explicitly present in
the signal part of the representation.
Of course the basic choice of the family of estimating functions from which
the quasi-score estimating function is to be chosen always poses a fundamental
question. There is, fortunately, a useful particular solution, which we shall call
the Hutton-Nelson solution. This involves confining attention to the local M G
estimating functions belonging to the class
M = Gt  K : Gt() =
t
0
s() dMs(), s predictable ,
where the local M G Mt comes from the semimartingale representation of the
process under consideration, or in the discrete case
M = Gt  K : Gt() =
t
s=1
s() ms(), s Fs-1 measurable ,
where Mt =
t
s=1 ms. As usual, elements of M are interpreted, via the
semimartingale representation, as functions of the data and parameter  of
interest.
Within this family M the quasi-score estimating function is easy to write
down, for example, using Theorem 2.1. It is
G
t () =
t
s=1
E ˙ms Fs-1 E ms ms Fs-1
-
ms
(discrete time)
=
t
0
(d Ms) (d M s)-
dMs
(continuous time)
where
d Ms = E d ˙Ms Fs- ,
M s being the quadratic characteristic and - referring to generalized inverse.
The Hutton-Nelson solution is very widely usable, so widely usable that it
may engender a false sense of security. Certainly it cannot be applied thoughtlessly
and we shall illustrate the need for care in a number of ways.
2.6. A SEMIMARTINGALE MODEL FOR APPLICATIONS 33
First, however, we give a straightforward application to the estimation of
parameters in a neurophysiological model. It is known that under certain
circumstances the membrane potential V (t) across a neuron is well described
by a stochastic differential equation
dV (t) = (- V (t) + ) dt + dM(t)
(e.g. Kallianpur (1983)), where M(t) is a martingale with discontinuous sample
paths and a (centered) generalized Poisson distribution. Here M t = 2
t for
some  > 0. The Hutton-Nelson quasi-score estimating function for  = ( , )
on the basis of a single realization {V (s), 0  s  T} gives
G
T =
T
0
(-V (t) 1) {dV (t) - (- V (t) + ) dt }.
The estimators ^ and ^ are then obtained from the estimating equations
T
0
V (t) dV (t) =
T
0
-^V (t) + ^ V (t) dt,
V (T) - V (0) =
T
0
-^V (t) + ^ dt,
and it should be noted that these do not involve detailed properties of the stochastic
disturbance M(t), only a knowledge of M t. In particular, they remain
the same if M(t) is replaced (as holds in a certain limiting sense; see Kallianpur
(1983)) by 2
W(t), where W(t) is standard Brownian motion. In this latter
case ^ and ^ are actually the respective maximum likelihood estimators.
There are, of course, some circumstances in which a parameter in the noise
component of the semimartingale model needs to be estimated, such as the scale
parameter  associated with the noise in the above membrane potential model.
This can ordinarily be handled by transforming the original semimartingale
model into a new one in which the parameter of interest is no longer in the
noise component. For the membrane potential model, for example, we can
consider a new semimartingale {[M]t - M t}, {[M]t} being the quadratic
variation process. Then, since
T
0
(dV (t))2
=
T
0
(dM(t))2
= [M]T
and
M T =
T
0
d M t = 2
T,
it is clear that 2
can be estimated by T-1 T
0
(dV (t))2
.
Now for stochastic processes it is important to focus on the interplay between
the data that is observed and the sources of variation in the model for
which estimators are sought.
34 CHAPTER 2. THE GENERAL FRAMEWORK
Stochastic process estimation raises a rich diversity of confounding and nonidentifiability
problems that are often less transparent than those which occur
in a nonevolutionary statistical environment. Furthermore, there are more often
limitations on the data that can be collected and consequent limitations on
the parameters that can be estimated.
It is worth illustrating with some examples in which one must be careful to
focus individually on the different sources of variation in the model. The first
example (Srensen (1990)) concerns the process
dXt =  Xt dt + dWt + dNt, t  0, X0 = x0,
where Wt is a standard Brownian motion and Nt is a Poisson process with
intensity . This process may be written in semimartingale form as
dXt =  Xt dt +  dt + dMt,
where
Mt = Wt + Nt -  t
and the Hutton-Nelson quasi-score estimating function based on Mt is
G
T = -
1
1 + 
T
0
Xs-
1
dMs.
Then, the estimating equations are
T
0
Xs- dXs = ^
T
0
X2
s ds + ^
T
0
Xs ds
XT = ^
T
0
Xs ds + ^ T.
These, however, are the maximum likelihood estimating equations for the model
dXt = ( Xt + ) dt + dWt,
that is, where Nt has been replaced by its compensator  t. The entire stochastic
fluctuation is described by the Brownian motion and this would only be
realistic if  1.
However, if we rewrite the original model by separating the discrete and
continuous parts and treating them individually, we get true MLE's. Put
dXc
t =  Xt dt + dWt Xc
t = Xt - Nt (2.15)
dXt - dXc
t = dNt =  dt + dNt -  dt. (2.16)
The Hutton-Nelson quasi-score estimating functions based on (2.15) and (2.16)
individually are, respectively,
T
0
Xs- dXc
s = ~
T
0
X2
s ds,
NT = ~ T.
2.7. SOME PROBLEM CASES FOR THE METHODOLOGY 35
The improvement over the earlier approach can easily be assessed. For
example, E(~ - )2
 /T while E(^ - )2
 (1 + )/T as T  .
In the next example (Srensen (1990)) {Xt} behaves according to the compound
Poisson process
Xt =
Nt
i=1
Yi,
where {Nt} is a Poisson process with intensity  and the {Yi} are i.i.d. with
mean  (and independent of {Nt}). Here the semimartingale representation is
Xt =   t +
Nt
i=1
(Yi - ) + (Nt -  t) =   t + Mt,
say, {Mt} being a martingale. Using {Mt}, the estimating equation based on
the Hutton-Nelson quasi-score is
XT = ( ) T
so that only the product   is estimable.
However, if the data provide both {Xt, 0  t  T} and {Nt, 0  t  T},
then we can separate out the two martingales
Nt
1 (Yi - ) , {Nt -  t} and
obtain the Hutton-Nelson quasi-score estimating functions based on each. The
QL estimators are
^ = XT /NT , ^ = NT /T
and these are MLE's when the Yt's are exponentially distributed.
General conclusion: Identify relevant martingales that focus on the sources
of variation and combine them.
Relevant martingales can be constructed in many ways. The simplest single
approach is, in the discrete time setting, to take the increments of the observation
process, substract their conditional expectations and sum. Ordinary likelihoods
and their variants such as marginal, conditional or partial likelihoods
can also be used to generate martingales (see, e.g., Barndorff-Nielsen and Cox
(1994, Sections 8.6 and 3.3)). General methods for optimal combination of
estimating functions are discussed in Chapter 6.
2.7 Some Problem Cases for the Methodology
It is by no means assured that a quasi-score estimating function is practicable
or computable in terms of the available data even if it is based on the
semimartingale representation as described in Section 2.6.
To provide one example we take the population process in a random envi-
ronment
Xt = ( + t) Xt-1 + t,
36 CHAPTER 2. THE GENERAL FRAMEWORK
where ( + t) is multiplicative noise coming from environmental stochasticity
and t is additive noise coming from demographic stochasticity. Let {Ft} denote
the past history -fields. We shall take
E t Ft-1 = 0 a.s., E t Ft-1 = 0 a.s.,
E 2
t Ft-1 = 2
Xt-1,
the last result being by analogy with the Galton-Watson branching process.
The problem is to estimate  on the basis of data {X0, . . . , XT }. Here the
semimartingale representation is
Xt =  Xt-1 + ut,
where the ut = t Xt-1+ t are M G differences. The Hutton-Nelson quasi-score
estimating function based on {ut} is
Q
T () =
T
1
Xt-1
E u2
t Ft-1
(Xt -  Xt-1)
and
E u2
t Ft-1 = X2
t-1 E 2
t Ft-1 + 2Xt-1 E t t Ft-1 + E 2
t Ft-1
may not be explicitly computable in terms of the data; indeed a tractable form
requires independent environments.
If we attempt to restrict the family of estimating functions to ones that are
explicit functions of the data
H : HT =
T
t=1
at(Xt-1, . . . , X0; ) (Xt -  Xt-1) ,
then there is not in general a solution to the optimality problem of minimizing
var HS
T .
Suppose now that the environments in the previous example are independent
and 2
= E(2
t | Ft-1). Then,
E u2
t Ft-1 = 2
X2
t-1 + 2
Xt-1,
so that
Q
T =
T
1
Xt-1
2X2
t-1 + 2Xt-1
(Xt -  Xt-1).
The QLE is
^T =
T
1
Xt Xt-1
2X2
t-1 + 2Xt-1
T
1
X2
t-1
2X2
t-1 + 2Xt-1
,
2.7. SOME PROBLEM CASES FOR THE METHODOLOGY 37
which contains the nuisance parameters 2
, 2
. However, on the set {Xt 
}, ^T will have the same asymptotic behavior as
T-1
T
t=1
Xt
Xt-1
.
This kind of consideration suggests the need to formulate a precise concept of
asymptotic quasi-likelihood and this is done in Chapter 5.
Another awkward example concerns the estimation of the growth parameter
in a logistic map system whose observation is subject to additive error (e.g.,
Berliner (1991), Lele (1994)). Here we have observations {Yt, t = 1, 2, . . . , T}
where
Yt = Xt + et (2.17)
with the {Xt} process given by the deterministic logistic map
Xt+1 =  Xt(1 - Xt), t = 0, 1, 2, . . . ,
while the e's are i.i.d. with zero mean and variance 2
(assumed known).
Based on the representation (2.17) one might consider the family of estimating
functions
H =
T
t=1
ct(Yt - Xt), ct s constants , (2.18)
although some adjustments are clearly necessary to avoid nuisance parameters,
the X's not being observable. Nevertheless, proceeding formally, the quasiscore
estimating function from H is
QT () =
T
t=1
dXt
d
(Yt -  Xt-1(1 - Xt-1)).
The weight functions dXt/d are, however, not practicable to work with. Note
that the system {Xt} is uniquely determined by X0 and  and that Xt may be
written (unhelpfully) as a polynomial of degree 2t
- 1 in  and 2t
in X0.
It may be concluded that, despite its apparent simplicity, H is a poor choice
as a family of estimating functions and that one based more directly on the
available data may be more useful. With this in mind, and noting that
Yt(1 - Yt) + 2
- Xt(1 - Xt) = et(1 - 2Xt) + (2
- e2
t ),
so that Yt(1 - Yt) + 2
is an unbiased estimator of Xt(1 - Xt), we can consider
the family
K =
T
t=1
ct(Yt - (Yt-1(1 - Yt-1) + 2
)), ct s constants (2.19)
38 CHAPTER 2. THE GENERAL FRAMEWORK
of estimating functions. The element of this family for which ct = 1 for all t has
been shown by Lele (1994) to provide an estimator that is strongly consistent
for  and asymptotically normally distributed. Quasi-likelihood methods do
offer the prospect of some improvement in efficiency at the price of considerable
increase in complexity (see Exercise 4, Section 2.8).
2.8 Complements and Exercises
Exercise 1. Show that an equivalent form to Criterion 2.2 when the score
function UT exists is
E UT - G
(s)
T G
(s)
T = 0 = EG
(s)
T UT - G
(s)
T .
(Note that UT = U
(s)
T .) (Godambe and Heyde (1987)).
Exercise 2. Extend Theorem 2.2 to include the smallest eigenvalue criterion
and average variance criterion in the remark immediately following the
statement of the theorem as additional equivalent conditions (e.g., Pukelsheim
(1993)).
Exercise 3. (Geometric waiting times) If time is measured in discrete periods,
a model that is often used for the time X to failure of an item is
P(X = k) = k-1
(1 - ), k = 1, 2, . . . .
Suppose that we have a set of i.i.d. observations X1, . . . , Xn from X.
(i) Show that this is an exponential family model and find the MLE ^ of .
(ii) Show that EX = (1 - )-1
and that the QLE for  from the family
H = {
n
i ci(Xi - EXi), ci constants} coincides with the MLE.
Now suppose that we have censoring of the data. Suppose that we only
record the time of failure if failure occurs on or before time r and otherwise
we just note that the item has survived at least time r + 1. Thus we
observe Y1, . . . , Yn, which have the distribution
P(Yi = k) = k-1
(1 - ), k = 1, 2, . . . , r
P(Yi = r + 1) = 1 - P(X  r + 1) = r
.
Let M be the number of indices i such that Yi = r + 1.
(iii) Show that the MLE of  based on Y1, . . . , Yn is
^(Y1, . . . , Yn) =
n
i=1 Yi - n
n
i=1 Yi - M
.
2.8. COMPLEMENTS AND EXERCISES 39
(iv) Show that the QLE for  based on the family
H =
n
1
ci(Yi - EYi), ci constants
does not coincide with the MLE ^(Y1, . . . , Yn).
(v) Now subdivide the data into those indices i such that Yi = r + 1 and
those indices i for which Yi  r and consider these sets separately. To
make this clear, define
Zi = Yi I(Yi = r + 1), Wi = Yi I(1  Yi  r),
i = 1, 2, . . . , n, I denoting the indicator function. Next consider the
families of estimating functions
H1 =
n
1
ci(Zi - EZ1), ci constants ,
H2 =
n
1
di(Wi - EW1), di constants ,
and show that these lead to QSEF's
(r + 1)(M - n r
) (2.20)
and
n
1
Yi - (r + 1) M - n
1 - r
1 - 
- r r
, (2.21)
respectively. Note that the first of these suggests the use of M to estimate
nr
and substituting this into
n
1
Yi - (r + 1) M - n
1 - ^r
1 - ^
- r ^r
= 0
with M = n^r
leads to the ^(Y1, . . . , Yn) defined above in (iii).
(vi) Show that the estimating functions (2.20) and (2.21) can also be added
to produce ^(Y1, . . . , Yn). That is, there are multipliers 
and 
such
that

(r + 1)(M - nr
) + 
n
1
Yi - (r + 1)M - n
1 - r
1 - 
- rr
= 0
has solution  = ^(Y1, . . . , Yn). Find 
/
.
40 CHAPTER 2. THE GENERAL FRAMEWORK
(vii) Show that the solution in (vi) is the QS estimating function for the family
H = 
n
1
(Zi - EZ1) + 
n
1
(Wi - EW1), ,  constants .
Exercise 4. For the logistic map observed subject to noise investigate the
quasi-score estimating function from the family (2.19) together with practicable
variants. If
t = Yt -  Yt-1 (1 - Yt-1) + 2
,
note that s and t are not independent unless |t-s| > 1. Obtain strong consistency
and asymptotic normality results for estimators if possible and compare
asymptotic variances with that obtained for the estimator based on the case
ct = 1, all t.
Infinite moments
The quasi-likelihood theory developed herein is based on estimating functions
with zero means and finite variances. When the data are derived from distributions
that do not possess finite second moments, linear functions of the data
cannot be used as a basis for constructing estimating functions and transformation
of the data is required. Exercise 5 provides a simple illustation of the
possibilities. More information is provided in Chapter 13.
Exercise 5. Let Xi, i = 1, 2, . . . , n be i.i.d.r.v. having a Cauchy distribution
with density b/((b2
+ x2
)), - < x < . Show that n-1 n
i=1 cos Xi can
be obtained as a QLE for e-b
. Extending this reasoning, show that, for any
real t, n-1 n
i=1 cos t Xi is a QLE for e-tb
and investigate the question of a
suitable choice for t.
Non-regular cases
In assessing the appropriateness of likelihood-based methods for parameter
inference in a particular context, it is necessary to check the following basic set
of requirements:
(1) Is the likelihood differentiable with respect to the parameter (i.e., does
the score function exist)?
(2) Does Fisher information exist?
(3) Does the Fisher information of the sample increase unboundedly as the
sample size increases?
(4) Can the parameter of interest take a value on the boundary of the parameter
set?
(5) Is the likelihood zero outside a domain that depends on the unknown
parameter?
2.8. COMPLEMENTS AND EXERCISES 41
Failure of any of these regularity conditions is typically a warning signal that
non-standard behavior may be expected. Similar considerations apply also
to quasi-likelihood methods in respect to (2)-(4) and the information E. The
following exercises illustrate consequences of warning signals (4), (5).
Exercise 6. Let X1, . . . , Xn be i.i.d. uniform U(0, ) random variables.
(i) Show that the maximum likelihood estimator X(n) = maxkn Xk does not
satisfy the likelihood equation (obtained from equating the score function
to zero).
(ii) Show that n( - X(n)) converges in distribution to an exponential law.
(iii) Let Sn = [n + 1/n] X(n). Show that n is preferable asymptotically to
X(n) by verifying that
E n(X(n) - )2
 22
, E n(n - )2
 2
as n  .
Exercise 7. Let Y1, Y2, . . . , Yn be i.i.d. such that Yj = |Xj|, where Xj has a
N(, 1) distribution. The situation arises, for example, if Xj is the difference
between a matched pair of random variables whose control and treatment labels
are lost. The sign of the parameter  is unidentifiable so we may work with
 = ||, with parameter space  = [0, ). Show that the score function is
n
j=1
[Yj + tanh (Yj) - ]
and observe that for  > 0 the problem is a regular one and

n (^n - )
d
- N(0, i())
with
i() = 1 -
2

2
e- 1
2 2

0
y2
e- 1
2 y2
sech (y) dy.
For  = 0 all this breaks down and the score function is identically zero.
However, show that a first-order expansion of the score function gives
^2
n =
3
n
1 Y 2
j - n
n
1 Y 4
j
provided
n
1 Y 2
j > n. Show that ^2

n has asymptotically a N(0, 2) distribution
truncated at zero (Cox and Hinkley (1974, pp. 303-304)).
Chapter 3
An Alternative Approach:
E-Sufficiency
3.1 Introduction
In classical statistics, the concept of "sufficiency" and its dual concept of "ancillarity"
play a key role and sufficient statistics provide the essential approach
to parameter inference in cases where they are available. To adapt these basic
concepts to the context of estimating functions, Small and McLeish developed
the expectation based concepts of E-sufficiency, E-ancillarity, local E-sufficiency
and local E-ancillarity. The basic idea is that an ancillary statistic is one whose
distribution is insentive to changes in the parameter and that, in an estimating
function context, changes to the parameter are first evident through changes to
the expectation of the estimating function. For a space  of estimating functions
the E-ancillary subset, A (say), are defined on the basis of expectation
insensitivity to parameter changes and the E-sufficient estimating functions belong
to the orthogonal complement of A with respect to . It is argued that
an optimum estimating function from  should be chosen from the E-sufficient
or locally E-sufficient subset of , if this exists. Detailed expositions are given
in McLeish and Small (1988), Small and McLeish (1994, Chapter 4).
It turns out that our framework of optimal or quasi-score estimating functions
in this book corresponds most closely to the notation of locally E-sufficient
estimating functions and this chapter is concerned with their relationship.
Our study shows that whether a quasi-score estimating function is locally Esufficient
depends strongly on the estimating function space chosen, and also
that, under certain regularity and other conditions, the quasi-score estimating
function will be locally E-sufficient. In many cases the two approaches both
lead to the same estimating equation. The treatment here follows Lin (1994a).
3.2 Definitions and Notation
Let X be a sample space and P be a class of probability measures P on X. For
each P, let vP denote the p-dimensional vector space of vector-valued functions
{f} which are defined on the sample space X and satisfy EP f 2
< , where
f 2
= f (x) f(x).
Let  be a real-valued function on the class of probability measures P and
 = {(P), P  P}  IRp
be the parameter space.
Now we consider a space  in which every estimating function (, x) =
() is a mapping from the parameter space and the sample space into IRp
.
43
44 CHAPTER 3. AN ALTERNATIVE APPROACH: E-SUFFICIENCY
Each element of  is unbiased and square integrable, i.e.,
EP [((P))] = 0,
and
 2
(P )  EP [ ((P)) ((P))] < , (3.1)
for all P and  = (P). Assume that the space has constant covariance structure,
i.e., the inner product EP [ ((P)) ((P))] depends on P only through
(P), for all ,   . As usual, we also require that  be a Hilbert space of
estimating functions and weak square closed, i.e., for any sequence of functions
n  , if limn, m n - m  = 0, then there is a function   
such that limn n -   = 0. The choice of  will, in general, depend on
; see McLeish and Small (1988), Small and McLeish (1994).
Definition 3.1 An unbiased estimating function () is locally E-ancillary
if it is a weak square limit of functions {}   that satisfy
EP [()] = o(|(P) - |), for (P)  . (3.2)
We note that if the order of the differentiation and integration can be interchanged,
then equation (3.2) can be rewritten as
E
i
j p×p
= 0.
Therefore, under suitable regularity conditions, Definition 3.1 can be replaced
by Definition 3.2 below.
Definition 3.2 An unbiased estimating function () is locally E-ancillary
if it is a weak square limit of functions {}   that satisfy
E
i
j p×p
= 0.
The locally E-ancillary estimating functions form a subset of the estimating
function space, which will be denoted by Aloc.
Following the definition of local E-ancillarity, when   Aloc, i.e., (1, . . . ,
p)  Aloc, then all of the estimating functions ai,j = (0, . . . , ai, . . . , 0), where
ai = j and i, j = 1, . . . , p, belong to Aloc.
Definition 3.3 A subset Sloc of the estimating function space is complete
locally E-sufficient when   Sloc if and only if
E( () ()) = 0 for all P,  = (P) and   Aloc.
3.2. DEFINITIONS AND NOTATION 45
Because of the special property of Aloc, when   Sloc, for each element of
Aloc, say  = (1, . . . , p) , we have E( aij) = (0, . . . , ai, . . . , 0) , ai = j,
and i, j = 1, . . . , p, i.e,
E(ij) = 0 (i, j = 1, . . . , p).
So E(  ) = 0, which in turn implies that E( ) = 0. Therefore, the definition
of Sloc has the following equivalent formulation.
Definition 3.3 A subset Sloc of the estimating function space is complete
locally E-sufficient when   Sloc if and only if
E(() () ) = 0 for all P,  = (P) and   Aloc.
It follows from the definition of Sloc that Sloc = A
loc, and by the assumptions
on , we have
 = Sloc  Aloc.
However the decomposition is associated with   . A more detailed discussion
of the space  can be found in Small and McLeish (1991).
In general there is no guarantee of the existence of a complete locally Esufficient
subspace. However, if it does exist it is a closed linear subspace of .
Completeness is dispensed with in the following definition.
Definition 3.4 A subset L of  is locally E-sufficient if for   ,
E( () ()) = 0 for all   L and   
implies that  is locally E-ancillary.
As a simple illustration we consider the square integrable scalar process
{Xt}, with Xt adapted to the increasing sequence of a fields Ft, and such that
E Xt Ft-1 = t, var  Xt Ft-1 = t()
for  an unknown parameter and Ft-1-measurable t, t() whose forms are
specified. Let  be the space of square integrable estimating functions of the
form
() =
n
i=1
Ai()(Xi - i), Ai's are Fi-1 measurable . (3.3)
Then, if the function

() =
n
i=1
i-1
i (Xi - i)
is in the space , it generates the locally E-sufficient subspace.
46 CHAPTER 3. AN ALTERNATIVE APPROACH: E-SUFFICIENCY
To check this, we note that of () is of the form (3.3), then
E(
() ()) = E
n
i=1
i -1
i iAi()
= E
n
i=1
i Ai()
=
d
d
E
n
i=1
( - )i Ai()
=
=
d
d
E ()
=
and if this is zero then  is locally E-ancillary.
3.3 Results
In the following, assume that  satisfies the usual regularity conditions, so that
Definition 3.2 can be used to define the concept of local E-ancillarity.
Let G1 be a subset of . Each element G of G1 satisfies the conditions
that E ˙G() = (E(Gi/j))p×p and EG() G () are nonsingular. The score
function is as usual denoted by U().
The fact that  = Sloc  Aloc leads us to seek conditions under which a
quasi-score estimating function is in Sloc or Aloc. In the following discussion,
a quasi-score estimating function in G1 is always denoted by G
and, for an
estimating function G, if G = 0 a.s., we write G = 0.
Theorem 3.1 If G1 is a convex set and the quasi-score estimating function
G
 G1  Aloc (resp. G
 G1  Sloc), then G1  Sloc =  (resp. G1  Aloc = ).
Proof. We give only the first part of the proof. The proof of the second
part is analogous.
Applying Theorem 2.1, if G
 G1  Aloc, but G1  Sloc = , then there is
a G
 G1  Sloc such that
E ˙G
-1
EG G
= E ˙G
 -1
EG
G
.
Following the definition of Sloc, we have EG G
= 0. So the left hand side of
the above equation is equal to zero. This implies
EG
G
= 0,
and G
has to be a zero vector. This is contrary to the definition of a quasiscore
estimating function. Therefore G1  Sloc = .
3.3. RESULTS 47
From Theorem 3.1 we note that, if there are no other restrictions on G1, it
is possible that G
is outside Sloc. However in general, under regularity conditions,
G1  Aloc = . This implies that G
can be only in the complement of
Aloc. But this conclusion has not answered the questions of whether G
 Gloc
and what the necessary conditions are for G
to belong to Sloc. The conditions
that force G
 Sloc are of interest and are provided in the following theorems.
Theorem 3.2 Suppose that G1 is a convex set and G
 G1 - G1  Aloc.
Let G
ps be the projection of G
on Sloc. If G
ps  G1 and G
ps satisfies the
condition that G
ps + G  G1 for any G with E ˙G = 0, then G
 Sloc.
Proof. Assume that G
= G
ps + G
pa, where G
pa  Aloc and G
ps  Sloc.
Since G
pa  Aloc, there is a sequence {Gpa,n} in Aloc with E ˙Gpa,n = 0 (n =
1, 2, . . .), such that E((G
pa - Gpa,n) (G
pa - Gpa,n))  0, as n  .
Let Gn = G
ps + Gpa,n. By the assumption of the theorem, we have Gn 
G1. So by Theorem 2.1,
E ˙Gn
-1
EGnG
= E ˙G
 -1
EG
G
,
i.e.,
E ˙G

ps
-1
EG
psG
ps + EGpa,nG
pa
= E ˙G

ps + E ˙G

pa
-1
EG
psG
ps + EG
paG
pa . (3.4)
Since
E Gpa,nG
pa - G
paG
pa = E Gpa,n - G
pa G
pa
 E Gpa,n - G
pa E G
pa
 0 as n  ,
letting n   in (3.4) gives
E ˙G

ps
-1
EG
psG
ps + EG
paG
pa
= E ˙G

ps + E ˙G

pa
-1
EG
psG
ps + EG
paG
pa ,
which implies that E ˙G

pa = 0. Applying Theorem 2.1 again, we have
E ˙G

ps
-1
EG
psG
= E ˙G
 -1
EG
psG
ps + EG
paG
pa
= E ˙G

ps
-1
EG
psG
ps + EG
paG
pa ,
48 CHAPTER 3. AN ALTERNATIVE APPROACH: E-SUFFICIENCY
i.e.,
E ˙G

ps
-1
EG
psG
ps = E ˙G

ps
-1
EG
psG
ps + EG
paG
pa .
Therefore EG
paG
pa = 0 and hence G
pa = 0. Thus,
G
 G1  Sloc.
Theorem 3.3 Suppose that G1 is a convex set. Let G
= G
ps + G
pa, where
G
ps  Sloc  G1 and G
pa  Aloc. If E
˙G

= E
˙G

ps for all , then G
 Sloc.
The proof of Theorem 3.3 is analogous to that of Theorem 3.2 and is omit-
ted.
Before another important property for the quasi-score estimating function
is mentioned, we introduce some new notation and definitions.
Let ~G1 = {G, E ˙G is nonsingular and G  }, so that ~G1  G1.
Definition 3.5 An estimating function G
T is a sub-quasi-score estimating
function in ~G1, if
E ˙GT
-1
EGT GT E ˙GT
-1
- E ˙G

T
-1
EG
T G
T E ˙G

T
-1
is nonnegative-definite for all GT  ~G1 and all   .
Remark. The relationship between a sub-quasi-score estimating function
and a quasi-score estimating function is as follows. If a sub-quasi-score estimating
function G
T is an element of G1, then G
T is a quasi-score estimating
function in G1. However, if G
T is a quasi-score estimating function in G1, G
T
need not to be a sub-quasi-score estimating function in ~G1 because the space
~G1 is larger than G1
Obviously a sub-quasi-score estimating function is not a quasi-score estimating
function in general. However, if we restrict our consideration to the
model
Xt =
t
0
fs ds + ms,
and consider the estimating function space T = {
T
0
s() dms, s() 
Fs-} with G1  ~G1  T , where {ms} is a martingale with respect to some
-fields {Ft} and {s} is an increasing function, then following the proof in Godambe
and Heyde (1987, p. 238), we can show that the quasi-score estimating
function G
is a sub-quasi-score estimating function in ~G1.
Since the conditions required for the sub-quasi-score estimating function
are weaker than those for the quasi-score estimating function, it should be easy
3.3. RESULTS 49
to find the conditions under which the sub-quasi-score estimating function is
an E-sufficient estimating function.
We now restrict  by requiring that the subset Aloc of  satisfy the following
condition.
Condition 3.1 G  Aloc if and only if E
˙G = 0 for all .
Theorem 3.4 Let G
be a sub-quasi-score estimating function on ~G1  .
Then, under Condition 3.1, G
 Sloc.
Proof. Express G
as G
= G
ps + G
pa, where G
ps  Sloc and G
pa  Aloc.
Since G
pa  Aloc and G
 ~G1, it follows that E
˙G

ps = E
˙G

is nonsingular
for all   . Following Definition 3.5 we have that
E ˙G

ps
-1
EG
psG
ps E ˙G

ps
-1
- E ˙G
 -1
EG
G
E ˙G

is nonnegative-definite. That is,
E ˙G

ps
-1
EG
psG
ps E ˙G

ps
-1
- E ˙G

ps
-1
EG
psG
ps + EG
paG
pa E ˙G

ps
-1
is nonnegative-definite. Consequently, -EG
paG
pa is nonnegative-definite, which
gives EG
paG
pa = 0 and G
 Sloc.
Theorem 3.4 shows that, if Condition 3.1 is satisfied and if a quasi-score
estimating function is a sub-quasi-score estimating function, then it will belong
to Sloc. Therefore, another way to check whether G
 Sloc is by verifying
Condition 3.1 and showing that G
is a sub-quasi-score estimating function.
As an example, consider a stochastic process {Xt} that satisfies a stochastic
diferential equation of the form
dXt = at() d M() t + dMt() (t  T), (3.5)
where {Mt} is a square integrable martingale with predictable variation process
M t observable up to the value of the parameter . Also assume that the
predictable process {at()} is observable up to the parameter value and there
exists a real-valued predictable process f(t; , ) such that
s
0
at() d M() t =
s
0
f(t; , ) d M() t,
for all 0 < s < T and all  sufficiently close to . Noting that f(t; , ) = at()
(see McLeish and Small (1988, p. 103)) and restricting our consideration to
a space of estimating functions, in which each element has the form () =
T
0
Hs() dMs(), where H is a predictable process, we obtain the following
50 CHAPTER 3. AN ALTERNATIVE APPROACH: E-SUFFICIENCY
proposition.
Proposition 3.1 Under the conditions of the model (3.5), Condition 3.1 is
satisfied, i.e.,
Aloc = { : E
˙ = 0, all   }.
Proof. We note that, for any estimating function  in the estimating function
space and parameters  and  in the parameter space,
E() = E
T
0
Ht()[dMt() + at() d M() t - at() d M() t]
= E
T
0
Ht()[f(t; , ) - f(t; , )] d M() t .
Dividing by  -  in the above equation and then letting   , we obtain
E
˙() =


E ()
=
= E
T
0
Ht()


f(t; , ) d M() t.
For any 0() =
T
0
Ht,0() dMt() in Aloc, we can show that E 0 = 0.
From the definition of Aloc, there is a sequence of estimating functions {n =
T
0
Ht,n dMt()} with
E
˙n = E
T
0
Ht,n()


f(t; , ) d M() t = 0,
for all , which weakly square converges to 0. Therefore, applying the KunitaWatanabe
inequality (see e.g. Elliott (1982, Corollary 10.12, p. 102)), we obtain
E
˙0() = E
˙0() - ˙n()
= E
T
0
(Ht,0() - Ht,n())


f(t; , ) d M() t
 E
T
0
(Ht,0() - Ht,n())2
d M() t
1
2
 E
T
0


f(t; , )
2
d M() t
1
2
= 0 - n 2 E
T
0


f(t; , )
2
d M() t
1
2
 0 as n  ,
3.4. COMPLEMENT AND EXERCISE 51
which yields E
˙0 = 0 for all   . Therfore,
Aloc = { : E
˙() = 0, all ,   }.
Applying Proposition 3.1 and following the remark under Definition 3.5
for models satisfying (3.5), all quasi-score estimating functions from the space
G1   = {
T
0
Hs() dMs(); H a predictable process} lie in Sloc.
As another example consider the widely used model
dXt = ft() d + dMt,
where {ft()} is a predictable process and M is a square integrable martingale
with d M() t = at() d. Since this model can be rewritten in the form (3.5)
and also satisfies the assumption required for (3.5), the quasi-score estimating
function in the space
 =
T
0
Ht() [dXt - ft() d], {Ht} a predictable process
is locally E-sufficient.
Condition 3.1 plays an important role in our discussion. However, checking
this condition in general is still an issue. Of course Proposition 3.1 says that
Condition 3.1 is always true for the model 3.5, which covers many problems
and a similar analysis can be undertaken in other particular cases.
3.4 Complement and Exercise
The theory in this book has been formulated on the assumption that the estimating
functions G() under consideration are differentiable with respect to .
It is possible, however, to extend this theory by replacing the -E
˙G() used
herein by
G() =


EG()
=
.
This feature has been incorporated in the theory of this chapter. The following
exercise indicates the kind of result that is thereby included.
1. Estimation of the parameter  in the density function 1
2 e-|x-|
, - <
x < , is required on the basis of a random sample {X1, . . . , XT }. Consider
the space  of estimating functions that contains just the two ele-
ments
1 =
T
t=1
|Xt - | - T, 2 =
T
t=1
sgn (Xt - ).
Show that 1 is locally E-ancillary and 2 is locally E-sufficient for estimation
of . Note that 2 leads to the sample median as an estimator of
.
Chapter 4
Asymptotic Confidence Zones of
Minimum Size
4.1 Introduction
Thus far in the book we have concentrated on the optimal properties of estimating
functions that are derived from maximum information, minimum distance,
or, in the case of Chapter 3, sufficiency ideas, which all involve fixed finite
samples. Now we shall proceed to asymptotic considerations about confidence
zones that generalize classical results holding for maximum likelihood. However,
some comments about maximum likelihood itself are appropriate as a
prelude since, although it is one of the most widely used statistical techniques,
it does suffer from various problems.
The principal justification of maximum likelihood is asymptotic. Under certain
regularity conditions the maximum likelihood estimator is asymptotically
unbiased, consistent, and asymptotically normally distributed with minimum
size asymptotic confidence zones for the unknown parameter. Unfortunately
the regularity conditions are rather stringent, and there is no shortage of striking
examples of things that can go wrong, even in such seemingly innocent
contexts as for exponential families. Le Cam (1990b) has given an entertaining
account of problems for i.i.d. variables. However for all its faults, maximum
likelihood methods do provide a benchmark against which others are judged
and the exceptions just serve to underline the principle of reviewing each application
individually for compliance.
Not surprisingly, whatever problems can occur for likelihood based methods
can also occur for quasi-likelihood. However, the quasi-likelihood framework
has one big advantage and that is the opportunity to choose the family of
estimating functions within which to work. Appropriate regularity can be demanded
of these estimating functions whereas there is ordinarily little control
over the choice of model and consequent form of the likelihood.
Despite the foregoing comments we shall ordinarily take counterparts of the
standard regularity conditions for maximum likelihood for granted in developing
the corresponding quasi-likelihood theory. Exceptions typically need rather
special treatment.
The reader is reminded that the standard regularity conditions for maximum
likelihood involve a finite dimensional parameter space , say, which is
an open subset of the corresponding Euclidean space and a true parameter
which is an interior point of . A likelihood which is at least twice differentiable
with respect to the parameter  is postulated and differentiation under
the integral sign is permitted. Various other technical conditions are typically
53
54 CONFIDENCE ZONES OF MINIMUM SIZE
imposed which are sufficient to ensure consistency and asymptotic normality of
the estimators (e.g., Cox and Hinkley (1974, Chapter 9), Basawa and Prakasa
Rao (1980, particularly Chapter 7)). The most common exceptions involve unknown
endpoint problems (i.e., one endpoint of the range of the distribution in
question is the unknown parameter), parameters on the boundary of the parameter
space and unbounded likelihoods. For discussion of nonregular cases for
the MLE see, e.g., Smith (1985), (1989), Cheng and Traylor (1995). Whenever
something can be done for the MLE, it may be expected that corresponding
quasi-likelihood analogues can be developed.
Suppose that we have experiments indexed by t  [0, ) and {Ft} denotes
a standard filtration generated from these experiments. Both discrete and
continuous time are covered. With little loss in generality we confine attention
to the class G of zero mean, square integrable, Ft-measurable, semimartingale
estimating functions Gt() which are vectors of dimension p such that the p
dimensional matrices
˙Gt() = (Gt,i()/j),
E ˙Gt(), EGt() Gt(), and [G()]t are (a.s.) nonsingular for each t > 0,
[G()]t denoting the quadratic variation process.
Here
[G()]t = [Gcm
()]t +
0<st
( Gs)( Gs) ,
where Gcm
is the unique continuous martingale part of G and
 Gs = Gs() - Gs-();
see, e.g., Rogers and Williams (1987, p. 391) for a discussion.
An effective choice of Gt is vital and in the case where the likelihood function
exists, is known, and is tractable, the score function provides the benchmark
and should ordinarily be used for Gt. It is well known, for example, that maximum
likelihood (ML) estimation is associated with minimum-size asymptotic
confidence zones under suitable regularity conditions (e.g., Hall and Heyde
(1980, Chapter 6)). We shall, however, show that for quasi-likelihood properties
similar to that of the ML estimator hold within a more restricted setting.
4.2 The Formulation
Asymptotic confidence statements about the "parameter"  are based on consideration
of the asymptotic properties of a suitably chosen estimating function
Gt() as t  . For this purpose we shall assume, as it typically the case in
regular problems of genuine physical relevance, that Gt() has a limit distribution
under some approximate normalization.
Indeed, for Gt()  G, the result
(EGtGt)-1/2
Gt
d
- X (4.1)
4.2. THE FORMULATION 55
for some proper law X, which is not in general normal, but does not depend
on the choice of Gt, seems to be uniquitous.
As a basis for obtaining confidence zones for  we suppose that data are
available for 0  t  T and, letting 
T be a solution of GT () = 0, we use
Taylor expansion to obtain
0 = GT (
T ) = GT () + ˙GT (1,T ) (
T - ), (4.2)
where  - 1,T   - 
T , the norm denoting sum of squares of elements.
Then, assuming that
(EGT () GT ())-1/2 ˙GT (1,T ) E ˙GT ()
-1
EGT () GT ()
1/2 p
- Y Ip
for some random variable Y ( > 0 a.s.) we have, as T  ,
(EGT () GT ())-1/2
E ˙GT () (
T - )
d
- Z, (4.3)
say, not depending on the choice of GT .
The size of confidence zones for  is then governed by the scaling "informa-
tion"
E(Gt()) = E ˙Gt() EGt() Gt()
-1
E ˙Gt()
= EG
(s)
t () G
(s)
t ()
and we prefer estimating function G1,t to G2,t if E(G1,t)  E(G2,t) for each
t  t0 in the Loewner ordering (partial order of nonnegative definite matrices).
This is precisely the requirement of OF -optimality, as discussed in Chapter 2,
but for all sufficiently large samples.
We shall henceforth suppose that a quasi-score or asymptotic quasi-score
estimating function Qt() has been chosen for which (4.3) holds and our considerations
regarding confidence zones will be based on Qt.
Meta Theorem: Under sensible regularity conditions a quasi-likelihood estimator
within some H is strongly consistent and, with suitable norming,
asymptotically normally distributed. It can be used to construct minimum
size asymptotic confidence zones for estimators within H.
This is a result for which a satisfactory formal statement and proof are
elusive to the extent that any readily formulated set of sufficient conditions has
obvious exceptions for which the desired results continue to hold. Furthermore,
it is usually preferable, and more economical, to check directly in a particular
case to see whether consistency and asymptotic normality hold than to try
to verify the sufficient conditions of a general theorem that might not quite
apply. The results in this area mostly make use of martingale strong laws
and central limit theorems and we shall provide powerful versions of these in
56 CONFIDENCE ZONES OF MINIMUM SIZE
Chapter 12. They can be used in particular cases. For specific results along
the lines of the meta theorem we refer the reader to Hutton and Nelson (1986),
Theorem 3.1 (strong consistency) and Theorem 4.1 (asymptotic normality),
Hutton, Ogunyemi and Nelson (1991), Theorem 12.1 (strong consistency) and
Theorem 12.2 (asymptotic normality) and Greenwood and Wefelmeyer (1991),
Proposition 10.1. See also Section 3 of Barndorff-Nielsen and Srensen (1994).
Of course the issue of "sensible regularity conditions" is crucial. For some
examples where asymptotic normality of the quasi-likelihood estimators does
not hold, see Chan and Wei (1988). These involve unstable autoregressive
processes.
Much of the complication of regularity conditions can be drawn off into a
single stochastic equicontinuity condition (Pollard (1984, Chapter 7)) in the
case where a stochastic maximization or minimization is involved, or similarly,
into stochastic differentiability conditions (Hoffmann-Jrgensen (1994, Chapter
14)).
4.3 Confidence Zones: Theory
At the outset it is important to emphasize that confidence intervals based on
(4.3) are mostly difficult to formulate unless Z is normally distributed and it
is desirable, wherever possible, to renormalize to obtain asymptotic normality.
Indeed, as we shall see in Section 4.4, there is a specific sense in which confidence
intervals based on asymptotic normality are preferable, on average, to those
based on an alternative limiting distribution, at least in the scalar case.
For Qt()  G, the result
[Q]
-1/2
t Qt
d
- MV N(0, Ip) (4.4)
seems to encapsulate the most general form of asymptotic normality result.
Norming by a random process such as [Q]t is essential in what is termed the
nonergodic case (for which (E[Q]t)-1
[Q]t
p
- constant as t  ). A simple
example of this is furnished by the pure birth process Nt with intensity  Nt-,
where we take for Qt the score function and
Qt = -1
(Nt - 1) -
t
0
Ns- ds,
[Q]t = -2
(Nt - 1),
while
[Q]
-1/2
t Qt
d
- N(0, 1), (EQ2
t )-1/2
Qt
d
- W1/2
N(0, 1)
and
(E[Q]t)-1
[Q]t
a.s.
- W,
where W has a gamma distribution with the form parameter N0 and shape
parameter N0 and W1/2
N(0, 1) is distributed as the product of independent
4.3. CONFIDENCE ZONES: THEORY 57
W1/2
and N(0, 1) variables. Here
E(Qt()) = E[Q]t = EQ2
t ,
the MLE ^t satisfies
^t = (Nt - 1)
t
0
Ns- ds
and
1

[Q]
1
2
t (^t - )
d
- N(0, 1),
1

E(Qt())
1
2 (^t - )
d
- W- 1
2 N(0, 1).
On average, confidence intervals for  from the former are shorter than those
from the latter (Section 4.4).
To obtain confidence zones for  from (4.4) we may use the Taylor expansion
(4.2) and then under appropriate continuity conditions for ˙Qt,
˙QT (1,T ) ˙QT ()
-1 p
- Ip
and, when (4.4) holds
[Q()]
-1/2
T
˙QT () (
T - )
d
- MV N(0, Ip) (4.5)
as T  .
For the construction of confidence intervals we actually need this convergence
to be uniform in compact intervals of ; we shall use
u.d.
- to denote such
convergence and henceforth suppose that (4.5) holds in this mode. We shall
also write
E(Qt()) = - ˙Qt() [Q()]-1
t - ˙Qt() . (4.6)
If C is a column vector of dimension p, convergence in (4.5) is mixing in
the sense of R´enyi (see Hall and Heyde (1980, p. 64)) and E(QT ()) behaves
asymptotically like a constant matrix. Then, replacing E(QT ()) by the estimated
E(QT (
T )), we obtain
P C   C 
T  z/2 C E(QT (
T ))
-1
C  1 - 
where (z) = 1 - ,  denoting the standard normal distribution function.
In particular, this provides asymptotic confidence results for the individual
elements of  and also any nuisance parameters can conveniently be deleted.
Other confidence statements of broader scope may also be derived. Noting
that
(
T - ) E (QT (
T )) (
T - )
u.d.
- 2
p
58 CONFIDENCE ZONES OF MINIMUM SIZE
and that
(
T - ) E (QT (
T )) (
T - ) = max
c
(C (
T - ))2
C ( E (QT (
T )))-1C
,
we have that
P max
c
|C (
T - )|
(C ( E (QT (
T )))-1C)1/2
 p,  1 - ,
where 2
p, is the upper  point of the chi-squared distribution with p degrees
of freedom, so that
P C 0  C 
T  p, C ( E (QT (
T )))-1
C
1/2
for all C  1 - .
Also, if we let c be the set of all possible  satisfying the inequality
(
T - ) E (QT (
T )) (
T - )  2
p,,
the so-called ellipsoids of Wald in the classical case (e.g., Le Cam (1990a)), then
simultaneous confidence intervals for general functions gi(), i = 1, 2, . . . , p
with asymptotic confidence possibly greater than (1 - ) are given by
min
c
gi(), max
c
gi() , i = 1, 2, . . . , p.
This allows us to deal conveniently with confidence intervals for nonlinear functions
of the components .
Other methods are available in particular cases and, it often happens, in
cases where likelihood-based theory is available and tractable, that the ML ratio
produces better-behaved confidence zones than the ML estimator (e.g. Cox and
Hinkley (1974, p. 343)). In such a case, if  = ( ,  ) and 0 = {0,   },
then the set
0 : sup

Lt() - sup
0
LT () 
1
2
2
d, (4.7)
based on the likelihood function L, and with d = dim , may give an asymptotic
confidence region for  of size . It should be noted that (4.7) is invariant under
transformation of the parameter of interest.
It is usually the case that the quasi-score or asymptotic quasi-score estimating
function Qt() is a martingale and subject to suitable scaling,
- ˙Qt() - [Q()]t, (4.8)
- ˙Qt() - Q() t, (4.9)
and
[Q()]t - Q() t (4.10)
4.3. CONFIDENCE ZONES: THEORY 59
are zero mean martingales, Q() t being the quadratic characteristic of Qt.
The quantity - ˙Qt() is a generalized version of what is known as the observed
information, while
EQt() Qt() = E[Q()]t = E Q() t = -E ˙Qt()
is a generalized Fisher information. The martingale relationships (4.8) ­ (4.10)
and that fact that each of the quantities - ˙Qt(), [Q()]t and Q() t goes a.s.
to infinity as t increases usually implies their asymptotic equivalence.
Which of the asymptotically equivalent forms should be used for asymptotic
confidence zones based on (4.5) is unclear in general despite the fact that
many special investigations have been conducted. Let S be the set of (possibly
random) normalizing sequences {At} that are positive definite and such that
A-1
t () [Q()]
-1/2
t
˙Qt()
p
- Ip, (4.11)
as t  . Then, each element of S determines an asymptotic confidence zone
for , E(Qt()) being replaced by At() At(). Of course (4.11) ensures that
zones will be similar for large T with high probability.
In the ergodic case when likelihoods rather than quasi-likelihoods are being
considered, there is evidence to support the use of the observed information
- ˙Qt() rather than the expected information EQt() Qt() (e.g., Efron and
Hinkley (1978)).
In the general case, Barndorff-Nielsen and Srensen (1994) have used a
number of examples to argue the advantages of (4.6), which, as we have seen,
comes naturally from (4.4). On the other hand, in Chapter 2.5 we have used
the form
- Qt() Q() -1
t - Qt() ,
where Qt() is a matrix of predictable processes such that ˙Qt() - Qt() is a
martingale. This has the advantage of a concrete interpretation as an empirical
information and is a direct extension of the classical Fisher information and is
closely related to E(Qt()).
The general situation is further complicated by the fact that E(Qt())
ordinarily involves the unknown  and consequently has to be replaced by
E(Qt(
t )) in confidence statements. The variety of possibilities is such that
each case should be examined on an individual basis.
Finally, some special comments on the nonergodic case are appropriate.
Suppose that
(E[Q]t)-1
[Q]t
p
- W (> 0 a.s.)
as t  . As always, a sequence of normalizing matrices {At} is sought so
that
AT (
T - )
d
- MV N(0, Ip)
as T  . However, it has been argued that one should condition on the limit
random variable W and then treat the unobserved value w of W as a nuisance
parameter to be estimated. This approach has the attraction of reducing a
60 CONFIDENCE ZONES OF MINIMUM SIZE
nonergodic model to an ergodic one but at the price of introducing asymptotics
which, although plausible, may be very difficult to formalize in practice (e.g.,
Basawa and Brockwell (1984)). A related approach based on conditioning on an
asymptotic ancillary statistic has been discussed by Sweeting (1986). This also
poses considerable difficulties in formalization but may be useful in some cases.
The problem of precise choice of normalization, of course, usually remains.
In conclusion, we also need to comment on the point estimate itself. If
confidence statements are based on (4.5) and T is an estimator such that
[Q()]
-1/2
T
˙QT ()(T - 
T )
p
- 0
as T  , then
[Q()]
-1/2
T
˙QT ()(T - )
d
- MV N(0, Ip)
clearly offers an alternative to (4.5). There may sometimes be particular reasons
to favor a certain choice of estimator, such as for the avoidance of a
nuisance parameter.
4.4 Confidence Zones: Practice
In the foregoing theoretical discussion we have chosen to focus on the asymptotics
of the estimator 
rather than the quasi-score estimating function Q()
from which it is derived. However, if Q is asymptotically normal or mixed normal
itself, there is significant empirical evidence in favor of basing confidence
statements on Q directly.
Suppose that
(EQT () QT ())- 1
2 QT ()
d
- MV N(0, Ip)
resp. Q()
- 1
2
T QT ()
d
- MV N(0, Ip) .
Then, we have the confidence regions
 : QT () EQT () QT ()
-1
QT ()  2
p, (4.12)
resp.  : QT () Q() -1
T QT ()  2
p, .
The form of such regions will depend heavily on the context but it can be
argued that they will usually be preferable to regions constructed on the basis
of the asymptotic distribution of the estimator 
. It may be expected that
our primary information concerns the distributions of Q() directly and that
the Taylor expansion used to obtain 
from Q() introduces an unnecessary
component of approximation.
A simple example concerns nonlinear regression where Xi are independent
random vectors with the distribution MV N(fi(), 2
I), i = 1, 2, . . . , T. Then,
4.4. CONFIDENCE ZONES: PRACTICE 61
the Hutton-Nelson quasi-score estimating function based on the martingale
differences {mi() = Xi - fi()} is
-2
( ˙f()) m()
where ˙f() = ˙f1()
.....
. . .
.....
˙fT () with ˙fr() = (fr(i)/j), and
m() = m1()
.....
. . .
.....
mT () which has an MV N(0, 2
( ˙f()) ˙f())
distribution. Consequently, if 2
is known, the confidence zone corresponding
to (4.12), namely,
 : -2
m () ˙f()(( ˙f()) ˙f())-1
( ˙f()) m()  2
p, ,
is exact. On the other hand, the distribution of the quasi-likelihood estimator

may be seriously skewed, leading to confidence regions based on (
T )
E(QT (
))(
T -), say, which are unreliable and sometimes very poor. This
is despite the satisfactory asymptotics. The separate issue of possibly needing
to replace 2
by an estimated value does not alter the general preference for
confidence zones based on Q().
The problem referred to above can be alleviated by bias reduction methods
and these have been widely studied. The usual approach is to seek a first order
asymptotic expression for the bias
E
T =  + bT () + rT (),
say, where rT () = o( bT () ) and bT ()  0 as T   for fixed  and
then to replace 
T by the bias corrected estimator

(BC)
T = 
T - bT (
T ).
In the regular classical setting of i.i.d. random variables, bT () is of the form
b()/T. For a discussion of the basic methods of bias correction see Cox and
Hinkley (1974, Sections 8.4 and 9.2).
A more general approach to this problem involves modification of the score
(or quasi-score) rather than the estimator itself. This has been developed by
Firth (1993) for the case of maximum likelihood estimation but the ideas are
readily adapted to the quasi-likelihood context. We replace the quasi-score
estimating function QT () by
Q
(BC)
T () = QT () + AT (),
where AT () is chosen to be of the form ˙QT () bT (). This form is used since
Taylor expansion gives
0 = QT (
) QT () + ˙QT () (
T - )
62 CONFIDENCE ZONES OF MINIMUM SIZE
so that we have
Q(BC)
() - ˙QT () (
-  - bT ()),
which has been bias corrected to first order. The theory underlying such corrections
can usefully be interpreted in differential geometry terms related to
correcting for curvature (e.g., Barndorff-Nielsen and Cox (1994, Chapter 6)
and references therein).
For a recent further adaption to provide higher order corrections to estimating
functions see Li (1996b). The idea is to remove first order bias while
minimizing the mean squared error. This allows for situations in which, for
example, skewness problems preclude the use of the Edgeworth expansion and
the kurtosis is unknown.
For another approach to adjusting score and quasi-score estimating functions,
based on martingale methods, see Mykland (1995).
4.5 On Best Asymptotic Confidence Intervals
4.5.1 Introduction and Results
Much asymptotic inference for stochastic processes is based on use of some
obvious consistent estimator but involves a choice between competing normalizations,
one being of constants and the other of random variables. In this
section we study the common situation where either asymptotic normality or
asymptotic mixed normality is achievable through suitable normalization. It
should be remarked that asymptotic mixed nomality is also obtained under a
variety of non-regular conditions (e.g., Kutoyants and Vostrikova (1995)) as well
as regular ones. It is shown that there is a certain sense in which, on average,
confidence intervals based on the asymptotic normality result are preferable to
those based on asymptotic mixed normality. The result here is from Heyde
(1992a).
Theorem 4.1 Let {^t} be a sequence of estimators that is consistent for 
and {ct}, {dt} be norming sequences, possibly random, such that
ct(^t - )
d
- W, (4.13)
dt(^t - )
d
- -1
W (4.14)
as t  , where W is standard normal and  > 0 a.s. is random, independent
of W and such that
ct d-1
t
d
- 
as t  . Suppose that L
(i)
t (), i = 1, 2, is the minimum length of an
100(1 - )% confidence interval for  based on an exact, approximate or assumed
distribution of the pivot in (4.13) and (4.14), respectively, for which the
4.5. ON BEST ASYMPTOTIC CONFIDENCE INTERVALS 63
specified convergence result holds. Then
lim inf
t
E L
(2)
t ()/L
(1)
t ()  1.
Remark 1. The theorem gives a sense in which, on average, confidence
intervals based on the pivot in (4.13) are to be preferred to those based on the
pivot in (4.14) whether or not a random norming is required.
Remark 2. A principal application of the theorem is in the context of
inference for the nonergodic models that are common in stochastic process estimation
(e.g., Basawa and Scott (1983), Hall and Heyde (1980, Chapter 6)).
In this context we have {ct} as random and {dt} as constants so that L
(1)
t ()
is a random variable and L
(2)
t () is a constant.
One of the most important cases is that of locally asymptotic mixed normal
(LAMN) families. This is the situation in which the log-likelihood ratio has
asymptotically a mixed normal distribution and results of type (4.14) hold with
{dt} as constants and (4.13) with {ct} as random variables, while Ect = dt for
each n. For LAMN families one has, under modest regularity conditions, a
striking result known as Haj´ek's convolution theorem (e.g., Theorem 2, page
47, of Basawa and Scott (1983)) to the effect that if {Tt} is any other sequence
of consistent estimators of  such that
dt(Tt - )
d
- U (4.15)
for some nondegenerate U, then
U
d
= V + -1
W,
where V is independent of  and W.
It is readily shown that confidence intervals based on the pivot in (4.15) are
wider than those for the corresponding result (4.14). That is, using (4.13) is
better on average than (4.14), which is better than (4.15). If  is the standard
normal distribution function, we have for any real a, ,  with  > ,
(( - a) y) - (( - a) y)  
1
2
( - ) y -  -
1
2
( - ) y (4.16)
and hence, integrating with respect to dP(  y),
P( < -1
W + a < )  P -
1
2
( - ) < -1
W <
1
2
( - ) , (4.17)
so that
P( < -1
W + V < ) =

P(
< -1
W + v < ) dP(V  v)
 P -
1
2
( - ) < -1
W <
1
2
( - ) .
64 CONFIDENCE ZONES OF MINIMUM SIZE
Remark 3. A particular case of the result of the theorem has been obtained
by Glynn and Iglehart (1990) who studied the problem of finding minimum size
asymptotic confidence intervals for steady state parameters of the simulation
output process from a single simulation run. They contrasted the approach
of consistently estimating the variance constant in the relevant central limit
theorem with the standardized time series approach which avoids estimation of
the variance in a manner reminiscent of the t-statistic and suggested that the
former approach is preferable on average.
4.5.2 Proof of Theorem 4.1
Suppose that t and t are exact, approximate or assumed distribution functions
of the pivots ct(^t - ) and dt(^t - ), respectively. We are given the
complete convergence results
t
c
- , t
c
- 
as t  , where
(x) = P(W  x) and (x) = P(-1
W  x) =

0
(xy) dG(y)
with G(x) = P(  x).
Now let at, bt be any numbers for which
t(bt) - t(at)  1 - .
This gives a confidence interval for  of length
l
(1)
t () = (bt - at)/ct.
By passing to a subsequence, we may suppose that at  a and bt  b, where
-  a  b  . Then, by complete convergence,
t(at) - t(bt)  (b) - (a),
and, in view of (4.16), b - a  2z, where 2(z) = 2 - . Therefore, noting
that ctl
(1)
t () is not random, we have
lim inf
t
ct L
(1)
t ()  2z. (4.18)
Next, let zt be the smallest z for which
t(z) - t(-z)  1 - .
As above, we may suppose that zt  z and, because  is continuous,
t(zt) - t(-zt) = 1 -  + o(1)
4.5. ON BEST ASYMPTOTIC CONFIDENCE INTERVALS 65
as t  . By complete convergence and using the result (-x) = 1 - (x),
x > 0, it follows that 2(z) = 2- and hence that z = z. Then, L
(1)
t () 
2zt/ct, so that
lim inf
t
ct L
(1)
t ()  2z, (4.19)
and (4.18) and (4.19) imply
lim
t
ct L
(1)
t () = 2z. (4.20)
Similar reasoning to that which led to (4.20) also applies to show that
lim
t
dt L
(2)
t () = 2, (4.21)
where 2() = 2 - . We merely replace (4.16) and (4.17) to show that the
symmetric confidence interval is best, while (-x) = 1 - (x), x > 0, is
evident from the definition and the corresponding result for .
From (4.18) and (4.21), it then follows that, as t  ,
dt L
(2)
t ()/ct L
(1)
t ()  /z.
Since ct/dt
d
- , Slutsky's Theorem gives
L
(2)
t ()/L
(1)
t ()
d
-  /z.
and then, using Billingsley ((1968), Theorem 5.3, page 32),
lim inf
t
E L
(2)
t ()/L
(1)
t ()  (E) /z.
Thus, in order to complete the proof, it remains to show that
(E)   z. (4.22)
Now note that  = (; ) solves the equation
((; )) = P(-1
W  (; )) = 1 - /2.
Then, taking b > 0, we have
1 -

2
= P(W  (; )) = P W 
1
b
(; ) b  = P(W  (b ; ) b )
so that continuity and strict monotonicity of  imply that
(b ; ) =
1
b
(; )
and hence, if
() = (E) (; ),
66 CONFIDENCE ZONES OF MINIMUM SIZE
we have
(b ) = (). (4.23)
Now we see from (4.22) that it is required to show
()  -1
(1 - /2)
and using (4.23) we may, without loss of generality, scale  so that
(; ) = 1. (4.24)
But (4.24) implies
(1) = 1 - /2,
or equivalently,

0
(y) G(y) = E() = 1 - /2.
Thus we have to show that
() = E  -1
(1 - /2)
subject to
E() = 1 - /2,
and since  is monotone, this holds if
(E)  E(). (4.25)
However, since 1 - (x) is convex for x > 0, as is easily checked by differentiation,
we have from Jensen's inequality that
1 - (E)  E(1 - ())
and (4.25) follows. This completes the proof.
Final Remarks It is worth noting that (4.22), which relates the average
confidence intervals when (4.13) and (4.14) hold exactly rather than as limits,
is a paraphrase of the fact that, on average, extraneous randomization does not
improve confidence intervals for a normal mean. That is, if X is distributed
as N(, 1), the usual symmetric confidence interval for  based on X -  is
not improved, on average, if one instead uses the pivot (x - )/, where  is
independent of . In particular, when Xt, s2
t are the mean and variance of
a random sample of size n from the N(, 2
) distribution where  is known,
using the t-intervals for  based on ( Xn - )/sn produces confidence intervals
which are longer on average than z-intervals.
4.6. EXERCISES 67
4.6 Exercises
1. Suppose that X1, . . . , XT are i.i.d. random variables with mean 1/ and
variance 2
. Show that a quasi-score estimating function for the estimation
of  is QT () = T(-1
- X), where X = T-1 T
t=1 Xt and that a
suitable choice for a bias corrected version is Q
(BC)
T () = QT () - 2
.
In the case where the Xt have a Poisson distribution develop alternative
forms of confidence intervals using (a) bias correction and (b) the
quasi-score function directly. (Adapted from Firth (1993).)
2. Let X1, . . . , XT be a random sample from a distribution with mean  and
variance C2
, the coefficient of variation C being known. Taking C = 1
for convenience, consider the estimating function
GT () = -2
T
t=1
(Xt - ) + -3
T
t=1
(Xt - )2
- 2
,
which is the score function when the Xt are normally distributed. Show
that a suitable choice for a bias corrected version of GT is
G
(BC)
T () = GT () +
2
3
-1
.
(Adapted from Firth (1993).)
Chapter 5
Asymptotic Quasi-Likelihood
5.1 Introduction
Discussion in earlier chapters on optimality of estimating functions and quasilikelihood
has been concerned with exact results, where a specific criterion
holds for either fixed T or for each T as T  . Here we address the situation
where the criteria for optimality are not satisfied exactly but hold in a certain
asymptotic sense to be made precise below. These considerations give rise to
an equivalence class of asymptotic quasi-likelihood estimator, which enjoy the
same kind of properties as ordinary quasi-likelihood estimators, such as having
asymptotic confidence zones of minimum size, within a specified family, for the
"parameter" in question.
One particular difficulty with the exact theory is that a quasi-likelihood
estimator may contain an unknown parameter or parameters. The asymptotic
quasi-likelihood theory herein allows one to focus on issues such as whether
there is loss of information when an estimator is replaced by a consistent estimator
thereof or, under some circumstances, is asymptotically irrelevant.
As a simple illustration of the ideas take the case where {Xt} is a subcritical
Galton-Watson branching process with immigration and  = (m, ) is to be
estimated on the basis of data {Xt, t = 0, 1, . . . , T}, where m and  are,
respectively, the means of the offspring and immigration distributions.
This model has been widely used in practice, for example, for particle counts
in colloidal solutions, and an account of various applications is given in Heyde
and Seneta (1972) and Winnicki (1988).
Noting that for the model in question the (t + 1)th generation is obtained
from the independent reproduction of each of the individuals in the tth generation,
each with the basic offspring distribution, plus an independent immigration
input with the immigration distribution, we obtain
E (Xt | Ft-1) = m Xt-1 + 
so that there is a semimartingale representation
T
t=1
Xt = m
T
t=1
Xt-1 + T  +
T
t=1
mt,
where
mt = Xt - E (Xt | Ft-1)
are martingale differences.
69
70 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
Now suppose that X0 has a distribution for which E X2
0 <  and that
the variances of the offspring and immigration distributions are 2
(< ) and
2
(< ), respectively. Then using Theorem 2.1, the quasi-score estimating
function based on the martingale {
t
s=1 ms} is
QT =


T
t=1 Xt-1 (2
Xt-1 + 2
)-1
(Xt - m Xt-1 - )
T
t=1 (2
Xt-1 + 2
)-1
(Xt - m Xt-1 - )

 , (5.1)
which in general involves the nuisance parameters 2
, 2
. If no immigration
is present, so that  = 2
= 0, the nuisance parameter 2
disappears in the
estimating equation Qt = 0. Furthermore, if we have a parametric model
based on m,  alone, such as in the case where the offspring and immigration
distributions are both Poisson, there is no nuisance parameter problem. Indeed,
in the Poisson case it is straightforward to check that (5.1) is a constant multiple
of the true score estimating function, which leads to the maximum likelihood
estimators.
In general, nuisance parameters must be estimated or avoided. In this
case estimation is possible using strongly consistent estimators of 2
and 2
suggested by Yanev and Tchoukova-Dantcheva (1980). We can take these estimators
as
^2
( mT , T ) =
T
T
t=1Xt-1
^U2
t -
T
t=1Xt-1
T
t=1
^U2
t
T
T
t=1 X2
t-1 -
T
t=1 Xt-1
2 ,
^2
T ( mT , T ) =
T
t=1
^U2
t
T
t=1 X2
t-1 -
T
t=1Xt-1
T
t=1Xt-1
^U2
t
T
T
t=1 X2
t-1 -
T
t=1 Xt-1
2
where ^U2
t = Xt- mT Xt-1-T , mT and T being strongly consistent estimators
of m and , respectively.
Wei and Winnicki (1989) studied an estimating function closely related to
(5.1) in which the term 2
Xt-1 + 2
is replaced by Xt-1 + 1. Details of the
corresponding limit theory are given in Theorem 3.C of Winnicki (1988). Furthermore,
for ^mT , ^T based on (5.1) with estimated 2
and 2
as suggested
above (using ^mT , ^T ), a similar analysis to that undertaken by Wei and Winnicki
shows that
T
1
2
^mT - m
^T - 
d
- N(0, W -1
)
as T   where, if X has the stationary limit distribution of the {Xt} process,
W =
E X2
(2
X + 2
)-1
E X (2
X + 2
)-1
E X (2
X + 2
)-1
E (2
X + 2
)-1 ,
the same result as one obtains if 2
and 2
are known. The quasi-likelihood
framework ensures that this estimation procedure is optimal from the point of
view of asymptotic efficiency.
5.2. THE FORMULATION 71
The substitution of estimated values for the nuisance parameters amounts
to replacing the quasi-likelihood estimator by an asymptotic quasi-likelihood
estimator that has the same asymptotic confidence zones for the unknown
parameter.
Exact solutions of the optimal estimation problem leading to ordinary quasilikelihood
estimators require certain orthogonality properties, which are often
only approximately satisfied in practice. For example, using Theorem 2.1 and
omitting the explicit  for convenience, if QT  H  G is OF -optimal within
H, then for the standardized estimating functions G
(s)
T , Q
(s)
T  H we need to
have
E G
(s)
T - Q
(s)
T G
(s)
T = E G
(s)
T G
(s)
T - Q
(s)
T = 0. (5.2)
Here we shall typically be dealing with circumstances under which we have
sequences of estimating functions and the quantities in (5.2) tend to zero as
T  .
5.2 The Formulation
We confine our attention to the space G of sequences of zero mean and finite
variance estimating functions {GT () = GT ({Xt, 1  t  T}; )}, which are
vectors of dimension p and are a.s. differentiable with respect to the component
of  and such that E ˙GT () and EGT ()GT () are nonsingular for each T,
the prime denoting transpose. We shall write G
(n)
T = (E ˙GT )-1
GT for the
normalized estimating function and drop the  for convenience. It is convenient
in this context to use a different standardization of estimating functions from
that of Chapters 1 and 2 that we have referred to as G
(s)
T .
The asymptotic optimality of a sequence of estimating functions will be
defined as the maximization of E(GT ) in the partial order of nonnegative definite
matrices in a certain asymptotic sense. Before the definition is spelt out,
however, we need to discuss certain concepts concerning matrices.
For a matrix A = (ai,j) we shall denote by A the Frobenius (Euclidean)
norm given by
A =


i j
a2
i,j


1
2
.
A sequence of symmetric matrices {An} is said to be asymptotically nonnegative
definite if there exists a sequence of matrices {Dn} such that An -Dn
is nonnegative definite and Dn  0 as n  .
In the sequel we need square roots of positive definite matrices. Let A
1
2
(A
T
2 ) be a left (resp. right) square root of the positive definite matrix A, i.e.,
A
1
2 A
T
2 = A. In addition, let A- 1
2 = (A
1
2 )-1
, A- T
2 = (A
T
2 )-1
. The left
Cholesky square root matrix is defined as the unique lower triangular matrix
satisfying the square root condition, and it can be calculated easily without
solving any eigenvalue problems. Though we do not require any specific form
72 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
of the square root matrix for our results to hold, Cholesky's square root matrix
is preferred for ease of application.
Definition 5.1 Suppose that {QT }  H  G, and
EQ
(n)
T Q
(n)
T
- 1
2
EG
(n)
T G
(n)
T EQ
(n)
T Q
(n)
T
- T
2
- I
is asymptotically nonnegative definite for all {GT }  H, where I is the p × p
identity matrix. Then we say that {QT } is an asymptotic quasi-score (AQS)
sequence of estimating functions within H. The solution 
T of QT () = 0 will
be called an asymptotic quasi-likelihood estimator from H.
Remark 5.1 The choice of the subset H of estimating functions is open and
it can be tailored to suit the particular problem. For example, features such as
being linear or bounded functions of the data could be incorporated.
Remark 5.2 If GT is OF -optimal within H for each T (see Heyde (1988a)),
then {GT } is an asymptotic quasi-score sequence of estimating functions within
H.
Remark 5.3 The asymptotic quasi-score formulation provides a natural
framework for extension of Rao's concept of asymptotic first order efficiency
(e.g., Rao (1973, pp. 348-349)). In the case of a scalar parameter and estimator
ZT , Rao's definition takes the form
EU2
T
1
2
ZT -  - ()
UT
EU2
T
P
- 0 (5.3)
as T   for some () not depending on T or the observations. Here UT is
the score function and the idea is that ZT - has basically the same asymptotic
properties as the score function. For example, as indicated by Rao, the Fisher
information in ZT is asymptotic to the Fisher information in UT under minor
additional conditions.
In this chapter we generalize this framework in a variety of ways. First,
it is possible to confine attention to estimation within a subset of estimation
functions, H say, in which case the benchmark role of an estimating function
containing maximum information is a quasi-score estimating function, QT , say.
In addition, we can replace the linear (or linearized) estimating function by a
general normalized estimating function G
(n)
T = (E ˙GT )-1
GT . Now, since under
usual regularity conditions for quasi-score estimating functions
E ˙QT = -EQ2
T ,
we have
Q
(n)
T = -(EQ2
T )-1
QT .
5.2. THE FORMULATION 73
Thus, we can think of
EQ2
T
1
2
G
(n)
T - ()Q
(n)
T
P
- 0 (5.4)
as a natural generalization of (5.3). If (5.4) holds, we could say that GT is
asymptotically first order efficient for  within H.
Now, if {GT } is an asymptotic quasi-score sequence of estimating functions
within H, we have
E G
(n)
T
2
 E Q
(n)
T
2
= EQ2
T
-1
(5.5)
as T  . Also, under the standard regularity conditions
EG
(n)
T Q
(n)
T = -(EQ2
T )-1
.
Thus, in this case we have
EQ2
T E G
(n)
T
2
- 2EG
(n)
T Q
(n)
T + E Q
(n)
T
2 2
 0
and (5.4) holds with () = 1 because L2 convergence holds. It is thus clear
that the asymptotic quasi-score condition is a sensible (sufficient) condition
for Rao's asymptotic first order efficiency. It has the advantages of natural
extensions to deal with the quasi-likelihood setting, the case where Tn is not a
linear function of , and also the vector case.
There are several different but equivalent versions of Definition 5.1 that
illuminate the concept. These are given in Proposition 5.4. As pointed out
in Chapter 2, for the ordinary quasi-score case, a definition such as that given
above is of limited practical use. However, the following propositions provide
a rather easy route to checking whether a sequence of estimating functions has
the asymptotic quasi-score property.
Proposition 5.1 The sequence of estimating functions {QT }  H is AQS if
there exists a p × p matrix kT such that
lim
T 
kT EG
(n)
T Q
(n)
T kT = K = lim
T 
kT EQ
(n)
T Q
(n)
T kT (5.6)
for all {GT }  H, where K is some nondegenerate p × p matrix. In particular,
if
EQ
(n)
T Q
(n)
T
- 1
2
EG
(n)
T Q
(n)
T EQ
(n)
T Q
(n)
T
- T
2
 I (5.7)
as T   for all {GT }  H, or if there exists constant T such that
lim
T 
-1
T EG
(n)
T Q
(n)
T = K
for all {GT }  H and some nondegenerate matrix K, then {QT } is AQS.
74 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
Proof. Suppose that (5.6) holds for some {kT }. It is easy to see that kT is
nondegenerate when T is large enough. Given any {GT }  H, let
AT  kT EG
(n)
T G
(n)
T - EG
(n)
T Q
(n)
T EQ
(n)
T Q
(n)
T
-1
EG
(n)
T Q
(n)
T kT ;
then AT is nonnegative definite using a standard argument based on properties
of covariance matrices (e.g. Heyde and Gay (1989, p. 227)). We may write
AT = kT EG
(n)
T G
(n)
T - EQ
(n)
T Q
(n)
T kT
+ kT EQ
(n)
T Q
(n)
T kT - kT EG
(n)
T Q
(n)
T kT kT EQ
(n)
T Q
(n)
T kT
-1
 kT EG
(n)
T Q
(n)
T kT
= kT EG
(n)
T G
(n)
T - EQ
(n)
T Q
(n)
T kT + DT ,
say. It follows from (5.6) that DT  0 as T  . Thus
kT EG
(n)
T G
(n)
T - EQ
(n)
T Q
(n)
T kT
is asymptotically nonnegative definite. This is equivalent, as shown in Proposition
5.4 below, to saying that {QT } is AQS within H. The two particular
cases are accomplished by taking kT = (EQ
(n)
T Q
(n)
T )- 1
2 and kT = 
- 1
2
T , respectively.
This completes the proof.
Remark 5.4 (5.6) and (5.7) are actually equivalent.
On the grounds that one seeks to maximize E(GT ), i.e., to minimize the
estimating function variance EG
(n)
T G
(n)
T , for GT  H, it would be sensible to
avoid those estimating functions for which asymptotic relative efficiency comparisons
are arbitrarily poor. With this in mind we give the following definition.
Definition 5.2 The sequence of estimating function {GT }  H is unacceptable
if there exists a sequence {QT }  H such that
EQ
(n)
T Q
(n)
T
- 1
2
EG
(n)
T G
(n)
T EQ
(n)
T Q
(n)
T
- T
2
- I
is unbounded as T  .
If we denote by HU  H the subset of all unacceptable estimating functions,
the complementary set H - HU consists of all the acceptable estimating func-
tions.
5.2. THE FORMULATION 75
Proposition 5.2 A sequence of estimating functions {GT }  H is unacceptable
if and only if there exists a sequence of estimating functions {QT }  H
and vectors {cT } such that
lim
T 
cT EG
(n)
T G
(n)
T cT
cT EQ
(n)
T Q
(n)
T cT
= .
The proof is fairly straightforward from the definition of unacceptability and
we omit the details.
Proposition 5.3 Suppose {QT }  H is AQS where H is a linear space, then
(5.7) holds for all {GT }  H-HU . On the set H-HU , (5.6) is both necessary
and sufficient for the AQS property.
Proof. For T an arbitrary p×p matrix, consider the sequence of estimating
functions {(E ˙GT ) T Q
(n)
T + GT } that belongs to H since H is a linear space.
Because {QT } is AQS within H, by the definition of AQS and the fact that
EQ
(n)
T = I, we have, upon writing L = E Q
(n)
T Q
(n)
T ,
limT  kT L
1
2 E ˙GT T + E ˙GT
-1
 E (E ˙GT )T Q
(n)
T + GT Q
(n)
T T E ˙GT + GT
 T E ˙GT + E ˙GT
-1
L- T
2 kT - kT kT  0, (5.8)
for arbitrary sequence of unit vectors {kT }. For ease of notation, set M =
EG
(n)
T Q
(n)
T and N = EG
(n)
T G
(n)
T . The quantity on the left side of (5.8) for
fixed T can be written as
kT L- 1
2 (T + I)-1
(E ˙GT )-1
E (E ˙GT )T Q
(n)
T + GT
 Q
(n)
T T E ˙GT + GT
- (E ˙GT )(T + I)L(T + I)E ˙GT (E ˙GT )-1
(T + I)-1
L- T
2 kT
= kT L- 1
2 (T + I)-1
T M + M T
- T L - L T + (N - L) (T + I)-1
L- T
2 kT
= kT L- 1
2 (T + I)-1
L
1
2 L- 1
2 T L
1
2 L- 1
2 M L- T
2 - I
76 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
+ (L- 1
2 M L- T
2 - I)(L- 1
2 T L
1
2 ) + L- 1
2 N L- T
2 - I
 L
T
2 (T + I)-1
L- T
2 kT
= kT CT (C-1
T - I)(L- 1
2 M L- T
2 - I)
+(L- 1
2 M L- T
2 - I) (C-1
T - I) + L- 1
2 N L- T
2 - I CT kT
( where CT = L- 1
2 (T + I)-1
L
1
2 )
= kT (I - CT ) L- 1
2 M L- T
2 - I CT kT (5.9)
+ kT CT L- 1
2 M L- T
2 - I (I - CT ) kT
+ kT CT L- 1
2 N L- T
2 - I CT kT .
Since {GT }  H-HU and L- 1
2 N L- T
2 -I is bounded, there exists a constant
A large enough such that
x L- 1
2 N L- T
2 - I x  A x 2
(5.10)
for any p dimensional vector x.
Suppose (5.7) does not hold. Then there exist two subsequences of vectors,
denoted by {xT } and {yT } for simplicity in notation, such that yT =
(L- 1
2 M L- T
2 - I)xT , yT = 1 and xT  A. Because kT is an arbitrary
unit vector, we can choose kT = yT . Moreover, since T is an arbitrary matrix,
so is CT . Hence we may choose CT so that CT yT =  xT , where  = - 1
2A+A2 .
Using (5.10), the quantity in (5.9) is less than or equal to
2 - 22
<xT , yT >+A 2
xT
2
 2 + 22
A + 2
A2
=  < 0,
which contradicts (5.8). This completes the proof of (5.7) and the remaining
part of the proof follows immediately from Proposition 5.1.
The next proposition gives five equivalent versions of the definition of asymptotic
quasi-score.
Proposition 5.4 The following conditions are equivalent.
1) EQ
(n)
T Q
(n)
T
- 1
2
EG
(n)
T G
(n)
T EQ
(n)
T Q
(n)
T
- T
2
- I is asymptotic nonnegative
definite for all {GT }  H.
5.2. THE FORMULATION 77
2) I - EG
(n)
T G
(n)
T
- 1
2
EQ
(n)
T Q
(n)
T EG
(n)
T G
(n)
T
- T
2
is asymptotic nonnegative
definite for all {GT }  H.
3) For any nonzero p-dimensional vector {cT },
limT 
cT EG
(n)
T G
(n)
T cT
cT EQ
(n)
T Q
(n)
T cT
 1
for all {GT }  H.
4) There exists a sequence of p×p matrices {kT } such that kT EQ
(n)
T Q
(n)
T kT 
K nondegenerate, and kT EG
(n)
T G
(n)
T kT - kT EQ
(n)
T Q
(n)
T kT is asymptotic
nonnegative definite for all {GT }  H.
5) There exists a sequence of p × p matrices {kT } such that
kT EQ
(n)
T Q
(n)
T
-1
kT  K nondegenerate, and kT EQ
(n)
T Q
(n)
T
-1
kT kT
EG
(n)
T G
(n)
T
-1
kT is asymptotic nonnegative definite.
Proof. 1)  3) Let xT be an arbitrary unit vector. Then,
limT  xT EQ
(n)
T Q
(n)
T
- 1
2
EG
(n)
T G
(n)
T EQ
(n)
T Q
(n)
T
- T
2
xT  1.
Let cT = a EQ
(n)
T Q
(n)
T
- T
2
xT ; then
limT 
cT EG
(n)
T G
(n)
T cT
cT EQ
(n)
T Q
(n)
T cT
 1.
Here cT is an arbitrary vector because xT is an arbitrary unit vector and a is
an arbitrary nonzero constant.
3)  1). Let xT =
EQ(n)
T Q(n)
T
T
2
cT
EQ(n)
T Q(n)
T
T
2
cT
. Then,
limT  xT EQ
(n)
T Q
(n)
T
- 1
2
EG
(n)
T G
(n)
T EQ
(n)
T Q
(n)
T
- T
2
xT
= limT 
cT EG
(n)
T G
(n)
T cT
cT EQ
(n)
T Q
(n)
T cT
 1.
Here xT can be regarded as an arbitrary unit vector because cT is an arbitrary
vector.
78 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
2)  3) can be shown in a similar fashion.
1)  4) and 2)  5) are obvious.
4)  3) For an arbitrary vector cT ,
limT 
cT EG
(n)
T G
(n)
T - EQ
(n)
T Q
(n)
T cT
cT k-1
T k-1
T cT
 0,
so that
limT 
cT EG
(n)
T G
(n)
T cT
cT EQ
(n)
T Q
(n)
T cT
- 1
cT EQ
(n)
T Q
(n)
T cT
cT k-1
T k-1
T cT
 0,
and since
limT 
cT EQ
(n)
T Q
(n)
T cT
cT k-1
T k-1
T cT
 min(K) > 0,
because K is nondegenerate and thus is positive definite, we have
limT 
cT EG
(n)
T G
(n)
T cT
cT EQ
(n)
T Q
(n)
T cT
 1.
5)  3) can be proved similarly.
The final proposition indicates that two sequences of asymptotic quasi-score
estimating functions are asymptotically close under appropriate norms.
Proposition 5.5 Suppose { ~GT }  H is AQS, H is a linear space, and there
exists a sequence of matrices {kT } such that
lim
T 
kT E ~G
(n)
T
~G
(n)
T kT = K
where K is some nondegenerate matrix. Then the following statements are
equivalent:
(i) {QT }  H is AQS,
(ii) limT  kT EQ
(n)
T Q
(n)
T kT = K,
(iii) EQ
(n)
T Q
(n)
T
- 1
2
E ~G
(n)
T
~G
(n)
T EQ
(n)
T Q
(n)
T
- T
2
 I as T  ,
(iv) trace EQ
(n)
T Q
(n)
T
- 1
2
E ~G
(n)
T
~G
(n)
T EQ
(n)
T Q
(n)
T
- T
2
 p as T  ,
(v)
det E ~
G
(n)
T
~
G
(n)
T
det E Q(n)
T Q(n)
T
 1 as T  ,
5.3. EXAMPLES 79
(vi)
cT E ~
G
(n)
T
~
G
(n)
T cT
cT
EQ(n)
T Q(n)
T cT
 1 as T   for all p-dimensional vectors cT = 0,
(vii) kT E ~G
(n)
T - Q
(n)
T
~G
(n)
T - Q
(n)
T kT  0 as T  .
The proof of this proposition is straightforward and we omit the details.
It is worthwhile to point out that under a small perturbation within H, an
asymptotic quasi-score is still asymptotic quasi-score. (This can be seen from
(vii) in Proposition 5.5.) One can see that the relative difference of the information
about the parameter contained in two sequences of asymptotic quasi-score
estimating functions is arbitrarily small under suitable normalization.
5.3 Examples
5.3.1 Generalized Linear Model
A simple application involves the generalized linear model where the parameter
is a p-vector . The distribution of the response yn is assumed to belong to a
natural exponential family, which may be written as
f(yn| n) = c(yn) exp(nyn - b(n))
and n and Zn are related by a link function, Zn being a p-vector of covariates.
We assume yn and n are one dimensional for simplicity. Suppose the
link is such that n = u(Zn) for some differentiable function u. The score
function is
~GT =
T
i=1
Zi ˙u(Zi)(yi - ˙b(u(Zi))),
which is the optimal estimating function within HT = {
T
i=1 Ci(yi-˙b(u(Zi)))}
for all p-vectors Ci, i = 1, 2, ....
Now suppose the assumption of the exponential family distribution of yn
and the link function u is slightly relaxed to only assuming that yn has mean
˙b(u(Zn)) and finite variance 2
n. This guarantees that { ~GT } is within H. Now
~GT is no longer necessarily a score or quasi-score estimating function. However,
under the assumption E ~G
(n)
T
~G
(n)
T  0 and
¨b(u(ZT ))
2
T
 1 as T  , one can
still show that { ~GT } remains an AQS sequence within 
T =1HT .
5.3.2 Heteroscedastic Autoregressive Model
Our second example concerns a d-dimensional heteroscedastic autoregressive
model. Let {Xk, k = 1, 2, ..., T} be a sample from a d-dimensional process
satisfying
Xk = Xk-1 + k,
80 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
where  is a d×d matrix to be estimated and { k; k = 1, 2, ...} is a d-dimensional
martingale difference sequence such that E k = 0 and E k k = k. One can
conveniently write this model as
Xk = (Xk-1  Id) + k,
where  = vec () = (11, ..., d1, 21, ..., d2, ..., d1, ...dd) , the vector obtained
from  by stacking its columns one on top of the other and for matrices
A = (Aij) and B the Kronecker product, A  B denotes the matrix whose
(i, j), the block element, is the matrix AijB. The quasi-score estimating function
based on the martingale difference sequence { k} is
QT () =
T
k=1
(Xk-1  Id) -1
k (Xk - (Xk-1  Id)),
which contains the generally unknown k. Let
GT () =
T
k=1
(Xk-1  Id) (Xk - (Xk-1  Id)).
We shall show that {GT } is a sequence of asymptotic quasi-score estimating
functions under the conditions that k  , a nondegenerate matrix. That is,
GT is a quasi-score for the model when k is constant in k but unknown, and
{GT } remains as AQS when k is not constant but converges to a constant.
Now, writing V k for E(Xk-1Xk-1), we have
EQ
(n)
T Q
(n)
T =
T
k=1
V k  -1
k
-1
,
EG
(n)
T G
(n)
T =
T
k=1
V k  Id
-1 T
k=1
V k  k
T
k=1
V k  Id
-1
,
it being easily checked that
E ˙QT = -
T
k=1
V k  -1
k = -EQT QT ,
E ˙GT = -
T
k=1
V k  Id,
EGT GT =
T
k=1
V k  k,
since for a d × d matrix M,
E Xk-1  Id M Xk-1  Id = V k  M.
5.3. EXAMPLES 81
Since -1
k  -1
, a positive definite matrix, for any d-dimensional vector
sequence {bk},
bk-1
k bk
bk-1
bk
 1. (5.11)
Since V k is nonnegative definite, we can write V k = T T for some matrix
T = (tij)d×d. For any d2
-dimensional vector sequence {ck}, let ck =
(ck1, ck2, ..., ckd) where cki is a d-dimensional vector for 1  i  d. Now we
can write
ck(V k  -1
k ) ck =
d
l=1


d
i=1
til cki -1
k
d
i=1
til cki

 .
Replacing -1
k by -1
, the above formula still holds. Therefore it follows from
(5.11) that
ck(V k  -1
k ) ck
ck(V k  -1
) ck
 1. (5.12)
Also, min(E(QT ))   gives
cT (
T
k=1
V k  -1
k ) cT   (5.13)
as T  . Combining (5.12) and (5.13), it follows that
cT (
T
k=1 V k  -1
k ) cT
cT (
T
k=1 V k  -1
) cT
 1
as T  . But since {cT } can be an arbitrary unit vector sequence, so we
have
det(
T
k=1 V k  -1
k )
det(
T
k=1 V k  -1
)
 1
as T  , and
lim
T 
det(EQ
(n)
T Q
(n)
T )
det(EG
(n)
T G
(n)
T )
= lim
T 
det((
T
k=1 V k)  -1
)-1
(det(
T
k=1 V k))-ddet((
T
k=1 V k)  )(det(
T
k=1 V k))-d
= lim
T 
(det(
T
k=1 V k))-d
(det())d
(det(
T
k=1 V k))-d(det(
T
k=1 V k))d(det())d(det(
T
k=1 V k))-d
= 1.
Because QT is a quasi-score, by Proposition 5.4 Condition (v), we know {GT }
is an AQS sequence of estimating functions.
82 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
5.3.3 Whittle Estimation Procedure
As a final illustration of the AQL methodology we shall show that the widely
used estimation procedure based on the smoothed periodogram, which dates
back to remarkable early work of Whittle (1951), provides an AQL-estimator
under a broad range of conditions. These encompass many random processes
or random fields with either short or long range dependence. Here we shall deal
with the one-dimensional random process case. For the multivariate random
process and random fields cases see Heyde and Gay (1989), (1993).
Suppose that the stationary random process {Xt}, with E Xt = 0, is observed
at the times t = 1, 2, . . . , T and let
IT () =
1
2 T
T
i=1
Xj e-ij 
2
be the corresponding periodogram. The spectral density of the random process
is f(; , 2
), where the one-step prediction variance is
2
= 2 exp (2)-1

log
f(; , 2
) d , (5.14)
and we shall write
g(; ) = 2 -2
f(; , 2
).
We shall assume that 2
is specified while , of dimension p, is to be estimated.
It should be noted that (5.14) gives

log
g(; ) d = 0,
which is equivalent to the fact that -1
Xt has prediction variances independent
of  (e.g., Rosenblatt (1985, Chapter III, Section 3)). We shall initially assume
further that f is a continuous function of , including at  = 0 (short range
dependence), and is continuously differentiable in  as a function of (, ).
We consider the class of estimating functions
GT =

A()
[IT () - EIT ()] d
constructed from smoothed periodograms where the smoothing function A()
(a vector of dimension p) is square integrable and symmetric about zero, that
is,

A
() A() d < , A() = A(-).
We shall show that under a considerable diversity of conditions, an asymptotic
quasi-score sequence of estimating functions for  is given by
G
T =

(g(;
))-2 g(; )

[IT () - EIT ()] d
5.3. EXAMPLES 83
=

(g(;
))-2 g(; )

[IT () - f(; , 2
)] d + o(1)
as T  . The corresponding AQL estimator T obtained from the estimating
equation
G
T () = 0
is then asymptotically equivalent to the Whittle estimator obtained by choosing
 to minimize

log
f(; , 2
) + IT ()(f(; , 2
))-1
d, (5.15)
i.e., to solve

(g(;
))-2 g(; )

[IT () - f(; , 2
)] d = 0.
The idea to minimize (5.15) comes from the observation that if the random
process Xt has a Gaussian distribution, it is plausible that T-1
times the log
likelihood of the data could be approximated by
-
1
2
log 2 2
- (2 2
)-1

IT
()(g(; ))-1
d
= - log 2  -
1
2(2)

log
f(; , 2
) + IT ()(f(; , 2
))-1
d.
It is, in fact, now a commonly used strategy in large sample estimation problems
to make the Gaussian assumption, maximize the corresponding likelihood, and
then show that the estimator still makes good sense without the Gaussian
assumption (e.g., Hannan (1970, Chapter 6, Section 6; 1973)).
Our first specific illustration concerns the linear process
Xt =

u1=-
guet-u,
with
Eet = 0,
Eeset =
2
if s = t,
0 otherwise,
and

u1=-
g2
u < .
There are various somewhat different regularity conditions under which we
could proceed and since the discussion is intended to be illustrative rather
84 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
than to present minimal conditions, we shall, for convenience, use results from
Section 6, Chapter IV of Rosenblatt (1985). The same conclusions, can be
deduced under significantly different conditions using results of various other
authors (e.g., via Theorem 4 of Parzen (1957) or Theorem 6.2.4, p. 427 of
Priestley (1981)).
We suppose that {Xt} is a strongly mixing stationary random process with
EX8
t <  and cumulants up to eighth order absolutely summable. Then, from
the discussion leading up to Theorem 7, p. 118 of Rosenblatt (1985), we find
that if
GT =

A()[IT
() - EIT ()] d,
G
T =

-
A
()[IT () - EIT ()] d,
then
TEGT G
T  (2) 2

-
A()(A
()) f2
() d
+

f4(,
-, )A()(A
()) d d (5.16)
as T  , with f() now being written instead of f(; , 2
). Here f4(, , )
is the fourth order cumulant spectral density
f4(, , ) = (2)-3
a,b,d
ca,b,d e-i(a+b +d )
with
ca,b,d = cum(Xt, Xt+a, Xt+b, Xt+d) = ( - 3)4
u
gugu+agu+bgu+d,
( - 3)4
being the fourth cumulant of Xt (which is zero in the case where the
process is Gaussian). But, analogously to the discussion of Chapter II, Section
4 of Rosenblatt (1985, pp. 46, 47), we find that
f4(, -, ) = (2)-1
( - 3)4
f()f(),
and hence the second term on the right-hand side of (5.16) is
(2)-1
( - 3) 4

A()f()
d

-
A
()f() d ,
which is zero if the random field is Gaussian or if the class of smoothing functions
{A()} is chosen so that

A()f()
d = 0.
5.3. EXAMPLES 85
We shall henceforth suppose that either (or both) of these conditions hold and
then
T EGT G
T  2(2)

-
A()(A
()) f2
() d (5.17)
as T  .
Also,
E ˙GT = -

-
A()


EIT () d
 -

A()(
˙f()) d, (5.18)
provided ˙f() is square integrable over [-, ], since


EIT ()  ˙f()
in the L2 norm by standard Fourier methods (e.g. Hannan (1970, p. 508)).
Then, from (5.17) and (5.18) we see that (5.6) holds with kT = T if
A
() = ˙f()(f())-2
,
which gives the required AQL property.
Under the same conditions and a number of variants thereof it can be further
deduced that if T is the estimator obtained from the estimating equation
GT () = 0, then (see e.g. the discussion of the proof of Theorem 8, p. 119 of
Rosenblatt (1985))
T1/2
(T - 0)
d
- Np 0, (A B-1
A)-1
,
where
A = -

A()(
˙f()) d,
B = -

A()A
()f2
() d
and that the inverse covariance matrix A B-1
A is maximized in the partial
order of nonnegative definite matrices for the AQL case where A() takes the
value
A
() = ˙f()(f())-2
.
These considerations place the optimality results for random processes of Kulperger
(1985, Theorem 2.1 and Corollary) and Kabaila (1980, Theorem 3.1) into
more general perspective. The matrix maximization result mentioned above
appears in the latter paper.
86 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
The next specific example on the Whittle procedure concerns a random
process exhibiting long range dependence. We shall consider the case where
the spectral density
f(x; , 2
)  2
|x|-()
L(x)
as x  0, where 0 < () < 1 and L(x) varies slowly at zero. A typical
example is provided by the fractional Brownian process. This is a stationary
Gaussian process with zero mean and covariance
E(XnXn+k) =
1
2
c |k + 1|2H
- 2|k|2H
+ |k - 1|2H
 c H(2H - 1)k2H-2
as k  , where H is a parameter satisfying 1
2 < H < 1 and c > 0, while
f(x; H) 
1
2
cF(H)|x|1-2H
as x  0 with
f(H) =

(1
- cos x)|x|-1-2H
dx
-1
.
The Whittle estimation procedure (5.15) for  in this context and subject
to the assumption of a Gaussian distribution has been studied in some detail
by Fox and Taqqu (1986), (1987) and Dahlhaus (1988). In particular, it follows
from Theorem 1 of Fox and Taqqu (1987) that if
GT =

A(x)[IT
(x) - EIT (x)] dx
with A(x) = O(|x|)--
as x  0 for some  < 1 and each  > 0 and
 +  < 1
2 , then
TEGT GT  4

-
f2
(x)A(x)A (x) dx (5.19)
as T  ; see also Proposition 1 of Fox and Taqqu (1986). Furthermore,
provided that
f
j
 La
, 1  j  p, 1  a < -1
,
then

j
EIT (x) 
f
j
in the La
norm (e.g., Hannan (1970, p. 508)) and from H¨older's inequality we
find that
E ˙GT = -

-
A(x)


EIT (x) dx  -

A(x)(
˙f(x)) dx. (5.20)
5.3. EXAMPLES 87
Then, from (5.19) and (5.20),
4T-1
(E ˙GT ) (EGT GT )-1
(E ˙GT ) 

A(x)(
˙f(x)) dx (5.21)


-
f2
(x)A(x)A (x) dx
-1 
A(x)
˙f(x)) dx
and the right-hand side of (5.21) is maximized in the partial order of nonnegative
definite matrices when A(x) takes the value
A
(x) = ˙f(x)(f(x))-2
(Kabaila (1980, Theorem 3.1)). This gives the AQL property for the corresponding
{G
T } by direct application of Definition 5.1.
5.3.4 Addendum to the Example of Section 5.1
Earlier approaches to this estimation problem have used conditional least squares
or ad hoc methods producing essentially similar results. This amounts to
using the estimating function (5.1) with the terms 2
Xt-1 + 2
removed. For
references and details of the approach see Hall and Heyde (1980, Chapter 6.3).
The essential point is that quasi-likelihood or asymptotic quasi-likelihood will
offer advantages in asymptotic efficiency and these may be substantial.
The character of the limiting covariance matrices associated with different
estimators precludes direct general comparison other than via inequalities. We
therefore give a numerical example as a concrete illustration. This concerns the
Smoluchowski model for particles in a fluid in which the offspring distribution
is assumed to be Poisson and the immigration distribution to be Bernoulli; for
a discussion see Heyde and Seneta (1972). The data we use is a set of 505
observations from F¨urth (1918, Tabelle 1). We shall compare the asymptotic
covariance matrices which come from the methods (i) conditional least squares
(e.g., Hall and Heyde (1980, Chapter 6.3); Winnicki (1988, Theorem 3.B)), (ii)
Wei and Winnicki's weighted conditional least squares (e.g., Winnicki (1988,
Theorem 3.C)), (iii) asymptotic quasi-likelihood. All quantities are estimated
from the data, and since the covariance matrices are all functions m and ,
common point estimates of m and  (those which come from (i)) are used as
a basis for the comparison. Methods (ii), (iii) produce slightly different point
estimates.
We find that ^m = 0.68, ^ = 0.51 and that the estimated asymptotic covariance
matrices for (505)
1
2 ( ^m - m, ^ - ) in the cases (i), (ii), (iii) are,
respectively,
0.81 -1.04
-1.04 2.05
,
0.76 -0.74
-0.74 1.57
,
0.75 -0.73
-0.73 1.56
.
The reduction in variance estimates obtained by using (iii) rather than (i) or
(ii) in this particular case is not of practical significance. Approximate 95%
88 CHAPTER 5. ASYMPTOTIC QUASI-LIKELIHOOD
confidence intervals for m are 0.68  0.08 in each case, and for , (i) gives
0.51  0.12 while (ii) and (iii) improve this to 0.51  0.11. Although quasi-likelihood
does not perform better than the simpler methods here, it nevertheless
remains as the benchmark for minimum width confidence intervals.
5.4 Bibliographic Notes
The bulk of this chapter follows Chen and Heyde (1995), a sequel to Heyde and
Gay (1989). The branching process discussion is from Heyde and Lin (1992).
If {GT } is an asymptotic quasi-score sequence under Definition 5.2, then it is
also under the definition given in Heyde and Gay (1989), which is relatively
weaker. In particular, the two definitions are equivalent if the parameter  is
one dimensional or if max EQ
(n)
T Q
(n)
T /min EQ
(n)
T Q
(n)
T is bounded for
all T where max() (min()) denotes the largest (smallest) eigenvalue.
A corresponding theory to that of this chapter, based on conditional covariances
rather than ordinary ones, should be able to be developed straightforwardly.
This would be easier to use in some stochastic process problems.
5.5 Exercises
1. Suppose that {X1, X2, . . . , XT } is a sample from a normal distribution
N(, 2
), where 2
is to be estimated and  is a nuisance parameter.
Recall that
T
t=1(Xt - )2
has a 2
2
T distribution and
T
t=1(Xt - X)2
a 2
2
T -1 distribution, X being the sample mean. Show that
T
t=1
[(Xt - X)2
- 2
]
can be interpreted as an AQS estimating function for 2
.
2. Let {X1, X2, . . . , XT } be a sample from the process
Xt =  Xt-1 + ut,
where  > 1 is to be estimated and if Ft = (Xt, . . . , X1) are the pasthistory
-fields,
E(ut | Ft-1) = 0, E(u2
t | Ft-1) = 2
X2
t-1 + 2
Xt-1,
2
and 2
being nuisance parameters. Consider the family of estimating
functions
H = {
T
t=1
t(Zt -  Xt-1)}
where t's are Ft-1-measurable and show that, on the set {Xt  },
T
t=1 X-1
t-1(Xt -  Xt-1) is an AQS estimating function.
5.5. EXERCISES 89
3. For a regression model of the form
Y = () + e
with Ee = 0 and V = Ee e (such as is studied in Section 2.4), suppose
that the quasi-score estimating function from the space
H = {A(Y - ())}
for p×p matrices A not depending on  for which E(A ˙) and E(A e e A )
are nonsingular and E(e e | ˙) = V is ˙ V -1
(Y -) (see, e.g., Theorem
2.3). Suppose that V is not known and is replaced by an approximate
covariance matrix W . Show that ˙ W -1
(Y -) is an asymptotic quasiscore
estimating function iff
det(E ˙ W -1
V W -1
˙)/det(E ˙ V -1
˙)  1
as the sample size T  . This gives a necessary and sufficient condition
for a GEE (see Section 2.4.3) to be fully efficient asymptotically.
Chapter 6
Combining Estimating Functions
6.1 Introduction
It is often possible to collect a number of estimating functions, each with information
to offer about an unknown parameter. These can be combined much
more readily than the estimators therefrom and, indeed, simple addition is
often appropriate. This chapter addresses the issue of how to perform the
combination optimally.
Pragmatic considerations led to the combination of likelihood scores. Likelihood
based inference encounters a range of possible problems including computational
difficulty, information on only part of the model, questionable distributional
assumptions, etc. As a compromise, the method of composite likelihoods,
essentially the summing of any available likelihood scores of relevance
was suggested (e.g., Besag (1975), Lindsay (1988)).
A good example concerns observations on a lattice of points in the plane
where the conditional distributions of the form f(yi | y[i]) are specified, the y[i]
denoting a prescribed set of variables, for example the nearest neighbors of i.
Let y be the vector of the yi's in dictionary order and write W = (wij)
for the matrix of 0's and 1's such that wii = 0, wij = wji = 1 if i and j are
neighbors, while wi denotes the ith column of W .
Then, if yi | y[i] has a N( wi y, 2
) distribution for parameters , 2
, there
is a consistent joint distribution of the form
y
d
= N(0, 2
(I - W )-1
), 2
= 2
(, ).
Taking 2
as known and setting it equal to 1 for convenience, it is easily seen
that the score function associated with y is
y W y +
d
d 
log det (I -  W ).
There are fundamental computational difficulties in estimating  from this.
On the other hand, the composite score obtained from summing the conditional
scores of the yi's is
y W y -  y W 2
y,
and this is simple to use. This kind of approach is easy to formalize into a
framework of composite quasi-likelihood, and this will be the topic of the next
section.
The setting that we shall consider is a general one in which maximum likelihood
may be difficult or impossible to use; the context may be a semiparametric
91
92 CHAPTER 6. COMBINING ESTIMATING FUNCTIONS
one, for example. However, we suppose that we are able to piece together a
set of quasi-likelihood estimating functions, possibly based on conditional or
marginal information. The problem is how to combine these most efficiently
for estimation purposes.
6.2 Composite Quasi-Likelihoods
Let {Xt, t  T} be a sample of discrete or continuous data that is randomly
generated with values in r-dimensional Euclidean space whose distibution involves
a "parameter"  taking values in an open subset  of p-dimensional
Euclidean space. The true value of the "parameter" is 0 and this is to be
estimated.
Suppose that the possible probability measures for {Xt} are {P} and that
each (, F, P) is a complete probability space.
We shall as usual confine attention to the class G of zero mean square integrable
estimating functions GT = GT ({Xt, t  T}, ) for which EGT () = 0
and E ˙GT () and EGT () GT () are nonsingular for each P. Here GT is a
vector of dimension p.
We consider a class of K  G of estimating functions GT that are a.s.
differentiable with respect to the components of  and such that
E ˙GT =
EGT,i
j
and EGT GT are nonsingular, the prime denoting transpose.
We shall suppose that the setting is such that k distinct estimating functions
Gi,T (), 1  i  k, are available. Our first approach to the composition
problem is to seek a quasi-score estimating function from within the set
K =
k
i=1
i,T () Gi,T ()
where the weighting matrices i,T are p × p constants. The discussion here
follows Heyde (1989a).
Suppressing the dependence on T and  for convenience, we write
H =
k
i=1
i Gi =  G, H
=
k
i=1

i Gi = 
G,
where
 = 1
.....
. . .
.....
k , 
= 
1
.....
. . .
.....

k ,
G = G1
.....
. . .
.....
Gk ,
6.3. COMBINING MARTINGALE ESTIMATING FUNCTIONS 93
so that
E ˙H =
k
i=1
i E ˙Gi =  E ˙G,
EH H
=
k
i=1
k
j=1
i EGi Gj 
j =  EG G 
.
Then, from Theorem 2.1 we see that H
is a quasi-score estimating function
within K if

= (EG G )-1
E ˙G.
This provides a general solution, although it may be less than straightforward to
calculate 
in practice. Note in particular that since each individual Gi  K,
the formulation ensures that
E(H
) - E(Gi)
is nnd, 1  i  k, so that composition is generally advantageous.
Various simplifications are sometimes possible. For example, if the Gi are
standardized quasi-score (or true score) estimating functions, then we can suppose
that each Gi satisfies
-E ˙Gi = EGi Gi, 1  i  k.
Circumstances under which equal weighting is obtained are especially important.
Clearly this holds in the particular case when EGi Gj = 0, i = j, that
is, the estimating functions are orthogonal. For an application to estimation
in hidden-Markov random fields see Section 8.4.
If the estimating functions are not orthogonal, then we can adjust them
using Gram-Schmidt orthogonalization to get a new set Ki, 1  i  k of
mutually orthogonal estimating functions. For example, in the case k = 2 we
can replace G1 and G2 by H1 = G1 and
H2 = G2 - (EG2 G1)(EG1 G1)-1
G1
and H1, H2 are orthogonal.
There is also another approach to the problem of composition that is closely
related but quite distinct, and this is treated in the next section. It is applicable
only to the case of martingale estimating functions but it is nevertheless of use
in a wide variety of contexts. As has been noted earlier, many models admit
a semimartingale description, which naturally leads to martingale estimating
functions and, furthermore, true score functions are martingales under mild
conditions. It is principally spatial processes that are excluded.
6.3 Combining Martingale Estimating
Functions
Section 6.2 deals with the general question of finding a quasi-score by combination
of functions. However, as noted earlier in Section 2.5, it is often useful
94 CHAPTER 6. COMBINING ESTIMATING FUNCTIONS
to adopt a martingale setting. For example, in many cases there is actually
an underlying likelihood score U, say, which is typically unknown. Then U
will be a square integrable martingale under minor regularity conditions. This
suggests the use of a martingale family of estimating functions as the most
appropriate basis for approximation of U. Indeed, if the family is large enough
to contain U, then the quasi-score estimating function on that space will be
U.
Again we let (Xt, 0  t  T) be a sample from a process taking values
in r-dimensional Euclidean space whose distribution depends on a parameter
 belonging to an open subset of p-dimensional Euclidean space. The chosen
setting is one of continuous time but the theory also applies to the discrete
time case. This is dealt with by replacing a discrete time process (Xn, n  0)
by a continuous version (Xc
t, t  0) defined through Xc
t = Xn, n  t < n+1.
The discussion here follows Heyde (1987).
Suppose that the possible probability measures for (Xt) are (P) and that
each (, F, P) is a complete probability space. The past-history -fields
(Ft, t  0) are assumed to be a standard filtration. That is, Fs  Ft  F for
s  t, F0 being augmented by sets of measure zero of F and Ft = Ft+, where
Ft+ = s>t Fs.
Let M denote the class of square integrable estimating functions (GT , FT )
with GT = GT {(Xt, 0  t  T), } that are martingales for each P and
whose elements are almost surely differentiable with respect to the components
of . Here GT is a vector of dimension d not necessarily equal p.
Now for each martingale estimating function GT  M there is an associated
family of martingales
T
0
as() dGs() where as() is a predictable matrix and
the quasi-score estimating function from this family is
T
0
(d Gs) (d G s)-
dGs.
As usual the prime denotes transpose and the minus generalized inverse. For
n × 1 vector valued martingales MT and NT , the n × n process M, N T ,
called the mutual quadratic characteristic, is the predictable increasing process
such that MT NT - M, N T is an n × n martingale. Also, we write M T
for M, M T , the quadratic characteristic of MT . Finally, ˙MT is the n × p
matrix obtained by differentiating the elements of MT with respect to those
of  and d Mt = E(d ˙Mt | Ft-).
If we have two basic martingales HT and KT , which belong to M and are
such that one is not absolutely continuous with respect to the other, then each
gives rise to a quasi-score estimating function, namely,
T
0
(d Ht) (d H t)-
dHt
and
T
0
(d Kt) (d K t)-
dKt,
6.3. COMBINING MARTINGALE ESTIMATING FUNCTIONS 95
respectively, and these may be regarded as competitors. If all the relevant
quantities are known, a better estimating function may be obtained by combining
them. We shall discuss the best procedure for combining the estimating
functions and the gains that may be expected.
In the particular case where the Xi are independent random variables with
finite mean i() and variance 2
i (), natural martingales to consider for the
estimation of  are
n
i=1
{Xi - i()}
and
n
i=1
(Xi - i())2
- 2
i ()
and the optimal linear combination of these has been considered by Crowder
(1987) and Firth (1987). Here we treat the problem of combination generally.
Given HT and KT we combine them into a new 2p × 1 vector martingale
JT = (HT , KT ) and the quasi-score estimating function for this is
T
0
(dJt) (d J t)dJt.
(6.1)
Now
J t =
H t H, K t
H, K t K t
(6.2)
and if A and D are symmetric matrices and A is nonsingular,
M-
=
A B
B D
-
=
A-1
+ F EF
-F E-
-EF
E- (6.3)
where
E = (D - B A-1
B), F = A-1
B,
and M is nonsingular if A and E are nonsingular. Also, if E is nonsingular
(A-1
+ F E-1
F )-1
= A - B D-1
B . (6.4)
The result (6.3) and its supplement follows from Exercise 2.4, p. 32 and a
minor modification of Exercise 2.7, p. 33 of Rao (1973) while (6.4) comes from
Exercise 2.9, p. 33 of the same reference.
Suppose that d H t and d K t are nonsingular. Then, using (6.2) ­ (6.4)
and after some algebra, we find that (6.1) can be expressed as
T
0
(d Ht) (d H t)-1
(I - RtSt)-1
- (d Kt) (d K t)-1
(I - StRt)-1
St dHt
+
T
0
(d Kt) (d K t)-1
(I - StRt)-1
- (d Ht) (d H t)-1
(I - RtSt)-1
Rt dKt
(6.5)
96 CHAPTER 6. COMBINING ESTIMATING FUNCTIONS
where
Rt = (d H, K t)(d K t)-1
,
St = (d H, K t) (d H t)-1
= (d K, H t)(d H t)-1
.
Note that if H, K t  0, the quasi-score estimating function (6.5) becomes
T
0
(d Ht) (d H t)-1
dHt +
T
0
(d Kt) (d K t)-1
dKt,
the sum of the quasi-score estimating functions for H and K.
The quasi-score estimating function (6.5) can be conceived as arising from
the martingales H and K in the following way. Based on H, K we can consider
the class of estimating functions
T
0
at dHt +
T
0
bt dKt
where at, bt are predictable matrices. Then, the corresponding quasi-score
estimating function is
T
0
(at d Ht + bt d Kt) (d a H + b K t)-1
(at dHt + bt dKt)
and the best of these, in the sense of minimum size asymptotic confidence
intervals, is obtained by choosing for at, bt the values a
t , b
t , where a
t , b
t are
such that
T
0
(at d Ht + bt d Kt) (d a H + b K t)-1
(at d Ht + bt d Kt)
is maximized. This result is given by (6.5).
The formula (6.5) can also be easily obtained through Gram-Schmidt orthogonalization.
Note that for the martingales G1T = HT and
G2T = KT -
T
0
d H, K t(d H t)-1
dHt
we have G1, G2  0 so that, upon standardizing the estimating functions, we
have as a combined quasi-score estimating function
T
0
(d Ht) (d H t)-1
dHt +
T
0
(d G2t) (d G2 t)-1
dG2t,
which, after some algebra, reduces to (6.5).
In order to compare martingale estimating functions we shall use the martingale
information introduced in Section 2.5. For G  M we define the martingale
information IG as
IG = GT ( G T )-1 GT (6.6)
6.3. COMBINING MARTINGALE ESTIMATING FUNCTIONS 97
when G T is nonsingular. Note that in the one-dimensional case where GT =
UT , the score function of (Xt, 0  t  T), IG is the conditional Fisher
information. Also, the quantity IG occurs as a scale variable in the asymptotic
distribution of the estimator 
obtained from the estimating equation
GT (
) = 0.
The quantities IG for competing martingales can be compared in the partial
order of nonnegative definite matrices. Note that the quasi-score estimating
function based on G is
Q S(G) =
T
0
(d Gt) (d G t)-1
dGt
and
IQS(G) =
T
0
(d Gt) (d G t)-1
d Gt. (6.7)
In many cases there is one "natural" martingale suggested by the context
such as through a semimartingale model representation as is described in Section
2.6. For example, this is certainly the case if (Xt) is representable in the
form
Xt =
t
0
fs() ds + mt() (6.8)
where (t) is a real, monotone increasing, right continuous process with 0 =
0, (mt(), Ft)  M and {ft()} is predictable. This framework has been
extensively discussed in Hutton and Nelson (1986) and Godambe and Heyde
(1987).
Various other martingales can easily be constructed from a basic martingale
(mt(), Ft) to use in conjuction with it. The simplest general ones are, for
d = 1,
t
0
(dms())2
- m() t (6.9)
(discrete time) and
m2
t () - m() t = 2
t
0
ms-() dms() , (6.10)
the last result following from Ito's formula. Generally if Hn(x, y) is the HermiteChebyshev
polynomial in x and y defined by the generating function
exp t x -
1
2
t2
y =

n=0
tn
n!
Hn(x, y),
then, for each n, Hn(mt, m t) is a martingale. See for example, Chung and
Williams (1983, Theorem 6.4, p. 114).
98 CHAPTER 6. COMBINING ESTIMATING FUNCTIONS
6.3.1 An Example
We consider the first order autoregressive process
Xi =  Xi-1 + i,
where the i = i() are independent and identically distributed with Ei = 0,
E2
i = 2
, E4
i < . Write -3
E3
i =  and -4
E4
- 3 =  for the skewness
and kurtosis, respectively, of the distribution of . We wish to estimate  on
the basis of sample (Xi, 0  i  T).
Now the natural martingale for the process (Xi), which of course has a
representation of the form (6.8), is
Hj =
j
i=1
(Xi -  Xi-1), j = 1, 2, . . .
and the associated martingale given by (6.9) is
Kj =
j
i=1
(Xi -  Xi-1)2
- j2
, j = 1, 2, . . . .
Put Hj =
j
i=1 hi, Kj =
j
i=1 ki. The combined quasi-score estimating function
given by (6.5) is then
1
3( + 2)
1 -
2
 + 2
-1 T
i=1
{(2 ˙ - ( + 2)Xi-1)hi + (Xi-1 - 2 ˙)ki} ,
(6.11)
where ˙ = d/d.
Note that the quasi-score estimating functions based separately on H and
K are
-
1
2
T
i=1
Xi-1hi, -
2 ˙
( + 2)3
T
i=1
ki,
respectively. Thus, if ˙ = 0, the K martingale will contribute to the estimation
of  in combination with H if  = 0 even though it is useless in its own right.
If the i are normally distributed, then  = 0,  = 0 and (6.11) reduces
to minus twice the score function. That is, quasi-likelihood and maximum
likelihood estimation are the same.
To compare the various estimators we calculate their corresponding martingale
informations. We find, after some algebra, that
IQS(H) = -2
T
i=1
X2
i-1,
IQS(K) =
4T( ˙)2
( + 2)2
6.4. APPLICATION. NESTED STRATA OF VARIATION 99
and, writing Q S(H, K) for the combined quasi-score estimating function (6.11),
IQS(H,K) =
1
2( + 2)
1 -
2
 + 2
-1

T
i=1
(( + 2)Xi-1 - 2)Xi-1 + 2 ˙(2 ˙ - Xi-1)
= 1 -
2
 + 2
-1
IQS(H) + IQS(K) -
4 ˙
2( + 2)
T
i=1
Xi-1 .
In the case where (Xi) is stationary (|| < 1) we have
IQS(H)  T-2
EX2
1 = T(1 - 2
)-1
a.s. and
IQS(H,K)  1 -
2
 + 2
-1
IQS(H) + IQS(K) a.s.
as T  . If, on the other hand, ||  1, then
IQS(K) = o(IQS(H)) a.s. and
IQS(H,K)  1 -
2
 + 2
-1
IQS(H) a.s.
as T  . Note that even in this latter case combining K with H is advantageous
if  = 0.
6.4 Application. Nested Strata of Variation
Suppose we have a nested model in which e(R) represents the error term for
stratum R. Let e denote all the e(R) written in lexicographic order as a column
vector. Then, the obvious approach to quasi-likelihood estimation based on
the family of estimating function {C e, C constant matrix} is not practicable.
Because of the nesting, the e's are correlated across strata making Ee e difficult
to invert.
The problem can be avoided, however, by working with appropriate residuals.
Define, within each stratum R, residuals
r(R) = (I - W (R)) e(R) (6.12)
where W (R) is idempotent so that W (R)r(R) = 0. Then, if W (R) is also chosen
so that
V (R) W (R) = W (R) V (R) (6.13)
100 CHAPTER 6. COMBINING ESTIMATING FUNCTIONS
where V (R) = Ee(R) e(R), we have that a quasi-score estimating function
(QSEF) based on the set of estimating functions {C r(R), C constant matrix}
is
E ˙e(R) Ee(R) e(R)
-1
r(R). (6.14)
The result (6.14) follows because (dropping the suffix (R) for convenience)
the QSEF is, from Theorem 2.1,
(E ˙r) (Er r )r
= (E ˙e) (I - W ) ((I - W ) V (I - W ) )r,
(6.15)
the minus denoting generalized inverse. Further, in view of (6.13),
(I - W ) V (I - W ) = V - W V - V W + W V W
= (I - W ) V
since W is idempotent and it is easily checked that
((I - W ) V )=
V -1
(I - W ). (6.16)
Also from (6.13),
(I - W ) V -1
= V -1
(I - W ), (6.17)
so that using (6.16), (6.17) and (I - W ) r = r, the right hand side of (6.15)
reduces to
(E ˙e) V -1
r
as required.
Now it is necessary to combine the QSEF from each of the strata and for
this it is desirable that the residuals be orthogonal across strata. That is, we
need
Er(R) r(S) = 0  R, S.
These requirements are not difficult to achieve in practice and then the combined
QSEF is the sum over strata
R
(Ee(R) e(R))-1
(E ˙e(R)) r(R). (6.18)
As a concrete example we shall consider a model introduced by Morton
(1987) to deal with several nested strata of extra Poisson variation in a generalized
linear model where multiplicative errors are associated with each stratum.
The motivation for this modeling came from a consideration of insect
trap catches, details of which are given in the cited paper.
Suppose that there are three nested strata labeled 2, 1, 0 with respective
subscripts i, j, k having arbitrary ranges. Stratum 0, the bottom stratum, has
scaled Poisson errors; stratum 1 introduces errors Zij and the top stratum 2
has further errors Zi. It is assumed that the Z's are mutually independent with
EZij = 1, EZi = 1, var Zij = 2
1, var Zi = 2
2
6.4. APPLICATION. NESTED STRATA OF VARIATION 101
and the model can be written as
Xijk = ijk Zij Zi + eijk
with the eijk uncorrelated and
E eijk Zij, Zi = 0, var eijk Zij, Zi =  ijk Zij Zi
where  is an unknown scale parameter.
From these assumptions it is straightforward to calculate the covariance
matrix V for the data vector x = (Xijk) , the elements being written in a row
in lexicographic order. Define 1 = 2
1(2
2 + 1)/ and 2 = 2
2/. Then, the
elements of V are
var Xijk =  ijk{1 + (1 + 2) ijk},
cov (Xijk, Xijn) =  (1 + 2) ijk ijn (k = n),
cov (Xijk, Ximn) =  2 ijk imn (j = m),
cov (Xijk, Xlmn) = 0 (i = l).
If  = (ijk) depends on a vector  of unknown parameters, then the quasiscore
estimating function can be written as (/) V -1
(x - ). However,
this is typically not a tractable form to use because of its dimension and the
complexity of V .
We shall use the theory developed above to obtain a useful version of the
quasi-score estimating function by focusing on the sources of variation in each
of the strata separately and optimally combining the resulting estimating functions.
This contrasts with the approach of Morton (1987) where ideas based
on partitioning of a sum of squares are used (in the spirit of the Wedderburn
approach to quasi-likelihood).
We write for the error terms in the three strata,
eijk = Xijk - ijk Zij Zi,
eij = Xij - ij Zi,
ei = Xi - i,
where the weighted totals are
Xij =
k
Xijk, ij =
k
ijk, eij =
k
eijk,
Xi =
j
wij Xij, i =
j
wij ij, ei =
j
wij eij,
ei =
j
wij eij,
102 CHAPTER 6. COMBINING ESTIMATING FUNCTIONS
with wij =  ij /var ei = 1/(1 + 1 ij), and the residuals are
rijk = Xijk -
ijk
ij
Xij = eijk -
ijk
ij
eij,
rij = Xij -
ij
i
Xi = eij -
ij
i
ei,
ri = Xi - i = ei.
We find that
var ei =  i (1 + 2 i), var eij =  ij (1 + 1 ij)
and it is easily checked that the conditions (6.12) and (6.13) are satisfied for
the three strata.
To check orthogonality across strata we note that
cov (rijk, Xij) = cov (Xijk, Xij) -
ijk
ij
var Xij
=  ijk ij (1 + 2) +  ijk
-
ijk
ij
{2
ij(1 + 2) + ij}
= 0
and hence cov (rijk, rij) = 0, cov (rijk, ri) = 0. Also,
cov (rij, Xi) = cov (Xij, Xi) -
ij
i
var Xi
=  2 ij i +  ij
-
ij
i
( 2 2
i +  i}
= 0
and hence cov (rij, ri) = 0. All other cross strata covariances of residuals are
easily seen to be zero.
The combined QSEF is then given by (6.18) as
ijk
ijk

rijk
ijk
+
ij
ij

rij
ij (1 + 1 ij)
+
i
i

ri
i (1 + 2 i)
.
For another example of a linear model with multiplicative random effects
on which quasi-likelihood methods have been used, see Firth and Harris (1991).
Their approach is similar to that of Morton (1987).
6.5. STATE-ESTIMATION IN TIME SERIES 103
6.5 State-Estimation in Time Series
An important class of time series problems involve state space models of which
a simple example is the type
F1t(Yt, Yt-1, Xt) = t, F2t(Xt, Xt-1) = t, (6.19)
where the Y 's are observed but not the X's, the F's are known functional forms
and the { t} and {t} are each uncorrelated sequences of random variables that
are mutually orthogonal. In filtering we require the estimation of XT given
the observations (Y0, . . . , YT ) and in smoothing the estimation of (X0, . . . , XT )
given (Y0, . . . , YT ). Other parameters in the system are generally assumed
known in the context of filtering or smoothing. The joint estimation of the
parameters and the X's is known as system identification.
In this section we shall indicate the estimating function approach to filtering
and smoothing and in particular show how the celebrated Kalman filter follows
from optimal combination of estimating functions. The material follows the
ideas of Naik-Nimbalkar and Rajarshi (1995). A closely related approach, based
on an extension of the E-M algorithm for missing data, is given in Chapter 7.4.4.
To focus the discussion we shall specialize (6.19) to the linear case
Yt = t Xt + t, Xt -  = t(Xt-1 - ) + t, (6.20)
where , 's, 's are known constants and if Gt and Ht are the -fields generated
by Xt, . . . , X0, Yt, . . . , Y0 and Xt, . . . , X0, Yt-1, . . . , Y0, respectively, then
almost surely,
E t Ht = 0, E t Gt-1 = 0,
E s t Ht = 0 = E s t Ht , s = t.
Now the model (6.20) suggests estimating functions
G1T = YT - T XT , G2T = XT - XT |T -1,
where XT |T -1 is the minimum variance unbiased estimator of XT based on
YT -1, . . . , Y0. Further, G1T and G2T are uncorrelated since
EG1T G2T = E G2T E T HT = 0,
G2T being HT -measurable. Then, according to Section 6.2 we obtain a composite
quasi-score by standardizing the estimating functions G1T , G2T and
adding. This leads to the estimating equation
YT - T XT
E 2
T
(-T ) +
XT - XT |T -1
E(XT - XT |T -1)2
= 0 (6.21)
for XT . The solution of (6.21) gives Zehnwirth's (1988) extended Kalman filter,
the classical filter corresponding to the case where E( 2
T | HT ) is nonrandom.
104 CHAPTER 6. COMBINING ESTIMATING FUNCTIONS
We now consider the case of smoothing. Here we can conveniently use the
estimating functions
H1T =
T
t=1
ct(Yt - t Xt)
H2T =
T
t=1
dt(Xt -  - t(Xt-1 - ))
where ct is Ht-1-measurable and dt is Gt-1-measurable. Then, using the ideas
of Section 6.3 and noting that H1T , H2T are orthogonal, the quasi-score estimating
equations for X0, . . . , XT are
Yt - t Xt
E 2
t Ht
(-t) +
Xt -  - t(Xt-1 - )
E 2
t Gt-1
+
Xt+1 -  - t+1(Xt - )
E 2
t+1 Gt
(-t+1) = 0, t = 1, 2, . . . , T - 1,
YT - T XT
E 2
T HT
(-T ) +
XT -  - T (XT -1 - )
E 2
T GT -1
= 0.
Further examples and references can be found in Naik-Nimbalkar and Rajarshi
(1995).
6.6 Exercises
1. (Generalized linear model). Take X = (X1, . . . , Xn) as a sample of
independent random variables, {F} a class of distributions and  =
(1, . . . , p) a vector parameter such that
EF Xi = i((F)), var Xi = (F) Vi((F))
for all F  F where i, Vi, i = 1, 2, . . . , n are specified real functions of
the indicated variables. Suppose that the skewness () and kurtosis ()
are known:
 = EF (Xi - i)3
/(Vi)3/2
,  = EF (Xi - i)4
/(Vi)2
- 3
for all F  F and i = 1, 2, . . . , n. Let
h1i = Xi - , h2i = (Xi - i)2
- Vi - (Vi)1/2
(Xi - i)
and write hr = (hr1, . . . , hrn) , r = 1, 2. Show that the quasi-score
estimating function for the family
G =  h1 +  h2; ,  nonrandom n × p matrices
6.6. EXERCISES 105
is 
h1 + 
h2 where

= -(Vi)-1
(i/j) ,

=
(Vi)1/2
(i/j) + (Vi/j)
(Vi)2( + 2 - 2)
(Crowder (1987), Firth (1987), Godambe and Thompson (1989)).
2. Consider the random effects model
Xi,j =  + ai + eij (i = 1, 2, . . . , n, j = 1, 2, . . . , ni),
where ai and eij are normally distributed with zero means and variances
2
a and 2
i , respectively. Take  = 0 for convenience. Use the space of
estimating functions
H =
n
i=1
Ai Y i
where the Ai are (n + 1) × (ni + ni(ni - 1)/2) nonrandom matrices and
the Y i are the ni + ni(ni - 1)/2 vectors
Y i = (X2
i1 - EX2
i1, . . . , X2
ini
- EX2
ini
, Xi1Xi2 - EXi1Xi2, . . . ,
Xi1Xin1
- EXi1Xin1
, . . . , Xini-1Xini
- E(Xini-1Xini
))
to find a quasi-score estimating function for estimation of 2
a and 2
i ,
i = 1, 2, . . . , n (Lin (1996)).
Chapter 7
Projected Quasi-Likelihood
7.1 Introduction
There is a considerable emphasis in theoretical statistics on conditional inference.
Principles such as that inference about a parameter should be conditioned
on an ancillary statistic have received much attention. This discussion
requires full specification of distributions and is consequently not directly available
in quasi-likelihood context. However, conducting inference conditionally
on a statistic T is equivalent to projecting onto the subspace orthogonal to
that generated by T and such projections can conveniently be carried out for
spaces of estimating functions. Indeed, projection based methods seem to be
the natural vehicle for dealing with conditioning in an estimating function context,
and it has recently become clear that quasi-likelihood extensions of much
conditional inference can be obtained via projection.
It is worthwhile to trace the development of this subject from conditional
likelihoods to conditional score functions to projected quasi-likelihoods. The
principles of conditioning date back to Fisher but with extensive subsequent
developments; see Cox and Hinkley (1974, Chapter 2) for a discussion and
references. The first step to an estimating function environment was taken by
Godambe (1976) who showed that the conditional score function is an optimal
estimating function for a parameter of interest when the conditioning statistic is
complete and sufficient for the nuisance parameters. This work was generalized
by Lindsay (1982) to deal with partial likelihood factorizations and cases where
the conditioning statistic may depend on the parameter of interest. The next
steps in the evolution, from a likelihood to a quasi-likelihood environment, are
now emerging and elements of this development are the subject of this chapter.
Conditioning problems may involve either the parameter to be estimated or
the data that are observed. We shall illustrate by dealing with parameters subject
to constraint, nuisance parameters, and missing data (for which we provide
a P-S (Project-Solve) generalization of the E-M (Expectation-Maximization)
algorithm).
7.2 Constrained Parameter Estimation
Suppose that the parameter  is subject to the linear constraint F  = d,
where F is a p × q matrix not depending on the data or . For nonlinear or
inequality constraints the extension of the theory is indicated briefly in Section
7.2.3. The overall discussion follows Heyde and Morton (1993).
107
108 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
For an arbitrary positive-definite symmetric matrix V , we define the projection
matrix
P V = F (F V -1
F )F
V -1
,
where Adenotes
a generalized inverse of A. Necessarily (Rao (1973, p. 26))
P V F = F , and (Rao (1973, p. 47)) P V is unique.
In what follows, V is independent of the data but may depend on , although
such dependence will be suppressed in the notation. An important
property of these projections is that, for any V and W ,
(I - P W )(I - P V ) = I - P W . (7.1)
We shall consider three methods of estimation of  subject to constraint.
For Methods 2 and 3 we use standardization of the type employed in Chapter
2 which ensures that the likelihood score property EG G = -E ˙G holds.
Method 1: Projection of free estimator. Given any estimating function
G  G, let ^ solve G(^) = 0. For arbitrary V the projected estimator is
~ = (I - P V ) ^ + V -1
F (F V -1
F )d.
(7.2)
Provided that d is consistent in the sense that
F V -1
F (F V -1
F )d
= d,
then the constraint F ~ = d is ensured.
Method 2: Projection of standardized functions. For any standardized
estimating function G  G let ~ solve the equations
(I - P V ) G(~) = 0, F ~ = d. (7.3)
Note that from (7.1) the choice of V is immaterial, for if we multiply (7.3) by
I -P W , we get (I -P W ) G(~) = 0. In practice we may like to choose V = I.
Method 3: Analogue of Lagrange multipliers. If we had a quasilikelihood
q, then by the method of Lagrange multipliers we would maximize
q +  (F  - d), where  is determined by the constraint. This suggests the
following method. Let G  G be any standardized estimating function. We
solve for ~ and  the equations
G(~) + F  = 0, F ~ = d. (7.4)
This method is available even when G is not the derivative of a function q or
even a quasi-score. Define V 0 = E(G) and P 0 to be projector with V = V 0.
The information criterion is modified to
EF (G) = (I - P 0)V 0(I - P 0) .
7.2. CONSTRAINED PARAMETER ESTIMATION 109
Its symmetric generalized inverse is
EF (G)=
(I - P 0) V -1
0 (I - P 0), (7.5)
which is shown below to be the asymptotic variance for the projected estimator
with the optimum choice of V = V 0. Our optimality criterion will be to
minimize (7.5). With this choice, it is seen that the three methods are asymptotically
equivalent and that optimality occurs within H when G is a quasi-score
estimating function, standardized for the purpose of (7.3) and (7.4).
7.2.1 Main Results
Theorem 7.1 Let ~ be the projected estimator (7.2). Suppose that ^ has
asymptotic variance E(G)-1
. Then ~ has asymptotic variance
(I - P V ) E(G)-1
(I - P V )  EF (G)-
;
equality holds when V = E(G).
Proof. The formula for the asymptotic variance ~ is obvious from the definition
of ~. To prove the inequality, we use (7.1) and the fact that V -1
0 P 0 =
P 0V -1
0 and P 2
0 = P 0. The difference is
(I - P V ) V -1
0 (I - P V ) - (I - P 0) V -1
0 (I - P 0)
= (I - P V ) {V -1
0 - (I - P 0) V -1
0 (I - P 0)}(I - P V )
= (I - P V ) {V -1
0 - V -1
0 (I - P 0)}(I - P V )
= (I - P V ) V -1
0 P 0(I - P V )
= (I - P V ) P 0 V -1
0 P 0(I - P V )
= (P 0 - P V ) V -1
0 (P 0 - P V )  0,
since V -1
0 is positive definite.
Theorem 7.2 Let G  G be a standardized estimating function. Assume
that with probability tending to 1 all methods (7.2)­(7.4) possess unique solutions
for ~ in some neighborhood of . Then in this neighborhood, when
solutions are unique:
(a) under mild regularity conditions the projected estimator (7.2) with V =
V 0 and the projected function estimator (7.3) agree to first order;
(b) the projected function estimator (7.3) and the Lagrange analogue estimator
(7.4) have identical solutions.
110 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
Proof. (a) Expand G(^) about  to first order and assume that ˙G approximates
-V 0. Then
^ -  - ˙G()-1
G() V -1
0 G().
Hence
~ -  (I - P 0) V -1
0 G()
and
G(~) G() + ˙G()(~ - ) G() - V 0(~ - )
{I - V 0(I - P 0) V -1
0 } G() = P 0 G().
Thus (I - P 0) G(~) 0, so (7.3) is satisfied to first order. By uniqueness, the
two estimators must agree to first order.
(b) Given that a solution to (7.3) exists, it also solves (7.4) with
 = -(F V -1
F )F
V -1
G.
By uniqueness the two solutions must be identical.
Corollary 7.1 Assume that G() is asymptotically normally distributed under
variance based norming. Then, under mild regularity conditions, and for
large samples, ~ is approximately normally distributed with mean  and singular
variance matrix EF (G),
whichever method is used. Also, (I -P 0) G()
is approximately normally distributed with mean zero and singular variance
matrix EF (G).
This Corollary may be used to construct approximate confidence regions for
 (see Chapter 4). We would prefer to base them on the projected estimating
function, since the asymptotic normality of G rather than ^ is usually the
origin of the theory, so that the further approximation due to the expansion of
G is avoided. Then the confidence region for  is of the form
G EF (G)G
 c, F  = d,
where c is a scalar obtained from the appropriate chi-squared or F distribution.
Theorem 7.3 Let Q be a quasi-score estimating function within H. Then
Q is optimal in that EF (G)-
 EF (Q)for
all G  H.
Proof. Since E(Q)  E(G), E(G)-1
 E(Q)-1
. Hence
(I - P V ) {E(G)-1
- E(Q)-1
} (I - P V )  0.
Upon inserting V = E(G), we get that
EF (G)-
 (I - P V ) E(Q)-1
(I - P V )  EF (Q)-
7.2. CONSTRAINED PARAMETER ESTIMATION 111
by Theorem 7.1.
Naturally, the asymptotic properties of the corollary hold for Q.
7.2.2 Examples
Example 1: Linear regression. Consider the usual regression model
Y = X  + , E( ) = 0, E( ) = 2
I,
where X is the design matrix with full rank and F  = d. In this case the
quasi-score estimating function Q = X (Y - X ) gives the usual normal
equation Q = 0. The optimal choice for V is V 0 = 2
X X, so
P 0 = F {F (X X)-1
F }F
(X X)-1
.
Standardization of Q gives Q(s)
= -2
Q, which would be the likelihood score
if were normally distributed. The free estimator is
^ = (X X)-1
X Y ,
so its projection is
~ = ^ - (X X)-1
F {F (X X)-1
F }(F
^ - d),
which is identical to that obtained by projecting Q(s)
or using the Lagrange
multiplier method. The result has been long known: see Judge and Takayama
(1966), who used Lagrange multipliers with least squares.
Example 2: Two Poisson processes, constrained total. Let Ni =
{Ni(t), 0  t  T} be independent Poisson processes with constant rates
i (i = 1, 2), which are constrained so that 1 + 2 = 1. We have that H
is the class of estimating functions of the form
T
0
b1(s){dN1(s) - 1 ds},
T
0
b2(s){dN2(s) - 2 ds} ,
with b1, b2 being predictable. An application of Theorem 2.1 can be used to
find the quasi-score estimating function
Q = -1
1 N1(T) - T, -1
2 N2(T) - T .
This is already standardized and in fact is the likelihood score. Here F = (1, 1) ,
V 0 = diag (-1
1 , -1
2 ), so that
P 0 = (1 + 2)-1 1 2
1 2
.
112 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
Using (7.3) we obtain the estimator
~i = Ni(T)/{N1(T) + N2(T)} (i = 1, 2),
which is identical to the solution of (7.2) and (7.4).
For an example in this form a likelihood and Lagrange multiplier approach
is available, but this could easily have been precluded by the use of counting
processes, which were not specified to be Poisson and for which only the first
and second moment properties were assumed.
Example 3: Linear combination of estimating functions. Suppose
that we have n > p estimating functions H = {hi(X, )} such that E(H) = 0
and E(H H ) = V H exist. We wish to reduce the dimension of H to p
by choosing optimal linear combinations. That is, H consists of estimating
functions of the form G = C H, where C is a p × n matrix depending only on
. The standardized quasi-score estimating function is
Q(s)
= -E( ˙H) V -1
H H. (7.6)
Suppose now that we wish to reduce the dimensions of H in the presence of
linear constraints F  = d. By (7.4), the theory is modified to solve for ~, 
from
E( ˙H) V -1
H H - F  = 0, F ~ = d.
The theory extends to the case where C may contain incidential parameters
(Morton (1981a)). An important application concerns functional and structural
relationships where the elements of H are the contrasts that are free of the
incidential parameters (Morton (1981b)).
7.2.3 Discussion
We have presented the theory assuming that the constraints are linear in order
to avoid obscuring the simple arguments by messy details. If we had nonlinear
constraints, say () = c, then the Lagrange multiplier analogue replaces F by
F () = / , which we assume exists and has full rank. If the projection
methods are used, then this substitution is also made and in (7.2) we must
also approximate d() = F (0)  iteratively. In (7.3) and (7.4), F  = d is
replaced by () = c.
Inequality constraints could in principle be handled by Lagrange multipliers.
If the jth constraint is (F )j  dj, the corresponding multiplier j would be
set equal to zero if strict inequality holds and would be free if equality holds.
In many problems it would be natural to transform the parameter  to
( ,  ) , where  has r = rank(F ) elements, which define the constraints,
and  is free. We then need to reduce G to dimensions p - r. Picking any
p - r elements of G would not be optimal; the best linear reduction would be
achieved in the manner of (7.6), which would be equivalent to projecting G as in
(7.3). Algebraic elimination of  from G would be equivalent to the Lagrange
7.3. NUISANCE PARAMETERS 113
multiplier method. For nonlinear problems this could lead to an estimating
function outside the class G. In the case where G is the class of functions that
are linear in , all three methods agree with the parameter transformation
method.
In the linear regression model (Example 1), suppose that  = F  is not
fixed but is a nuisance parameter. It can be shown that, in the terminology of
Chapter 3 that Q1 = (I -P 0) Q is locally E-ancillary for  and if Q2 = P 0 Q,
then EQ1 Q2 = 0 and hence Q2 is locally E-sufficient for .
In the general situation, assuming that linear approximation is adequate,
the decomposition
Q = P 0 Q + (I - P 0) Q
has the first term locally E-sufficient for  and the second term locally Eancillary
for . Thus, the method of projection of standardized functions (7.3)
discards the first order sufficient information about  and replaces it by the
known constraint  = d.
7.3 Quasi-Likelihood in the Presence
of Nuisance Parameters
In this section we use methods analogous to those of the last section to treat
the case where the basic parameter  contains a (vector) nuisance component.
Suppose that the parameter  is partitioned with
 = ( ,  ),
 being the component of interest and  a nuisance parameter and that G is
a standardized estimating function chosen for the estimation of . Standardization
means that the likelihood score property
E ˙G = E(G/) = -EG G
holds. We write
G =
G
G
and partition the information matrix
V G = EG G = E(G)
according to
V G =
V  V 
V  V 
where
V  = EG G, V  = EG G, V  = V  = E(G G).
114 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
Write also
F  =
V 
V 
= -E(G/),
F  =
V 
V 
= -E(G/).
It turns out that the projection
P  = F (F  V -1
G F )-1
F  V -1
G
identifies the information about  for  given and the estimating equation
(I - P ) G = 0 (7.7)
is optimal for estimation of  in the presence of the nuisance parameter . In
(7.7) the sensitive dependence of G on  has been removed in the sense that
E


{(I - P ) G} = 0
since P  F  = F  and G has zero mean. This is, of course, a first order
approach and (I -P )G will not in general be free of . There is a convenient
interpretation in terms of E-ancillary and E-sufficient estimating functions as
discussed in Chapter 3. It can be seen that (I - P ) G is locally E-ancillary
for  and P  G is locally E-sufficient for .
To appreciate (7.7) in more detail, first note that, in partitioned matrix
form,
F  V -1
G = 0 I ,
so that
P  = F  V -1
 0 I
=
0 V  V -1

0 I
and
(I - P ) G =
G - V  V -1
 G
0
.
The "information" concerning  associated with the estimation procedure is
E G - V  V -1
 G G - V  V -1
 G = V  - V  V -1
 V ,
(7.8)
or alternatively we can think of it in partitioned matrix form as
E(G) = (I - P ) V G (I - P )
=
V  - V  V -1
 V 
0
.
7.3. NUISANCE PARAMETERS 115
Of course there is no loss in efficiency if G, G are orthogonal, for then
V  = V  = 0. The formula (7.8) is a direct generalization of the familiar
result obtained from partitioning the likelihood score (e.g., Bhaphkar (1989))
and the asymptotic variance of the estimator of  based on G is then
V  - V  V -1
 V 
-1
= V 
, (7.9)
where
V -1
G =
V 
V 
V 
V  .
Optimality of estimation for  thus focuses on maximizing V 
in the partial
order of nnd matrices and we have the following theorem.
Theorem 7.4 Let Q be a quasi-score estimating function within H. Then
Q is optimal for estimation of  in the sense that
V 
G  V 
Q
in the partial order for all G  H.
Proof. The result follows immediately from the fact that
E(Q) = V Q  E(G) = V G
for all G  H and then
V -1
G  V -1
Q .
As a simple example of the methodology we shall discuss the linear model
Z = X  + ,
say, where Z is a vector of dimension T, X = (X1, X2) is T × p matrix with
X1 of dimension T × r and X2 of dimension T × (p - r). Also,  = ( ,  )
where  and  are, respectively, vectors of dimension r and p - r and is a
vector of dimension T with zero mean and covariance matrix .
In this case the standardized estimating function is
G = X -1
(Z - X1  - X2 )
=
X1 -1
(Z - X1  - X2 )
X2 -1
(Z - X1  - X2 )
=
G
G
.
Also,
V G = EG G = X -1
X
=
X1 -1
X1 X1 -1
X2
X2 -1
X1 X2 -1
X2
=
V  V 
V  V 
.
116 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
Then, the estimating function
G - V  V -1
 G
= X1 -1
- (X1 -1
X2)(X2 -1
X2)-1
X2 -1
(Z - X1 )
does not involve the nuisance parameter  and the estimator for  is obtained
from equating this to zero.
This solution can, of course, be obtained directly from solving the ordinary
estimating equation
G
G
=
0
0
and then eliminating  from the resultant equations. Indeed, many nuisance
parameter problems are amenable to such a direct approach.
The focus in this section has been on projection but there is no need for
this geometric interpretation to be emphasized. The first order theory seeks
the best combination G - c(, )G, which is easy to calculate analytically.
Also, it is not difficult to go beyond the first order theory. Second order theory
allows, incorporation, for example, of the result of Godambe and Thompson
(1974) who dealt with the case in which the likelihood score is U = (U, U) ,
 and  being scalars, and showed that an estimating function of the form
U - c(, )(U2
 - EU2
),
if c(, ) can be chosen to make it free of , is the best choice for estimating
. An example where this holds is in estimation of  using a random sample
of observations from the N(, ) distribution.
7.4 Generalizing the E-M Algorithm:
The P-S Method
The E-M algorithm (e.g., Dempster et al. (1977)) is a widely used method
for dealing with maximum likelihood calculations where there are missing or
otherwise incomplete data. It involves the taking of the expectation of the
complete-data likelihood with respect to the available data (the E-step) and
then maximizing this over possible distributions (the M-step) and the procedure
suggests a simple iterative computing algorithm. It has not been available in
contexts where a likelihood is unknown or unavailable.
In this section we extend the E-M algorithm method to deal with estimation
via estimating functions, in particular the quasi-score. The transitions
to estimating functions is made since there are situations where no quasi-loglikelihood
exists. The discussion here follows Heyde and Morton (1995).
In our approach the E-step is replaced by a step which projects the quasiscore
rather than taking expectations of a log-likelihood. In many examples the
7.4. GENERALIZING THE E-M ALGORITHM: THE P-S METHOD 117
projection is equivalent to predicting the missing data or terms. The predictor
will not in general be a conditional expectation. The M-step is replaced by solving
the projected quasi-score set equal to zero. The approach can reasonably
be described as the projection-solution (P-S) method.
When the likelihood is available and the score function is included in the
class of estimating functions permitted, the standard E-M procedure is recovered
from the P-S method.
More broadly, however, we seek to make the point that there is no formal
difference between QL estimation for incomplete data and QL estimation for
complete data.
7.4.1 From Log-Likelihood to Score Function
As a prelude to generalization of the E-M algorithm formalism from log-likelihoods
to estimating functions, we first show how attention can be transferred
from operations on the log-likelihood to corresponding ones on its derivative,
the likelihood score.
We denote the full data by x and the observed data by y. The parameter
of interest is , a p ( 1) dimensional vector. The likelihood of  based on data
z is denoted by L(; z).
First note that, from the definition of conditional distributions,
 log L(; y)

=
 log L(; x)

-
 log L(; x | y)

. (7.10)
The second term on the right hand side of (7.10) has zero expectation conditional
on y under the usual regularity conditions and so
 log L(; y)

= E
 log L(; x)

y . (7.11)
That is, the likelihood score based on data y is obtained by taking the expectation,
conditional on y fixed, of the likelihood score based on the full data
x.
Now the M-step of the E-M algorithm is based on maximization of
E0
(log L(; x) | y),
and it is clear that, under the usual regularity conditions,
 log L(; y)

=


E0
log L(; x) y
0=
. (7.12)
The iterative computing algorithm involving (p+1)
as the value of  that maximizes
E(p) (log L(; x) y) then has as its obvious analogue the choice of (p+1)
as the value of  which solves


E(p) log L(; x) y = 0 (7.13)
118 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
or, equivalently, to solve  log L(; y)/ = 0 via
E(p)
 log L(; x)

y
=(p+1)
= 0, (7.14)
using (7.11).
7.4.2 From Score to Quasi-Score
The Framework
Now suppose that likelihoods are unsuitable or unavailable and that, based on
data x, a family of zero mean, square integrable estimating functions Hx has
been chosen within which the quasi-score estimating function is Q(; x). The
function Q may be a likelihood score but this will not be the case in general,
nor will Q in general be expressable as the derivative with respect to  of a
scalar function. As usual Q is chosen as the estimating function G  Hx for
which the "information"
E(G) = (E ˙G) (EG G )-1
E ˙G (7.15)
is maximized in the Loewner ordering (partial order of nonnegative definite
matrices).
It should be recalled that estimating functions G and M G, M being any
nonsingular matrix of the dimension of G, give rise to the same estimators of
. For each estimating function G we choose
M = -(E ˙G) (EG G )-1
and denote G(s)
= M G, the standardized version of G. This satisfies the
likelihood score property
EG(s)
G(s)
= -E ˙G
(s)
= E(G(s)
) (= E(G)).
All estimating functions in the subsequent discussion of this section will be
assumed to be in standardized form and we shall drop the superscripts for
convenience. Note that the likelihood score is automatically in standardized
form.
We shall subsequently stipulate that all families G of estimating functions
under consideration, such as Hx, are convex. Then, Q is a quasi-score estimating
function within G iff
EG Q = -E ˙G (7.16)
for all G  G (Theorem 2.1).
In the case where the likelihood score U exists, an equivalent formulation
of quasi-score is obtained by defining Q  Hx to be a quasi-score estimating
function if
E{(Q - U)(Q - U) } = inf
GHx
E{(G - U)(G - U) }, (7.17)
7.4. GENERALIZING THE E-M ALGORITHM: THE P-S METHOD 119
the infimum being taken with respect to the partial order of nonnegative definite
matrices. This follows easily from the formulation based on (7.15) upon
observing that
E{(G - U)(G - U) } = EG G - EU U
(Criterion 2.2 and Theorem 2.1). Also, we see in conjunction with (7.17) that
E{(G - U)(G - U) } = E{(Q - U)(Q - U) } (7.18)
+ E{(Q - G)(Q - G) }.
The general situation with which we are concerned is when y = y(x) is
observed rather than x. We seek to adapt a quasi-score estimating function
Q(; x) to obtain Q
(; y) where Q
is a quasi-score in a class Hy, which is
typically a suitable linear subspace of Hx. Then,
E{(Q
- U)(Q
- U) } = inf
GHy
E{(G - U)(G - U) },
when U exists, and subtracting E{(Q-U)(Q-U) } from both sides, we have
that
E{(Q
- Q)(Q
- Q) } = inf
GHy
E{(G - Q)(G - Q) }, (7.19)
a formula whose use we advocate even when U does not exist. Thus, Q
is the
element of Hy with minimum dispersion distance from Q  Hx.
We note that Q
= E(Q | y) provided this belongs to Hy as in the E-M
algorithm. However, in general Q
is just given as the least squares predictor
(7.19). That is, the "expectation" step is replaced by a "projection" step in
the E-M algorithm procedure. We can regard this as a generalization of the
E-M technique.
The Algorithm
In general the algorithm proceeds as follows. Write a 2
= a a for vector a.
Then, we define H( | 0; y) such that
E0 H( | 0; y) - Q(; x) 2
= inf
GHy
E0 G(; y) - Q(; x) 2
and solve iteratively for (p+1)
from
H((p+1)
(p)
; y) = 0
starting from an initial guess (0)
. A sequence {(p)
} is generated and, provided
(p)
 , we have in the limit
E H( ; y) - Q(; x) 2
= inf
GHy
E G((; y) - Q(; x) 2
, (7.20)
120 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
and
Q
(; y) = H( ; y)
satisfies (7.19) provided a solution does exist.
This last result is a straightforward consequence of the following lemma.
Lemma 7.1 If a and b are random vectors such that E(a a - b b ) is nonnegative
definite and E a 2
 E b 2
, then Ea a = Eb b .
For our application, suppose that K = K(; y)  Hy satisfies (7.20) and
that a solution Q
= Q
(; y) to (7.19) exists. We take a = K - Q and
b = Q
- Q in the lemma. Since (7.19) gives nonnegativity of E(a a - b b )
while (7.20) gives E a 2
 E b 2
we obtain Ea a = Eb b . However, for
Hy  Hx we have Ea b = 0, so that
E a - b 2
= E a 2
- E b 2
= 0,
which gives K = Q
a.s.
Finally, to prove the lemma we note that, upon taking traces,
0  tr {E(a a - b b )} = tr {E a 2
- E b 2
}  0,
so that A = E(a a -b b ) is a symmetric and nonnegative definite matrix with
tr A = 0. This forces A  0 since the sum of squares of the elements of A
is the sum of squares of its eigenvalues all of which must be zero as they are
nonnegative and sum to zero.
Alternatives
It may be expected, by analogy with the arguments of Osborne (1992) for the
score function, that the algorithm developed above will give a first order rate
of convergence compared with the second order rate typical of Fisher's method
of scoring. Thus, the latter would be preferred for solving Q
(; y) = 0 if the
necessary derivative calculation could be straightforwardly accomplished. The
E-M generalization, of course, has the virtue of avoiding any such requirement.
Sufficient conditions for convergence of either method can, in principle, be
formulated along the lines of Section 3 of Osborne (1992), but they involve
technical assumptions that ensure smoothness and the applicability of a law of
large numbers, and are not transparent, so they will not be discussed herein.
It should be noted that there is no formal requirement for Q in (7.19), (7.20)
to be a quasi-score estimating function. It can, in principle, be replaced by any
estimating function H  Hx. However, the optimality properties associated
with the quasi-score estimating function will then be lost.
In practice Q
can usually be calculated with the aid of Theorem 2.1, only
first and second moment properties being required.
7.4. GENERALIZING THE E-M ALGORITHM: THE P-S METHOD 121
7.4.3 Key Applications
Estimating Functions Linear in the Data
Since explicit expressions are not available in general we shall focus on the
most common area of use of quasi-likelihood methods, namely, the situation of
estimating functions which are linear in the data.
Let
 = Ex,  = E y,
V x = E(x - )(x - ) , V y = E(y - )(y - ) .
If Hx and Hy consist, respectively, of functions that are linear in x and y, the
quasi-score estimating functions may be taken as
Q(; x) =


V -1
x (x - ),
Q
(; y) =


V -1
y (y - ) (7.21)
(e.g. using Theorem 2.1). Also, if y belongs to a linear subspace of x, y = C x
say, then we have
 = C , V y = C V x C
and the best linear predictor of x given y (in the weighted least squares sense)
is
^x(y) =  + V x C V -1
y (y - ),
so that
Q
(; y) = Q(; ^x(y)). (7.22)
It should be noted that even in this particular case
Q
(; y) = E(Q(; x) y) (7.23)
in general, the right-hand side being what is suggested by the E-M prescription.
Equality in (7.23) requires that E(x | y) be linear in y as holds, for example,
for Normal or Poisson variables.
Another context where equality in (7.23) may not hold occurs when Hy
is not a subset of Hx. Then it can also happen that Q
 Hx. For details
involving exponential variables see the example concerned with totals from
data with constant coefficient of variation below.
Generalization to Polynomials
The linear description developed above can be exploited to deal with quadratic,
and indeed higher polynomial, functions of x. To illustrate this, take Hx to
include quadratic functions of x and again y = C x. We can regard Hx as
linear in vectors containing all 1
2 p(p + 3) linear and quadratic elements
x
= (x1, . . . , xp, x2
1, x1x2, . . . , x2
p)
122 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
and Hy as linear, in a similarly augmented vector y
with y
= C
x
where
C
is constructed so as to generate the quadratic functions of y.
7.4.4 Examples
Screening Tests
Our first example deals with extension of a standard setting for the E-M algorithm
concerned with developing screening tests for a disease given by Laird
(1985). Our purpose is to compare the approach via the score or quasi-score
estimating function to that of the familiar one via the log-likelihood which is
given by Laird.
The problem here involves observed data consisting of a random sample
of patients measured on screening tests, each of which gives a dichotomous
result. The true disease status is unknown and for Test i the sensitivity has
a distribution with mean Si, i = 1, 2. It is assumed that the test results are
conditionally independent given disease status and that false positives are not
possible. We wish to estimate S1, S2 and the disease prevalence .
Retaining Laird's notation, the observed data can be put into an array of
the form
Test 2
+ +
y11 y12
Test 1
- y21 y22
The complete data are x11, x12, x21, x221, x222 where yij = xij unless (i, j) =
(2, 2) and y22 = x221 +x222, with x221 being the number of diseased individuals
with negative outcomes on both tests and x222 the number of nondiseased
patients. We also write x1+ = x11 + x12, x+1 = x11 + x21, N for the sample
size ijyij and ND for the number of diseased patients (N - x222).
Laird has supposed that the test sensitivities are random and that, conditional
on ND fixed, x1+ and x+1 have, respectively, the independent binomial
distribution b(ND, S1) and b(ND, S2) while x222 has the binomial distribution
b(N, 1-). Then the quasi-score estimating function based on linear functions
of the essential data x1+, x+1 and ND is
Q(; x) =
x1+ - E x1+ ND
var x1+ ND
E x1+ ND

+
x+1 - E x+1 ND
var x+1 ND
E x+1 ND

7.4. GENERALIZING THE E-M ALGORITHM: THE P-S METHOD 123
+
ND - END
var ND
END

,
where  = (S1, S2, ) . We have, under Laird's assumptions,
E x1+ ND = ND S1, var x1+ ND = ND S1 (1 - S1),
E x+1 ND = ND S2, var x+1 ND = ND S2 (1 - S2),
END = N , var ND = N  (1 - )
so that
Q(; x) =
x1+ - ND S1
S1(1 - S1)
,
x+1 - ND S2
S2(1 - S2)
,
ND - N 
(1 - )
=
x1+
S1
ND
- x1+
1 - S1
,
x+1
S2
ND
- x+1
1 - S2
,
ND

-
x222
1 - 
,
which is also the likelihood score estimating function as is easily seen from
Laird's discussion. Then, the obvious choice for Hy yields the quasi-score
E(Q(; x) y) =
y1+
S1
E(ND
| y) - y1+
1 - S1
,
y+1
S2
E(ND
| y) - y+1
1 - S2
,
E(ND | y)

N
- E(ND | y)
1 - 
and a simple conditional probability argument gives
E(ND | y) =
N - y22(1 - )
(1 - S1)(1 - S2)  + (1 - )
. (7.24)
If an algorithmic solution is sought, we set N
(p+1)
D = E(ND | y, (p)
) and
we have
H((p+1)
| (p)
, y) =
y1+
S
(p+1)
1
-
N
(p+1)
D - y1+
1 - S
(p+1)
1
,
y+1
S
(p+1)
2
-
N
(p+1)
D - y+1
1 - S
(p+1)
2
,
N
(p+1)
D
(p+1)
N
- N
(p+1)
D
1 - (p+1)
leading to
(p+1)
=
N
(p+1)
D
N
, S
(p+1)
1 =
y1+
N
(p+1)
D
, S
(p+1)
2 =
y+1
N
(p+1)
D
as obtained by Laird for the E-M algorithm solution. The iterative solution
converges quite quickly in this case but an explicit solution to the estimating
equation
E(Q(; x y) = 0
124 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
is also available. After some algebra we find that
^S1 =
y11
y+1
, ^S2 =
y11
y1+
, ^ =
1 - y22
N
1 - (1 - y11
y+1
)(1 - y11
y1+
)
=
y+1 y1+
N y11
,
which gives the maximum likelihood estimator for  based on y.
Note that the full distributional assumptions made by Laird are not needed
in the above analysis. All that is required is the first and second (conditional)
moment behavior of x1+, x+1 and ND. Suppose now that the data are not
homogeneous and perhaps not really from a random sample. It is, nevertheless,
still reasonable to take
E(x1+ ND) = ND S1, E(x+1 ND) = ND S2
and to suppose that
E x221 y22
E x222 y22
=
 (1 - S1)(1 - S2)
1 - 
,
which, together with
E(ND y) = N - E x222 y22
still gives (7.24).
The expressions for the conditional variances of x+1 | ND and x1+ | ND may
need to be changed to incorporate a measure of dispersion. For example, if the
individuals counted in x+1 (resp. x1+) displayed sensitivity to Test 1 (resp. 2)
with a distribution of beta type with mean S1 (resp. S2), then formulas
var (x1+ | ND) = 1 ND S1(1 - S1), var (x+1 | ND) = 2 ND S2(1 - S2)
would be obtained for certain dispersion parameters 1, 2. If it were reasonable
to assume that the dispersion parameters could be taken as the same, the
original analysis would remain unchanged. If not, obvious adjustments could
be made.
Totals from Data with Constant Coefficient of Variation
Here we consider a situation where the full data set is x = (xij, 1, 2, . . . , n; j =
1, 2) and it is the totals yi = xi1 + xi2 that are observed, so that y = (yi, i =
1, 2, . . . , n). The likelihood is not known but the xij are assumed to be independent
with moments
E(xij) = ij, var xij =  2
ij
where ij = ij() and  is the coefficient of variation, assumed to be constant.
The aim is to estimate , taken as scalar for convenience, and for simplicity we
assume that  = 1 is known.
7.4. GENERALIZING THE E-M ALGORITHM: THE P-S METHOD 125
The class Hx consists of the linear functions of x and the usual quasi-score
is (e.g., McCullagh and Nelder (1989))
Q(; x) =
ij
(xij - ij) -2
ij
ij

.
Since we observe only the totals y we consider Hy, the class of linear functions
of y. Then the corresponding quasi-score is
Q
(; y) =
i
(yi - (i1 + i2))(2
i1 + 2
i2)-2 

(i1 + i2). (7.25)
The linear predictor ^x of x given y is
^xij - ij =
2
ij
2
i1 + 2
i2
(yi - (i1 + i2))
and we see that
Q
(; y) = Q(; ^x).
It should be noted that Q
is not the expectation of Q, which is general is
nonlinear in y. For example, suppose that xij has the exponential density
f(xij; ) =
-1
ij exp(-xij/ij), if 0 < xij < ,
0, otherwise.
Then it is easily checked that
E(xij yi; ) = i yi
exp(-yi/i)
1 - exp(-yi/i)
,
where
-1
i = -1
i1 - -1
i2 .
In this case Q = U is the likelihood score for the full data but
U
= E(U y; )
differs from Q
, which is the projection of Q to be used when the exponential
likelihood specification is doubtful or inappropriate.
There is some evidence that there is little loss in efficiency in using Q instead
of U
even when the exponential assumption is actually correct. To evaluate
the asymptotic relative efficiency we must compare
E(U
) = E(U
)2
(7.26)
=
i
E(E(xi1 yi) - i1)
-1
ij

2
+ (E(xi2 yi) - i2)
-1
ij

2
,
126 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
and
E(Q
) = E(Q
)2
=
i
2
i1 + 2
i2
-1 i1

+
i2

2
. (7.27)
The information measures (7.26) and (7.27) satisfy the inequality
E(U
)2
 E(Q
)2
but a concrete comparison requires more specific values for the ij. For example,
in the particular case where i2 =  i1 with  fixed and taking
 =  - 1 > 0 without loss of generality we find that
E(Q
)2
=
(1 + )2
1 + 2
i
i2

2
(7.28)
while E(U
)2
is given below. Indeed, temporarily dropping subscript i again,
we find that
f(y; 1, 2) =
1 + 
 2
e-y/2
(1 - e-y/2
), y > 0,
and, after some algebra,
E -
2
2
log f(y; 1, 2) =
2

2
1 + (1 + )

0
z2 e-(1+)z
1 - e-z
dz
=
2

2
1 + 2 (1 + )

r=1
1
(1 + r )3
=
2

2
1 + 2
1 + 
2
 3,
1

,
where (3, 1
 ) =

r=1 (r + 1
 )-3
is the generalized Riemann zeta-function.
Then,
E(E(U y))2
= 1 + 2
1 + 
2
 3,
1
 i
i2

2
(7.29)
and the asymptotic relative efficiency of Q
is
ARE(Q
)() =
(1 + )2
1 + 2
1
1 + 2 ( - 1)-2  (3, ( - 1)-1)
using (7.27) and (7.28). It is easily seen that ARE(Q
)() approaches unity
as   1 or   . A plot of ARE(Q
)() against  (1 <  < 50) has
been obtained with the aid of the package MAPLE and appears in Heyde and
Morton (1995). The ARE always exceeds 0.975.
7.5. EXERCISES 127
Time Series Smoothing with Missing Observations
The E-M algorithm itself is mainly useful in situations where the y-likelihood is
intractable but the x-likelihood is simple (so that the M-step is easily achieved).
The algorithm outlined in this section is similarly useful in situations where
quasi-score Q is simple but Q
is intractable. Many such examples can be generated
in the context of time series observed subject to noise, and with missing
data as well, when Gaussian distributional assumptions are questionable or
inappropriate.
The example we discuss here comes from Shumway and Stoffer (1982) whose
discussion has been succinctly summarized by Little and Rubin (1987, pp. 165-
168). It concerns data on expeditures for physician services and is treated using
the model:
xij = zi + ij, j = 1, 2, (7.30)
zi =  zi-1 + i, (7.31)
where the ij are independent with E ij = 0, var ij = Bj, the i are independent
with Ei = 0, var i = Q and  is an inflation factor. The xij's are the
observations but in practice only a subset yij have been observed yij = xij for
i, j  A, say. The object is to estimate the zi's, which are not assumed to be
stationary.
Now Shumway and Stoffer have assumed, without comment, that the ij and
i are normally distributed and they have used the E-M algorithm to calculate
the zi, the corresponding likelihood being intractable. However, if normality
is not assumed it is still reasonable to use linear estimating equations for this
problem. Then, the quasi-scores are identical to those based on the normal
likelihood and the E-M algorithm results are preserved in our approach. The
inferences remain appropriate for nonnormal data having unchanged first and
second moments.
7.5 Exercises
1. (Mixture densities) Suppose that 0    1 and let X1, . . . , Xn be
independent and identically distributed with mixture density
h(x; , ) =  f(x; ) + (1 - ) g(x; ).
Here  is to be estimated.
If the functions h, f, g are known, so that the score function U =
(U, U) can be used as a basis for the calculation, show that the optimal
estimating equation for  is
U - [E(U U)/EU2
] U = 0
(Small and McLeish (1989)).
128 CHAPTER 7. PROJECTED QUASI-LIKELIHOOD
2. (A form of Rao-Blackwellization) Let Q be a quasi-score estimating
function. If K is an estimating function show that E(K | Q) is preferable
to K in the usual sense that E(E(K | Q))  E(K) in the partial order of
nonnegative definite matrices (Small and McLeish (1994, p. 85)).
Chapter 8
Bypassing the Likelihood
8.1 Introduction
Estimation based on the application of maximum likelihood methods can involve
quite formidable difficulty in calculation of the joint distribution of the
observations and sometimes even in differentiating the likelihood to obtain the
corresponding estimating equations. Notable examples are the derivation of the
restricted (or residual) (REML) estimating functions for dispersion parameters
associated with linear models and the derivation of the maximum likelihood
estimators of parameters in diffusion type models. In the former case both the
derivation of the likelihood and its differentiation are less than straightforward
while for the latter the Radon-Nikodym derivative calculations are a significant
obstacle. Quasi-likelihood methods, however, allow such estimators (or
estimating functions) to be obtained quite painlessly and under more general
conditions. In this chapter we shall illustrate the power and simplicity of the
approach through three quite different examples.
8.2 The REML Estimating Equations
Suppose that the n × 1 vector y has the multivariate normal distribution
MV N(X, V ()) with mean vector X and covariance matrix V (). For
simplicity we take the rank of X as r, the dimension of the vector . Let A
be any matrix with n rows and rank n - r satisfying A X = 0. Then, the
distribution of A y is MV N(0, A V A) which may be singular and it can be
shown, with some difficulty, that the likelihood function of  based on A y has
the form of a constant multiplied by
n-r
i=1
i
1/2
exp -
1
2
y V -1
Q y , (8.1)
where Q = I - P V
R(X), P V
R(X) denoting the projector X (X V -1
X)X
V -1
onto the subspace R(X), the range space of the matrix X that is the orthogonal
projection onto the orthogonal complement of R(X) with respect to the inner
product a, b = a V -1
b, while 1, . . . , n-r are the nonzero eigenvalues of
V -1
Q. (See Rao (1973, Eq. (8a.4.11) p. 528) for the case where A V A has
full rank.) The striking thing here is that the result (8.1) does not depend on
A. That is, all rank n - r matrices that define error contrasts (i.e., for which
A X = 0) generate the same likelihood function in . This follows from the
129
130 CHAPTER 8. BYPASSING THE LIKELIHOOD
result
A(A V A)A
= V -1
Q (8.2)
for all A and any g-inverse (A V A).
The basic reference on this topic is
Harville (1977), although detailed derivations are not included, the reader being
referred to an unpublished technical report. For a more recent derivation see
Verbyla (1990).
The next step is to differentiate (8.1) with respect to . Again this is less
than straightforward, involving vector differentiation of matrices and determinants,
but there finally emerge the REML estimating equations
tr V -1
Q
V
i
= y V -1
Q
V
i
V -1
Q y, (8.3)
where tr denotes trace.
We shall now show how the result (8.3) can be derived painlessly, and under
more general conditions, by quasi-likelihood methods. We shall no longer suppose
that y has the multivariate normal distribution but only that its mean vector
is X, its covariance matrix is V (), that Ey2
i y2
j = Ey2
i Ey2
j and Ey3
i yj = 0
for i = j and that each yi has kurtosis 3.
For fixed A, let z = A y and, in the expectation that it is quadratic functions
of the data that should be used to estimate covariances, consider the class
of quadratic form estimating functions
H = {G : G = (G(Si), . . . , G(Sp)) , G(Si) = z Siz - E(z Siz),
Si symmetric matrix i = 1, 2, . . . , p},
 being of dimension p. Write W = A V A.
Theorem 8.1 G
= (G(S
i ), . . . , G(S
p))  H is a quasi-score estimating
function in H if
S
i = W - W
i
W ,
i = 1, 2, . . . , p, (8.4)
for any g-inverse W =
(A V A).
Furthermore, the G(S
i ) do not depend
on A and the corresponding estimating equation G(S
i ) = 0 is Equation (8.3).
Proof. Using standard covariance results for quadratic forms in multivariate
normal variables (see, e.g., Karson (1982, p. 62)) that involve only the moment
assumptions noted above, we have
cov (G(Si), G(S
j )) = 2 tr (W SiW S
j ), (8.5)
while
Ez Siz = tr (W Si).
Also, it is easily checked that the (i, j) element of E ˙G,
(E ˙G)ij = - tr
W
j
Si .
8.3. PARAMETERS IN DIFFUSION TYPE PROCESSES 131
Then, when S
i is given by (8.4), we see that
cov (G(Si), G(S
j )) = - 2(E ˙G(S))ij,
since tr BC = tr CB and A W W
= A, W W A
= A for any choice of
generalized inverse (e.g., Lemma 2.2.6, p. 22 of Rao and Mitra (1971)). Consequently,
the result of the theorem concerning the form of S
i follows from
Theorem 2.1. The REML estimating equations (8.5) then follow immediately
using z = A y and the representation (8.2).
The theorem is new even in the case where the mean of y is known to
be zero (i.e., the case r = 0). The results could, in principle, be adjusted
to deal with yi's having different moments from those of the corresponding
multivariate normal variables. For a general discussion of variance function
estimation see Davidian and Carroll (1987). The results in this section are
from Heyde (1994b). For a discussion of consistency and asymptotic normality
of REML estimators see Jiang (1996).
8.3 Estimating Parameters in Diffusion Type
Processes
The standard method of estimation for parameters in the drift coefficient of
a diffusion process involves calculation of a likelihood ratio (Radon-Nikodym
derivative) and thence the maximum likelihood estimator(s). This is less than
straightforward for more complicated models and indeed it is often not available
at all because of the nonexistence of the Radon-Nikodym derivative. The
methods of quasi-likelihood , however, allow estimators to be obtained straightforwardly
under very general conditions. They can deal, in particular, with the
situation in which the Brownian motion in a diffusion is replaced by a general
square-integrable martingale. The approach, which is based on selection of
an optimal estimating function from within a specified class of such functions,
involves assumptions on only the first two conditional moments of the underlying
process. Nevertheless, the quasi-likelihood estimators will ordinarily be true
maximum likelihood estimators in a context where the Radon-Nikodym derivative
is available. Furthermore, they will generally be consistent, asymptotically
normally distributed, and can be used to construct minimum size asymptotic
confidence zones for the unknown parameters among estimators coming from
the specified class. In this section we illustrate these ideas through a general
discussion and application to the Cox-Ingersoll-Ross model for interest rates
and to a modification of the Langevin model for dynamical systems. The results
are from Heyde (1994a).
The models which we shall consider in this section can all be written in the
semimartingale form
dXt = dAt() + dMt(), (8.6)
132 CHAPTER 8. BYPASSING THE LIKELIHOOD
where the finite variation process {At} can be interpreted as the signal and the
local martingale {Mt} can be interpreted as the noise, as has been dicussed in
Section 2.6. Then, the quasi-score estimating function based on the family of
local martingale estimating functions
H =
T
0
at() dMt(), {at} predictable
is easily seen from Theorem 2.1 to be
T
0
E d ˙Mt() Ft- (d M() t)dMt(),
(8.7)
where {Ft} is a filtration of past-history -fields, M() t is the quadratic
characteristic and the - denotes the generalized inverse.
Now (8.7) can be rewritten as
T
0
E d ˙At() Ft- (d M() t)(dAt()
- dXt)
from which it is clear that the quasi-score estimating function is unaffected
if the local martingale noise {Mt()} is replaced by another whose quadratic
characteristic is the same. The precise distributional form of the noise does
not need to be known. In the commonly met situation where Mt() = Wt
with  > 0 and Wt being standard Brownian motion, the results are robust to
the extent that {Mt()} could be replaced by any local martingale {Zt()},
for example, one with independent increments, for which Z t = 2
t without
changing the estimators.
In the particular case of a diffusion process the components on the righthand
side of the representation (8.6) can be written as
dAt() = a(t, Xt, ) dt, dMt = b
1
2 (t, Xt) dWt, (8.8)
where a and b are known vector and matrix functions, respectively, and {Wt} is
standard Brownian motion. Then, an appropriate Radon-Nikodym derivative
of the measure induced by the process {Xt, 0  t  T} with parameter  with
respect to the corresponding measure for parameter 0 can be calculated and
is given by
exp
T
0
C(t, Xt) dXt -
T
0
D(t, Xt) dt , (8.9)
where
b(t, x) C(t, x) = a(t, x, ) - a(t, x, 0),
D(t, x) = (a(t, x, 0)) C(t, x) +
1
2
(C(t, x)) b(t, x) C(t, x)
(e.g., Basawa and Prakasa Rao (1980, p. 219)).
8.3. PARAMETERS IN DIFFUSION TYPE PROCESSES 133
From (8.9) it is easily checked that the likelihood score function (derivative
of the logarithm of the Radon-Nikodym derivative with respect to ) is
given by (8.7). This means that the quasi-likelihood estimator is the maximum
likelihood estimator for the model (8.8). However, as indicated above,
the quasi-likelihood estimator is available much more generally.
The explanation for the quasi-score corresponding to the likelihood score
for the model (8.8) is not hard to discern. A likelihood score is a martingale
under modest regularity conditions and all square integrable martingales living
on the same probability space as the Brownian motion in the noise term of
the model (8.8) can be described as stochastic integrals with respect to the
Brownian motion (see, e.g., Theorem 5.17 of Liptser and Shiryaev (1977)).
The likelihood score will be one such martingale and will therefore be included
in the relevant family H over which optimization takes place and it solves the
optimization problem.
As a concrete illustration of the methodology we shall discuss the stochastic
differential equation
dXt = ( - Xt) dt +  Xt dWt, (8.10)
where X0 > 0,  > 0,  > 0,  > 0. This form was proposed by Cox, Ingersoll
and Ross (1985) as a model for interest rates and it has been widely used in
finance.
In considering the model (8.10) we shall be concerned with the estimation
of  = (, ) . The parameter  can be regarded as known whenever Brownian
motion and continuous sampling are involved. Indeed,  can be calculated
with probability one on the basis of knowledge of a path of the process on any
finite time interval. This follows from the definitions of the quadratic variation
process and stochastic integrals with respect to Brownian motion (e.g., Rogers
and Williams (1987, Chapter IV, Section 4)) from which one obtains that,
writing t
(n)
i = min(T, 2-n
i),
lim
n

i=0
Xt
(n)
i+1
- Xt
(n)
i
2
= 2
T
0
Xt dt a.s.
and
lim
n

i=0
Xt
(n)
i
t
(n)
i+1 - t
(n)
i =
T
0
Xt dt a.s.
For the model (8.10) the Radon-Nikodym derivative of the measure based
on (, ) with respect to that based on (0, 0) is easily seen from (8.9) to be
exp -2
T
0
X-1
t [( - Xt) - a0(0 - Xt)] dXt
-
1
2
-2
T
0
X-1
t [2
( - Xt)2
- a2
0(0 - Xt)2
] dt
134 CHAPTER 8. BYPASSING THE LIKELIHOOD
and differentiating the logarithm of this likelihood ratio with respect to  =
(, ) gives the likelihood score
UT = -2
T
0
 - Xt

X
- 1
2
t dWt, (8.11)
which is also the quasi-score given by (8.7).
If we modify the model to the form
dXt = ( - Xt) dt + (, ) Xt dWt
where (, ) reflects a possibly rate dependent noise, then the likelihood ratio
does not exist in general. Indeed, when (1, ) = (2, ) the supports of the
distributions of the two processes are disjoint. The quasi-likelihood methodology,
however, is unaffected by this change. The quasi-score estimating function
continues to be given by (8.11) and the asymptotic properties of the estimators
are also unaffected.
From (8.11) the maximum likelihood/quasi-likelihood estimators ^T , ^T are
given by
T
0
^T - Xt X-1
t dXt - ^T
^T - Xt dt = 0,
T
0
X-1
t dXt - ^T
^T - Xt dt = 0,
and putting
IT =
T
0
X-1
t dXt, JT =
T
0
X-1
t dt, KT =
T
0
Xt dt,
we find that
^T = (IT T - JT (XT - X0))/(JT KT - T2
),
^T = (IT KT - T(XT - X0))/(IT T - JT (XT - X0)).
For the model with 2  2
there is a strictly positive ergodic solution
to (8.10) as T   whose distribution has gamma density (2/2
, 2/2
)
(see, e.g., Kloeden and Platen (1992, p. 38)). Suppose X has this density;
then
EX = , EX-1
 = 2/(2 - 2
). (8.12)
Using the ergodic theorem we obtain
T-1
JT = T-1
T
0
X-1
t dt
a.s.
- EX-1
 , (8.13)
T-1
KT = T-1
T
0
Xt dt
a.s.
- EX, (8.14)
8.3. PARAMETERS IN DIFFUSION TYPE PROCESSES 135
and, since
IT =
T
0
X-1
t dXt = log X-1
0 XT +
1
2
2
JT
using Ito's formula,
T-1
IT
a.s.
-
1
2
2
EX-1

as T  . These results readily give the strong consistency of the estimators
^T , ^T . Asymptotic normality of T
1
2 (^T - , ^T - ) can be obtained by
applying Theorem 2.1, p. 405 of Basawa and Prakasa Rao (1980).
All these results continue to hold under substantially weakened conditions
on the noise component in the model. For example, they hold if {Wt} is
replaced by a square integrable martingale with stationary independent increments
{Zt} for which Z t  t. The details involve straightforward applications
of the martingale strong law and central limit theorem and are omitted.
It should be remarked that there is a considerable recent literature concerning
estimation of parameters in diffusion type models, partly motivated
by burgeoning applications in mathematical finance. See, for example, Bibby
and Srensen (1995), Pedersen (1995), Kessler and Srensen (1995), Kloeden,
Platen, Schurz and Srensen (1996) and references therein. In most realistic
situations the diffusion cannot be observed continuously, so discrete time approximations
to stochastic integrals or a direct approach using the discrete time
observations is required.
The formulation via (8.6) and (8.7) has to be used with care for problems
with multiple sources of variation. Suppose, for example, that the Langevin
stochastic differential equation (e.g., Kloeden and Platen (1992, pp. 104­105))
is augmented with jumps coming from a Poisson process and becomes
dXt = Xt- dt + dWt + dNt, (8.15)
Nt being a Poisson process with intensity . Then, the process may be written
in semimartingale form as
Mt = Wt + Nt - t. (8.16)
Using (8.15) and (8.16) in (8.7), the quasi-score estimating function based
on noise {Mt} is
T
0
(Xt- 1) dMt,
leading to the estimating equations
T
0
Xt- dXt = ^T
T
0
X2
t dt + ^T
T
0
Xt dt,
XT = ^T
T
0
X2
t dt + ^T T.
136 CHAPTER 8. BYPASSING THE LIKELIHOOD
These, however, are the maximum likelihood estimating equations for the model
dXt = (Xt + ) dt + dWt,
i.e., a version of (8.15) in which Nt has been replaced by its compensator t.
In this model the entire stochastic fluctuation is described by the Brownian
process and this is only realistic if  1.
This problem, first noted by Srensen (1990), can be circumvented and the
true maximum likelihood estimators for ,  obtained if we treat the sources
of variation separately. Details are provided in Chapter 2, Section 5.
Models of the above kind are quite common and the general message is
to identify relevant (local) martingales that focus on the individual sources
of variation. It is then possible to obtain quasi-score estimating functions for
each, and to combine them, provided the appropriate sample information is
available.
8.4 Estimation in Hidden Markov
Random Fields
Considerable recent interest has been shown in hidden Markov models or more
generally, partially observed stochastic dynamical system models. The area
embodies a wide class of problems in which a random process (or field) is not
observed directly but instead is observed subject to noise through a second
process (or field).
The techniques that have been developed for this class of problem typically
involve full distributional assumptions. Examples are the use of the E-M
algorithm (e.g., Qian and Titterington (1990)) or the use of a measure transformation
which changes all the signal and noise variables into independent and
identically distributed random variables. The monograph Elliott et al. (1994)
is devoted to this latter topic and includes a wide range of references to the
area.
In this section we shall feature a quite different technique, namely, that
of optimal combination of estimating functions. This can be used to provide
quick derivations of optimal estimating equations for problems of this type.
A representative example taken from Heyde (1996) is given to illustrate the
approach in the case of hidden Markov models. It is chosen for its clarity
rather than its generality.
Suppose that we observe continuous variables {yi} on a lattice of sites indexed
by i. The dimension of the lattice is d  1 and the observation at site i
depends on the value of its neighbors.
In order to specify the dependence on neighbors we introduce a matrix
W = (wij) for which wii = 0, while wij = 1 if i and j are neighbors and is zero
otherwise. The matrix W is assumed to be symmetric and wi denotes the ith
column of W . We shall write [i] to denote the neighbors of i.
8.4. ESTIMATION IN HIDDEN MARKOV RANDOM FIELDS 137
We shall consider the field {yi} that satisfies the spatial autoregression
specification
yi = wiy + i, (8.17)
where
E i y[i] = 0, var yi y[i] = 2
and  is scalar. These processes are also known as conditional autoregressions
(CAR). See Ripley (1988, p. 11) for a discussion.
Now we are concerned with the context in which {yi} is not observed directly
but is hidden in a second field {zi} that is observed on the same lattice. We
suppose that the field {zi} satisfies the dynamics
zi =
k{i[i]}
bkyk + i, (8.18)
where the {i} are uncorrelated with zero mean and finite variance 2
and are
uncorrelated with the { i}, while the {bk} are known. We now seek to estimate
{yi}, assuming  known and subsequently , on the basis of the observations
{zi}. Optimal estimation in general is achieved by identifying the sources of
variation and optimally combining estimating functions from each of these.
Here we have two sources of noise, namely { i} and {i} to work with.
As a prelude to obtaining the quasi-score estimating functions corresponding
to (8.17) and (8.18), we rewrite both in matrix form as
y = W y + (8.19)
and
z = By + , (8.20)
respectively, where y = (yi), = ( i), z = (zi) and B = (bij), where bij = bj
if j  {i  [i]}, 0 otherwise.
Now, we find quasi-score estimating functions from the families {G =
A , nonrandom matrix A} and {H = C, nonrandom matrix C} of estimating
functions. Suppose that G
= A
, H
= C
 are the required
forms.
Applying Theorem 2.1, we find that
EG G
= AE A
and
E ˙G = AE

y
(I - W )y = A(I - W )
so that
(E ) A
= (I - W ),
while
EHH
= CE  C
= 2
CC
138 CHAPTER 8. BYPASSING THE LIKELIHOOD
and
E ˙H = CE

y
(z - By) = -CB
so that C
= -B /2
.
Since we have taken and  as orthogonal we then have the combined
estimating function from sources (8.17) and (8.18) as the sum
A
+ C
 = (I - W ) (E )-1
(I - W ) y - B (z - By)/2
(8.21)
(see Section 6.3).
Now the specification (8.17) gives
Ey y = 2
I + W Ey y ,
i.e.,
(I - W ) Ey y = 2
I
and
E = 2
(I - W )
so that, assuming that (I - W ) is positive definite,
(I - W ) (E )-1
(I - W ) = (I - W )/2
and (8.21) becomes
((I - W ) y/2
) - (B (z - By)/2
).
The quasi-score estimating equation for estimation of y is then
(I - W ) y/2
= B (z - By)/2
,
i.e.,
y = (I - W )/2
+ B B /2 -1
B z/2
. (8.22)
The estimating equation (8.22) is of the same as the one derived by Elliott,
Aggoun and Moore (1994, Section 9.3) using assumptions of Gaussianity and
Girsanov transformation type arguments to change measures and obtain conditional
probability densities. The difference in formulation is just that they have
begun from a model in Gibbs field form. The quasi-likelihood approach has enabled
the specific distributional assumptions and change of measure arguments
to be avoided.
Now we suppose that, rather than being known,  is to be estimated. Without
specific distributional assumptions ordinary likelihood based methods are
not available. However, we can find a quasi-score estimating function from the
family
{G(S) = S - E( S ), S symmetric matrix}
to be
G(S
) = y W y - tr W (I - W )-1
(8.23)
8.5. EXERCISE 139
following the argument of Section 8.2 and under the supposition that each yi
has kurtosis 3. The estimating function (8.23) is, in fact, the score-function
under the assumption that y is normally distributed (see, e.g., Lindsay (1988,
p. 226)).
Estimation of y and  can then, in principle, be achieved by simultaneous
numerical solution (8.22) and (8.23).
Other simpler, albeit less efficient, procedures can be also be found. For
example, it is possible to replace (8.23) by the estimating function
y W y -  y W 2
y; (8.24)
as has been mentioned in Section 6.1.
A similar approach to that used can be employed to deal with a wide variety
of models, including those for which the different sources of variability are not
uncorrelated. In each case the combined quasi-score estimating function can be
found via Theorem 2.1. For an example of optimal combination of correlated
estimating functions see Chapter 6, Section 3.
It is also worth remarking that the hidden Markov model problem is structurally
similar to problems in statistics involving measurement errors where
surrogate predictors are used. For a discussion of the use of quasi-likelihood
based methods in this context see Carroll and Stefanski (1990).
8.5 Exercise
Suppose that the parameter  = ( ,  ) , where  is of interest and  is a
nuisance component. For estimating functions G()  H we partition into
G =
G
G
 H =
H
H
.
Consider the case where the likelihood score U = (U, U) is available and
write I = E(U U) where ,  are  or . Show that the likelihood score
based estimating function U - II-1
U for  is a quasi-score estimating
function in H if it belongs to H. (Hint. Use the results of Section 7.3.)
Chapter 9
Hypothesis Testing
9.1 Introduction
In this chapter we discuss hypothesis testing based on the quasi-score estimating
function. This is done by developing analogues of standard procedures of
classical statistics.
The classical statistical setting for hypothesis testing involves a sequence
of T, say, independent random variables whose distribution depends on a pdimensional
parameter  = (1, 2, . . . , p) belonging to a sample space , an
open subset of p-dimensional Euclidean space p
. A null hypothesis H0 usually
involves specification of the i, i = 1, 2, . . . , p to be functions gi(1, 2, . . . , k)
of   k
, or specifying restrictions Rj(i) = 0 for j = 1, 2, . . . , r, (r + k = p).
Tests of H0 against the full model have typically involved one of the three test
statistics:
(a) Likelihood ratio statistic (Neyman-Pearson)
T = 2 LT (^T ) - LT (~T ) ;
(b) Efficient score statistic (Rao)
T = ST (~T )(IT (~T ))-1
ST (~T );
(c) Wald test statistic
T = T R (^T ) W (^T )(IT (^T ))-1
W(^T )
-1
R(^T ),
which are ordinarily asymptotically equivalent. Here LT is the log likelihood,
R is the vector of restrictions that define H0, W () is the matrix with elements
Ri/j, ST () is the likelihood score function LT ()/, IT () is the Fisher
information matrix, ^T is the MLE for the full model and ~T is the MLE under
the restrictions imposed by the hypothesis H0.
Note that the efficient score statistic depends only on the MLE for the
restricted class of parameters under H0, while Wald's statistic depends only on
the MLE over the whole parameter space. For further details see Rao (1973,
Section 6e).
In the case of hypothesis testing for stochastic processes there is now a reasonably
well developed large sample theory based on the use of the likelihood
score. See for example Basawa and Prakasa Rao (1980), particularly Chapter
141
142 CHAPTER 9. HYPOTHESIS TESTING
7 and Basawa and Scott (1983, Chapter 3). The use of what are termed generalized
M-estimators is discussed in Basawa (1985) in extension of the efficient
score and Wald test statistics and these ideas are further developed in Basawa
(1991). This path is also followed in the present chapter. However, much of the
discussion in the existing literature has involved testing relative to sequences
of alternative hypotheses. We shall here avoid this formulation to focus on the
special features of the quasi-likelihood setting. The discussion here is based on
Thavaneswaran (1991) together with various extensions.
9.2 The Details
Neither the efficient score statistic nor the Wald statistic involve the existence
of a scalar objective function (the likelihood) from which an optimal estimating
function is derived by differentiation. Each can therefore be generalized
straightforwardly to an estimating function setting and this we shall do below.
We shall not consider restricted settings, such as that of the conservative quasiscore
(Li and McCullagh (1994)) in which a likelihood ratio test generalization
is available. The asymptotic properties of such a likelihood ratio test will be
the same as those of the efficient score and Wald test generalizations (see the
exercise of Section 9.3).
Suppose that Q() is a quasi-score estimating function, for , in standardized
form, chosen from an appropriate family H of estimating functions. Here
and below we drop the suffix T for convenience. As usual we take
V = V () = E(Q()) = E(Q() Q ())
for the information matrix.
In the following 
denotes the unrestricted quasi-likelihood estimator whereas
~ is the quasi-likelihood estimator calculated under hypothesis H0.
Most testing problems for stochastic models concern linear hypotheses on
the unknown parameter. We shall follow the ideas of Section 7.2 on constrained
parameter estimation, and consider testing H0 : F  = d against H1 : F  =
d where F is a p × q matrix not depending on the data or  with full rank
q  p. Note that this framework includes the problem of testing a subvector
of length q ( p), say 2, of  = (1, 2) with H0 : 2 = 02 to be tested
against H1 : 2 = 02. Just take the block matrix form
F =
O(p-q)×(p-q) O(p-q),q
Oq,(p-q) Iq
,
where Or,s is the r × s matrix all of whose elements are zero and, of course, Iq
is the q × q identity matrix.
For the H0 : F  = d setting and for the ergodic case (see Section 4.3) we
have that the efficient score statistic analogue is
 = Q (~)(E(Q(~)))Q(~)
(9.1)
9.2. THE DETAILS 143
and the Wald test statistic analogue is
 = (F 
- d) (F (E(Q(
)))F
)(F

- d). (9.2)
These test statistics are, of course, the classical ones in the case where the
quasi-score estimating function is the likelihood score.
We would envisage using the test statistics  and  in circumstances under
which
V - 1
2 () Q()
d
MV N(0, Ip) (9.3)
and, using Taylor series expansion,
V
1
2 () (
- )
d
MV N(0, Ip). (9.4)
Then, both  and  are approximately distributed as 2
q under H0.
From the proof of Theorem 7.2 we have that
Q(~)
d
P Q(),
where P = F (F V -1
F )F
V -1
, so that using (9.3),
Q(~) MV N(0, P V P ),
while
E(Q(~)) P V P
so that

d
(P Q) (P V P )P
Q
= V - 1
2 Q V
1
2 P (P V P )P
V
1
2 (V - 1
2 Q),
which is approximately distributed as 2
q in view of (9.3) and since
V
1
2 P (P V P )P
V
1
2
is idempotent with rank q. On the other hand, from (7.2) we have that
~ = 
- V -1
F (F V -1
F )(F

- d) (9.5)
and, using (9.4) and H0,
F 
- d
d
MV N(0, F V -1
F ).
Then, since
E(Q(
)) E(Q()) = V
we have that  is also approximately distributed as 2
q.
144 CHAPTER 9. HYPOTHESIS TESTING
Under H1 we generally have that (9.3) and (9.4) still hold and, for large
T, upon expanding Q() about 
to the first order, and assuming that ˙Q
approximates -V (as in the discussion of the proof of Theorem 7.2),
Q(~) -V (~ - 
)
= F (F V -1
F )(F

- d)
from (9.5). Thus, from (9.4),
F  d
MV N(F , F V -1
F ) (9.6)
and then
Q(~)
d
MV N(F (F V -1
F )(F
 - d), F (F V -1
F )F
). (9.7)
We also have
E(Q(
)) E(Q()) = V (9.8)
and
E(Q(~)) F (F V -1
F )F
. (9.9)
Then, from (9.7) and (9.9) in the case of  and (9.6) and (9.8) in the case of
 we find that both  and  are approximately distributed as noncentral 2
q,
with noncentrality parameter
C(V ) = (F  - d) (F V -1
F )(F
 - d). (9.10)
The test based on  or  using the quasi-score Q is optimal in the sense of
providing a maximum noncentrality parameter under H1 for estimating functions
within the family H. If another estimating function were chosen from H,
then, under appropriate regularity conditions, we would obtain (9.10) with V
replaced by V 1 say, with V 1  V in the Loewner ordering. It is easily seen
that C(V 1)  C(V ). This justifies the choice of the quasi-score as a basis for
hypothesis testing.
These results can readily be extended to the non-ergodic case when {QT }
is a martingale. In this setting we replace (9.1) by
 = QT (^) Q(^) T
QT (^), (9.11)
where Q() T is the quadratic characteristic (see Section 4.3) and replace (9.2)
by
 = F 
- d F Q(
) T
F
F

- d . (9.12)
The role of V in the above discussion is now taken by the (random) quantity
Q .
9.3. EXERCISE 145
9.3 Exercise
Consider the setting of Section 9.2 in which H0 : F  = d is to be tested
against H1 : F  = d. Suppose that Q() is a quasi-score estimating function
chosen from some family H of estimating functions and that there is a scalar
function q such that Q() = q/. The natural analogue of the likelihood
ratio statistic in this setting is
 = 2(q(
) - q()).
Using the first order approximations of this chapter show that

d
(
- ~) V (
- ~)
and that this is approximately distributed as 2
q under H0 or as noncentral 2
q
with noncentrality parameter (9.10) under H1.
Chapter 10
Infinite Dimensional Problems
10.1 Introduction
Many nonparametric estimation problems can be regarded as involving estimation
of infinite dimensional parameters. This is the situation, for example, in
the neuronal membrane potential model
dV (t) = (- V (t) + (t)) dt + dM(t)
discussed earlier in Section 2.6 with parameters held constant, if changes in (t)
over the recording interval 0  t  T are important and the function needs to be
estimated. Both the situation where n replica trajectories of {V (t), 0  t  T}
are observed and n   or one trajectory is observed and T   are of
interest but most of the established results deal with the former case.
Various possible estimation procedures have been developed for use in such
problems including kernel estimation and the method of sieves. Efforts have
also been made to develop the quasi-likelihood methodology in a Banach space
setting to deal directly with infinite dimensional problems (e.g., Thompson and
Thavaneswaran (1990)) but the results are of limited scope and no asymptotic
analysis has been provided. The topic is of sufficient intrinsic importance to
be dignified with a chapter, albeit brief and more suggestive than definitive.
Here the quasi-likelihood methodology described elsewhere in the book does
not play a central role. We shall first sketch the approach via the method of
sieves and then offer some heuristic discussion related to particular problems.
10.2 Sieves
The method of sieves was first developed by Grenander (1981) as a natural extension
of the finite dimensional theory. It is ordinarily applied in the context
where there is replication of some basic process. We use as a sieve a suitably
chosen complete orthonormal sequence {i(t)}, say, and approximate the
unknown function (t) by
(n)
(t) =
n
k=1
i i(t),
where n increases to infinity as the sample size increases. Then, the i are estimated,
say by quasi-likelihood, to give an optimal estimate (n)
(t) of (n)
(t).
The remaining task is to show that as n  , (n)
(t) is consistent for (t)
and, hopefully, asymptotically normally distributed.
147
148 CHAPTER 10. INFINITE DIMENSIONAL PROBLEMS
Considerable recent literature has been devoted to such studies. For example,
Nguyen and Pham (1982) established consistency of the sieve estimator
of the drift function in a linear diffusion model. McKeague (1986) dealt with
a general semimartingale regression model and obtained a strong consistency
result for certain sieve estimators. Sieve estimation problems for point processes,
such as in the multiplicative intensity model, have been studied by
Karr (1987), Leskow (1989) and Leskow and Rozanski (1989). Kallianpur and
Selukar (1993) have provided a general discussion of sieve estimation where
the focus is on maximum likelihood for the nested finite dimensional problems.
Rate of convergence results for sieve estimators are discussed in Shen and Wong
(1994). The method is a valuable one but it does suffer from the disadvantage
of lack of exlicit expressions and the problem in practice of trade-offs between
approximation error and estimation error. Much remains to be done and a
general discussion, based on the use of quasi-likelihood for the nested finite
dimensional problems, is awaited.
10.3 Semimartingale Models
Here we shall consider semimartingale models of the form
dXt = dAt + dMt,
where {Xt} is the observation process, {Mt} is a square integrable martingale
and the predictable bounded variation process {At} is of linear form
dAt = t dBt
in the unknown function t of interest, {Bt} being an observable process, possibly
a covariate or a function of the observation process, such as dBt = Xt-dt.
Observations on the process typically involve a history over an extended
time interval 0  t  T, say, or we have copies X1t, . . . , Xnt, say, of the process
over a fixed time interval, which we can take as [0, 1]. In any case, we seek
estimators that, in a suitable sense, are consistent as T   or n  .
For models of the type considered here, the semimartingale representation
immediately suggests a simple estimation procedure. If we have n histories,
write Xt = (X1t, . . . , Xnt) , Mt = (M1t, . . . , Mnt) and Bt = (B1t, . . . , Bnt) .
Then,
dXt = t dBt + dMt,
so that if 1 is the n × 1 vector, all of whose elements are unity,
T
0
s ds =
T
0
(1 dBs)-1
ds (1 dXs) -
T
0
(1 dBs)-1
ds (1 dMs),
and
T =
T
0
s ds (10.1)
10.3. SEMIMARTINGALE MODELS 149
is unbiasedly estimated by
^T =
T
0
(1 dBs)-1
ds (1 dXs). (10.2)
In a certain formal sense, t is estimated by (1 dBt)-1
(1 dXt), the derivative
of a generally (pointwise) non-differentiable process.
This approach is convenient but of course it does not readily lead to an
estimator of t, but rather of its indefinite integral t.
Since
^T - T =
T
0
(1 dBs)-1
ds (1 dMs) (10.3)
is a square integrable martingale, we obtain from the Burkholder, Davis, Gundy
inequality (e.g., Theorem 4.2.1, p. 93 of Rogers and Williams (1987)) that for
C a universal constant,
E sup
T [0,1]
(^T - T )2
 C E(^1 - 1)2
= C

0
(1 dBs)-1
ds (1 dMs)
1
= C
1
0
(1 dBs)-2
(ds)2
d 1 M s
so we have consistency if
1
0
(1 dBs)-2
(ds)2
d 1 M s  0
as n  . Various central limit results can also be formulated on the basis of
(10.3).
On the other hand, if we have a single history it is usually the case that
T
0
(dBs)-1
ds dXs
T
0
s ds
a.s.
- 1 (10.4)
as T  . Note that
T
0
(dBs)-1
ds dXs
T
0
s ds
= 1 +
T
0
(dBs)-1
ds dMs
T
0
s ds
,
and for the martingale
Nt =
T
0
(dBs)-1
ds dMs,
150 CHAPTER 10. INFINITE DIMENSIONAL PROBLEMS
we have
d N t = (dBt)-2
(dt)2
d M t
and, so long as N T   a.s. as T  , the martingale strong law (e.g.,
Theorem 12.5) ensures that NT / N T
a.s.
- 0, i.e. (10.4) holds if
lim sup
T 
T
0
(dBs)-2
(ds)2
d M s
T
0
s ds <  a.s. (10.5)
This is easily checked in practice for simple models. For example, if dBs = ds
and d M s = ds we require
lim sup
T 
T
T
0
s ds <  a.s.
Sometimes minor adjustments are necessary, as for example in the case of
the Aalen (1978) model for nonparametric estimation of the cumulative hazard
function using censored lifetime data. In this setup Xt is a counting process
representing the number of deaths up to time t, dBt = t Yt dt, where Yt is
the number at risk at time t (possibly zero) and t is the hazard function of
the lifetime distribution. Then, d M s = s Ys ds, and using I to denote the
indicator function,
T
0
I(Ys > 0) Y -1
s dXs
T
0
I(Ys > 0) s ds
= 1 +
T
0
I(Ys > 0) Y -1
s dMs
T
0
I(Ys > 0) s ds
,
and provided that its denominator goes a.s. to infinity.
T
0
I(Ys > 0) Y -1
s dMs

0
I(Ys > 0) Y -1
s dMs T
=
T
0
I(Ys > 0) Y -1
s dMs
T
0
I(Ys > 0) Y -1
s s ds
a.s.
- 0
using the above mentioned martingale strong law. The consistency result
T
0
I(Ys > 0) Y -1
s dXs
T
0
I(Ys > 0) s ds
a.s.
- 1
as T   then follows since Ys  1 on I(Ys > 0).
It is possible to use the quasi-likelihood theory of this book if, for example,
t is assumed to be a step function with jumps at 0 = t0 < t1 < . . . < tk-1 <
tk = T, namely, t = j, tj-1  t < tj, j = 1, 2, . . . , k. Then, postulating the
model
d ~Xt =


k
j=1
j I(tj-1  t < tj)

 d ~Bt + d ~Mt,
say, the Hutton-Nelson quasi-score for estimation of  = (1, . . . , k) is
T
0
diag (I(t0  t < t1), . . . , I(tk-1  t < tk))
d ~Bt
d ~M t
d ~Mt
10.3. SEMIMARTINGALE MODELS 151
from which we see that
^j =
tj
tj-1
d ~Bt
d ~M t
d ~Xt
tj
tj-1
(d ~Bt)2
d ~M t
, 1  j  k, (10.6)
are the corresponding quasi-likelihood estimators if ~M t does not involve the
's.
If, on the other hand,
d ~M t = t d ~Bt
as for counting process models, then we find that
^j =
~Xtj
- ~Xtj-1
~Btj
- ~Btj-1
, 1  j  k, (10.7)
for the quasi-likelihood estimators. One can envisage approximating the model
dXt = t dBt + dMt, (10.8)
observed over 0  t  T, by a sequence of models such as
d ~Xt =
N
j=1
j I
j - 1
N
 t <
j
N
d ~Bt + d ~Mt
where N  . This idea is formalized in the theory of the histogram sieve.
An alternative approach would be to use a discrete model approximation
to (10.8), such as provided by the Euler scheme, say
~Xj - ~X(j-1) = j
~Bj - ~B(j-1) + ~Mj - ~M(j-1) , (10.9)
j = 1, 2, . . . , [T/], [x] denoting the integer part of x.
Here the Hutton-Nelson quasi-score leads to the estimators
^j =
~Xj - ~X(j-1)
~Bj - ~B(j-1)
, 1  j  [T/], (10.10)
regardless of the form of the quadratic characteristic of the martingale { ~Mt}.
It should be noted that (10.7) and (10.10) accord with the interpretation mentioned
above of the estimator of t as dXt/dBt. They are straightforward to
deal with numerically.
For asymptotic results we need, for example, replication as indicated above
and it is again worth remarking on the model approximation error and the
estimation error. It can ordinarily be expected that for Euler schemes the error
in model approximation is O(
1
2 ) (see, e.g., Section 9.6 of Kloeden and Platen
(1992) which readily extends beyond the diffusion context). On the other hand,
if n replicates are being used, the estimation error is O(n
1
2 ) (central limit rate).
Chapter 11
Miscellaneous Applications
11.1 Estimating the Mean of a
Stationary Process
A method of last resort, when more sophisticated methods are not tractable or
not available, is the method of moments, which typically relies on an ergodic
theorem and involves no significant structural assumptions. Nevertheless, in
various circumstances, such as for estimating the mean of a stationary process,
it turns out that the method of moments (which estimates EXi =  by
T-1 T
i=1 Xi) produces an estimator that has the same asymptotic variance
as the best linear unbiased estimator (BLUE) under broad conditions on the
spectral density or covariance function. This is of considerable practical significance.
The discussion in this section follows the papers Heyde (1988b),
(1992b).
Let {Xt, t = . . . , -1, 0, 1, . . .} be a stationary ergodic process with EXt =
, which is to be estimated, and EX2
t < . Write
(t - s) = cov (Xs - , Xt - ) =

-
ei(s-t)
f() d
for the covariance function, where
f() = (2)-1


(0) + 2

j=1
(j) cos j


,  = 0, ||  ,
is the spectral density, the spectral function being assumed to be absolutely
continuous. This is the case for all purely nondeterministic (i.e., contain no
component which is exactly predictable) processes (e.g., Hannan (1970, Theorem
3 , p. 154)).
Our concern here is to estimate  on the basis of observations (X1, . . . , XT )
and we shall first restrict consideration to the class of estimating functions
H1 =
T
i=1
ai,T (Xi - ), ai,T constants,
T
i=1
ai,T = 1 .
Then, if GT =
T
i=1 ai,T (Xi - ), G
T =
T
i=1 a
i,T (Xi - ), we find that
EGT = -1, EGT G
T =
T
i=1
T
j=1
a
i,T aj,T (j - i)
153
154 CHAPTER 11. MISCELLANEOUS APPLICATIONS
and, using Theorem 2.1, G
T is a quasi-score estimating function within H, if
T
i=1
(j - i) a
i,T = cT (constant), j = 1, 2, . . . , T. (11.1)
In this case the quasi-score estimating function leads to a quasi-likelihood estimator
that is just the best linear unbiased estimator (BLUE) MV possessing
minimum asymptotic variance. To see this, note that
EG2
T - EG2
T = E(GT - G
T )2
+ 2EG
T (GT - G
T )
= E(GT - G
T )2
 0.
Of course the calculation of the BLUE involves a full knowledge of the
covariance structure of the process so that it is ordinarily not feasible to use it
in practice.
To obtain an expression for the minimum variance cT , we write  (T) for
the T × T matrix ((i - j)) and let 
(T) and e(T) denote the T-vectors
(
1,T , . . . , 
T,T ) and (1, 1, . . . , 1) , respectively, the prime denoting transpose.
Then, (11.1) gives
 (T) 
(T) = cT e(T),
while
(e(T)) 
(T) = 1,
and hence
cT = [(e(T)) ( (T))-1
e(T)]-1
. (11.2)
The explicit expressions for 
(T) and cT are, however, of limited practical
value as, even when the covariances (i) are known, they involve calculation of
the inverse of the Toeplitz matrix  (T), which will usually be of high order.
On the other hand, the method of moments estimator
M = T-1
T
i=1
Xi
uses equally weighted observations and requires no direct knowledge of the
underlying distribution. We shall examine conditions under which
var MV = [(e(T)) ( (T))-1
e(T)]-1
(11.3)
and
var M = T-2
(e(T))  (T) e(T) (11.4)
have the same asymptotic behavior. That is, the estimator M is an asymptotic
quasi-score estimating function.
11.1. ESTIMATING THE MEAN OF A STATIONARY PROCESS 155
Theorem 11.1 Suppose that the density f() of {Xt} is continuous and
positive at  = 0. Then
T var M  2f(0) as T  . (11.5)
If, in addition,

-
(|p()|2
/f()) d <  for some trigonometric polynomial
p(), then
T var MV  2f(0) as T  , (11.6)
and
T
1
2 (MV - M )
p
- 0 as T  , (11.7)
p denoting convergence in probability.
Proof The result (11.5) is a well-known consequence of the representation
var M =
1
T2

-
sin(T/2)
sin(/2)
2
f() d;
see, for example, Ibragimov and Linnik (1971, p. 322). The result (11.6) follows
from Theorem 3 of Adenstadt and Eisenberg (1974).
To obtain (11.7) we note that
T E(MV - M )2
= T E
T
i=1
(
i,T - T-1
) Xi
2
= T
T
i=1
(
i,T - T-1
)
T
j=1
(
j,T - T-1
) (i - j)
= T
T
i=1
(
i,T - T-1
)

cT - T-1
T
j=1
(i - j)


= -
T
i=1
(
i,T - T-1
)
T
j=1
(i - j)
= T(var M - var MV )  0
as T   using (11.5) and (11.6) and the result follows. This completes the
proof.
It should be noted from (11.7) that if a central limit result holds for T
1
2 (M -),
then the corresponding one also holds for T
1
2 (MV - ). Various conditions of
asymptotic independence lead to the result
T
1
2 (M - ) = T- 1
2
T
i=1
(Xi - )
d
- N(0, 2 f(0)),
156 CHAPTER 11. MISCELLANEOUS APPLICATIONS
namely, convergence in distribution to the normal law with zero mean and
variance 2 f(0). For example, this holds if there is a -field M0 such that X0
is M0-measurable and

k=0
E|E(Xk | M0) - | < 
(Hall and Heyde (1980, Theorem 5.4, p. 136)). This is a very general result
from which many special cases, such as for mixing sequences, follow.
For the results of Theorem 11.1 to be useful it is essential that f(0) > 0.
An example where this may not be the case is furnished by the moving average
model
Xt =  + t - r t-1, |r|  1,
where the i are stationary and orthogonal with mean 0 and variance 2
. Here
f() = 2
(1 + r2
- 2r cos )/2,
so that f(0) > 0 requires r < 1. If r = 1 it turns out that (Grenander and
Szeg¨o (1958, p. 211)),
var M  22
T-2
, var MV  122
T-3
as T  , so that the method of moments estimator is far from efficient in this
case. In fact, if f(0) = 0, it holds rather generally that var MV = o(var M )
as T   (Adenstadt (1974), Vitale (1973)).
Now Theorem 11.1 can be used to show that MV and M have the same
asymptotic variance if the spectral density f() is everywhere positive and
continuous. Such information is ordinarily not directly available but a useful
sufficient condition in terms of covariances is given in the following theorem.
Theorem 11.2 If the covariance (T) decreases monotonically to zero as
T  , then {Xt} has a spectral function that is absolutely continuous.
The spectral density f() is nonnegative and continuous, except possibly at
zero, being continuous at zero when

j=1 (j) < . If {(n)} is convex and
(0) + (2) > 2(1), then the spectral density is strictly positive.
Proof The first two parts of the theorem are given in Theorem 3 of Daley
(1969). For the last part it should be noted that when {(n)} is convex, namely,
2
(n) = (n) + (n + 2) - 2(n + 1)  0, n  0,
the spectral density f() can be expressed in the form
f() = (2)-1

n=0
(n + 1)2
(n) Kn(), ||  ,
11.1. ESTIMATING THE MEAN OF A STATIONARY PROCESS 157
where
Kn() =
2
n + 1
sin((n + 1)/2)
2 sin(/2)
2
 0
is the Fej´er kernel (Zygmund (1959, p. 183)). Then, since K0() = 1
2 and
2
(n) Kn()  0 for n  1, f() > 0 if 2
(0) > 0 as required.
Corollary 11.1 If the covariance (T) decreases monotonically to zero as
T  ,

j=1 (j) <  and {(n)} is convex with (0) + (2) > 2(1), then
var MV  var M  T-1

(0) + 2

j=1
(j)


as T  .
This result follows immediately from Theorem 11.1 and 11.2.
The particular case of the result of Corollary 11.1 for which the covariance
function (n) is of the form
(n) =
K
i=1
ai e-ri|n|
(11.8)
where ai > 0, i = 1, 2, . . . , K, 0 < r1 < r2 < . . . , rK, has been obtained
by Halfin (1982). Covariance functions of the type (11.8) are common for
Markovian queueing models with limited queue space and various applications
are given in Halfin's paper.
Corollary 11.1, however, is very much more widely applicable and many
examples can be noted from the review article of Reynolds (1975) on the covariance
structure of queues and related processes. As one example, we mention
the queue length process in an M/G/ queue (Reynolds (1975, p. 386)) where
(n) = (n)/(0) = 

n
G(x) dx
with -1
= ES, the mean service time and
G(x) = P(S  x).
Then, it is easily checked that

n=0 (n) <  is equivalent to ES2
< ,
while convexity of {(n)} is automatically satisfied and (0) + (2) > 2(1)
requires
1
0
G(x) dx >
2
1
G(x) dx.
Another example involves the waiting time {Wn} in an M/G/1 queue where
(Reynolds (1975, p. 401))
(n) = 1 -

n(1 - ) -
n
r=1
Cr
2

 2
, n  1,
158 CHAPTER 11. MISCELLANEOUS APPLICATIONS
for certain constants Cr , with  = EWn, 2
= var Wn,  the traffic intensity
and  the mean arrival rate. Here it is again easily checked that {(n)} is
monotone and convex with
(0) + (2) > 2(1)
while

n=0 (n) <  if and only if ES4
< .
Now if the spectral density has zeros, Theorem 11.1 allows for their removal
by |p()|2
. This can be done, for example, for the queue length process in an
M/D/ queue where the constant service time T = 2.
For the M/D/ queue with constant service time S, (j) = (j)/(0) has
the particularly simple form
(j) =
1 - |j|/S, |j|  S,
0, otherwise
(Reynolds (1975, p. 387)) and (j) = T2
(j) where  denotes the mean arrival
rate. When S = 2, the spectral density is given by
2f() = 4(1 + cos )
and we find that
T var MV  8, T var M  8 (11.9)
as T   despite the fact that f(-) = f() = 0.
Now a crucial property that underlies the parity of (11.3) and (11.4) is that
of short-range dependence. It is said that short-range or long-range dependence
holds according as

j=1 (j) converges or diverges or, essentially equivalently,
the spectral density f() converges or diverges at  = 0.
For long-range dependence, T var XT ordinarily diverges as T   in contrast
to the short-range dependence case where convergence usually holds. Indeed,
rather typical of the behavior in the long-range dependence case is the
convergence in distribution of T+1/2
( XT - ) to a proper limit law as T  
where -1
2 <  < 0. The limit law is not Gaussian in general, save when the
stationary process {Xt} is itself Gaussian. Little is known in general about
the asymptotic behavior of MV . This should be contrasted with the case
of short-range dependence where T1/2
( XT - ) converges to the normal law
N(0, 2 f(0)) under very broad conditions and then a similar result holds for
T1/2
(MV - ) as readily follows from (11.7). These results are vitally important
as they provide the basis for asymptotic confidence statements about
.
Efficiency comparisons of X and MV for the long-range dependence case
appear in Samarov and Taqqu (1988), while Beran (1989) has provided confidence
intervals (under Gaussian assumptions) that take the estimation of the
parameter  into account. The relevance of all this, however, rests critically
on the assumption of Gaussianity because efficiency calculations based on variances
may be of little use in assessing sizes of confidence zones save when
11.2. ESTIMATION FOR A HETEROSCEDASTIC REGRESSION 159
Gaussian limits obtain. Gaussian limits would not be typical for queueing
models under long-range dependence.
The parity of (11.5) and (11.6) for short-range dependence also breaks down
in the case where the spectral density f() has a zero at  = 0. Again efficiency
comparisons can sometimes be made, for example in the moving-average model
considered earlier, but the problem remains that asymptotic normality does not
hold for the estimators in general.
11.2 Estimation for a Heteroscedastic
Regression
The model we consider in this section is the general linear regression model
y = X  + u, (11.10)
where y is an n × 1 vector of observations, X is an n × k matrix of known
constants of rank k < n,  is a k × 1 vector of unknown parameters and u is
an n×1 vector of independent residuals with zero mean and covariance matrix
 = diag (g1(), . . . , gn()),
where gi() = 2
(Xi)2(1-)
, Xt = (Xt1, . . . , Xtk), (t = 1, . . . , n), and X =
(X1
.
.
.   
.
.
. Xn) . Variances proportional to a power of expectations have
been widely observed, for example in cross-sectional data on microeconomic
units such as firms or households. The object is the efficient estimation of the
(k + 2) × 1 vector  = ( 2
) . The results here are from Heyde and Lin
(1992).
The obvious martingales to use for the model (11.10) are {Ut} and {Vt}
given by
Ut =
t
s=1
us, Vt =
t
s=1
(u2
s - gs()).
Then, the quasi-score estimating function based on {Ut} is
n
t=1
(Xt 0 0) g-1
t () ut = u -1
X
.
.
. 0 0 , (11.11)
which focuses on , provides no information on 2
and involves  as a nuisance
parameter. On the other hand, the quasi-score estimating function based on
{Vt} is
n
t=1
(˙gt()) (var u2
t )-1
(u2
t - gt()) =
 vec 

vec F -1
( - u u ), (11.12)
160 CHAPTER 11. MISCELLANEOUS APPLICATIONS
where
F = diag (var u2
1, . . . , var u2
n),
which allows estimation of all components of  provided F is a known function
of . However, efficiency in the estimation is enhanced if the martingales {Ut}
and {Vt} are used in combination and, using results of Chapter 6 (e.g., (6.5)),
the combined quasi-score estimating function is
n
t=1
(Xt 0 0) [ut + Rt (u2
t - gt())]
gt()(1 - RtSt)
+
n
t=1
(˙gt()) [St ut + (u2
t - gt())]
(var u2
t )(1 - RtSt)
,
(11.13)
where Rt = Eu3
t / var u2
t , St = Eu3
t /gt(). This simplifies considerably when
Eu3
t = 0 for each t, for then the martingales {Ut} and {Vt} are orthogonal and
(11.13) is a sum of the quasi-score estimating functions (11.11) and (11.12). In
this case we may separate the corresponding estimating equation into the two
components
X  u +
 vec 

vec F -1
( - u u ) = 0,
(11.14)
 vec 

vec F -1
( - u u ) = 0,
where  = (, 2
) .
If the ut are normally distributed, F = 2 and the quasi-likelihood estimator
for  coincides with the maximum likelihood estimator. The equation
(11.14) then corresponds to the true score equations (4.2), (4.3) of Anh (1988).
The martingale information (see Section 6.3) contained in (11.11) is
IQS(U) =
X -1
X Ok×2
O2×k O2×2
,
which clearly contributes only to the estimation of  while that in (11.12) is
IQS(V ) =
n
t=1
(var u2
t )-1
(˙gt()) (˙gt()) = A F -1
A,
where A = (gi()/j), and the combined information in (11.13) is
IQS(U,V ) =


X -1
(I - R S)-1
X
.
.
. Ok×2
O2×k
.
.
. O2×2


+ A F -1
(I - R S)-1
A
+
X -1
(I - R S)-1
RA
O2×(k+2)
(11.15)
11.2. ESTIMATION FOR A HETEROSCEDASTIC REGRESSION 161
+ A R(I - RS)-1
-1
X
.
.
. O(k+2)×2 ,
where R = diag (R1, . . . , Rn), S = diag (S1, . . . , Sn), which of course is IQS(U)+
IQS(V ) when Eu3
t = 0 for each t.
In contrast to the above estimating functions, Anh (1988) used nonlinear
least squares based on minimization of
n
t=1
(e2
t - Ee2
t )2
, (11.16)
where the et are the (observable) elements of the vector
e = (I - X(X X)-1
X ) u = P u = P y,
where P = I - X(X X)-1
X . Note that (11.16) amounts to using the estimating
function
n
t=1
˙ft()(e2
t - ft()),
where ft() = Ee2
t . Anh showed that this leads to a strongly consistent and
asymptotically normal estimator ^A, say, satisfying
^A - 
d
MV N(0, (A A)-1
A F A(A A)-1
)
for large n under his Conditions 1­3. However, under the same conditions,
minor modifications of the same method of proof can be used to establish that
^QS(V ) - 
d
MV N(0, (A F -1
A)-1
)
and
^QS(U,V ) - 
d
MV N(0, I-1
QS(U,V )),
where ^QS(V ) and ^QS(U,V ) are, respectively, the quasi-likelihood estimators
based on (11.12) and (11.13). By construction we have that
I-1
QS(V ) - I-1
QS(U,V )
in nonnegative definite, while
(A A)-1
A F A(A A)-1
- (A F -1
A)-1
is also nonnegative definite. This last result holds because, for a covariance
matrix F and n×p matrix A, the Gauss-Markov Theorem gives nonnegativity
of
BF B - (A F -1
A)-1
for every p × n matrix B satisfying BA = Ip (e.g., Heyde (1989)), and the
result holds in particular for B = (A A)-1
A .
Thus, from the point of view of asymptotic variance, the order of preference
is ^QS(U,V ), ^QS(V ), ^A.
162 CHAPTER 11. MISCELLANEOUS APPLICATIONS
11.3 Estimating the Infection Rate in an
Epidemic
The discussion in this section is based on the results of Watson and Yip (1992).
We shall begin by considering the simple stochastic epidemic model for a closed
population of size N and where the infection rate is . Let S(t) denote the
number of susceptibles and I(t) the number of infectives as time t, so that
S(t) + I(t) = N, and suppose that I(0) = a.
The simple epidemic process (e.g., Bailey (1975, p. 39)) is a Markov process
for which, when I(t) infectives are present at time t, the chance of an
infective contact with a specified individual in the time interval (t, t + t) is
 I(t) t + o(t), where  is the infection rate, which is taken as constant.
Then, since there are N - N(t) susceptibles present at time t, the probability
of an infection in time (t, t + t) is  I(t)(N - I(t)) t + o(t). It is assumed
that the probability of more than one infection in (t, t + t) is in o(t). Then,
if Ft denotes the -field of history up to time t, we have in an obvious notation,
E d I(t) Ft- =  I(t)(N - I(t)) dt
and the basic zero mean martingale defined on the infectives process {I(t)} is
M(t) = I(t) - I(0) -
t
0
E d I(s) Fs=
I(t) - I(0) - 
t
0
I(s)(N - I(s)) ds. (11.17)
Then, the Hutton-Nelson family of estimating based on the martingale
(11.17) is
M =
t
0
(s) dI(s) - 
t
0
(s) I(s)(N - I(s)) ds, (s) predictable
from which the quasi-score estimating function for estimation of  is obtained
by choosing (s)  1 for the predictable weight function. This follows (e.g.,
using results of Section 2.5) since it is easily seen that
E d ˙M(t) Ft- = -I(t)(N - I(t)) dt,
d M t = E (dM(t))2
Ft- = E dM(t) Ft=
 I(t) (N - I(t)) dt.
The quasi-likelihood estimator of  based on complete observation of the process
{I(t)} over [0, T] is then
^T = (I(T) - a)
T
0
I(s)(N - I(s)) ds (11.18)
11.3. ESTIMATING THE INFECTION RATE IN AN EPIDEMIC 163
and it should be noted that this coincides with the maximum likelihood estimator.
It is not difficult to check that ^T is a strongly consistent estimator of
 and
(I(t) - a)- 1
2 M(T)
d
- N(0, 1)
as T   so that
(I(t) - a)
1
2 ^-1
T (^T - )
d
- N(0, 1)
from which confidence intervals for  could be constructed.
In practice, one would not expect complete observation of the process. The
available data might, for example, be of the form {(tk, ik), k = 0, 1, . . . , m},
where m  1, 0  t0 < t1 < . . . < Tm  T are fixed time points and ik is the
observed number of infectives at time tk. Then, a reasonable approximation to
the estimator (11.18) is provided by
(I(tm)-a)
1
2
(I(t0)(N - I(t0)) + I(tm)(N - I(tm))) +
m-1
r=1
I(tr)(N - I(tr))
which is obtained by using the simple trapezoidal rule to approximate the
stochastic integral. This estimator is due to Choi and Severo (1988).
A rather more realistic model is provided by the general stochastic epidemic
model that allows for removals from the population (e.g., Bailey (1975, pp. 88-
89)). Here the probability of one infection in the time interval (t, t + t) is
 I(t) S(t) t + o(t) and of one removal is, say,  I(t) (t) + o(t), while the
probability of more than one change is o(t). Suppose S(0) = n and I(0) = a.
Since the removals come only from the infectives, we have that
E dS(t) Ft- = - I(t) S(t) dt
and the basic zero-mean martingale defined on the process {S(t)} is
N(t) = n - S(t) +
t
0
E dS(s) Fs=
n - S(t) - 
t
0
I(s) S(s) ds.
Estimation of  can then be achieved along similar lines to those described
above for the case of the simple epidemic. The quasi-score estimating function
for the Hutton-Nelson family based on the martingale {N(t)} is
n - S(T) - 
T
0
I(s) S(s) ds,
so that the quasi-likelihood estimator is
^T = (n - S(T))
T
0
I(s) S(s) ds. (11.19)
164 CHAPTER 11. MISCELLANEOUS APPLICATIONS
Again ^T is strongly consistent for  and asymptotically normally distributed,
this time with
(n - S(T))
1
2 ^-1
T (^T - )
d
- N(0, 1)
as T  .
Similar ideas can be applied for the estimation of the removal rate . If
R(t) denotes the number of removals up to time t, we have that
E dR(t) Ft- =  I(t) dt
and the basic zero-mean martingale defined on the process {R(t)} is
K(t) = R(t) -
t
0
E dR(s) Fs=
R(t) - 
s
0
I(s) ds.
We have
E d ˙K(t) Ft- = I(t) dt,
the dot now referring to differentiation with respect to , and
d K t = E (dR(t))2
Ft- =  I(t) dt,
so that, as in the previous cases, the basic zero-mean martingale is itself the
quasi-score for the Hutton-Nelson family. The quasi-likelihood estimator of 
is then
^T = R(T)
T
0
I(s) ds. (11.20)
Again we would not expect complete observation of the epidemic process.
The available information might be of the form {(tk, sk, ik), k = 0, 1, . . . , m},
which provides progressive information on both the numbers of susceptibles
and infectives. This would allow for approximation to the quasi-likelihood
estimator ^T given in (11.19), but the removal rate  would not be able to be
estimated via (11.20) without information on the number of removals.
11.4 Estimating Population Size from Multi-
ple
Recapture Experiments
Consider a population in which there are  animals, where  is unknown. A
multiple recapture experiment is conducted in which animals are marked at the
time of their first capture. At each subsequent capture the mark is correctly
11.4. ESTIMATING POPULATION SIZE 165
recorded. Apart from time allowed for marking or noting of marks, each captured
animal is released immediately. The aim is to make inferences about 
from the set of observed captures of marked and unmarked animals.
The results in this section follow Becker and Heyde (1990) and are derived
using a continuous-time formulation. Analogous results also hold for multiple
recapture experiments in discrete time.
We label the animals by 1, 2, . . . ,  and let Ni(t) denote the number of times
animal i has been caught in [0, t]. Write Nt for (N1(t), . . . , N(t)) . Each
{Ni(t); t  0} is a continuous-time counting process with right-continuous
sample paths and common intensity function  defined by
t = lim
h0
1
h
Pr Ni(t + h) = Ni(t) + 1 Ft ,
where Ft is the -field generated by {Nx : 0  x  t}. The dependence of
 on t can reflect the dependence of the animal's behavior on time as well as
variations in the intensity of trapping.
Consider observations from a multiple capture experiment over [0, ]. The
total number of captures by time t is given by Nt =

i=1 Ni(t). Write Mt
for the number of animals that are marked by time t. Then Mt is the number
of captures of unmarked animals, while Kt = Nt - Mt gives the number of
captures of marked animals by time t.
Maximum likelihood estimation of  based on observation of {Mt, Nt, 0 
t  } is complicated by the fact that the intensity function x is generally
unknown. We find that the log-likelihood function is
L =

0
log(x) dNx - 

0
x dx +

0
log( - Mx-) dMx
=

0
log(
x) dNx -

0

x dx - N log  +

0
log( - Mx-) dMx,
where 
x =  x is an arbitrary positive value if x is arbitrary positive. The
likelihood function is maximized with respect to  by the solution of
-
L

= -1
N -

0
dMx
 - Mx=
0, (11.21)
irrespective of the form of 
, which provides the maximum likelihood estimator
for . See Samuel (1969), and references quoted by her, for more detailed
consideration of related maximum likelihood estimation.
Estimation of  via equation (11.21) requires the use of numerical computation
because of the term  - Mx- in the denominator. However, there are
alternative estimators for  that have explicit expressions, are easy to compute,
and have high asymptotic efficiency relative to the benchmark provided
by (11.21).
To facilitate avoidance of the computationally troublesome (-Mx-) in the
denominator of (11.21) it is sensible to work with the martingale Ht defined by
dHt = Mt- dMt - ( - Mt-) dKt, (11.22)
166 CHAPTER 11. MISCELLANEOUS APPLICATIONS
which is suggested by the right-hand alternative representation for the martin-
gale
N -

0
 dMx
 - Mx=
K -

0
Mx- dMx
 - Mx,
(11.23)
and to seek effective estimating functions from within the class of martingale
estimating functions

0
fx dHx, fx predictable . (11.24)
The quasi-score estimating function from this class is

0
(d Hx)(d H x)-1
dHx, (11.25)
where
d Hx = E d(dHx/d) Fx- = -E dKx Fx- = -Mx- dx
and H t is the quadratic characteristic of the martingale Ht, so that
d H x = E (dHx)2
Fx- = E M2
x- dMx + ( - Mx-)2
dKx Fx=
 Mx-( - Mx-) dx,
and (11.25) corresponds to (11.23). Thus the quasi-likelihood estimator is the
maximum likelihood estimator (MLE), ^ say.
Now the martingale information in (11.23) (see Chapter 6), whose reciprocal
is essentially an asymptotic variance, is
IML = -1

0
Mx dx
 - Mx
, (11.26)
and the central limit result Theorem 12.6 or, for example, those of Aalen (1977)
and Rebolledo (1980) can be used to show that
I
1/2
ML(^ - )
tends in distribution to the standard normal law as   .
Maximum likelihood provides a benchmark but other estimating functions
from the family (11.24) of the type

0
gx dHx, (11.27)
where gx is of the form g(Nx-), produce simple and easily computable estima-
tors
^g =

0
gx Mx- dNx

0
gx dKx, (11.28)
11.4. ESTIMATING POPULATION SIZE 167
whose performance relative to the benchmark is well worth considering. We
shall examine a number of choices for gx and, in particular, determine whether
gx  1 is a good choice.
The martingale information in the estimating function (11.27) is
Ig =

0
gx Mx dx
2


0
g2
x Mx( - Mx) dx (11.29)
and, subject to suitable restrictions on g, the above cited central limit results
can be used to show that
I1/2
g (^g - )
tends in distribution to a standard normal law as   . To investigate the
relative behavior of the random quantities (11.26) and (11.29) as  increases,
information on the distributions of Mx and Nx, x > 0, is needed and one can
show that the probability generating function of (Mx, Nx) is
E(yMx
zNx
) = e-x
(1 - y + y exx
)
.
Thus, Mx is distributed as Binomial (, 1 - e-x
) and Nx as Poisson ( x).
In particular, -1
Mx
a.s.
- 1 - e-x
and -1
Nx
a.s.
- x as   .
Then, if the function g is well behaved, and, in particular, there exists a
sequence of constants {c} such that
c gx
p
- gd(x), x  [0, ], (11.30)
as   , where gd(x) is a deterministic function, then it follows that
 IML
a.s.
-

0
(ex
- 1) dx = e
- 1 - 
and
 Ig
p
-

0
gd(x)(1 - e-x
) dx
2 
0
g2
d(x) e-x
(1 - e-x
) dx.
The asymptotic relative efficiency of ^g to the benchmark of the MLE is the
ratio
ARE(^g) =

0
gd(x)(1 - e-x
) dx
2
(e- - 1 -  )

0
g2
d(x) e-x (1 - e-x ) dx
. (11.31)
That ARE(^g) is less than or equal to unity follows from the Cauchy-Schwarz
inequality and equality holds iff gd(x) = exp x.
There is no obvious choice for g which satisfies the requirement (11.30)
for gd(x) = exp x and does not involve . Using (11.28), we consider the
convenient estimators
^1 =

0
Mx- dNx

0
dKx,
^2 =

0
M2
x- dNx

0
Mx- dKx,
168 CHAPTER 11. MISCELLANEOUS APPLICATIONS
and
^3 =

0
Mx- Nx- dNx

0
Nx- dKx
for which the form gd(x) in (11.30) is 1, 1 - e-x
and x, respectively. By
direct integration,
ARE(^1) =
2( + e-
1)2
(e - 1 -  )(1 - e- )2
,
ARE(^2) =
(2 + 1 - (2 - e-
)2
)2
(e - 1 -  )(1 - e- )4
,
and
ARE(^3) =
(2
 + 2 e-
2 + 2e-
)2
(e -1- ){3 + 2e- ( +1)( e- -2 -2)+ (2-e- )2}
.
It is convenient to compare these efficiencies against the scale 1 - e-
which
represents the expected fraction of animals marked by time  and therefore has
direct experimental relevance. Graphs are given in Becker and Heyde (1990).
All three estimators are very efficient over the range 0 < 1 - e<
0.5,
which should contain the range of most practical importance. The efficiency of
^1 is highest over this range, which suggests that it is better to have a function g
in (11.28) that is not zero initially. The efficiency of ^1 drops significantly once
the expected fraction of marked animals exceeds 0.5, as might be expected,
since gx = 1 is not an increasing function. These observations encourage us to
consider the estimator
^4 =

0
exp
Nx-
Mx-
1{Mx->0} Mx- dNx

0
exp
Nx-
Mx-
1{Mx->0} dKx.
Corresponding to this estimator we have
gd(x) = exp
x
1 - e-x
,
which is an increasing function starting at the value e and is close to the optimal
ex
for large x. The efficiency of this estimator, ARE(^4) is  0.995 over the
range 0 < 1 - e<
0.5 and never falls below 0.9774.
It should finally be remarked that the argument leading to estimators given
by (11.28) remains valid even when  is random. For example, random fluctuations
in environmental factors might affect the trapping rate.
If a single trap is used that captures just one animal at a time, then the
estimators given by (11.28) remain valid even when immediate release of the
animals is not possible. Occupancy of the trap simply has the effect of suspending
the entire capture process temporarily.
11.5. ROBUST ESTIMATION 169
11.5 Robust Estimation
The results in this section are from Kulkarni and Heyde (1987). Let Y1, . . . , Yn
be a sample of n observations from a discrete time stochastic process whose
distribution depends on a real valued parameter belonging to an open interval
of the real line. It is known that when there are outliers in the observations,
the standard methods of estimation (maximum likelihood, method of moments
etc.) may be seriously affected with adverse results. Robustified versions of
some of these procedures are discussed in quite a number of sources, for example,
Huber (1981), Gastwirth and Rubin (1975), Denby and Martin (1979),
Martin (1980, 1982), Martin and Yohai (1985), K¨unsch (1984), Basawa, Huggins
and Staudte (1985), Bustos (1982) and references therein. These authors,
however, concentrate on particular problems and we shall here discuss a general
procedure of broad applicability based on robustifying the quasi-likelihood
framework. As usual this produces an estimating function, which has certain
optimality properties, within a specified class of estimating functions.
First note, that it is generally possible to specify a set of functions hi =
hi(Y1, . . . , Yi, ), i = 1, 2, . . . , n, that are martingale differences, namely,
E hi Fi-1 a.s., 1  i  n, (11.32)
where Fk denotes the past history -field generated by Yj, 1  j  k, k  1,
and F0 is the trivial -field. A natural choice for hi in the case of integrable
Yi is just Yi - E(Yi | Fi-1).
Now in this book we have given special consideration to the class M of
square integrable martingale estimating functions
Gn =
n
i=1
ai-1 hi, (11.33)
where the hi are specified and the coefficients ai-1 are functions of Yi, . . . , Yi-1
and , i.e., are Fi-1-measurable. We have shown in Chapter 2 how to choose
from M a quasi-score estimating function G
n =
n
i=1 a
i-1 hi with certain
optimality properties. However, although the hi can be chosen to be robust
functions of the observations, outliers can enter the estimating function through
the weights ai-1 and hence adversely affect the properties of the estimates.
Thus the class M of estimating functions is not resistant to outliers.
To overcome this difficulty we constrain the ai-1's in (11.33) to be of robust
form by considering a subset S of M whose elements are of the form
Sn =
n
i=1
hi
i
j=1
bj,i ki-j, (11.34)
where the bj,i, 1  j  i, are constants and the ki-j, 1  j  i, i  1, are
specified functions of Y1, . . . , Yi-j and  such that k0 = 1 and
E(ki-j Fi-j-1) = 0, i > j. (11.35)
170 CHAPTER 11. MISCELLANEOUS APPLICATIONS
That is, the ki-j, i > j, are martingale differences. Note that, unlike the ai-1's
in (11.33), the bj,i's in (11.34) are constants and free of the observations. Also,
the ki-j's in (11.34) being specified functions, can be chosen to be robust, just
as with the hi's. We may take, for example ki-j = hi-j. Thus, the estimating
functions in S can be chosen to be robust and we shall show how to select the
optimum S
within S.
In these introductory remarks we have described the case of a scalar parameter
for clarity. The results given below, however, deal with a vector parameter
and vector valued estimating functions. In Section 11.5.1 we shall develop a
general theory of robust quasi-score estimating functions and discuss the loss of
efficiency due to robustification. The analysis is carried out formally without
specific attention to whether the resultant quasi-score estimating functions are
strictly allowable in the sense of involving only the data and the parameter to
be estimated. Indeed, they often involve unobservable quantities but are nevertheless
amenable to iterative solution. In Section 11.5.2 we shall illustrate by
considering the example of a regression model with autoregressive errors.
11.5.1 Optimal Robust Estimating Functions
Suppose that Y 1, Y 2, . . . , Y n is a vector process and that  = (1, . . . , p)
is a parameter taking values in an open subset of p dimensional Euclidean
space. As usual we consider the class G of zero mean, square integrable p
dimensional vector estimating functions Gn = Gn(Y 1, . . . , Y n, ) that are a.s.
differentiable with respect to the components of  and such that
E ˙Gn = (E Gn,i/j)
and EGn Gn are nonsingular.
Now let hi and ki-j be specified vectors of dimension q satisfying vector
forms of (11.32) and (11.35) for 1  j  i - 1, i  1 and k0 = (1, 1, . . . , 1) .
Usually q  p. We consider the subset S of G having elements
Sn =
n
i=1


i
j=1
bj,i ki-j

 hi (11.36)
where the bj,i are constant vectors of dimension p. Then, with the aid of Theorem
2.1 we shall obtain the following theorem.
Theorem 11.3 The quasi-score estimating function S
n within the class S
defined by (11.36) is given by
S
n =
n
i=1


i
j=1
b
j,i ki-j

 hi,
11.5. ROBUST ESTIMATION 171
where the b
j,i, 1  i  n, satisfy


E(ki-1 hi hi ki-1)    E(ki-1 hi hi k0)
E(ki-2 hi hi ki-1)    E(ki-2 hi hi k0)
...
...
E(k0 hi hi ki-1)    E(k0 hi hi k0)


b
1,i
b
2,i
...
b
i,i


=


E(ki-1
˙hi)
E(ki-2
˙hi)
...
E(k0
˙hi)


.
(11.37)
For the special case when E(ki hm hm kj) = 0, i = j and all m (which holds,
for example, if E(hm hm | Fm-1) = cm, a constant, for each m),
b
j,i = (E(ki-j hi hi ki-j))-1
E(ki-j
˙hi), 1  j  i, 1  i  n. (11.38)
Proof We have, using the fact that the hi's are martingale differences,
E ˙Sn = E
n
i=1


i
j=1
bj,i ki-j

 ˙hi,
while
ESn S
n = E
n
i=1


i
j=1
bj,i ki-j


hi hi
i
l=1
ki-l b
l,i .
Then, the result (11.37) follows from Theorem 2.1 since
E ˙Sn = ESn S
n
for all Sn  S when
E ki-j hi hi
i
l=1
ki-l b
l,i = E(ki-j
˙hi),
1  j  i, and this gives (11.37). The important special case (11.38) follows
immediately from (11.37).
Now it is important to be able to assess the effect on efficiency of choosing
a quasi-score estimating function from S rather than the broader class M of
martingales of the form
Gn =
n
i=1
ai-1 hi
with the ai-1 being matrices of dimension p × q depending on Y 1, . . . , Y i-1
and , 1  i  n.
Efficiency may conveniently be assessed by comparing the martingale information
in the quasi-score estimating functions within G and S. In particular, it
determines the size of the confidence zone in asymptotic confidence statements
about the unknown parameter .
172 CHAPTER 11. MISCELLANEOUS APPLICATIONS
Let G
n be a quasi-score estimating function within G. Then,
G
n =
n
i=1
a
i-1 hi (11.39)
with
a
i-1 = E ˙hi Fi-1 E hi hi Fi-1
-1
,
the inverse being assumed to exist a.s. and the martingale information in G
n
is
IG
n
=
n
i=1
E ˙hi Fi-1 E hi hi Fi-1
-1
E ˙hi Fi-1 (11.40)
(see Section 6.3). On the other hand, the corresponding result for the quasiscore
estimating function from within S is
IS
n
=
n
i=1
BiE ˙hi Fi-1
n
i=1
BiE hi hi Fi-1 Bi
-1

n
i=1
BiE ˙hi Fi-1 , (11.41)
where
Bi =
i
j=1
b
j,i ki-j
and the b
j,i are given by (11.37).
It should be noted that if ^G and ^S are the estimators obtained from the
estimating equations G
n = 0 and S
n = 0, respectively, then, under regularity
conditions that are usually satisfied in cases which are of practical relevance,
(^G - ) IG (^G - )
d
- 2
p, (11.42)
(^S - ) IS (^S - )
d
- 2
p, (11.43)
as n   (Godambe and Heyde (1987, Section 4)). Here 2
p is the chi-squared
distribution with p degrees of freedom.
Of course the matrices IG and IS are of dimension p × p and their comparison
is not straightforward in general. However, comparison can often be
confined to scalar criteria; see Section 2.3. Furthermore, in special cases, such
as that treated in the next section, useful results can be obtained from examination
of the diagonal elements of interest.
It should be remarked that statistics based on optimal robust estimating
functions usually involve quantities which cannot be calculated explicitely in
terms of the data provided. However, in specific cases iterative computations
can be carried out beginning from preliminary estimates of the unknown parameters.
For information on similar procedures see Martin and Yohai (1985).
11.5. ROBUST ESTIMATION 173
11.5.2 Example
A regression model with autoregressive errors. Suppose that (Y1, . . . , Yn) comes
from a model of the form
Yi =  Ci + Xi, Xi =  Xi-1 + i,
i = 1, 2, . . . , n, where  = (1, . . . , r), Ci = (C1i, . . . , Cri), || < 1, X0 = 0
and the i are independent and identically distributed random variables (i.i.d.
r.v.'s) having zero mean and unit variance. Here the Xi are not directly observed,
the Ci are fixed regressors and  = (,  ) are unknown parameters.
Let hi = (Xi -  Xi) = ( i) and ki-j = hi-j where () is a bounded
function chosen so that E( i) = 0. If the i have a symmetric distribution,
then a convenient choice for  is Huber's function
(u) =
u if |u| < m,
m sgn u if |u|  m,
m being any specified constant.
Here E(h2
i Fi-1) = E2
( ) is constant and from Theorem 11.3 we readily
find (see (11.38)) that
b
j,i =
K1(j-1
, 0 ) , j < i,
K2(0, Ci -  Ci-1) , j = i,
where 0 = (0, . . . , 0) (of dimension r) while K1 and K2 are (scalar) constants
and the quasi-score robust estimating function for  within S based on {hi} is
given by
S
n = (S
n1, S
n2)
with
S
n1 = K1
N
i=1
( i)
i-1
j=1
j-1
( i-j) = K1
n
i=1
( i) ~Xi-1,
S
n2 = K2
n
i=1
( i)(Ci -  Ci-1),
where ~Xt is given by
~Xt =  ~Xt-1 + ( t) =
t
j=1
j-1
( t-j+1).
This shows that the estimating function approach discussed by Basawa et al.
(1985, Example 1), under the assumption of normally distributed i, gives a
quasi-score estimating function within S whatever the distribution of the i.
Now using the same set of hi, the estimating function which is optimal
within G is, from (11.39),
G
n = (G
n1, G
n2)
174 CHAPTER 11. MISCELLANEOUS APPLICATIONS
with
G
n1 = K3
n
i=1
( i) Xi-1,
G
n2 = K3
n
i=1
( i) (Ci -  Ci-1),
K3 being a scalar constant. Clearly the ( i) will not be directly observable in
practice and iterative calculations will be necessary to deal with all these estimating
functions. For example, the basic strategy for  is as follows. Start from
~X0(^0, 0) where (^0, ^0) are robust preliminary estimates. Then calculate
~0,i = (Yi - ^0 Ci - ^0Yi-1 + ^0Ci-1), i = 1, 2, . . .
and obtain { ~X0,i, i = 1, 2, . . .} recursively from
~X0,i = ^0
~X0,i-1 + ~0,i
taking ~X0,0 = 0. Next compute
^1 =
n
i=1
~X0,i
~X0,i-1
n
i=1
~X2
0,i-1
and repeat the procedure. Continue the iterations until a stable value for ^ is
obtained.
We note that S
n2 and G
n2 are the same, except for the constant multiplier,
both being robust, and it is of special interest to compare the performance of
the robust estimating function S
n1 with that of the nonrobust G
n1.
For the purpose of this comparison the regression part of the model is
irrelevant. Consequently, we now focus attention on the autoregression and
suppose that the data is (X1, . . . , Xn).
From (11.40) and (11.41) we obtain
IG
1
= d
n
i=1
X2
i-1, (11.44)
IS
1
= d
n
i=1
~Xi-1 Xi-1
2 n
i=1
~X2
i-1
-1
, (11.45)
where d = (E ˙( ))2
/E2
( ).
Now suppose that || < 1, i.e., the autoregression is (asymptotically) stationary.
Then, from the martingale strong law of large numbers applied in turn
to the martingales
n
i=1
X2
i - E X2
i Fi-1 ,
11.5. ROBUST ESTIMATION 175
n
i=1
Xi
~Xi - E Xi
~Xi Fi-1
and
n
i=1
~X2
i - E ~X2
i Fi-1 ,
we obtain after some straightforward analysis that
n-1
n
i=1
X2
i-1
a.s.
- (1 - 2
)-1
, (11.46)
n-1
n
i=1
~Xi-1Xi-1
a.s.
- E( ( ))(1 - 2
)-1
, (11.47)
n-1
n
i=1
~X2
i-1
a.s.
- E2
( )(1 - 2
)-1
(11.48)
as n  . Consequently,
IG
1
 nd (1 - 2
)-1
a.s.,
IS
1
 nd (E( ( )))2
(E2
( ))-1
(1 - 2
)-1
,
which gives
IS
2
I-1
G
1
a.s.
- corr2
( , ( ))  1
where corr denotes correlation.
The quantity corr2
( , ( )) is the asymptotic relative efficiency of the estimator
^S obtained from the estimating equation S
n1 = 0 compared with the
estimator ^G obtained from the estimating equation G
n1 = 0. This follows
along the lines of (11.42), (11.43), which are readily formalized in the present
context.
We also note that the standard (nonrobust) estimator for  is
^QL =
n
i=1
XiXi-1
n
i=1
X2
i-1,
which is a quasi-score estimating function within the class
n
i=1
ai-1(Xi -  Xi-1),
ai-1 being Fi-1-measurable. It is easily shown that
n1/2
(^QL - )
d
- N(0, 1 - 2
),
so that the asymptotic efficiency of ^S relative to ^QL is d corr2
( , ( )).
176 CHAPTER 11. MISCELLANEOUS APPLICATIONS
11.6 Recursive Estimation
There are important applications in which data arrive sequentially in time and
parameter estimates need to be updated expeditiously, for example, for on-line
signal processing. We shall illustrate the general principles in this section.
To focus discussion we shall, following Thavaneswaran and Abraham (1988),
consider a time series model of the form
Xt =  ft-1 + t, (11.49)
where  is to be estimated, ft-1 is measurable w.r.t. Ft-1, the -field generated
by Xt-1, . . . , X1 and the { t, Ft} are martingale differences. This framework
covers many non-linear time series models such as random coefficient autoregressive
and threshold autoregressive models. However, we shall just treat the
scalar case here for simplicity.
For the family of estimating functions
HT =
T
t=1
at(Xt -  ft-1), at Ft-1 measurable
we have that the quasi-score estimating function is
G
T =
T
t=1
a
t (Xt -  ft-1),
where
a
t = E ˙ht Ft-1 E h2
t Ft-1 = ft-1 E 2
t Ft-1 .
The optimal estimator for  based on HT and the first T observations is
then given by
^T =
T
t=1
a
t Xt
T
t=1
a
t ft-1. (11.50)
When the (T +1)st observation becomes available, we now carry out estimation
within HT +1. The estimator for  based on the first (T +1) observations, ^T +1,
is given by (11.50) with T replaced by (T + 1) and
^T +1 - ^T = KT +1
T +1
t=1
a
t Xt - ^T K-1
T +1 ,
where
K-1
T +1 =
T +1
t=1
a
t ft-1.
After a little more algebra we find that
KT +1 =
KT
1 + fT a
T +1 KT
(11.51)
11.6. RECURSIVE ESTIMATION 177
and
^T +1 = ^T +
KT a
T +1
1 + fT a
T +1 KT
XT +1 - ^T fT . (11.52)
The algorithm given by (11.52) provides the new estimate at time (T + 1)
in terms of the old estimate at T plus an adjustment. This is based on the
prediction error since the forecast of XT +1 at time T is ^T fT . From starting
values 0 and K0 the estimator can be calculated recursively from (11.51) and
(11.52). In practice 0 and K0 are often chosen on the basis of an additional
data run. Various extensions are possible. A similar recursive procedure may
be developed for a rather more complicated model that (11.49) proposed by
Aase (1983). For details see Thavaneswaran and Abraham (1988).
Chapter 12
Consistency and Asymptotic
Normality for Estimating Functions
12.1 Introduction
Throughout this book it has at least been implicit that consistency and asymptotic
normality will ordinarily hold, under appropriate regularity conditions,
just as for the case of ordinary likelihood. Sometimes we have been quite explicit
about this expectation, such as with the meta theorem enunciated in
Chapter 4. We have chosen not to attempt a substantiation of these principles
because of the elusiveness of a satisfying general statement and the delicacy of
the results as exercises in mathematics as distinct from the reality of practically
relevant examples. Also, we have made it clear already that it is generally
preferable to check directly, in any particular example, for consistency and
asymptotic normality, rather than trying to check the conditions of a special
purpose theorem.
When the quasi-score (or more generally, the estimating function) is a martingale,
consistency and asymptotic nomality are usually straightforward to
check directly using the strong law of large numbers (SLLN) and central limit
theorem (CLT), respectively, for martingales. In this chapter we shall give some
general consistency results and versions of the SLLN and CLT for martingales
which enable most typical examples to be treated.
There are, of course, applications for which the martingale paradigm is
not a satisfactory general tool. The most notable context in which this is the
case is that of random fields. Sometimes causal representations of fields are
possible for which martingale asymptotics works well but other limit theorems
are necessary in general. Consistency results are usually straightforward to
obtain, especially in the context of stationary fields where ergodic theorems can
be used, but asymptotic normality poses more problems. A standard approach
to the central limit theorem here has been to develop results based on the
use of mixing conditions to provide the necessary asymptotic independence
(e.g., Rosenblatt (1985), Doukhan (1994), Guyon (1995, Chapter 3)). However,
mixing conditions are difficult, if not impossible, to check, so this approach is
not entirely satisfactory. Fortunately, recent central limit results for random
fields based on conditional centering (e.g., Comets and Janˇzura (1996)) seem
very promising.
Non-random field examples of estimating functions that are not martingales
can often be replaced, for asymptotic purposes, by martingales. Thus, for
179
180 CONSISTENCY AND ASYMPTOTIC NORMALITY
example, if
GT () =
T
t=1
(Xt - )
where Xt is the moving average Xt =  + t - r t-1, |r| < 1, the 's being
i.i.d., then
GT () = (1 - r)
T
t=1
t + r( T - 0),
which is well approximated by the martingale {(1-r)
T
t=1 t} for the purposes
of studying the asymptotics of ^T = T-1 T
t=1 Xt.
12.2 Consistency
A range of consistency results for quasi-likelihood estimators, and more generally
for the roots of estimating functions, can be developed to parallel corresponding
ones for the maximum likelihood estimator (MLE). We therefore
begin with a brief discussion of methods that have been established for this
context.
There are two classical approaches for the MLE. These are the one due to
Wald (1949), which yields consistency of global maximizers of the likelihood,
and the one due to Cram´er (1946), which yields a consistent sequence of local
maximizers.
The Wald approach is not available in general in a quasi-likelihood setting.
Certainly its adoption requires the quasi-score to be the derivative with respect
to the parameter of a scalar objective function. However, there are certainly
cases in which the principles can be used. For a discussion of the approach
see, for example, Cox and Hinkley (1974, pp. 288­289). Similar comments on
applicability also apply to some recent necessary and sufficient results of Vajda
(1995). There is, however, a newly developed minimax approach to consistency
of quasi-likelihood estimators (Li (1996a)) that has some of the advantages of
the Wald approach.
The Cram´er approach is based on local Taylor series expansion of the likelihood
function and an analogous route can easily be followed for estimating
functions. It may be remarked that the law of large numbers plays a central
role in both approaches.
We shall use a stochastic process setting to describe the ideas behind the
Cram´er approach. Let LT ({Xt}; ) denote the likelihood that we suppose is
differentiable with respect to , here assumed to be scalar for clarity, and for
which the derivative of the log likelihood is square integrable. Then, under
modest regularity conditions which involve differentiability under an integral
sign, {d log LT ({Xt}; )/d, FT , T  1} is a martingale, Ft denoting the field
generated by X1, . . . , Xt, t  1, e.g., Hall and Heyde (1980, Chapter 6, p.
157).
12.2. CONSISTENCY 181
We write
d
d
log LT ({Xt}; ) =
T
t=1
ut(),
where the ut() are martingale differences, and take the local Taylor series
expansion for   . This gives
d
d
log LT ({Xt};  ) =
T
t=1
ut( )
=
T
t=1
ut() - ( - ) IT ()
+( - )(JT (
T ) - IT ()), (12.1)
where
IT () =
T
t=1
E u2
t () Ft-1 , JT () =
T
t=1
d
d
ui(),
and 
T =  + ( - ) with  = ({Xt}; ) satisfying || < 1.
Now the martingale strong law of large numbers (e.g., using Theorem 2.18
of Hall and Heyde (1980); see pp. 158, 159, or Theorem 12.4 herein) gives
(IT ())-1
T
i=1
ui()
a.s.
- 0 (12.2)
provided that IT ()
a.s.
-  as T  . Thus, using (12.1) and (12.2), the score
function has a root ^T , which is strongly consistent for the true parameter if
IT ()
a.s.
-  and
lim sup
T 
|IT () + JT (
T ) | IT () < 1 a.s. (12.3)
Various sufficient conditions can be imposed to ensure that (12.3) holds but
these are not illuminating.
The approach discussed above can be paralleled in many quasi-likelihood
contexts. If convenient families of martingale estimating functions are available
for a particular context, they would ordinarily be used. Then the quasi-score
estimating function will be a martingale, local Taylor series expansion can be
used just as above, and the martingale strong law of large numbers retains its
role. The case of vector valued  can be dealt with in the same way.
The different approaches to establishing consistency results depend fundamentally
on whether the estimating function is the derivative with respect to
 of a scalar objective function. If such a representation is possible, as it is
in the case of the score function (which is the derivative of the log likelihood),
then the consistency question can be translated into the problem of checking
182 CONSISTENCY AND ASYMPTOTIC NORMALITY
for local maxima. However, the representation is not possible in general for
quasi-score estimating functions; a scalar objective function (quasi-likelihood)
may not exist.
If there is a scalar objective function and the parameter space  is a compact
subset of p-dimensional Euclidean space, then an estimator obtained by
maximization always exists. The situation, however, is more complicated if a
scalar objective function does not exist or is not specified.
Example (Chen (1991)). Consider the quadratic estimating function
g =
T
i=1
a(Xi - ) + b (Xi - )2
- 2
,
where the Xi's are i.i.d. with EXi = , var Xi = 2
, a, b and 2
being known
and  is the parameter to be estimated. This is a quasi-score estimating function
if 3 a + (4 - 4
) b = 0, where k = E(X1 - )k
.
It is easily seen that there are two roots for g() = 0, namely,
^1,T =
a
2b
+
1
T
T
i=1
Xi - 
1/2
T , ^2,T =
a
2b
+
1
T
T
i=1
Xi + 
1/2
T ,
where
T =
a
2b
+
1
T
T
i=1
Xi
2
-
a
Tb
T
i=1
Xi +
1
T
T
i=1
X2
i - 2
.
The roots only exist on the sequence of events {T  0}, but it follows from
the strong law of large numbers that
T
a.s.
- (a/2b)2
as T  , so that they exist almost surely. Note that ^1,T
a.s.
- , but ^2,T
a.s.
-
 + (a/b). The point here is that both the existence and the consistency of
estimators can at most be expected in the almost sure sense.
For various problems we have to deal with situations in which the asymptotic
results on which we rely hold only on a set E, say, with P(E) < 1. This
is the case, for example, for estimating the mean of the offspring distribution
for a supercritical branching process; the asymptotic results hold on the nonextinction
set whose probability is less than one in general.
Thus, in view of the above considerations, and given a sequence of estimating
functions {GT ()}, we shall say that existence and consistency of the
estimator ^ that solves GT () = 0 holds if: for all sufficiently small  > 0,
a.s. on E, and for T sufficiently large, GT () = 0 has a root in the sphere
{ :  - 0  }, 0 denoting the "true value".
We shall now give a consistency criterion that does not require the estimating
function to be the derivative with respect to  of a scalar function. This
12.2. CONSISTENCY 183
is an adaption of a result originally due to Aitchison and Silvey (1958), which
uses the Brouwer fixed-point theorem.
Theorem 12.1 (Consistency Criterion) Let {GT ()} be a sequence of estimating
functions that are continuous in  a.e. on E for T  1, and for all small
 > 0 a.e. on E, there an > 0 so that
lim sup
T 
sup
-0 =
( - 0) GT () < - , (12.4)
then there exists a sequence of estimators ^T such that for any   E, ^T () 
0 and GT (^T ()) = 0 when T > T, a constant depending on .
On the other hand, if instead we have an in probability version of (12.4),
namely,
lim
T 
P sup
-0 =
( - 0) GT () < - E = 1 (12.5)
for any > 0, then there is a sequence of estimators ^T such that ^T converges
to 0 in probability on E and
P GT (^T ) = 0 E - 1
as T  .
Proof For any suitable small positive and  let
ET =  : sup
-0 =
( - 0) GT ()  - .
For   ET , define ^() to be a root of GT () = 0 with ^ - 0  . The
existence of such a root is ensured by Brouwer's fixed point theorem. Next,
define ^T () for   Ec
T , the complementary event, to be an arbitrary point
in the parameter space.
Then, from the definition of ^T , we have
{ ^T - 0 < }  ET
so that
{ ^T - 0  }  Ec
T
and hence
P( ^T - 0   i.o. | E)  P(Ec
T i.o. | E) = 0,
i.o. denoting infinitely often and the right-hand equality following from condition
(12.4). This gives a.s. convergence of ^ to 0 on E. Furthermore, since
P(Ec
T i.o. | E) = 0, we have GT (^T ()) = 0 on E when T > T.
184 CONSISTENCY AND ASYMPTOTIC NORMALITY
In the case where (12.5) holds, the results follow from
P( ^T - 0 <  | E)  P(ET | E)  1
as T  .
Theorem 12.1 is not usually easy to use in practice. The principal difficulty
is in showing that (12.4) holds uniformly in . Some sufficient conditions for
ensuring applicability are given in Theorem 12.1 of Hutton, Ogunyemi and
Nelson (1991). However, direct checking is usually preferable, albeit tedious.
See Exercises 12.1, 12.2 for example.
Next we shall consider the case where GT () is the gradient of a scalar
potential function q(), GT () = q()/. For this to be the case the matrix
GT ()/ must be symmetrical (e.g., McCullagh and Nelder (1989)). The
results that can then be obtained parallel these which have been developed for
maximum likelihood estimators. The scalar potential function takes the place
of the likelihood.
In the following we will use x for vector x to denote the Euclidean norm
(x x)1/2
and A for matrix A to denote the corresponding matrix norm
(maxA A)1/2
, max(resp. min) being the maximum (resp. minimum) eigen-
value.
For a sequence {AT } of p × p matrices, possibly random or depending on
the true parameter value 0, we shall define the sets
MT () =   p
: AT ( - 0)   (AT )GT
(0) ,
T = 1, 2, . . ., 0 <  < , the minus denoting generalized inverse. Then, we
shall make use of the following consistency conditions, W (for weak) and S (for
strong). These conditions refer to the true probability measure and hence they
must usually be checked for all possible 0, with c perhaps depending on 0.
Condition W (i) For all 0 <  < ,
sup
MT ()
 - 0
p
- 0.
(ii) For some random variable c, 0 < c < , and  = 2/c,
P(- ˙GT () - cAT AT  0 for all   MT ()  )  1,
 being the parameter space, an open subset of p
.
Condition S (i) For all 0 <  < ,
sup
MT ()
 - 0
a.s.
- 0.
(ii) For some random variables T1, c, 0 < c < , and  = 2/c, - ˙GT () cAT
AT  0 for all   MT () and all T  T0, AT is nonsingular, T  T0,
for some (random) T0 a.s., and
(AT )GT
(0) 2
= o (min(AT AT )) a.s.
12.2. CONSISTENCY 185
Some explanations are in order. The Condition W(i) can be thought of as
MT () shrinking to 0 in probability and it is equivalent to
P(AT nonsingular)  1,
(AT )GT
(0) 2
= op (min(AT AT )),
op meaning smaller order in probability, from which the parallel with Condition
S is evident.
Theorem 12.2 Under Condition W, there exists a weakly consistent sequence
{^T } of estimators. Actually, with  from W(ii), we have
P(^T  MT ())  1
and ^T
p
- 0.
To facilitate comparison with published results, assume that
(AT )GT
() = Op(1),
order one in probability having the obvious meaning. Then, it is natural to
consider the simpler sets
NT () = {  p
: AT ( - 0)  }, T = 1, 2, . . .
0 <  <  and it is not difficult to see that the Condition W(i) is implied by the
analogous assumption with NT () replacing MT (). However, no such simple
substitution is possible for W(ii). The usual Cram´er type results on weak
consistency in the literature (e.g., Basawa and Scott (1983)) are essentially
covered by this formulation. A plausible choice for the norming matrix AT is
usually (-EG
˙GT ())1/2
.
For the case of strong consistency we have the following result.
Theorem 12.3 Under Condition S, there exists a strongly consistent sequence
of estimators {T }. We have
^T  MT () a.s. for all T  T2
with some random T2 and ^T
a.s.
- 0.
Proof of Theorem 12.2 If Condition W(i) holds for all constants , 0 <
 < , it also holds if  is a random variable with 0 <  < . Let c,  = 2/c
be the random variables of W(ii) and denote by ET the combined event:
AT is nonsingular,
MT ()  ,
- ˙GT () - c AT AT  0 for all   MT ()  .
186 CONSISTENCY AND ASYMPTOTIC NORMALITY
Conditions W(i),(ii) and  open ensure P(ET )  1.
On the subset of ET where GT (0) = 0, MT () is degenerate to {0}.
Taking ^T = 0, we have GT (^T ) = 0, - ˙GT (^T ) > 0 and ^T is a (locally
unique) maximizer of q with ^T  MT ().
On the subset of ET where GT (0) = 0, MT ()   is a compact ellipsoidal
neighborhood of 0. For  on the boundary of MT (), set v = AT ( - 0).
Then, taking the Taylor expansion
qT () - qT (0) = v (AT )-1
GT () -
1
2
v (AT )-1 ˙GT (~) A-1
T v
with ~  MT (), we obtain
qT () - qT (0)  v (AT )-1
GT () -
1
2
v 2
minA-1
T
˙GT (~)(AT )-1
 v (AT )-1
GT () -
1
2
 (AT )-1
GT () c = 0.
Let M denote the subset of MT () where supMT () T () is attained. Since
MT () is compact, M is nonempty. Also, since qT ()  qT (0) for all  on
the boundary of MT (), if M were to contain the whole line segment between
 and 0, it would also contain the whole line segment between  and 0.
But - ˙GT (~) > 0, ~  MT () makes this impossible and we conclude that
M consists of a unique point in the interior of MT (). This point defines a
(measurable) local maximizer of q(),    on the set ET  {GT (0) = 0}.
Proof of Theorem 12.3 Let {ET } be defined as in the proof of Theorem
12.2. Using the fact that lim infT  ET requires ET to occur for all T  T2,
say, it is seen that Condition S implies lim infT  ET = 1 with probability
one. Arguing as above, ^T  MT (), T  T2, and ^T
a.s.
- 0 follow.
The general results given above obscure the simplicity of many applications.
Very often the estimating functions {GT ()} of interest form a square integrable
martingale. Then, direct application of an appropriate strong law of
large numbers (SLLN) for martingales will usually give strong consistency of
the corresponding estimator without requiring the checking of uniwieldy conditions.
Useful general purpose SLLN's for martingales are given in Section 12.3
and a corresponding central limit result is given in Section 12.4.
12.3 The SLLN for Martingales
The first result given here is a multivariate version of the standard SLLN for
martingales. It is from Lin (1994b, Theorems 2 and 3) and it extends work of
Kaufmann (1987) to allow for random norming.
The martingales here will be assumed to be vectors of dimension p  1. As
usual we shall use the norm x = (x x)1/2
for vector x and, for matrix A we
12.3. THE SLLN FOR MARTINGALES 187
shall write max(A) and min(A) for the maximum and minimum eigenvalues,
tr A for the trace and A = (max(A A))1/2
for the norm. A sequence of
matrices {An} will be called monotonically increasing if An+1 An+1 - An An
is nonnegative definite for all n.
Theorem 12.4 Let {Sn =
n
i=1 Xi, Fn, n  1} be a p-dimensional square
integrable martingale and {An} a monotone increasing sequence of nonsingular
symmetric matrices such that Ai is Fi-1-measurable for each i, and
min(An)   a.s. as n  . If

n=1
E A-1
n Xn
2
<  (12.6)
or

n=1
E A-1
n Xn
2
Fn-1 <  a.s., (12.7)
then A-1
n Sn
a.s.
- 0 as n  .
Remarks
(1) Since
A-1
n Xn
2
= tr A-1
n Xn Xn A-1
n ,
we have
E A-1
n Xn
2
= tr A-1
n E(Xn Xn) A-1
n ,
E A-1
n Xn
2
Fn-1 = tr A-1
n E Xn Xn Fn-1 A-1
n .
(2) Theorem 12.4 is for the case of discrete time but corresponding results
are available for continuous time with the conditions (12.6) and (12.7) replaced
by
tr

0
A-1
t d (E S t) A-1
t <  (12.8)
and
tr

0
A-1
t d S t A-1
t <  a.s., (12.9)
respectively, { S t} being the quadratic characteristic process for {St}.
In some applications it may be convenient to use one dimensional martingale
results on each component in the vector rather than the multivariate
results quoted above. A comprehensive collection of one dimensional stong
law results for martingales in discrete time are given in Chapter 2 of Hall and
Heyde (1980). Again these results have natural analogues in continuous time.
However, an additional one dimensional result for continuous time which is
particularly useful is given in the following theorem due to L´epingle (1977).
188 CONSISTENCY AND ASYMPTOTIC NORMALITY
Theorem 12.5 Let {St, Ft, t  0} be a locally square integrable martingale
that is right continuous with limits from the left (cadlag) and h(x) be an
increasing nonnegative function such that

0
(h(x))-2
dx < .
Then, St/h( S t)  0 a.s. on the set { S t  }.
Proof of Theorem 12.4 Suppose supn E(
n
k=1 A-1
k Xk
2
) < . Then,
since
x + y 2
- y 2
= x 2
+ 2x y,
we have, upon setting x = A-1
k Xk and y = A-1
k Sk-1 and using the martingale
property,
E A-1
k Sk
2
- E A-1
k Sk-1
2
= E A-1
k Xk
2
.
Write
zn =
n
k=1
A-1
k Sk
2
- A-1
k Sk-1
2
.
Now supn Ezn <  and since {zn, Fn, n  1} is a nonnegative submartingale
we must have the a.s. convergence of
zn = A-1
n Sn
2
+
n-1
k=1
A-1
k Sk
2
- A-1
k+1 Sk
2
.
But, the monotonicity condition for the A's ensures that
n-1
k=1
A-1
k Sk
2
- A-1
k+1 Sk
2
= zn - A-1
n Sn
2
is nondecreasing and it is a.s. bounded by supn zn, so that it must converge
a.s. This forces A-1
n Sn
2
to converge a.s. Hence, it suffices to check
that A-1
n Sn
2 p
- 0 to ensure that A-1
n Sn
2 a.s.
- 0 (and, consequently,
A-1
n Sn
a.s.
- 0).
Now, for any > 0 and N < n,
P A-1
n Sn
2
>  P A-1
n SN
2
>
1
2
+ P An (Sn - SN ) 2
>
1
2
= I1 + I2, (12.10)
12.3. THE SLLN FOR MARTINGALES 189
say. But, using Chebyshev's inequality and the monotonicity of the A's
1
2
I2  E A-1
n
n
k=N+1
Xk
2
 E


n
k=N+1
Xk A-2
k Xk +
n
k=N+2
Xk A-2
k
k-1
j=N+1
Xj
+
n
k=N+2
Xk A-2
k
n
j=k+1
Xj


= E
n
k=N+1
Xk A-2
k Xk = E
n
k=N+1
A-1
k Xk
2
,
since Ak is Fk-1-measurable. Also, since E

k=1 A-1
k Xk
2
< , for given
> 0 there is an N0 > 0 such that
I2  E
n
k=N0+1
A-1
k Xk
2
( /2) < ( /2) (12.11)
for all n > N0. Now fix N0. In view of the condition that min(An)   a.s.
as n   there is an N > N0 such that if n > N,
I1  P A-1
n
2
SN0
2
> /2 < /2 (12.12)
and hence, from (12.10), (12.11) and (12.12),
P A-1
n Sn
2
> <
for n > N. This gives A-1
n Sn
2 p
- 0 and hence A-1
n Sn
a.s.
- 0 and completes
the proof of the first part of the theorem.
Now suppose that (12.7) holds. Given a real C > 0, let
TC = max n :
n
k=1
E A-1
k Xk
2
Fk-1  C .
Note that TC is a stopping time since, for any n0,
{TC < n0} =
n0
k=1
E A-1
k Xk
2
Fk-1 > C  Fn0-1.
Therefore, for each C we can define a new random norming, such as I(n 
TC) A-1
n , I denoting the indicator function, which is still Fn-1-measurable.
Clearly
n
k=1
E I(k  TC) A-1
k Xk
2
Fk-1  C
190 CONSISTENCY AND ASYMPTOTIC NORMALITY
for all n and hence
n
k=1
E I(k  TC) A-1
k Xk
2
 C.
Also, {I(n  TC) A-1
n } is monotonically decreasing, so applying the result of
the first part of the theorem we obtain
I(n  TC) A-1
n Sn
a.s.
- 0
and hence A-1
n Sn
a.s.
- 0 and {TC = }. The required result follows since C
can be any positive number.
Proof of Theorem 12.5 The result basically follows from the martingale
convergence theorem and a Kronecker lemma argument.
Put At = S t, cu = inf{t : At > u} and define the martingale
Zt =
t
0
(h(Au))-1
dSu.
As Acu
 u for all u < A,
Z  =

0
dAu
(h(Au))2
=
A
0
du
(h(Acu
))2


0
dt
(h(t))2
< .
It follows that Z is square integrable and converges a.s.
Now, using integration by parts,
St =
t
0
h(Au) dZu = Zt h(At) -
t
0
Zu- dh(Au)
=
t
0
(Zt - Zu-) dh(Au).
Furthermore, if 0 < v < t < ,
St
h(At)

1
h(At)
v
0
(Zt - Zu-) dh(Au)
+
1
h(At)
(h(At) - h(Av)) sup
v<ut
|Zt - Zu-|.
The required result then follows by letting t   and v  .
12.4 Confidence Intervals and
the CLT for Martingales
Consistency or strong consistency of estimators obtained with the aid of the
LLN for martingales is usually able to be complemented with a corresponding
12.4. THE CLT FOR MARTINGALES 191
asymptotic normality result which follows from the central limit theorem (CLT)
for martingales. This provides asymptotic confidence zones for the parameters
of interest.
Suppose that {GT ()} is a martingale sequence of estimating functions and
that the estimator ^T  0 and GT (^) = 0 a.s. on an event E for T  1. Then,
consider the expansion
0 = GT (^T ) = GT (0) + ˙GT (0)(^T - 0) + rT (^T )
on E. If, on E, for a sequence {HT ()} of matrices, possibly random,
H-1
T (0) GT (0)
d
- Z, (12.13)
a proper law, and
sup
{: -0 }
H-1
T (0) rT ()
p
- 0 (12.14)
for some  > 0, then
H-1
T (0)(- ˙GT (0))(^T - 0)
d
- Z.
This last result is the basis for confidence zones for 0 as discussed in Chapter 4
and, as indicated in Section 4.2, - ˙GT (0) may often be replaced asymptotically
by equivalent normalizers [G(0)]T or G(0) T . The usual practice is for a
martingale CLT to be used to provide (12.13), while (12.14) is obtained from
ad hoc use of law of large number results and inequalities.
The multivariate central limit result given here is an adaption of a result
of Hutton and Nelson (1984) to deal with more general norming sequences.
A similar, but slightly weaker, result appears in Srensen (1991). Our proof
follows that of Hutton and Nelson (1984). The result is phrased to deal with
the continuous time case but it also covers the discrete time one. This is dealt
with by replacing a discrete time process {Xn, n  0} by a continuous version
{Xc
t , t  0} defined through Xc
t = Xn, n  t  n + 1. Here all the processes
that are considered are assumed to be right continuous with limits from the left
(cadlag) and defined on a complete filtered probability space (, F, {Ft}, P)
satisfying the standard conditions. As usual we denote by Ip the p × p identity
matrix.
We shall make use of the concepts of stable and mixing convergence in
distribution and these are defined below. Let
Y n = (Y1n, . . . , Ypn)
d
- Y = (Y1, . . . , Yp) .
Suppose that y = (y1, . . . , yp) is a continuity point for the distribution function
of Y and E  F. Then, we say that the convergence holds stably if
lim
n
P(Y1n  y1, . . . , Ypn  yp, E) = Qy(E)
192 CONSISTENCY AND ASYMPTOTIC NORMALITY
exists and
Qy(E)  P(E)
as yi  , i = 1, 2, . . . , p. On the other hand, we say that the convergence
holds in the mixing sense if
lim
n
P(Y1n  y1, . . . , Ypn  yp, E) = P(Y1  y1, . . . , Yp  yp) P(E).
For details see Hall and Heyde (1980, pp. 56­57) and references therein.
Theorem 12.6 Let {St = (S1t, . . . , Spt) , Ft, t  0} be a p-dimensional
martingale with quadratic variation matrix [S]t. Suppose that there exists
a non-random vector function kt = (k1t, . . . , kpt) with kit > 0 increasing to
infinity as t   for i = 1, 2, . . . , p such that as t  :
(i) k-1
it supst | Sis|
p
- 0, i = 1, 2, . . . , p,
where  Sis = Sis - Sis-;
(ii) K-1
t [S]t K-1
t
p
- 2
where Kt = diag (k1t, . . . , kpt) and 2
is a random nonnegative definite matrix;
(iii) K-1
t E(St St) K-1
t  
where  is a positive definite matrix. Then,
K-1
t St
d
- Z (stably) (12.15)
where the distribution of Z is the normal variance mixture with characteristic
function E(exp(-1
2 u 2
u)), u = (u1, . . . , up) ,
Kt [S]-1
t Kt
1/2
K-1
t (St det2
> 0)
d
- N(0, Ip) (mixing) (12.16)
det denoting determinant, and
St [S]-1
t St det2
> 0
d
- 2
p (mixing) (12.17)
as t  .
Remark It should be noted that {[S]t - S t, Ft, t  0} is a martingale,
S t being the quadratic characteristic matrix. Each of the quantities [S]t,
S t goes a.s. to infinity as t   and ordinarily [S]t, S t are asymptotically
equivalent.
Proof of Theorem 12.6 First we deal with the case p = 1 and for this
purpose we shall use an adaption of Theorem 3.2 of Hall and Heyde (1980). As
it stands, this theorem deals with finite martingale arrays (Sni, Fni, 1  i 
kn, n  1) with differences Xni and it gives sufficient conditions under which
Yn 
kn
i=1
Xni
d
- Z (stably)
12.4. THE CLT FOR MARTINGALES 193
where Z is as defined in (12.15). But, if {kn} increases monotonically to infinity
in such a way that
E
i>kn
X2
ni - 0
as n  , the theorem extends to infinite sums. That

i=1
Xni =
kn
i=1
Xni +
i>kn
Xni
d
- Z (stably)
under the condition i>kn
Xni
L1
- 0 follows from the definition of stable
convergence.
To prove (12.15) in the p = 1 case it suffixes to show that for any sequence
{Tn} diverging monotonically to infinity sufficiently fast that KTn+1 /KTn  4,
we have
K-1
Tn
STn
d
- Z (stably). (12.18)
Indeed, if K-1
t St does not converge stably we can choose a subsequence along
which the convergence is not stable and then a sufficiently sparse subsequence
{Tn} for which KTn+1 /KTn  4 and K-1
Tn
STn does not converge stably thus
giving a contradiction.
To show (12.18), define for each n the stopping times:
T0
n = 0,
Tk+1
n = inf t > Tk
n : K-1
Tn
|St - ST k
n
|  2-n
= Tn if no such t  Tn exists.
We shall consider the martingale differences
Xik = K-1
Tn
ST k
n
- ST k-1
n
, Fnk = FT k
n
.
Note that Tk
n = Tn ultimately in k for n = 1, 2, . . . so that k Xnk = K-1
Tn
STn
.
Now, to establish (12.15) we need to check
(a) supn E ( k Xnk)
2
< ,
(b) supk |Xnk|
p
- 0,
(c) k X2
nk
p
- 2
,
(d) E supk X2
nk is bounded in n,
(e) for all n and k,
Fnk = (Xn1, . . . , Xnk)  Fn+1,k = (Xn1, . . . , X(n+1)k).
194 CONSISTENCY AND ASYMPTOTIC NORMALITY
Now (a) is satisfied because K-2
Tn
E S2
Tn
is bounded, (b) is satisfied since
ST k
n
- ST k-1
n
 2-n
+ sup
tTn
 St,
(d) is satisfied since
K-2
Tn
E sup
k
ST k
n
- ST k-1
n
2
 K-2
Tn
E sup
k
2S2
T k
n
+ 2S2
T k-1
n
 4K-2
Tn
E S2
Tn
and (c) holds, while (e) is satisfied since by induction on k, we can show that
Tk
n  Tk
n+1 a.s.
To check (c), it suffices to show that
k
X2
nk - K-2
Tn
[S]Tn
p
- 0.
Now, by Ito's formula,
[S]Tn
= S2
Tn
- S2
0 - 2
Tn
0
Su- dSu,
and similarly,
k
X2
nk = K-2
Tn
S2
Tn
- S2
0 - 2

k=0
ST k
n
(ST k+1
n
- ST k
n
) .
Then, setting
n(u) =

k=0
ST k
n
I(Tk
n < u  Tk+1
n ),
we have
k
X2
nk - K-2
Tn
S2
Tn
= 2 K-2
Tn
Tn
0
(Su- - n(u)) dSu = In,
say.
Now, from the definitions of Tk
n and n, we have
K-1
Tn
(Su - u(u))  2-n
a.s.,
so that for > 0,
P(|In| > )  -2
EI2
n
= 4 -2
K-4
Tn
E
Tn
0
(Su- - n(u))2
d[S]u
 4 -2
K-2
Tn
2-2n
E[S]Tn
- 0
12.5. EXERCISES 195
as n  . The completes the proof in the case p = 1.
To deal with the general case we use the so-called Cram´er-Wold device.
Let c = (c1, . . . , cp) be a nonzero vector. By applying the theorem to the
martingale {Rt = c St}, we have
(c E(St St) c)-1/2
(c St)
d
- Zc (stably)
where
E(exp(i u Zc)) = E exp -
1
2
u2
2
c
with c given by
(c E(St St) c)-1
[c S]t
p
- 2
c .
But, using Conditions (ii) and (iii) of the theorem,
2
c = (c 2
c)/(c  c).
It then follows that
c K-1
t St
d
- (c  c)1/2
Zc (stably)
and since the characteristic function of (c  c)1/2
Zc is E[exp(-1
2 c 2
c u2
)],
the result (12.15) follows. The remaining results (12.16) and (12.17) are easy
consequences of the definition of stable convergence together with (12.15). This
completes the proof.
A broad range of one-dimensional CLT results may be found in Chapter 3 of
Hall and Heyde (1980). These may be turned into multivariate results using
the Cram´er-Wold device as in the proof of Theorem 12.6 above.
12.5 Exercises
1. Suppose that {Zi} is a supercritical, finite variance Galton-Watson branching
process with  = E(Z1 | Z0 = 1), the mean of the offspring distribution to
be estimated from a sample {Z1, . . . , ZT }. Show that
QT () =
T
i=1
(Zi -  Zi-1)
may be obtained as a quasi-score estimating function for  from an appropriate
family of estimating functions. Establish the strong consistency of the corresponding
quasi-likelihood estimator on the non-extinction set
(i) using Theorem 12.1, and
(ii) using the martingale SLLN, Theorem 12.4.
(Hint. Note that -n
Zn
a.s.
- W as n   where W > 0 a.s. on the nonextinction
set.)
196 CONSISTENCY AND ASYMPTOTIC NORMALITY
2. Let {Mt, Ft} be a square integrable martingale with quadratic characteristic
M T = T a.s. Thus, {Mt} could be a process with stationary independent
increments and need not be Brownian motion. Consider the model
dXt = 1 2 + 1(c tc-1
)1/2
d + dMt,
1 > 0, 2 > 0, 1
2 < c < 1, and obtain the Hutton-Nelson solution (Section 2.3)
for the quasi-score estimating function for  = (1, 2) . Show that the quasi-likelihood
estimator is strongly consistent for  using Theorem 12.1. (Hint.
Use Taylor expansion and L´epingle's version of the SLLN for martingales (Theorem
12.5).) (Hutton and Nelson (1986).)
3. Suppose that {Xk} is a process and {Fk} an increasing sequence of -fields
such that
E(Xk | Fk-1) = , E((Xk - )2
| Fk-1) = 1 +  uk-1,
k  1, with {uk} a family of non-negative random variables adapted to {Fk}.
For a sample {X1, . . . , XT }, obtain the estimating function
QT () =
T
k=1
Xk - 
1 +  uk-1
(12.19)
as a quasi-score.
Noting the difficulty of working with this quasi-score, it is interesting to
make a comparison with the simple, non-optimal, estimating function
HT () =
T
k=1
(Xk - ) u-1
k-1 I (uk-1 > 0). (12.20)
In the particular case where (u2
k, Fk) is a Poisson process with parameter 1
sampled at integer times, and Xk =  + wk (1 +  uk-1)1/2
, k  1, the {wk}
being i.i.d. with finite fourth moment and E(wk | Fk-1) = 0, E(w2
k | Fk-1) = 1,
k  1, investigate the consistency and asymptotic normality of the estimators
obtained from (12.19) and (12.20) (Hutton, Ogunyemi and Nelson (1991)).
4. Consider the stochastic differential equation
dXt = (1 + 2 Xt) dt +  dWt + dZt
where Wt denotes standard Brownian motion and Zt is a compound Poisson
process
Zt =
Nt
i=1
i,
12.5. EXERCISES 197
Nt being a Poisson process with parameter 3 and the i's being i.i.d random
variables with mean 4, which are independent of Nt. This s.d.e has been
used to model the dynamics of soil moisture; see Mtundu and Koch (1987).
For a sample {Xt, 0  t  T} investigate the consistency and asymptotic
normality, of quasi-likelihood estimators for  = (1, 2, 3, 4). Take  as
known (Srensen (1991)).
Chapter 13
Complements and Strategies for
Application
13.1 Some Useful Families of Estimating
Functions
13.1.1 Introduction
A broad array of families of estimating functions has been presented in this
monograph but many others are available and may be of particular use in special
circumstances. The focus of the theory has usually been on polynomial
functions of the data and these will not always provide ideal families of estimating
functions. It must be remembered that the general theory requires
estimating functions which are at least square integrable and if the underlying
variables of interest do not have finite second moments then transformations
are required. A number of useful approaches are indicated below.
13.1.2 Transform Martingale Families
Suppose we have real valued observations X1, . . . , Xn whose distribution involves
a parameter . Let Fj be the -field generated by X1, . . . , Xj, j  1.
Then, following Merkouris (1992), write
Fj(x | Fj-1) = P(Xj  x | Fj-1), ^Fj(x) = I(Xj  x),
I being the indicator function. Now introduce a suitably chosen index set of
{gt(x), t  T  }, which is real or complex valued and such that the integral
transform
j(t) = E(gt(Xj) | Fj-1) =

gt(x)
dFj(x | Fj-1)
exists and is finite for all    and all t  T. Transforms such as the characteristic
function and moment generating function, for which gt(x) is eitx
and
etx
, respectively, are clearly included.
Now let
^j(t) = gt(Xj) =

gt(x)
d ^Fj(x)
and write
hj(t) = ^j(t) - j(t)
= gt(Xj) - E(gt(Xj) | Fj-1).
199
200 COMPLEMENTS AND STRATEGIES
For fixed t  T the {hj(t), Fj} are martingale differences and the family of
martingale estimating functions
H =


n
j=1
aj hj(t), aj's Fj-1-measurable


will be useful in various settings. For an application to estimation of the scale
parameter in a Cauchy distribution using gt(x) = cos tx, see Section 2.8, Exercise
5.
Estimation of parameters of progression time distributions in multi-stage
models is treated by this methodology, modulo minor adaption to the context,
in Schuh and Tweedie (1970) and Feigin, Tweedie and Belyea (1983). It is
shown that Laplace transform based estimators may be very efficient when
maximum likelihood estimators are available.
13.1.3 Use of the Infinitesimal Generator of a Markov
Process
A number of useful approaches to estimation are based on the use of the infinitesimal
generator of a continuous time Markov process {Xt, t > 0} with
states in a sample space , say a locally compact metric space.
Suppose that X has distribution P,   . Let L = L be the infinitesimal
generator of the Markov process X. That is, for suitable functions f, Lf is
given by
lim
0
-1
[E(f(X) | X0 = x) - f(x)]
provided that the limit exists (see, e.g., Karlin and Taylor (1981, p. 294)).
Often a bounded limit is prescribed. Now let D be the domain of L. Then, for
an eigenfunction   D with eigenvalue ,
L = and
it is not difficult to prove, using the Markov property, that
L E((X) | X0 = x) = - E((X) | X0 = x)
and hence that
E((X) | X0 = x) = e-
(x).
This last result is a basis for showing that {Yt = e-t
(Xt), t > 0} is a
martingale and hence for the defining of families of estimating functions.
Kessler and Srensen (1995) have used these ideas for the estimation of parameters
in a discretely observed diffusion process with observations (Xt0
, . . . ,
Xtn
) and ti - ti-1 =  for each i. In this context the family of martingale
estimating functions
n
i=1
(Xti-1 ; ) (Xti ; ) - e-()
(Xti-1 ; )
13.2. SOLUTION OF ESTIMATING EQUATIONS 201
with 's arbitrary functions, is considered. Also, more generally, when k eigenfunctions
1(x; ), . . . , k(x; ) with distinct eigenvalues 1(), . . . , k() are
available, the family of estimating functions


n
i=1
k
j=1
j(Xti-1
; ) j(Xti
; ) - e-j ()
j(Xti-1
; )


,
with j's arbitrary functions, can be used. Kessler and Srensen (1995) have
discussed quasi-likelihood estimators from these families and their consistency
and asymptotic normality.
Another approach to estimation which makes use of the infinitesimal generator
has been suggested by Baddeley (1995). Here a parametric model given by
a family of distributions {P,   } is assumed and the aim is to estimate 
on the basis of x drawn from P. The idea is to find a continuous time Markov
process {Xt, t > 0} for which P is an equilibrium distribution for each .
Now let L be the infinitesimal generator of {Yt} and choose a family of
statistics {S(x)} belonging to the domain D of L. Each of these produces
an estimating function (LS)(x) and it is easily checked that these have zero
mean. Baddeley calls the corresponding estimating equations time invariant.
As an example we consider the discrete random fields X = (Xi, i  G) in
which the set of "sites" G is an arbitrary finite set and the site "labels" Xi take
values in an arbitrary finite set. Let
P(X = x) = (x), x  
and assume that (x) > 0 for all x  ,   . If {Yt, t > 0} is the Gibbs
sampler for  (e.g., Guyon (1995, p. 211)), then the infinitesimal generator of
{Yt} operating on the statistic S turns out to be
(L S)(x) =
iG
{E [S(X) | XG\i = xG\i] - S(x)},
where XB = (Xi, i  B) denotes the restriction of X to B  G. The corresponding
time invariant estimator is thus the solution of
1
|G|
iG
E [S(X) | XG\i = xG\i] = S(x)
where |G| is the number of "sites" in G. For more details see Baddeley (1995).
Questions of choice of family and of optimality are largely open.
13.2 Solution of Estimating Equations
Iterative numerical methods are very often necessary for obtaining estimators
from estimating equations. Detailed discussion of computational procedures is
outside the scope of this monograph. However, there are various points which
can usefully be noted.
202 COMPLEMENTS AND STRATEGIES
Preliminary transformation of the parameter so that the covariance matrix
is approximately diagonal can be very helpful. Also, if the dimension of the
problem can be reduced via expressing some of the components explicitly in
terms of others, then this is usually advantageous.
For details on computational methods the reader is referred to Thisted
(1988). In particular, Chapter 3, Section 10 of this reference deals with GLIM
and generalized least squares methods and Chapter 4 with the Newton-Raphson
method and Fisher's method of scoring. These are perhaps the most widely
used methods. For going beyond the primary tool of GLIM for possibly nonlinear
models see Gay and Welsh (1988) and Nelder and Pregibon (1987).
Let G(^) = 0 be an estimating equation. Then, the Newton-Raphson and
scoring iterative schemes can be defined by
^
(r+1)
= ^
(r)
- ( ˙G(^
(r)
))-1
G(^
(r)
)
and
^
(r+1)
= ^
(r)
- (E ˙G(^
(r)
))-1
G(^
(r)
),
respectively. The former uses the observed - ˙G() and the latter the expected
-E ˙G() which, when G() is a standardized estimating function, equals the
covariance matrix EG() G ()
These schemes often work well, with rapid convergence, especially if good
starting values are chosen. A simple method of estimation, such as the method
of moments, if available, can provide suitable starting values.
An alternative approach which has the attraction of not involving matrices
of partial derivatives of G has been suggested by Mak (1993). The idea here
is as follows. For any ,    let
H(, ) = E G().
Then, if ^
(0)
is a given starting value, the sequence {^
(r)
, r = 0, 1, . . . , } is
defined via
G(^
(r)
) = H(^
(r+1)
, ^
(r)
).
The properties of this scheme, which is equivalent to the method of scoring
when H is linear in , have been investigated in Mak (1993). In particular,
it is shown, under the usual kinds of regularity conditions, that if ^ is the
required estimator, then ^
(r)
- ^ = Op(n-r/2
), n being the sample size and
Op denoting order in probability.
13.3 Multiple Roots
13.3.1 Introduction
For maximum likelihood estimation the accepted practice is to use the likelihood
itself to discriminate between multiple roots. However, in cases where
13.3. MULTIPLE ROOTS 203
the estimating equation G() = 0, say, has been derived from other principles,
there has been uncertainty about how to choose the appropriate root (e.g.,
Stefanski and Carroll (1987, Section 2.3), McCullagh (1991)). This has provided
motivation for the search for scalar potential functions from which the
estimating equation can be derived and, in particular, the development of a
conservative quasi-score (Li and McCullagh (1994)), which, albeit with some
loss in efficiency in general, is always the derivative of a scalar function.
Recently Li (1993) and Hanfelt and Liang (1995) have developed approximate
likelihood ratio methods for quasi-likelihood settings. In each of these
papers it is shown that, under various regularity conditions the approximate
likelihood ratio can be used to discriminate asymptotically, with probability
one, to allow the choice of the correct root. To use these methods it is necessary
that the estimating functions, G(s)
(), say, are standardized to have the
likelihood score property
E - ˙G
(s)
() = E G(s)
() G(s)
() .
However, unless  has small dimension, the process of standardization may be
tedious if explicit analytical expressions are desired. For example, the method
of moments, or the eliminiation of incidential parameters, typically lead to
equations which are not standardized. Furthermore, a wide range of estimating
functions are used in practice, and many of these have no physically convincing
interpretation. For example, if  is 1-dimensional and G has an even number of
roots, then its integral must increase steadily at one or other end of the range
of . In many examples the integral is unbounded above, implying that the
appropriate root is at a local rather than a global maximum. For these reasons
we propose three simple direct methods which do not involve the construction
of a potential function. This is clearly advantageous, for example, when
(i) the estimating function is not optimal (for example being derived by the
method of moments), or
(ii) many incidental parameters are to be eliminated (e.g. for functional
relationships).
Also, conceptually unsatisfying issues such as the arbitrariness of paths for
integrals defining approximate likelihood ratios in non-conservative cases are
then avoided. The results of this section are from Heyde and Morton (1997).
The methods which we advocate for choosing the correct root of G() = 0
are:
(1) examining the asymptotics to see which root provides a consistent result;
(2) picking the root for which ˙G() behaves asymptotically as its expected
value E{ ˙G()}; and
(3) using a least squares or goodness of fit criterion to select the best root.
204 COMPLEMENTS AND STRATEGIES
The motivation for Method 2 comes from generalization of the usual technique
for identifying the maximum of a quasi-likelihood where - ˙G() and -E
˙G()
have interpretations as empirical and expected information respectively.
The principles which underly the advocated methods are simple and transparent
and we give three examples to illustrate their use. This is followed by a
discussion of the underlying theoretical considerations.
13.3.2 Examples
We begin with some simple examples, the first of which concerns the estimation
of the mean of a non-normal distribution using the first two moments.
Let X1, . . . , Xn be i.i.d. with mean  (- <  < ) and known variance
2
and consider the estimating function
G() = n-1
n
i=1
{a(Xi - ) + (Xi - )2
- 2
},
where a is a constant. Certainly we can calculate, say,
() =

0
G(u) du,
but to regard this cubic in  as an appropriate likelihood surrogate requires a
leap of faith. Nevertheless, the multiple root issue is simply resolved by each
of the approaches mentioned above.
Let
 =
1
4
a2
+ X2
- n-1
n
i=1
X2
i + 2
,
X being the sample mean. Then, on the set { > 0} the roots of G are
^1 =
1
2
a + X - 1/2
, ^2 =
1
2
a + X + 1/2
,
and the strong law of large numbers gives ^1
a.s.
- 0, ^2
a.s.
- 0 + a, 0 being
the true value. Thus, the root ^1 is the correct one. This is Method 1.
Using Method 2, we see that
- ˙G() = a + 2( X - ),
and E
˙G() = -a so that the ratio ˙G()/E
˙G() tends to 1 when  = ^1 and
to -1 when  = ^2, identifying ^1 as the correct root.
For Method 3 we use the criterion
S() = n-1
n
i=1
(Xi - )2
,
and see that
S(^2) - S(^1) = 2a
1
2  a2
n
13.3. MULTIPLE ROOTS 205
as n  , from which we again prefer ^1 to ^2.
The next example concerns estimation of the angle in circular data. This
problem, which has been studied by McCullagh (1991), concerns the model in
which (X1i, X2i), i = 1, . . . , n are i.i.d. multivariate normal, with mean vector
(r cos , r sin ) and covariance matrix I.
First we consider r = 1 and use the quasi-score (also the score) function
Q() = -( X1 - cos ) sin  + ( X2 - sin ) cos 
= - X1 sin  + X2 cos .
Then Q() = 0 possesses two solutions ^1 and ^2 with ^2 = ^1 + , ^1 =
tan-1
( X2/ X1) being the root in (-/2, /2).
Assume that -/2 < 0 < /2 is the true value. Then the law of large
numbers shows that ^1 is the consistent root. This is the approach of Method
1.
For Method 2, we see that
˙Q() = - X2 sin  - X1 cos ,
while E
˙Q() = -1 and ˙Q(^1)/E^1
˙Q(^1)  1, ˙Q(^2)/E^2
˙Q(^2)  -1, so we
choose ^1.
Finally, for Method 3 we can take the sum of squares criterion
S() =
n
i=1
(X1i - cos )2
+ (X2i - sin )2
,
and
S(^2) - S(^1) = 4n( X1 cos ^1 + X2 sin ^2)  4n
if ^1 is in the same half circle as 0.
There is also an interesting sequel to this example. If r is also to be estimated,
then the constraint r > 0 removes the ambiguity in  and there is only
one solution, which is ^1. In this case, two parameters are simpler than one!
These have been toy examples and we now move to something more substantial.
Here we follow up on the example on a functional normal regression
model treated in Section 2.3 of Stefanski and Carroll (1987). The setting is one
in which Y has a normal distribution with mean  +  u and variance 2
. It
is supposed that u cannot be observed but that independent measurements X
on it are available and var (X | u) = 2
 where  is known.
For this setting, Stefanski and Carroll derive the estimating function (cf.
their equation (2.15))
G() = - SY X + (SY Y  - SXX) + SY X
206 COMPLEMENTS AND STRATEGIES
for  where
SY Y = n-1
n
i=1
(Yi - Y )2
, SY X = (SY X1
, . . . , SY Xp
) ,
SXX = (SXiYj
),
with
SY Xi = n-1
n
k=1
(Yk - Y )(Xik - Xi), SXiXj = n-1
n
k=1
(Xik - Xi)(Xjk - Xj).
In contrast to Stefanski and Carroll we shall not specialize to the case where
 is scalar (p = 1).
For Method 1 we need to examine the asymptotics. We suppose that
SU,U = n-1
n
i=1
(ui - u)(ui - u)  var u,
say, as n   and then, using the law of large numbers in an obvious notation
which covers the structural model case as well as the functional one,
SY X
a.s.
- cov (X, Y ), SXX
a.s.
- var X, SY Y
a.s.
- var Y.
Also
var Y = E(var (Y | u)) + var (E(Y | u))
= 2
+  (var u) ,
var X = E(var (X | u)) + var (E(X | u))
= 2
 + var u,
cov (X, Y ) = E(cov (X, Y ) | u)) + cov (E(X | u), E(Y | u))
= cov (u,  +  u) = (var u) .
Thus, asymptotically,
G(^)  -^ (var u)^ + ( (var u) - var u)^
+ (var u)
= (-^ - I)(var u)(^ - ). (13.1)
One root of (13.1) is ^ =  and any other requires that (I +^ ) be singular
and hence that  ^ = -1 since, otherwise,
(I + ^ )-1
= I -
^
1 +  ^
.
13.3. MULTIPLE ROOTS 207
Any such other root satisfies
^ -  = (var u)-1
[(I + ^ )(I
+ ^ ) - I]Z,
where Z is an arbitrary vector and, noting that we can take
(I + ^ )=
I + ^ ,
^ -  = (var u)-1
^ Z. (13.2)
Next, using  ^ = -1 in (13.2), we have
 (var u)(^ - ) = - Z, (13.3)
and, setting  Z = s, (13.2) can be rewritten as
^ = [I - (var u)-1
s]-1
, (13.4)
provided s det{(var u)-1
} = 1 so that I - (var u)-1
s is nonsingular.
Note that, using (13.4) in (13.3),
 (I - (var u)-1
s)-1
 = -1, (13.5)
and s is obtained from this equation. The equations (13.4), (13.5) then specify
the second root.
When solving the estimating equation G() = 0, it is clear that the correct
root is close to (SXX - ^2
)-1
SY X which consistently estimates  when
^2
= SY Y - SY X(SXX - ^2
)-1
SY X.
If we have an incorrect root ^, then
SY X(SXX - ^2
)-1
^ -1.
Thus, Method 1 should pick the correct root given a suitably large sample.
To use Method 2 in this context we find
˙G() = -SY X -  SY X + (SY Y  - SXX)
 -0(var u) -  (var u)0
+(0(var u)0 - var u)
almost surely, 0 denoting the true value, and
E
˙G()  -( + I)var u
almost surely, so that if ^0 is a root which is consistent for 0,
˙G(^0)(E^0
˙G(^0))-1 p
- I,
208 COMPLEMENTS AND STRATEGIES
while if ^1 is a root which is consistent for 1 = 0,
˙G(^1)(E^1
˙G( ^1))-1 p
- (10 + I)(11 + I)
+ (1 - 0) (var u) 0 (var u)-1
(11 + I)-1
= I + 10 + (1 - 0) (var u) 0 (var u)-1
(11 + I)-1
= I - s(11 + I) (var u)(11 + I)-1
= I
since s = 0. Here we have used, in the algebra, the intermediate results
(1 - 0) (var u) 0 = -s
and
10(11 + I) var u = -s11.
The ratio thus detects the correct root.
13.3.3 Theory
Informal general explanations of the properties observed in the examples treated
in Section 13.3.2 are straightforward to provide.
The asymptotics of an estimating function G may usually be examined
directly. Suppose, for example, that the data are {Xt, t  T}, T being some
index set, and that G is of the form
GT = G(AT ({Xt, t  T}); )
with AT ({Xt, t  T})
a.s.
- A(0) in the limit, 0 denoting the true value of
. Then, the equation G(A(0); ) = 0 gives a clear indication of the large
sample behavior of the roots of G() = 0. Results of the kind AT ({Xt, t 
T})
a.s.
- A(0) are typically established using law of large number or ergodic
theorem arguments. This is the basis of Method 1.
To use Method 2, we assume that the matrix derivative ˙G()  H(0, )
in probability, where
H(0, ) = E0 ( ˙G()).
Both quantities should be unbounded as sampling continues and their difference
has zero expectation. Then, if ^0 is a consistent estimator of 0,
(H(^0, ^0))-1 ˙G(^0)  I
in probability. But, if ^1 is a consistent estimator of 1 = 0, then
(H(^1, ^1))-1 ˙G(^1)  (H(1, 1))-1
H(0, 1),
13.3. MULTIPLE ROOTS 209
which in most problems is different from Is the main exception being where
H(0, 1) is constant in 0. Apart from any such exception, (H(^, ^))-1 ˙G(^)
can be used to identify the correct root. The matrix may not need to be examined
in detail in order to discard incorrect roots. We could, for example, inspect
the signs of the diagonal elements of ˙G(^) and H(^, ^). If we have a quasiscore,
Q(), the asymptotic positive definiteness of - ˙Q(^) for ^ a consistent
estimator of 0 may be a useful diagnostic. We note that for likelihood scores,
- ˙Q(^) > 0 identifies a local maximum of the likelihood. Thus, Method 2
could be regarded as a generalization of this idea to non-standardized and nonoptimal
estimating functions. The positive definiteness of - ˙Q is particularly
useful when ˙Q depends on further variance parameters which are estimated
through its magnitude.
Finally, for the goodness of fit approach of Method 3 we assume that we
have some acceptable criterion for the goodness of fit of the data. A modest
requirement might be that the expectation of the criterion is a minimum when
 = 0.
While we do not use the minimization of the criterion for estimating 0, it
can be used to determine preference between the roots of G. Least squares, or
some generalization of it, seems appropriate for this purpose.
An informal justification can easily be given. We shall consider the situation
in which there are consistent roots and possibly others. Let 0 denote the true
value, for which there is a consistent estimator ^0 and suppose that ^1
p
-
1 = 0. Then, for a least squares type setting with minimization criterion
S() =
n
i=1
(Xi - f()) (Xi - f()),
say, where X1, . . . , Xn is the data and EXi = f(), we have
n-1
[S(^1) - S(^0)] = n-1
(f(^0) - f(^1))
n
i=1
(2Xi - f(^0) - f(^1))
p
- (f(0) - f(1)) (f(0) - f(1)),
using law of large number considerations.
If, on the other hand, ^2 is not a consistent root, then
n-1
[S(^2) - S(^0)] = n-1
(f(^0) - f(^2))
n
i=1
(2Xi - f(^0) - f(^2))
 (f(0) - f( ^2)) (f(0) - f( ^2))
in probability, which is Op(1) and the discrimination between roots should still
be clear. Note also that n-1
S(^0) will typically converge in probability to a
constant.
210 COMPLEMENTS AND STRATEGIES
13.4 Resampling Methods
Resampling methods, especially the bootstrap and jackknife, have proved to
be highly useful in practice for obtaining confidence intervals for unknown ,
particularly when the sample size is limited. The methods were originally developed
for i.i.d. random variables but considerable effort has been expended
recently in providing adaptions to various dependent variable contexts. Nevertheless,
the ideas behind resampling always require a focus on variables within
the model of interest which do not depart drastically from stationarity.
A detailed discussion on this subject is outside the scope of this monograph,
in part because only a small proportion of the current literature involves estimating
functions, but the prospects for future work are clear. Both jackknife
and bootstrap methods of providing confidence intervals for estimating functions
which include martingales are discussed by Lele (1991b). The jackknife
results come from Lele (1991a) and the bootstrap results, which are rather less
complete, adapt work of Wu (1986) and Liu (1988).
One may expect that the technology associated with the moving block bootstrap
to be able to deal with estimating functions based on stationary variables
(e.g. K¨unsch (1989) for weakly dependent variables; Lahiri (1993), (1995) for
strongly dependent ones). The idea behind the moving block bootstrap is to
resample blocks of observations, rather than single observations as with the
ordinary bootstrap, thereby adequately preserving the dependence of the data
within blocks.
References
Aalen, O. O. (1977). Weak convergence of stochastic integrals related to
counting processes. Z. Wahrsch. Verw. Geb. 38, 261­277.
Aalen, O. O. (1978). Nonparametric inference for a family of counting processes.
Ann. Statist. 6, 701­726.
Aase, K. K. (1983). Recursive estimation in nonlinear time series of autoregressive
type. J. Roy. Statist. Soc. Ser. B 45, 228­237.
Adenstadt, R. K. (1974). On large-sample estimation for the mean of a stationary
random sequence. Ann. Statist. 2, 1095­1107.
Adenstadt, R. K., and Eisenberg, B. (1974). Linear estimation of regression
coefficients. Quart. Appl. Math. 32, 317­327.
Aitchison, J., and Silvey, S. D. (1958). Maximum likelihood estimation of
parameters subject to restraint. Ann. Math. Statist. 29, 813­828.
Andersen, P. K., Borgan, O., Gill, R. D., and Keiding, N. (1982). Linear nonparametric
tests for comparison of counting processes, with applications
to censored survival data. Internat. Statist. Rev. 50, 219­258.
Anh, V. V. (1988). Nonlinear least squares and maximum likelihood estimation
for a heteroscedastic regression model. Stochastic Process. Appl. 29,
317­333.
Baddeley, A. J. (1995). Time-invariance estimating equations. Research Report.
Dept. Mathematics, Univ. Western Australia.
Bailey, N. T. J. (1975). The Mathematical Theory of Infectious Diseases.
Griffin, London.
Barndorff-Nielsen, O. E., and Cox, D. R. (1994). Inference and Asymptotics.
Chapman and Hall, London.
Barndorff-Nielsen, O. E., and Srensen, M. (1994). A review of some aspects
of asymptotic likelihood theory for stochastic processes. Internat. Statist.
Rev. 62, 133­165.
212 REFERENCES
Basawa, I. V. (1985). Neyman-Le Cam tests based on estimating functions. In
L. Le Cam and R. A. Olshen Eds., Proceedings of the Berkeley Conference
in Honor of Jerzy Neyman and Jack Kiefer 2, Wadsworth, Belmont, CA,
811­825.
Basawa, I. V. (1991). Generalized score tests for composite hypotheses. In
V. P. Godambe, Ed. Estimating Functions, Oxford Science Publications,
Oxford, 121­131.
Basawa, I. V., and Brockwell, P. J. (1984). Asymptotic conditional inference
for regular non-ergodic models with an application to autoregressive
processes. Ann. Statist. 12, 161­171.
Basawa, I. V., Huggins, R. M., and Staudte, R. G. (1985). Robust tests
for time series with an application to first-order autoregressive processes.
Biometrika 72, 559­571.
Basawa, I. V., and Koul, H. L. (1988). Large-sample statistics based on
quadratic dispersion. Internat. Statist. Rev. 56, 199­219.
Basawa, I. V., and Prakasa Rao, B. L. S. (1980). Statistical Inference for
Stochastic Processes. Academic Press, London.
Basawa, I. V., and Scott, D. J. (1983). Asymptotic Optimal Inference for NonErgodic
Models. Lecture Notes in Statistics 17, Springer, New York.
Becker, N. G., and Heyde, C. C. (1990). Estimating population size from
multiple recapture-experiments. Stochastic Process. Appl. 36, 77­83.
Beran, J. (1989). A test of location for data with slowly decaying serial
correlations. Biometrika 76, 261­269.
Berliner, L. M. (1991). Likelihood and Bayesian prediction for chaotic systems.
J. Amer. Statist. Assoc. 86, 938­952.
Besag, J. E. (1975). Statistical analysis of non-lattice data. The Statistician
24, 179­195.
Bhapkar, V. P. (1972). On a measure of efficiency in an estimating equation.
Sankhya Ser. A 34, 467­472.
Bhapkar, V. P. (1989). On optimality of marginal estimating equations. Technical
Report No. 274, Dept. Statistics, Univ. Kentucky.
Bibby, B. M., and Srensen, M. (1995). Martingale estimation functions for
discretely observed diffusion processes. Bernoulli 1, 17­39.
Billingsley, P. (1968). Convergence of Probability Measures. Wiley, New York.
Bradley, E. L. (1973). The equivalence of maximum likelihood and weighted
least squares estimates in the exponential family. J. Amer. Statist. Assoc.
68, 199­200.
REFERENCES 213
Bustos, O. H. (1982). General M-estimates for contaminated p-th order autoregressive
processes: consistency and asymptotic normality. Robustness
in autoregressive processes. Z. Wahrsch. Verw. Geb. 59, 491­504.
Carroll, R. J., and Stefanski, L. A. (1990). Approximate quasi-likelihood
estimation in models with surrogate predictors. J. Amer. Statist. Assoc.
85, 652­663.
Chan, N. H., and Wei, C. Z. (1988). Limiting distributions of least squares
estimates of unstable autoregressive processes. Ann. Statist. 16, 367­401.
Chandrasekar, B., and Kale, B. K. (1984). Unbiased statistical estimation for
parameters in presence of nuisance parameters. J. Statistical Planning
Inf. 9, 45­54.
Chen, K., and Heyde, C. C. (1995). On asymptotic optimality of estimating
functions. J. Statistical Planning Inf. 48, 102­112.
Chen, Y. (1991). On Quasi-likelihood Estimations. PhD thesis, University of
Wisconsin-Madison.
Cheng, R. C. H., and Traylor, L. (1995). Non-regular maximum likelihood
problems (with discussion). J. Roy. Statist. Soc. Ser. B 57, 3­44.
Choi, Y. J., and Severo, N. C. (1988). An approximation for the maximum
likelihood estimator for the infection rate in a simple stochastic epidemic.
Biometrika 75, 392­394.
Chung, K. L., and Williams, R. J. (1983). Introduction to Stochastic Integration.
Birkh¨auser, Boston.
Comets, F., and Janˇzura, M. (1996). A central limit theorem for conditionally
centered random fields with an application to Markov fields (Preprint).
Cox, D. R., and Hinkley, D. V. (1974). Theoretical Statistics. Chapman and
Hall, London.
Cox, J. S., Ingersoll, J. E., and Ross, S. A. (1985). A theory of term structure
of interest rates. Econometrica 53, 363­384.
Cram´er, H. (1946). Methods of Mathematical Statistics. Princeton University
Press, Princeton.
Crowder, M. (1987). On linear and quadratic estimating functions. Biometrika
74, 591­597.
Cutland, N. J., Kopp, P. E., and Willinger, W. (1995). Stock price returns
and the Joseph effect: a fractional version of the Black-Scholes model.
Seminar on Stochastic Analysis, Random Fields and Applications (Ascona
1993), Progr. Probab. 36, Birkh¨auser, Basel, 327­351.
214 REFERENCES
Dahlhaus, R. (1988). Efficient parameter estimation for self-similar processes.
Ann. Statist. 17, 1749­1766.
Daley, D. J. (1969). Integral representations of transition probabilities and
serial covariances of certain Markov chains. J. Appl. Prob. 6, 648­659.
Davidian, M., and Carroll, R. J. (1987). Variance function estimation. J.
Amer. Statist. Assoc. 82, 1079­1091.
Davis, R. A., and McCormick, W. P. (1989). Estimation for first order autoregressive
processes with positive or bounded innovations. Stochastic
Process. Appl. 31, 237­250.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood
from incomplete data via the E-M algorithm. J. Roy. Statist. Soc. Ser.
B 39, 1­38.
Denby, L., and Martin, R. L. (1979). Robust estimation of the first order
autoregressive parameter. J. Amer. Statist. Assoc. 74, 140­146.
Desmond, A. F. (1991). Quasi-likelihood, stochastic processes and optimal
estimating functions. In V. P. Godambe, Ed., Estimating Functions,
Oxford Science Publications, Oxford, 133­146.
Desmond, A. F. (1996). Optimal estimating functions quasi-likelihood and
statistical modelling. J. Statistical Planning Inf., in press.
Dion, J.-P., and Ferland, R. (1995). Absolute continuity, singular measures
and asymptotics for estimators. J. Statistical Planning Inf. 43, 235­242.
Doukhan, P. (1994). Mixing. Lecture Notes in Statistics 85, Springer, New
York.
Duffie, D. (1992). Dynamic Asset Pricing Theory. Princeton Univ. Press,
Princeton.
Durbin, J. (1960). Estimation of parameters in time series regression models.
J. Roy. Statist. Soc. Ser. B 22, 139­153.
Efron, B., and Hinkley, D. V. (1978). Assessing the accuracy of the maximum
likelihood estimator: observed versus Fisher information. Biometrika 65,
457­487.
Elliott, R. J. (1982). Stochastic Calculus and Applications. Springer, New
York.
Elliott, R. J., Aggoun, L., and Moore, J. B. (1994). Hidden Markov Model
Estimation and Control. Springer, New York.
Feigin, P. D. (1977). A note on maximum likelihood estimation for simple
branching processes. Austral. J. Statist. 19, 152­154.
REFERENCES 215
Feigin, P. D. (1985). Stable convergence of semimartingales. Stochastic Process.
Appl. 19, 125­134.
Feigin, P. D., Tweedie, R. L., and Belyea, C. (1983). Weighted area techniques
for explicit parameter estimation in multi-stage models. Austral.
J. Statist. 25, 1­16.
Firth, D. (1987). On efficiency of quasi-likelihood estimation. Biometrika 74,
233­246.
Firth, D. (1993). Bias reduction for maximum likelihood estimates. Biometrika
80, 27­38.
Firth, D., and Harris, I. R. (1991). Quasi-likelihood for multiplicative random
effects. Biometrika 78, 545­555.
Fitzmaurice, G. M., Laird, N. M., and Rotnitzky, A. G. (1993). Regression
models for discrete longitudinal responses. Statistical Science 8, 284­309.
Fox, R., and Taqqu, M. S. (1986). Large sample properties of parameter
estimates for strongly dependent stationary Gaussian time series. Ann.
Statist. 14, 517­532.
Fox, R., and Taqqu, M. S. (1987). Central limit theorems for quadratic forms
in random variables having long-range dependence. Probab. Th. Rel.
Fields 74, 213­240.
F¨urth, R. (1918). Statistik und Wahrscheinlichkeitsnachwirkung. Physik Z.
19, 421­426.
Gastwirth, J. I., and Rubin, H. (1975). The behavior of robust estimators on
dependent data. Ann. Statist. 3, 1070­1100.
Gauss, C. F. (1880). Theoria Combationis Observatorium Erroribus Minimus
Obnaie Part 1, 1821, Part 2, 1823; Suppl. 1826. In Werke 4, G¨ottingen,
1­108.
Gay, D. M., and Welsh, R. E. (1988). Maximum-likelihood and quasi-likelihood
for nonlinear exponential family regression models. J. Amer. Statist. Assoc.
83, 990­998.
Glynn, P., and Iglehart, D. L. (1990). Simultaneous output analysis using
standardized time series. Math. Oper. Res. 15, 1­16.
Godambe, V. P. (1960). An optimum property of regular maximum-likelihood
estimation. Ann. Math. Statist. 31, 1208­1211.
Godambe, V. P. (1976). Conditional likelihood and unconditional optimum
estimating equations. Biometrika 63, 277­284.
Godambe, V. P. (1985). The foundations of finite sample estimation in stochastic
processes. Biometrika 72, 419­428.
216 REFERENCES
Godambe, V. P. (Ed.) (1991) Estimating Functions. Oxford Science Publications,
Oxford.
Godambe, V. P. (1994). Linear Bayes and optimal estimation. Technical
Report STAT-94-11, Dept. Statistical & Actuarial Sciences, University of
Waterloo.
Godambe, V. P., and Heyde C. C. (1987). Quasi-likelihood and optimal estimation.
Internat. Statist. Rev. 55, 231­244.
Godambe, V. P., and Kale, B. K. (1991). Estimating functions: an overview.
In V. P. Godambe, Ed., Estimating Functions, Oxford Science Publications,
Oxford, 3­20.
Godambe, V. P., and Thompson, M. E. (1974). Estimating equations in the
presence of a nuisance parameter. Ann. Statist. 2, 568­571.
Godambe, V. P., and Thompson, M. E. (1986). Parameters of superpopulation
and survey population: their relationship and estimation. Internat.
Statist. Rev. 54, 127­138.
Godambe, V. P., and Thompson, M. E. (1989). An extension of quasilikelihood
estimation. J. Statistical Planning Inf. 22, 137­152.
Greenwood, P. E., and Wefelmeyer, W. (1991). On optimal estimating functions
for partially specified counting process models. In V. P. Godambe,
Ed., Estimating Functions, Oxford Science Publications, Oxford, 147­
168.
Grenander, U. (1981). Abstract Inference. Wiley, New York.
Grenander, U., and Szeg¨o, G. (1958). Toeplitz Forms and their Applications.
Univ. Calif. Press, Berkeley and Los Angeles.
Guyon, X. (1982). Parameter estimation for a stationary process on a ddimensional
lattice. Biometrika 69, 95­105.
Guyon, X. (1995). Random Fields on a Network, Modeling, Statistics and
Applications. Springer, New York.
Halfin, S. (1982). Linear estimators for a class of stationary queueing processes.
Oper. Res. 30, 515­529.
Hall, P., and Heyde, C. C. (1980). Martingale Limit Theory and its Application.
Academic Press, New York.
Hanfelt, J. J., and Liang, K.-Y. (1995). Approximate likelihood ratios for
general estimating functions. Biometrika 82, 461­477.
Hannan, E. J. (1970). Multiple Time Series. Wiley, New York.
REFERENCES 217
Hannan, E. J. (1973). The asymptotic theory of linear time series models. J.
Appl. Prob. 10, 130­145.
Hannan, E. J. (1974). Correction to Hannan (1973). J. Appl. Prob. 11, 913.
Hannan, E. J. (1976). The asymptotic distribution of serial covariances. Ann.
Statist. 4, 396­399.
Harville, D. A. (1977). Maximum likelihood approaches to variance components
estimation and related problems. J. Amer. Statist. Assoc. 72,
320­338.
Heyde, C. C. (1986). Optimality in estimation for stochastic processes under
both fixed and large sample conditions. In Yu. V. Prohorov, V. A. Statulevicius,
V. V. Sazonov and B. Grigelionis, Eds., Probability Theory and
Mathematical Statistics. Proceedings of the Fourth Vilnius Conference,
Vol. 1, VNU Sciences Press, Utrecht, 535­541.
Heyde, C. C. (1987). On combining quasi-likelihood estimating functions.
Stochastic Process. Appl. 25, 281­287.
Heyde, C. C. (1988a). Fixed sample and asymptotic optimality for classes of
estimating functions. Contemp. Math. 80, 241­247.
Heyde, C. C. (1988b). Asymptotic efficiency results for the method for moments
with application to estimation for queueing processes. In O. J. Boxma
and R. Syski, Eds., Queueing Theory and its Application. Liber
Amicorum for J. W. Cohen. CWI Monograph No. 7, North-Holland,
Amsterdam, 405­412.
Heyde, C. C. (1989a). On efficiency for quasi-likelihood and composite quasilikelihood
methods. In Y. Dodge, Ed., Statistical Data Analysis and
Inference, Elsevier, Amsterdam, 209­213.
Heyde, C. C. (1989b). Quasi-likelihood and optimality for estimating functions:
some current unifying themes. Bull. Internat. Statist. Inst. 53,
Book 1, 19­29.
Heyde, C. C. (1992a). On best asymptotic confidence intervals for parameters
of stochastic processes. Ann. Statist. 20, 603­607.
Heyde, C. C. (1992b). Some results on inference for stationary processes and
queueing systems. In U. N. Bhat and I. V. Basawa, Eds., Queueing and
Related Models. Oxford Univ. Press, Oxford, 337­345.
Heyde, C. C. (1993). Quasi-likelihood and general theory of inference for
stochastic processes. In A. Obretenov and V. T. Stefanov, Eds., 7th
International Summer School on Probability Theory and Mathematical
Statistics, Lecture Notes, Science Culture Technology Publishing, Singapore,
122­152.
218 REFERENCES
Heyde, C. C. (1994a). A quasi-likelihood approach to estimating parameters
in diffusion type processes. In J. Galambos and J. Gani, Eds., Studies in
Applied Probability, J. Applied Prob. 31A, 283­290.
Heyde, C. C. (1994b). A quasi-likelihood approach to the REML estimating
equations. Statistics & Probability Letters 21, 381­384.
Heyde, C. C. (1996). On the use of quasi-likelihood for estimation in hiddenMarkov
random fields. J. Statistical Planning Inf. 50, 373­378.
Heyde, C. C., and Gay, R. (1989). On asymptotic quasi-likelihood. Stochastic
Process. Appl. 31, 223­236.
Heyde, C. C., and Gay, R. (1992). Thoughts on modelling and identification of
random processes and fields subject to possible long-range dependence.
In L. H. Y. Chen, K. P. Choi, K. Hu and J.-H. Lou, Eds., Probability
Theory, de Gruyter, Berlin, 75­81.
Heyde, C. C., and Gay, R. (1993). Smoothed periodogram asymptotics and
estimation for processes and fields with possible long-range dependence.
Stochastic Process. Appl. 45, 169­182.
Heyde, C. C., and Lin, Y.-X. (1991). Approximate confidence zones in an estimating
function context. In V. P. Godambe, Ed., Estimating Functions,
Oxford Science Publications, Oxford, 161­168.
Heyde, C. C., and Lin, Y.-X. (1992). On quasi-likelihood methods and estimation
for branching processes and heteroscedastic regression models.
Austral. J. Statist. 34, 199­206.
Heyde, C. C., and Morton, R. (1993). On constrained quasi-likelihood estimation.
Biometrika 80, 755­761.
Heyde, C. C., and Morton, R. (1995). Quasi-likelihood and generalizing the
E-M algorithm. J. Roy. Statist. Soc. Ser. B 58, 317-327.
Heyde, C. C., and Morton, R. (1997). Multiple roots and dimension reduction
issues for general estimating equations (Preprint).
Heyde, C. C., and Seneta, E. (1972). Estimation theory for growth and immigration
rates in a multiplicative process. J. Appl. Prob. 9, 235­256.
Heyde, C. C., and Seneta, E. (1977). I. J. Bienaym´e: Statistical Theory
Anticipated. Springer, New York.
Hoffmann-Jrgensen, J. (1994). Probability With a View Towards Statistics.
Vol. II. Chapman and Hall, New York.
Hotelling, H. (1936). Relations between two sets of variables. Biometrika 28,
321­377.
REFERENCES 219
Huber, P. J. (1981). Robust Statistics. Wiley, New York.
Hutton, J. E., and Nelson, P. I. (1984). A mixing and stable central limit theorem
for continuous time martingales. Technical Report No. 42, Kansas
State University.
Hutton, J. E., and Nelson, P. I. (1986). Quasi-likelihood estimation for semimartingales.
Stochastic Process. Appl. 22, 245­257.
Hutton, J. E., Ogunyemi, O. T., and Nelson, P. I. (1991). Simplified and twostage-quasi-likelihood
estimators. In V. P. Godambe, Ed., Estimating
Functions, Oxford Science Publications, Oxford, 169­187.
Ibragimov, I. A., and Linnik, Yu. V. (1971). Independent and Stationary
Sequences of Random Variables. Wolters-Noordhoff, Groningen.
Jiang, J. (1996). REML estimation: asymptotic behaviour and related topics.
Ann. Statist. 24, 255­286.
Judge, G. G., and Takayama, T. (1966). Inequality restrictions in regression
analysis. J. Amer. Statist. Assoc. 61, 166­181.
Kabaila, P. V. (1980). An optimal property of the least squares estimate of
the parameter of the spectrum of a purely non-deterministic time series.
Ann. Statist. 8, 1082­1092.
Kabaila, P. V. (1983). On estimating time series parameters using sample
autocorrelations. J. Roy. Statist. Soc. Ser. B 45, 107­119.
Kallianpur, G. (1983). On the diffusion approximation to a discountinuous
model for a single neuron. In P. K. Sen, Ed., Contributions to Statistics:
Essays in Honor of Norman L. Johnson. North-Holland, Amsterdam,
247­258.
Kallianpur, G., and Selukar, R. S. (1993). Estimation of Hilbert space valued
random variables by the method of sieves. In J. K. Ghost et al., Eds.,
Statistics and Probability, A Raghu Rag Bahadur Festschrift, Wiley Eastern,
New Delhi, 325­347.
Karlin, S., and Taylor, H. M. (1981). A Second Course in Stochastic Processes.
Academic Press, New York.
Karr, A. F. (1987). Maximum likelihood estimation in the multiplicative
intensity model via sieves. Ann. Statist. 15, 473­490.
Karson, M. J. (1982). Multivariate Statistical Methods. Iowa State Univ.
Press, Ames, IA.
Kaufmann, H. (1987). On the strong law of large numbers for multivariate
martingales. Stochastic Process. Appl. 26, 73­85.
220 REFERENCES
Kaufmann, H. (1989). On weak and strong consistency of the maximum
likelihood estimator in stochastic processes (Preprint).
Kessler, M., and Srensen, M. (1995). Estimating equations based on eigenfunctions
for a discretely observed diffusion process. Research Report No.
332, Dept. Theoretical Statistics, Univ. Aarhus.
Kimball, B. F. (1946). Sufficient statistical estimation functions for the parameters
of the distribution of maximum values. Ann. Math. Statist. 17,
299­309.
Kloeden, P. E., and Platen, E. (1992). Numerical Solution of Stochastic Differential
Equations. Springer, Berlin.
Kloeden, P. E., Platen, E., Schurz, H., and Srensen, M. (1996). On effects
of discretization on estimators of drift parameters for diffusion processes.
J. Appl. Prob. 33, 1061-1076.
K¨unsch, H. (1984). Infinitesimal robustness for autoregressive processes. Ann.
Statist. 12, 843­863.
K¨unsch, H. R. (1989). The jackknife and bootstrap for general stationary
observations. Ann. Statist. 17, 1217­1241.
Kulkarni, P. M., and Heyde, C. C. (1987). Optimal robust estimation for
discrete time stochastic processes. Stochastic Process. Appl. 26, 267­
276.
Kulperger, R. (1985). On an optimal property of Whittle's Gaussian estimate
of the parameter of the spectrum of a time series. J. Time Ser. Anal. 6,
253­259.
Kutoyants, Yu., and Vostrikova, L. (1995). On non-consistency of estimators.
Stochastics and Stochastic Reports 53, 53­80.
Lahiri, S. N. (1993). On the moving block bootstrap under long range dependence.
Statistics & Probability Letters 18, 405­413.
Lahiri, S. N. (1995). On the asymptotic behaviour of the moving block bootstrap
for normalized sums of heavy-tail random variables. Ann. Statist.
23, 1331­1349.
Lai, T. L., and Wei, C. Z. (1982). Least squares estimates in stochastic
regression models with applications to identification of dynamic systems.
Ann. Statist. 10, 154­166.
Laird, N. (1985). Missing information principle. In N. L. Johnson and S. Kotz,
Eds., Encyclopedia of Statistical Sciences, Wiley, New York, 5, 548­552.
Le Cam, L. (1990a). On the standard asymptotic confidence ellipsoids of
Wald. Internat. Statist. Rev. 58, 129­152.
REFERENCES 221
Le Cam, L. (1990b). Maximum likelihood: an introduction. Internat. Statist.
Rev. 58, 153­171.
Lele, S. (1991a). Jackknifing linear estimating equations: asymptotic theory
and applications in stochastic processes. J. Roy. Statist. Soc. Ser. B 53,
253­267.
Lele, S. (1991b). Resampling using estimating equations. In V. P. Godambe,
Ed., Estimating Functions, Oxford Science Publications, Oxford, 295­
304.
Lele, S. (1994). Estimating functions in chaotic systems. J. Amer. Statist.
Assoc. 89, 512­516.
L´epingle, D. (1977). Sur le comportement asymptotique des martingales locales.
Springer Lecture Notes in Mathematics 649, 148­161.
Leskow, J. (1989). A note on kernel smoothing of an estimator of a periodic
function in the multiplicative model. Statistics & Probability Letters 7,
395­400.
Leskow, J., and Rozanski, R. (1989). Histogram maximum likelihood estimation
in the multiplicative intensity model. Stochastic Process. Appl. 31,
151­159.
Li, B. (1993). A deviance function for the quasi-likelihood method. Biometrika
80, 741­753.
Li, B. (1996a). A minimax approach to the consistency and efficiency for
estimating equations. Ann. Statist. 24, 1283­1297.
Li, B. (1996b). An optimal estimating equation based on the first three cumulants
(Preprint).
Li, B., and McCullagh, P. (1994). Potential functions and conservative estimating
equations. Ann. Statist. 22, 340­356.
Liang, K. Y., and Zeger, S. L. (1986). Longitudinal data analysis using generalized
linear models. Biometrika 73, 13­22.
Lin, Y.-X. (1992). The quasi-likelihood method. PhD thesis, Australian National
University.
Lin, Y.-X. (1994a). The relationship between quasi-score estimating functions
and E-sufficient estimating functions. Austral. J. Statist. 36, 303­311.
Lin, Y.-X. (1994b). On the strong law of large numbers of multivariate martingales
with random norming. Stochastic Process. Appl. 54, 355­360.
Lin, Y.-X. (1996). Quasi-likelihood estimation of variance components of
heteroscedastic random ANOVA model (Preprint).
222 REFERENCES
Lin, Y.-X., and Heyde, C. C. (1993). Optimal estimating functions and Wedderburn's
quasi-likelihood. Comm. Statist. Theory Meth. 22, 2341­2350.
Lindsay, B. (1982). Conditional score functions: some optimality results.
Biometrika 69, 503­512.
Lindsay, B. G. (1988). Composite likelihood methods. Contemp. Math. 80,
221­239.
Liptser, R. S., and Shiryaev, A. N. (1977). Statistics of Random Processes I.
General Theory. Springer, New York.
Little, R. J. A., and Rubin, D. B. (1987). Statistical Analysis with Missing
Data. Wiley, New York.
Liu, R. Y. (1988). Bootstrap procedures under some non-i.i.d. models. Ann.
Statist. 16, 1696­1708.
Mak, T. K. (1993). Solving non-linear estimating equations. J. Roy. Statist.
Soc. Ser. B 55, 945­955.
Martin, R. D. (1980). Robust estimation of autoregressive models (with discussion).
In D. R. Brillinger and G. C. Tiao, Eds, Directions of Time
Series, Inst. Math. Statist., Hayward, CA, 228­262.
Martin, R. D. (1982). The Cram´er-Rao bound and robust M-estimates for
autoregressions. Biometrika 69, 437­442.
Martin, R. D., and Yohai, V. J. (1985). Robustness in time series and estimating
ARMA models. In E. J. Hannan, P. R. Krishnaiah and M. M. Rao,
Eds., Handbook of Statistics 5, Elsevier Science Publishers, New York,
119­155.
McCullagh, P. (1983). Quasi-likelihood functions. Ann. Statist. 11, 59­67.
McCullagh, P. (1991). Quasi-likelihood and estimating functions. In D. V. Hinkley,
N. Reid and E. J. Snell, Eds., Statistical Theory and Modelling. In
Honour of Sir David Cox, FRS. Chapman and Hall, London, 265­286.
McCullagh, P., and Nelder, J. A. (1989). Generalized Linear Models, 2nd Ed.,
Chapman and Hall, New York.
McKeague, I. W. (1986). Estimation for a semimartingale model using the
method of sieves. Ann. Statist. 14, 579­589.
McLeish, D. L., and Small, C. G. (1988). The Theory and Applications of
Statistical Inference Functions. Lecture Notes in Statistics 44, Springer,
New York.
REFERENCES 223
Merkouris, T. (1992). A transform method for optimal estimation in stochastic
processes: basic aspects. In J. Chen, Ed. Proceedings of a Symposium
in Honour of Professor V. P. Godambe, University of Waterloo, Waterloo,
Canada, 42 pp.
Morton, R. (1981a). Efficiency of estimating equations and the use of pivots.
Biometrika 68, 227­233.
Morton, R. (1981b). Estimating equations for an ultrastructural relationship.
Biometrika 68, 735­737.
Morton, R. (1987). A generalized linear model with nested strata of extraPoisson
variation. Biometrika 74, 247­257.
Morton, R. (1988). Analysis of generalized linear models with nested strata
of variation. Austral. J. Statist. 30A, 215­224.
Morton, R. (1989). On the efficiency of the quasi-likelihood estimators for
exponential families with extra variation. Austral. J. Statist. 31, 194­
199.
Mtundu, N. D., and Koch, R. W. (1987). A stochastic differential equation
approach to soil moisture. Stochastic Hydrol. Hydraul. 1, 101­116.
Mykland, P. A. (1995). Dual likelihood. Ann. Statist. 23, 386­421.
Naik-Nimbalkar, U. V., and Rajarshi, M. B. (1995). Filtering and smoothing
via estimating functions. J. Amer. Statist. Assoc. 90, 301­306.
Nelder, J. A., and Lee, Y. (1992). Likelihood, quasi-likelihood and pseudolikelihood:
some comparisons, J. Roy. Statist. Soc. Ser. B 54, 273­284.
Nelder, J. A., and Pregibon, D. (1987). An extended quasi-likelihood function.
Biometrika 74, 221­232.
Nguyen, H. T. and Pham, D. P. (1982). Identification of the nonstationary
diffusion model by the method of sieves. SIAM J. Optim. Control 20,
603­611.
Osborne, M. R. (1992). Fisher's method of scoring, Internat. Statist. Rev.
60, 99­117.
Parzen, E. (1957). On consistent estimates of the spectrum of a stationary
time series. Ann. Math. Statist. 28, 329­348.
Pedersen, A. R. (1995). Consistency and asymptotic normality of an approximate
maximum likelihood estimator for discretely observed diffusion processes.
Bernoulli 1, 257­279.
Pollard, D. (1984). Convergence of Stochastic Processes. Springer, New York.
224 REFERENCES
Prentice, R. L. (1988). Correlated binary regression with covariates specific
to each binary observation. Biometrics 44, 1033­1048.
Priestley, M. B. (1981). Spectral Analysis and Time Series. Academic Press,
London.
Pukelsheim, F. (1993). Optimal Design of Experiments. Wiley, New York.
Qian, W., and Titterington, D. M. (1990). Parameter estimation for hidden
Markov chains. Statistics & Probability Letters 10, 49­58.
Rao, C. R. (1973). Linear Statistical Inference and its Applications, 2nd Ed.,
Wiley, New York.
Rao, C. R., and Mitra, S. K. (1971). Generalized Inverse of Matrices and its
Applications, Wiley, New York.
Rebolledo, R. (1980). Central limit theorems for local martingales. Z. Wahrsch.
Verw. Geb. 51, 269­286.
Reynolds, J. F. (1975). The covariance structure of queues and related processes
- a survey of recent work. Adv. Appl. Prob. 7, 383­415.
Ripley, B. D. (1988). Statistical Inference for Spatial Processes. Cambridge
Univ. Press, Cambridge.
Rogers, L. C. G., and Williams, D. (1987). Diffusions, Markov Processes and
Martingales, Vol. 2, Ito Calculus. Wiley, Chichester.
Rosenblatt, M. (1985). Stationary Sequences and Random Fields. Birkh¨auser,
Boston.
Samarov, A., and Taqqu, M. S. (1988). On the efficiency of the sample mean
in long-memory noise. J. Time Series Anal. 9, 191­200.
Samuel, E. (1969). Comparison of sequential rules for estimation of the size
of a population. Biometrics 25, 517­527.
Schuh, H.-J., and Tweedie, R. L. (1979). Parameter estimation using transform
estimation in time-evolving models. Math. Biosciences 45, 37­67.
Shen, X., and Wong, W. H. (1994). Convergence rate of sieve estimates. Ann.
Statist. 22, 580­615.
Shiryaev, A. N. (1981). Martingales, recent developments, results and applications.
Internat. Statist. Rev. 49, 199­233.
Shumway, R. H., and Stoffer, D. S. (1982). An approach to time series smoothing
and forecasting using the E-M algorithm. J. Time Series Anal. 3,
253­264.
REFERENCES 225
Small, C. G., and McLeish, D. L. (1989). Projection as a method for increasing
sensitivity and eliminating nuisance parameters. Biometrika 76, 693­703.
Small, C. G., and McLeish, D. L. (1991). Geometrical aspects of efficiency
criteria for spaces of estimating functions. In V. P. Godambe, Ed., Estimating
Functions, Oxford Science Publications, Oxford, 267­276.
Small, C. G., and McLeish, D. L. (1994). Hilbert Space Methods in Probability
and Statistical Inference, Wiley, New York.
Smith, R. L. (1985). Maximum likelihood estimation in a class of non-regular
cases. Biometrika 72, 67­92.
Smith, R. L. (1989). A survey of nonregular problems. Bull. Internat. Statist.
Inst. 53, Book 3, 353­372.
Srensen, M. (1990). On quasi-likelihood for semimartingales. Stochastic
Process. Appl. 35, 331­346.
Srensen, M. (1991). Likelihood methods for diffusions with jumps. In
N. U. Prabhu and I. V. Basawa, Eds., Statistical Inference in Stochastic
Processes, Marcel Dekker, New York, 67­105.
Stefanski, L. A., and Carroll, R. J. (1987). Conditional score and optimal
scores for generalized linear measurement-error models. Biometrika 74,
703­716.
Sweeting, T. J. (1986). Asymptotic conditional inference for the offspring
mean of a supercritical Galton-Watson process. Ann. Statist. 14, 925­
933.
Thavaneswaran, A. (1991). Tests based on an optimal estimate. In V. P. Godambe,
Ed., Estimating Functions, Oxford Science Publications, Oxford,
189­197.
Thavaneswaran, A., and Abraham, B. (1988). Estimation for non-linear time
series using estimating equations, J. Time Ser. Anal. 9, 99­108.
Thavaneswaran, A., and Thompson, M. E. (1986). Optimal estimation for
semimartingales. J. Appl. Prob. 23, 409­417.
Thisted, R. A. (1988). Elements of Statistical Computing. Chapman and Hall,
New York.
Thompson, M. E., and Thavaneswaran, A. (1990). Optimal nonparametric estimation
for some semimartingale stochastic differential equations. Appl.
Math. Computation 37, 169­183.
Tjstheim, D. (1978). Statistical spatial series modelling. Adv. Appl. Prob.
10, 130­154.
226 REFERENCES
Tjstheim, D. (1983). Statistical spatial series modelling II: some further
results on unilateral lattice processes. Adv. Appl. Prob. 15, 562­584.
Vajda, I. (1995). Conditions equivalent to consistency of approximate MLE's
for stochastic processes. Stochastic Process. Appl. 56, 35­56.
Verbyla, A. P. (1990). A conditional derivation of residual maximum likelihood.
Austral. J. Statist. 32, 227­230.
Vitale, R. A. (1973). An asymptotically efficient estimate in time series analysis.
Quart. Appl. Math. 30, 421­440.
Wald, A. (1949). Note on the consistency of the maximum likelihood estimate.
Ann. Math. Statist. 20, 595­601.
Watson, R. K., and Yip, P. (1992). A note on estimation of the infection rate.
Stochastic Process. Appl. 41, 257­260.
Wedderburn, R. W. M. (1974). Quasi-likelihood functions, generalized linear
models, and the Gauss-Newton method. Biometrika 61, 439­447.
Wei, C. Z., and Winnicki, J. (1989). Some asymptotic results for branching
processes with immigration. Stochastic Process. Appl. 31, 261­282.
Whittle, P. (1951). Hypothesis Testing in Time Series Analysis. Almqvist
and Wicksell, Uppsala.
Whittle, P. (1952). Tests of fit in time series. Biometrika 39, 309­318.
Whittle, P. (1953). The analysis of multiple time series. J. Roy. Statist. Soc.
Ser. B 15, 125­139.
Whittle, P. (1954). On stationary processes in the plane. Biometrika 41,
434­449.
Winnicki, J. (1988). Estimation theory for the branching process with immigration.
Contemp. Math. 80, 301-322.
Wu, C. F. J. (1986). Jackknife, bootstrap and other resampling methods in
regression analysis. Ann. Statist. 14, 1261­1295.
Yanev, N. M., and Tchoukova-Dantcheva, S. (1980). On the statistics of
branching processes with immigration. C. R. Acad. Sci. Bulg. 33, 463­
471.
Zeger, S. L., and Liang, K.-Y. (1986). Longitudial data analysis for discrete
and continuous outcomes. Biometrics 44, 1033­1048.
Zehnwirth, B. (1988). A generalization of the Kalman filter for models with
state-dependent observation variance. J. Amer. Statist. Assoc. 83, 164­
167.
Zygmund, A. (1959). Trigonometric Series, Vol. 1, 2nd Ed., Cambridge Univ.
Press, Cambridge.
Index
Aalen, O.O., 18, 19, 150, 166, 211
Aase, K.K., 177, 211
Abraham, B., 176, 177, 225
acceptable estimating function, 74
Adenstadt, R.K., 155, 156, 211
Aggoun, L., 136, 138, 214
Aitchison, J., 183, 211
algorithm, 127, 177
E-M, 116­119, 123, 127
P-S, 116­119, 123, 127
ancillary, 43, 60, 107 (see also E-
ancillary)
Anderson, P.K., 18, 211
Anh, V.V., 160, 161, 211
ARMA model, 23
asymptotically non-negative definite,
71, 72
asymptotic mixed normality, 27, 60,
62­64
asymptotic normality, 26, 27, 40,
41, 54­56, 60, 62­64, 110,
136, 147, 159, 161, 163, 164,
179, 191, 196, 197, 201
asymptotic first order efficiency, 72
asymptotic quasi-likelihood (see quasilikelihood,
asymptotic)
asymptotic relative efficiency, 74, 125,
126, 167, 168, 175
autoregressive process, 19, 23, 31,
56, 79, 98, 170, 173, 174,
176
spatial, 137
conditional, 137
random coefficient, 176
threshold, 176
Baddeley, A.J., 201, 211
Bailey, N.T.J., 162, 163, 211
Barndorff-Nielsen, O.E., 35, 56, 59,
62, 86, 211
Banach space, 147
Basawa, I.V., 54, 60, 63, 132, 135,
141, 142, 169, 173, 185, 212
Becker, N.G., 165, 168, 212
Belyea, C., 200, 215
Beran, J., 158, 212
Berliner, L.M., 37, 212
Bernoulli distribution, 87
Besag, J.E., 91, 212
best linear unbiased estimator (BLUE),
153, 154
beta distribution, 124
Bhapkhar, V.P., 115, 212
bias correction, 61, 67
Bibby, B.M., 135, 212
binary response, 25
Billingsley, P., 65, 212
binomial distribution, 122, 167
birth and death process, 18
birth process, 56
Black-Scholes model, 31
bootstrap, 9, 210
Borgan, O., 18, 211
Bradley, E.L., 7, 212
branching process, 182
Galton-Watson, 2, 15, 27, 31,
35, 36, 69, 87, 195
Brockwell, P.J., 60, 212
Brouwer fixed point theorem, 183
Brownian motion, 17, 31, 33, 34,
131­133, 136, 196
Burkholder, Davis, Gundy inequality,
149
Bustos, O.H., 169, 213
227
228 INDEX
cadlag, 191
Carroll, R.J., 131, 139, 203, 205,
213, 214, 225
Cauchy distribution, 40, 200
Cauchy-Schwarz inequality, 5, 24,
167
censored data, 150
central limit results, 4, 9, 26, 27, 64,
149, 155, 166, 167, 179, 190­
195
Chan, N.H., 56, 213
Chandrasekar, B., 19, 213
characteristic function, 199
Chebyshev inequality, 189
Chen, K., 88, 213
Chen, Y., 182, 213
Cheng, R.C.H., 54, 213
chi-squared distribution, 58, 110, 143­
145, 172
Choi, Y.J., 163, 213
Cholesky square root matrix, 71, 72
Chung, K.L., 97, 213
coefficient of variation, 121, 124
colloidal solution, 69
Comets, F., 179, 213
complete probability space, 92
conditional centering, 179
conditional inference, 107
confidence intervals (or zones), 1,
4, 8, 9, 24, 27, 53­67, 69,
71, 88, 110, 131, 158, 163,
171, 191, 210
average, 56, 66
simulteneous, 58
consistency, 6, 8, 10, 26, 38, 40, 54,
55, 63, 64, 70, 131, 135,
147­150, 161, 163, 179­186,
190, 196, 197, 201, 203
constrained parameter estimation,
8, 107­112, 142
control, 41
convergence
stable, 191, 193, 195
mixing, 57, 191, 192
convex, 14, 46, 156­158
correlation, 6, 13, 25
counting process, 18, 112, 150, 151,
165
covariate, 25, 148
Cox, D.R., 35, 41, 54, 58, 61, 62,
107, 180, 211, 213
Cox, J.S., 131, 133, 213
Cox-Ingersoll-Ross model, 131, 133
Cram´er, H., 2, 7, 180, 185, 195, 213
Cram´er-Rao inequality, 2, 7
Cram´er-Wold device, 195
cross-sectional data, 159
Crowder, M., 95, 105, 213
cumulant spectal density, 159
cumulative hazard function, 9, 150
curvature, 62
Cutland, N.J., 31, 213
Dahlhaus, R., 86, 214
Daley, D.J., 156, 214
Davidian, M., 131, 214
Davis, R.A., 19, 214
demographic stochasticity, 36
Dempster, A.P., 116, 214
Denby, L., 169, 214
dererminant criterion for optimality,
19
Desmond, A.F., 2, 25, 214
differentiability (stochastic), 56
differential geometry, 62
diffusion, 8, 17, 129, 131, 132, 135,
148, 151, 200
Dion, J.-P., 10, 214
discrete exponential family, 2, 16
dispersion, 12, 22, 119, 124, 129
distance, 7, 12
Doob-Meyer decomposition, 27
drift coefficient, 131, 148
Doukhan, P., 179, 214
Duffie, D., 31, 214
Durbin, J., 1, 214
dynamical system, 37, 38, 40, 131,
136
E-ancillary, 8, 43­51, 113, 114
Edgeworth expansion, 62
efficient score statistic, 9, 141, 142
INDEX 229
Efron, B., 59, 214
eigenfunction, 200, 201
Eisenberg, B., 155, 211
Elliott, R.J., 50, 136, 138, 214
ellipsoids (of Wald), 58
E-M algorithm, 8, 103, 107, 116,
117, 119­123, 127
epidemic, 9, 162, 163
equicontinuity (stochastic), 56
ergodic, 134, 142, 153
ergodic theorem, 134, 153, 208
error
approximation, 148, 151
contrasts, 139
estimation, 148, 151
estimating functions, 1, 2
combination, 2, 8, 35, 103, 137,
138
optimal, 4­6
robust, 9, 169­175
standardized, 3­5, 118, 138
estimating function space, 3­5, 8,
11, 13, 15­19, 22­25, 28,
32, 36­40, 43­51, 71, 75,
88, 89, 92, 94­96, 99, 100,
104, 105, 111, 130, 132, 137,
138, 153, 162­164, 166, 169­
171, 176, 195, 199­201
Hutton-Nelson, 32, 33, 162­164
convex, 14, 46­48
estimation
constrained parameter, 8, 107­
112, 142
function, 147­151
recursive, 176­177
robust, 169­175
E-sufficient, 8, 43­51, 113, 114
Euclidean space, 11, 92, 94, 141,
182
Euler scheme, 151
exponential distribution, 41, 121, 125
exponential family, 2, 7, 16, 24, 38,
53, 79, 125
failure, 38
F-distribution, 110
Feigin, P.D., 17, 200, 214, 215
Fejer kernel, 157
Ferland, R., 10, 214
field (see random field)
filtering, 8, 103
filtration, 27, 30, 54, 94, 132
finance, 31, 135
finite variation process, 31, 148
first order efficient, 72, 73
Firth, D., 61, 67, 95, 102, 105, 215
Fisher, R.A., 1, 2, 12, 40, 59, 72,
97, 107, 141, 202
Fisher information, 2, 12, 40, 59,
72, 141
conditional, 97
Fisher method of scoring, 120, 202
Fitzmaurice, G.M., 25, 26, 215
Fourier methods, 85
Fox, R., 86, 215
fractional Brownian motion, 31, 86
function estimation, 9, 147­151
functional relationships, 112, 205
F¨urth, R., 87, 215
Galton-Watson process, 2, 15, 27,
31, 36, 69, 195
gamma distribution, 56, 134
Gastwirth, J.L., 169, 215
Gauss, C.F. i, 1, 3, 25, 83, 84, 86,
127, 138, 158, 159, 161, 215
Gaussian distribution, 83, 86, 127,
138, 158, 159 (see also normal
distribution)
Gauss-Markov theorem, 3, 4, 161
Gay, D.M., 202, 215
Gay, R., 74, 82, 88, 218
generalized estimating equation (GEE),
8, 25, 26, 89
generalized inverse, 30, 108, 109, 130,
132, 184
generalized linear model (GLIM), 21,
22, 79, 100, 104, 202
geometric distribution, 38
Gibbs field, 138
Gibbs sampler, 201
Gill, R.D., 18, 211
230 INDEX
Girsanov transformation, 138
Glynn, P., 64, 215
Godambe, V.P., 1, 2, 21, 38, 48, 97,
105, 107, 116, 172, 215, 216
Gram-Schmidt orthogonalization, 93,
96
Greenwood, P.E., 56, 216
Grenander, U., 147, 156, 216
Guyon, X., 179, 201, 216
Hajek convolution theorem, 63
Halfin, S., 157, 216
Hall, P., 54, 57, 63, 87, 156, 180,
181, 187, 192, 195, 216
Hanfelt, J.J., 203, 216
Hannan, E.J., 83, 85, 86, 153, 216,
217
Harris, I.R., 102, 215
Harville, D., 130, 217
hazard function, 150
Hermite-Chebyshev polynomials, 97
heteroscedastic
autoregression, 79
regression, 9, 159­161
Heyde, C.C., 1, 2, 13, 38, 48, 54, 57,
62, 63, 69, 72, 74, 82, 87,
88, 92, 94, 97, 107, 116,
126, 131, 136, 153, 156, 159,
161, 165, 168, 169, 172, 180,
181, 187, 192, 195, 203, 212,
213, 216­218, 220, 222
hidden Markov models, 9, 93, 136,
139
Hinkley, D.V., 35, 41, 54, 58, 59,
61, 107, 180, 213, 214
Hilbert Space, 13, 44
Hoffmann-Jrgensen, J., 56, 218
Holder inequality, 86
Hotelling, H., 13, 218
Huber, P.J., 169, 173, 219
Huber function, 173
Huggins, R.M., 169, 173, 212
Hutton, J.E., 32­36, 56, 61, 97, 150,
151, 162­164, 184, 191, 196,
219
Hutton-Nelson estimating function,
32­36, 150, 151, 162­164,
196
hypothesis testing, 9, 141­145
Ibragimov, I.A., 155, 219
idempotent matrix, 99, 100, 143
Iglehart, D.L., 64, 215
immigration distribution, 69, 70, 87
Ingersoll, J.E., 131, 133, 213
infection rate, 9, 162
infectives, 162, 164
infinitesimal generator, 9, 200, 201
information, 2, 7, 8, 12, 40, 41, 55,
72, 92, 108, 113, 114, 118,
126, 142, 159
empirical, 59, 204
expected, 59, 204
Fisher, 2, 12, 40, 59, 72, 97,
141
martingale, 28, 96­98, 160, 166,
167, 172
observed, 59
integration by parts, 190
intensity, 18, 34, 56, 135, 148, 165
interest rate, 133
invariance, 2, 58
Ito formula, 32, 97, 135, 194
Janˇzura, M., 179, 213
jackknife, 9, 210
Jensen inequality, 66
Jiang, J., 131, 219
Judge, G.G., 111, 219
Kabaila, P.V., 85, 87, 219
Kale, B.K., 2, 19, 211, 216
Kallianpur, G.K., 33, 148, 219
Kalman filter, 8, 103
Karlin, S., 200, 219
Karr, A., 148, 219
Karson, M.J., 130, 219
Kaufmann, H., 186, 219, 220
Keiding, N., 18, 211
kernel estimation, 147
Kessler, M., 135, 200, 201, 220
INDEX 231
Kimball, B.F., 1, 220
Kloeden, P.E., 134, 135, 151, 220
Koch, R.W., 197, 223
Kopp, P.E., 31, 213
Kronecker lemma, 190
Kronecker product, 80
K¨unsch, H., 169, 210, 220
Kulkarni, P.M., 169, 220
Kulperger, R., 85, 220
Kunita-Watanabe inequality, 50
Kutoyants, Yu., 62, 220
kurtosis, 62, 98, 104, 130, 139
Lagrange multiplier, 108, 111­113
Lahiri, S.N., 208, 220
Laird, N.M., 25, 26, 116, 122­124,
212, 215, 220
Langevin model, 131, 135
Laplace, P.S. de, 1
Laplace transform, 200
lattice, 81, 136
law of large numbers, 120, 204, 205,
207, 208 (see also martingale
strong law)
least squares, 1, 2, 3, 5, 7, 10, 21,
87, 161, 202­205, 209
Le Cam, L., 53, 58, 220, 221
Lee, Y., 23, 223
Legendre, A., 1
Lele, S., 37, 38, 210, 221
L´epingle, D., 187, 196, 221
Leskow, J., 148, 221
lexicographic order, 101
Li, B., 62, 142, 180, 203, 221
Liang, K.Y., 21, 25, 221, 226
lifetime distribution, 150
likelihood, 6, 8, 10, 16, 25, 35, 40,
41, 53, 54, 58, 72, 83, 91,
111, 117, 118, 122, 123, 127,
129, 133, 138, 141, 142, 165,
179, 180, 184, 202, 203
conditional, 107
constrained
non-regular cases, 40, 41
partial, 107
likelihood ratio, 58, 63, 131, 134,
141, 142, 145, 203
Lin, Y.-X., 43, 88, 105, 159, 186,
218, 221
Lindeberg-Feller theorem, 4
Lindsay, B., 91, 107, 139, 222
Linnik, Yu.V., 155, 219
Liptser, R.S., 133, 222
Little, R.J.A., 127, 222
Liu, R.Y., 208, 222
Loewner optimality, 12, 20
Loewner ordering, 12, 55, 118, 144
logistic map, 37, 40
logit link function, 25
longitudinal data, 8, 25
long-range dependence, 82, 86, 158,
159
Mak, T.K., 202, 222
Markov, A.A. i, 3, 9, 93, 139, 157,
161, 162, 200, 201
Markov process, 9, 157, 162, 200,
201
Martin, R.D., 169, 172, 214, 222
martingale, 15, 17, 18, 26­28, 35,
48, 49, 51, 59, 61, 62, 69,
70, 93, 94­98, 131­133, 135,
136, 148­150, 159, 160, 162­
167, 169­171, 176, 180, 181,
186­196, 200, 210
central limit theorem, 55, 179,
186, 190­195
continuous part, 34, 54
information (see information, mar-
tingale)
strong law, 55, 150, 174, 179,
181, 182, 186­190, 195, 196
maximum likelihood, 1, 2, 16­19,
21, 34, 35, 38, 39, 41, 53,
54, 57, 58, 61, 70, 92, 98,
116, 124, 129, 131, 134, 136,
141, 148, 160, 163, 165, 166,
169, 180, 200, 202
non-parametric, 17
regularity conditions, 40, 41, 54,
62
232 INDEX
restricted, 141
McCormick, W.P., 19, 214
McCullagh, P., 21, 22, 125, 142, 184,
203, 205, 221, 222
McKeague, I.W., 148, 222
McLeish, D.L., 1, 3, 8, 13, 43­45,
49, 127, 128, 222, 225
measurement errors, 139
membrane potential, 33, 147
Merkouris, T., 13, 199, 223
M-estimation, 1, 142
method of moments, 1, 153, 154,
156, 169, 202
metric space, 200
minimum chi-squared, 1
minimal sufficient, 1
missing data, 8, 107, 116, 117, 127
Mitra, S.K., 131, 224
mixed normality, 27, 62, 63, 192
mixing conditions, 84, 156, 179
mixture densities, 127
models
branching (see branching pro-
cess)
epidemic, 162­164
hidden Markov, 93
interest rate, 133
logistic, 37, 38, 40
membrane potential, 33, 147
multi-stage, 200
nearest neighbour, 91
nested, 99­102
particles in a fluid, 87
physician services, 127
population, 35­37
queueing, 157, 158
recapture, 164­168
risky asset, 31
soil moisture, 196­197
moment generating function, 199
Moore, J.B., 136, 138, 214
Moore-Penrose inverse, 30
Morton, R., 21, 23, 100­102, 107,
112, 116, 126, 203, 218, 223
Mtundu, N.D., 197, 223
multiple roots, 202­209
multiplicative errors, 100, 102
mutual quadratic characteristic, 26
Mykland, P.A., 62, 223
Naik-Nimbalkar, U.V., 103, 104, 222
Nelder, J.A., 21, 22, 125, 184, 202,
222, 223
Nelson, P.I., 32­36, 56, 61, 97, 150,
151, 162­164, 184, 191, 196,
219
nested strata, 8, 99
neurophsiological model, 32, 147
Newton-Raphson method, 202
Nguyen, H.T., 148, 223
noise, 17, 30, 31, 33, 127, 132, 133,
135, 136
additive, 36, 37
multiplicative, 36
non-ergodic models, 59, 60, 63, 144
nonparametric estimation, 147, 150
norm (Euclidean), 55, 71, 184, 186
normal distribution, 4, 27, 28, 41,
55­57, 62­66, 88, 91, 105,
111, 116, 121, 129­131, 139,
156, 161, 163, 164, 166, 173
nuisance parameter, 8, 57, 59, 60,
70, 71, 88, 107, 113­115,
139, 159
offspring distribution, 16, 27, 69,
70, 87, 182, 195
Ogunyemi, O.T., 56, 184, 196, 219
on-line signal processing, 9, 176
optimal
asymptotic (OA), 1, 2, 7, 8, 11,
12, 29, 30
experimental design, 8, 12, 20
fixed sample (OF ), 1, 5, 7, 11­
21, 28, 30, 55
orthogonal, 13, 71, 93, 100, 102­
104, 138, 156, 160
orthonormal sequence, 147
Osborne, M.R., 120, 223
outliers, 169
parameters
INDEX 233
constrained, 107­112, 142
incidental, 112
nuisance (see nuisance param-
eters)
scale, 200
parametric model, 1, 11
parameter space, 41, 43, 50, 53
partial likelihood, 107
Parzen, E., 84, 223
Pearson, K., 1
Pedersen, A.R., 135, 223
periodogram, 82
Pham, D.P., 148, 223
pivot, 66
Platen, E., 134, 135, 151, 220
point process, 148 (see also counting
process, Poisson pro-
cess)
Poisson distribution, 33, 67, 70, 87,
100, 121, 167
Poisson process, 19, 31, 34, 35, 111,
112, 135, 196, 197
compound, 35, 196
Pollard, D., 56, 223
population process, 35 (see also branching
process)
power series family, 2, 16
Prakasa-Rao, B.L.S., 54, 132, 135,
141, 212
predictable process, 18, 31, 51, 94,
97, 162
prediction variance, 82
Pregibon, D., 202, 223
Prentice, R.L., 25, 224
Priestley, M.B., 84, 224
probability generating function, 167
projected estimating function, 108­
110, 143
projected estimator, 13, 108, 143,
144
projection, 13, 47, 107­109, 114, 125,
129
P-S algorithm, 107, 116
Pukelsheim, F., 12, 20, 38, 224
purely nondeterministic, 153
Qian, W., 136, 224
quadratic characteristic, 27, 29, 59,
94, 132, 144, 151, 166, 187,
192, 196
quadratic form, 130
quadratic variation process, 33, 54,
133
quasi-likelihood, 1, 7­9, 12, 16, 18,
19, 21, 22, 36, 38, 40, 53,
61, 69, 70, 73, 87, 107, 116,
121, 129, 130, 131, 134, 138,
139, 142, 147, 148, 150, 151,
160­165, 169, 180­182, 195­
197
asymptotic, 8
composite, 91, 92
quasi-score, 7­9, 12, 13, 22, 23, 26,
35, 39, 43, 46, 48, 50, 51,
55, 58, 62, 67, 69­72, 80,
82, 83, 85, 87, 92­96, 98­
100, 102, 104, 105, 108, 109­
112, 116­118, 122, 125, 127,
128, 130, 132, 134, 135, 137­
139, 142, 144, 145, 154, 159,
162, 163, 166, 170­173, 175,
176, 179, 182, 195, 196
asymptotic, 26, 35, 37, 58, 69,
72--76, 78­80, 88, 154
combined, 98, 99, 103, 160
conservative, 142, 203
existence of, 15, 21, 36­38
Hutton-Nelson, 33­36, 61, 150,
151, 162­164
robust, 170­174
sub-, 48, 49
queueing models, 157, 158
Radon-Nikodym derivitive, 129, 131­
133
Rajarshi, M.B., 103, 104, 223
random coefficient autoregression, 176
random effects model, 105
random environment, 35, 36
random field, 8, 82, 93, 136, 137,
179, 201
random norming, 8, 56, 63, 189
234 INDEX
Rao, C.R., 2, 7, 20, 28, 29, 72, 73,
95, 108, 128, 129, 131, 141,
224
Rao-Blackwell theorem, 128
Rebolledo, R., 166, 224
recapture experiment, 9, 164
recursive estimation, 9, 176, 177
regression, 9, 21, 23, 25, 60, 89, 111,
113, 148, 159, 170, 173, 174,
205
REML estimation, 8, 129­131
removal rate, 164
resampling, 9, 210
residuals, 99
Reynolds, J.F., 157, 158, 224
Riemann zeta function, 126
Ripley, B.D., 137, 224
robust methods, 9, 169
Rogers, L.C.G., 26, 31, 54, 133, 149,
224
Rosenblatt, M., 82, 84, 85, 179, 224
Rotnitzky, A.G., 25, 26, 215
Rozanski, R., 148, 221
Rubin, D.B., 116, 127, 214, 222
Rubin, H., 169, 215
Samarov, A., 158, 224
sample space, 2, 43
Samuel, E., 165, 224
Schuh, H.-J., 200, 224
score function, 2, 4, 6­8, 17, 24,
41, 62, 67, 91, 93, 94, 107,
111, 113, 115­118, 122, 127,
134, 139, 141, 160, 181, 205
Scott, D.J., 63, 142, 185, 212
screening tests, 122
Selukar, R.S., 148, 219
semimartingale, 8, 9., 30­36, 54, 69,
93, 97, 131, 135, 148
semiparametric model, 1, 9
Seneta, E., 69, 87, 218
service time, 157
Severo, N., 163, 213
Shen, X., 148, 224
Shiryaev, A.N., 26, 133, 222, 224
short-range dependence, 82, 158, 159
Shumway, R.H., 127, 224
sieve, 9, 147, 148, 151
signal, 30­32, 132, 136
Silvey, S.D., 183, 211
simulation, 64
simultaneous reduction lemma, 20,
21
skewness, 62, 98, 104
Slutsky theorem, 65
Small, C.G., 1, 3, 8, 13, 43­45, 49,
127, 128, 222, 225
Smith, R.L., 54, 225
smoothed periodogram, 82
smoothing, 8, 103, 104
smoothing function, 82, 84
soil moisture, 197
Srensen, M., 34, 35, 56, 59, 135,
136, 191, 197, 200, 201, 212,
220, 225
sources of variation, 33­35, 101, 136,
137, 139
spatial process, 93
spectral density, 82, 86, 153, 155,
156, 158, 159
square root matrix, 71, 72
standardization, 3­5, 11, 71
state space models, 103
stationary process, 1, 153, 156, 158,
210
Staudte, R.G., 169, 173, 212
Stefansky, L.A., 139, 203, 205, 206,
213, 225
stochastic differential equation, 32,
49, 133, 135, 196, 197
stochastic disturbance (see noise)
stochastic integral, 133, 163
Stoffer, D.S., 127, 224
stopping time, 189, 193
strata, 99, 100, 102
strong consistency, 56, 70, 135, 148,
161, 163, 181, 182, 185, 186,
190, 195
structural relationships, 112, 206
sufficiency, 2, 3, 43 (see also E-sufficient)
surrogate predictors, 139
Sweeting, T.J., 60, 225
INDEX 235
Szeg¨o, G., 156, 216
Takayama, T., 111, 219
Taqqu, M.S., 86, 158, 215, 224
Taylor, H.M., 200, 219
Tchoukova-Dantcheva, S., 70, 226
Thavaneswaran, A., 18, 142, 147,
176, 177, 225
Thisted, R.A., 202, 225
Thompson, M.E., 18, 105, 116, 147,
216, 225
time series, 8, 103, 176
moving average, 156, 159, 180
(see also autoregressive pro-
cess)
standardized, 64
Titterington, D.M., 136, 224
Toeplitz matrix, 154
trace criterion (for optimality), 19
traffic intensity, 158
transform martingale families, 199
trapping rate, 168
Traylor, L., 54, 213
treatment, 4
t-statistic, 64, 66, 168
Tweedie, R.L., 200, 215, 22
unacceptable estimating function, 74,
75
uniform convergence, 57
uniform distribution, 41
Vajda, I., 180, 226
Verbyla, A.P., 130, 226
Vitale, R.A., 156, 226
Vostrikova, L., 62, 220
Wald, A., 9, 58, 141­143, 180, 226
Wald test, 9, 141­143
Watson, R.K., 162, 226
Wedderburn, R.W.M. i, 7, 21, 23­
25, 101, 226
Wedderburn estimating function, 24
Wefelmeyer, W., 56, 216
Wei, C.Z., 56, 70, 87, 211, 226
Welsh, R.E., 202, 215
Whittle, P. ii, 82, 83, 86, 226
Whittle estimator, 83, 86
Williams, D., 26, 31, 54, 133, 149,
224
Williams, R.J., 97, 213
Willinger, W., 31, 213
Winnicki, J., 69, 70, 87, 226
Wong, W.H., 148, 224
Wu, C.F.J., 208, 226
Yanev, N.M., 70, 226
Yip, P., 162, 226
Yohai, V.J., 169, 172, 222
Yule-Walker equations, 23
Zeger, S.L., 21, 25, 200, 226
Zehnwirth, B., 103, 226
Zygmund, A., 157, 226