ESTIMATION WITH QUADRATIC LOSS
W. JAMES
FRESNO STATE COLLEGE
AND
CHARLES STEIN
STANFORD UNIVERSITY
1. Introduction
It has long been customary to measure the adequacy of an estimator by the
smallness of its mean squared error. The least squares estimators were studied
by Gauss and by other authors later in the nineteenth century. A proof that the
best unbiased estimator of a linear function of the means of a set of observed
random variables is the least squares estimator was given by Markov [12], a
modified version of whose proof is given by David and Neyman [4]. A slightly
more general theorem is given by Aitken [1]. Fisher [5] indicated that for large
samples the maximum likelihood estimator approximately minimizes the mean
squared error when compared with other reasonable estimators. This paper will
be concerned with optimum properties or failure of optimum properties of the
natural estimator in certain special problems with the risk usually measured by
the mean squared error or, in the case of several parameters, by a quadratic
function of the estimators. We shall first mention some recent papers on this
subject and then give some results, mostly unpublished, in greater detail.
Pitman [13] in 1939 discussed the estimation of location and scale parameters
and obtained the best estimator among those invariant under the affine transformations
leaving the problem invariant. He considered various loss functions,
in particular, mean squared error. Wald [18], also in 1939, in what may be
considered the first paper on statistical decision theory, did the same for location
parameters alone, and tried to show in his theorem 5 that the estimator
obtained in this way is admissible, that is, that there is no estimator whose risk
is no greater at any parameter point, and smaller at some point. However, his
proof of theorem 5 is not convincing since he interchanges the order of integration
in (30) without comment, and it is not clear that this integral is absolutely
convergent. To our knowledge, no counterexample to this theorem is known,
but in higher-dimensional cases, where the analogous argument seems at first
glance only slightly less plausible, counterexamples are given in Blackwell [2]
(which is discussed briefly at the end of section 3 of this paper) and in [14],
This work was supported in part by an ONR contract at Stanford University.
361
362 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
which is repeated in section 2 of this paper. In the paper of Blackwell an analogue
of Wald's theorem 5 is proved for the special case of a distribution concentrated
on a finite arithmetic progression. Hodges and Lehmann [7] proved some results
concerning special exponential families, including Wald's theorem 5 for the
special problem of estimating the mean of a normal distribution. Some more
general results for the estimation of the mean of a normal distribution including
sequential estimation with somewhat arbitrary loss function were obtained by
Blyth [3] whose method is a principal tool of this paper. Girshick and Savage [6]
proved the minimax property (which is weaker than admissibility here) of the
natural estimator in Wald's problem and also generalized the results of Hodges
and Lehmann to an arbitrary exponential family. Karlin [8] proved Wald's
theorem 5 for mean squared error and for certain other loss functions under
fairly weak conditions and also generalized the,results of Girshick and Savage
for exponential families. The author in [16] proved the result for mean squared
error under weaker and simpler conditions than Karlin. This is given without
complete proof as theorem 1 in section 3 of the present paper.
In section 2 of this paper is given a new proof by the authors of the result of
[14] that the usual estimator of the mean of a multivariate normal distribution
with the identity as covariance matrix is inadmissible when the loss is the sum
of squares of the errors in the different coordinates if the dimension is at least
three. An explicit formula is given for an estimator, still inadmissible, whose
risk is never more than that of the usual estimator and considerably less near the
origin. Other distributions and other loss functions are considered later in section
2. In section 3 the general problem of admissibility of estimators for problems
with quadratic loss is formulated and a sufficient condition for admissibility is
given and its relation to the necessary and sufficient condition [15] is briefly
discussed. In section 4 theorems are given which show that under weak conditions
Pitman's estimator for one or two location parameters is admissible when the
loss is taken to be equal to the sum of squares of the errors. Conjectures are discussed
for the more difficult problem where unknown location parameters are
also present as nuisance parameters, and Blackwell's example is given. In section
5 a problem in multivariate analysis is given where the natural estimator is
not even minimax although it has constant risk. These are related to the examples
of one of the authors quoted by Kiefer [9] and Lehmann [11]. In section 6
some unsolved problems are mentioned.
The results of section 2 were obtained by the two authors working together.
The remainder of the paper is the work of C. Stein.
2. Inadmissibility of the usual estimator for three or more
location parameters
Let us first look at the spherically symmetric normal case where the inadmissibility
of the usual estimator was first proved in [14]. Let X be a normally
distributed p-dimensional coordinate vector with unknown mean t = EX and
ESTIMATION WITH QUADRATIC LOSS 363
covariance matrix equal to the identity matrix, that is, E(X - t)(X - t)I.
We are interested in estimating t, say by 4 and define the loss to be
(1) LQ(, 4) = (t - = |) - J112,
using the notation
(2) -1X112 = x'x.
The usual estimator is 'po, defined by
(3) OW(x) = x,
and its risk is
(4) p(Q, po) = EL[t, po(X)] = E(X -t)'(X- = p.
It is well known that among all unbiased estimators, or among all translationinvariant
estimators (those s for which po(x + c) = so(x) + c for all vectors x
and c), this estimator go has minimum risk for all t. However, we shall see that
for p > 3,
(5) E||(1-lX| )X t|| = p -E P2+2 < p'(5)~ ~ ~ IX12p - 2 +2K
where K has a Poisson distribution with mean tl12/2. Thus the estimator pi
defined by
(6) sol(X) = 1 - 2 X
has smaller risk than so for all t. In fact, the risk of ,1 is 2 at 0= 0 and increases
gradually with 11112 to the value p as I
tl 2 oo. Although solis not admissible
it seems unlikely that there are spherically symmetrical estimators which are
appreciably better than pi. An analogous result is given in formulas (19) and
(21) for the case where E(X - t)(X - t)' = a21, where a2 is unknown but we
observe S distributed independently of X as a2 times a x2 with n degrees of
freedom. For p _ 2 it is shown in [14] and also follows from the results of section
3 that the usual estimator is admissible.
We compute the risk of the estimator 2 defined by
(7) <2(X) = - __X,
where b is a positive constant. We have
(8) P(t P2) = E X-t
= EIlX-_lI2 -2bE (Xllx X + b2E11l12
= p - 2bE (X x)'X + b2E 11X1
364 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
It is well known that IIXI 12, a noncentral x2 with p degrees of freedom and noncentrality
parameter lItI12, is distributed the same as a random variable W
obtained by taking a random variable K having a Poisson distribution with
mean 1/211,1 2 and then taking the conditional distribution of W given K to be
that of a central x2 with p + 2K degrees of freedom. Thus
(9) 1 1 =E(1E KIE1KE
(9)IEj= X2+2K E X+2K) p-2+2K
To compute the expected value of the middle term on the right side of (8) let
(10) U= V=I2
Then
(11) W = I1X12 = U2 + V,
and U is normally distributed with mean IltlI and variance 1, and V is independent
of U and has a x2 distribution with p - 1 d.f. The joint density of U
and V is
(12) , exp{-(U P{2(UV1(p)2}2(P21),2F](-
1)/2] V(P-)/2e-V/2
if v _ 0 and 0 if v <0. Thus the joint density of U and W is
(13) 27r2(v-))/2r[(p - 1)/2] (W U2)(23)I2 exp 221I12 + IIfl U 1W}
if u2 < w and 0 elsewhere. It follows that
(14) EU exp {-2- r dwW x27r 2(P')'2r[(p - 1)/2] O
i!U (W - U2)(P-3)/2 exp {lllu- w}
Making the change of variable t = u/V\w we find
(15) f dw f (w - U2)(p-3)2 exp{llllu - W}du
10f w~~3~' ex{p -w2}11dw exp {l1tltt\w} dt
= f W(p3)/2 exp {-w w}dw )i f t'+'(1 - t2)(p3)/2 dt
Itl
(2ji+1)!|Wp/2+j-l tx --W W| f2(j+l)(1 -t2)(p-3)/2 dt
~ (Ca WP/2+1 exp dwwCd,=o(2j+ 1)! C ~ 2 W} 1t~~~( 2()2d
ESTIMATION WITH QUADRATIC LOSS 365
2'0(2j+1)!2' +ir(-2+j)B(j+3,P 1-jo0 (2+;!21
Illf12i+1 2P/2+j r ( + 3)r (P )
=° (2j+ 1)
(P
9p12 X 2(2 + j)
(~uI) - =02 i+lr(j + 1) (2 +i)
It follows from (10), (11), (14), and (15) that
(16) E (Xll2 = 1- it IlE U
=1-110~111t X r j21tjj2i+1
lIrf V j=02i+lr(j + 1)(2 +i)
= exp {-2 11011!} { E X -11011 Ej!(p + 2j)}
= exp {-2 pp{-2)
i} i! p -2+2i
p - 2 + 2
where K again has a Poisson distribution with mean jl1112/2. Combining (8),
(9), and (16) we find
(17) P(%, P2) =
Ell(i-_
I_ )X-
t|l
= p-2(p-2)bE 1+ + b2E _ 2+2K
This is minimized, for all {, by taking b = p - 2 and leads to the use of the estimator
ioi defined by (6) and the formula (5) for its risk.
Now let us look at the case where X has mean t and covariance matrix given
by
(18) E(X - t)(X -t)' = a2
366 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
and we observe S independent of X distributed as a2 times a x2 with n degrees of
freedom. Both t and a2 are unknown. We consider the estimator (p3 defined by
(19) p3(X, S) = (1 X,
where a is a nonnegative constant. We have
(20) p(%, ps) = Ellp3(X, S) - t1|2
EIIX -t112 - 2aES(X_ tyX + a2E S2
11X1,12 IalX.112
= -j -2aE S E ($)')°+a2E(S)2E 1
-a2 {p -2an(p - 2)E - 2+ 2K + a2n(n + 2)E - 2K}
by (9) and (16) with X and t replaced by X/a and t/a respectively. Here K has a
Poisson distribution with mean ItI12/2a2. The choice a = (p - 2)/(n + 2) minimizes
(20) for all t, giving it the value
(21) p( $p3) = cr2 - n+2 (P- 2)2E -2+rp-n p 1 2+~Kf
We can also treat the case where the covariance matrix is unknown but an
estimate based on a Wishart matrix is available. Let X and S be independently
distributed, X having a p-dimensional normal distribution with mean t and covariance
matrix 2 and S being distributed as a p X p Wishart matrix with n degrees
of freedom and expectation n2, where both t and 2 are unknown and 2 is
nonsingular. Suppose we want to estimate t by I with loss function
(22) L[(Q, 2), j] = (t -)t-(-.
We consider estimators of the form
(23) (p(X, S) I(i - X.
The risk function of (p is given by
(24) p[(Q, I), sp] = EE,x[(p(X, S) - t]' I-'[((X, S)z
1 - - M-1 [(i- c
-=
R,[(1- -,S X) X-]21[1X'S-,X) Xt
= EE,1 - XS-,X) ] [(i X'S- X)
where t*' = [(Q'Z-1j)1'2, 0, *, 0]. But it is well known (see, for example
Wijsman [19]) that the conditional distribution of X'S-1X given X is that of
X'X/S*, where S* is distributed as x2-p+l independent of X. Comparing (24)
ESTIMATION WITH QUADRATIC LOSS 367
and (20) we see that the optimum choice of c is (p - 2)/(n - p + 3) and, for
this choice, the risk function is given by
(25) P[(' 1) f] = P-n p+ 1 (p- 2)2 1
n - p+3 p- 2 +2K
where K has a Poisson distribution with mean (1/2)t'2;'.
The improvement achieved by these estimators over the usual estimator may
be understood better if we break up the error into its component along X and
its component orthogonal to X. For simplicity we consider the case where the
covariance matrix is known to be the identity. If we consider any estimator
which lies along X, the error orthogonal to X is t - (t'X/lXlI2)X and its mean
square is
(26) Et lit- X || = lftl-l2Et ((X)12
1l2(1-E 1+ 2K)
= (p 1)1101 p + 2K
= (p - 1)[1-I -2)Ep 2 + 2K]
Thus the mean square of the component along X of the error of [1(p
- 2)/1 XII2]Xis
(27)
p p)2Et(P -(p -2)Et2 2K
[P (p 2)2Et p-2 + 2K 1)[1- 2 + 2K]
1(P-2)E~ p-2 +2K =2.
On seeing the results given above several people have expressed fear that they
were closely tied up with the use of an unbounded loss function, which many
people consider unreasonable. We give an example to show that, at least qualitatively,
this is not so. Again, for simplicity we suppose X a p-variate random
vector normally distributed with mean t and covariance matrix equal to the
identity matrix. Suppose we are interested in estimating t by f with loss function
(28) L(Q, j) = F( I -t112)
where F has a bounded derivative and is continuouusly differentiable and concave
(that is, F"(t) _ 0 for all t > 0). We shall show that for sufficiently small b
and large a (independent of t) the estimator so defined by
(29) 'p(X) = (1-- a +1IIX1l2)X
has smaller risk than the usual estimator X. We have (with Y = X -t)
368 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
(30) pQ, p) = EZF 1- + X_t1|2]
=EtF [!IX _ t12-2b X'(X-0) + b2 I1X112)2
_ EeF(flX-tI2) - 2bE+[X1a+ (Xa12b IX-122b
XVY+)t'-
=
EF(1XY112)-2bE a
+IM Y +
tiI22 F'(1XY-12) 2
(Y+ t)'Y--
(31) E a + {lY + t112 2 F'(11Y1a2)
= Y+a '+IIllyl 1- lll2 + IIYII) ({2) )+°[(a +Il4112)J
a+E ll |2+ a,fl+2!eY F+(flYfl2)+1111\}
b~~~~~~~~
(IYl+V
FI(1Y1-) +F[(I +II2 II2)1
{a + I2 + 11Y112 - (a + I1lI! + /ly +O112
To see that this is everywhere positive if b is sufficiently small and a sufficiently
large we look at
(32) +Vy -
and
(33) AE AF(I1IY =12)
a+IIYA + 1Y12 = (A)
ESTIMATION WITH QUADRATIC LOSS 369
It is clear thatf and g are continuous strictly positive valued functions on [1, )
with finite nonzero limits at X. It follows that
(34) c = inf f(A) = 0,A>1
(35) d = sup g(A) < oo.
A>1
Then, if b is chosen so that
(36) 1--) _-2 d > O
it follows from (30) and (31) that, for sufficiently large a
(37) p(, p) < EZF(IiX -
l12)
for all t.
The inadmissibility of the usual estimator for three or more location parameters
does not require the assumption of normality. We shall give the following
result without proof. Let X be a p-dimensional random coordinate vector with
mean t, and finite fourth absolute moments:
(38) E(Xi-t)4,_C< oc.
For simplicity of notation we assume the Xi are uncorrelated and write
(39) 2 = EJX,-i)2.
Then for p 2 3 and
(40) b < 2(p-2) min at
and sufficiently large a depending only on C
(41) E [(1 2[a + E X2/y2])Xi -{i
< Et (X_-.)2 = a2
It would be desirable to obtain explicit formulas for estimators one can seriously
recommend in the last two cases considered above.
3. Formulation of the general problem of admissible estimation
with quadratic loss
Let Z be a set (the sample space), 63 a a-algebra of subsets of Z and X a a-finite
measure on 63. Let 0 be another set (the parameter space), e a o-algebra of subsets
of 0, and p(. ) a nonnegative valued (6B-measurable function on Z X 0
such that for each (E 0, p(. 10) is a probability density with respect to X, that is,
(42) fp(zO)dX(z) = 1.
Let A, the action space, be the k-dimensional real coordinate space, a a e-measurable
function on 0 to the set of positive semidefinite symmetric k X k matrices,
370 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
and ,q a C-measurable function on 0 to A. We observe Z distributed over Z
according to the probability density p(. 10) with respect to X, where 0 is unknown,
then choose an action a E A and suffer the loss
(43) L(0, a) = [a - q(0)]'a(6)[a - o()]An
estimator (p of q(o) is a B-measurable function on Z to A, the interpretation
being that after observing Z one takes action sp(Z), or in other words, estimates
,1(0) to be p(Z). The risk function p(., so) is the function on 0 defined by
(44) p(0, sp) = Eo[<p(Z) - q(0)]'a(0)[(p(Z) -1(0)].
Roughly speaking, we want to choose so so as to keep p( , s) small, but this
is not a precise statement since, for any given <, it will usually be possible to
modify p so as to decrease p(O, ,o) at some 0 but increase it at other 0. In many
problems there is a commonly used estimator, for example, the maximum likelihood
estimator or one suggested by invariance, linearity, unbiasedness, or some
combination of these. Then it is natural to ask whether this estimator is admissible
in the sense of Wald. The estimator so0 is said to be admissible if there is no
estimator (p for which
(45) PO@, () _PO, soo)
for all 0 with strict inequality for some 0. If there does exist such a so then (p is
said to be better than 'po and v7o is said to be inadmissible. We shall also find it
useful to define an estimator spo to be almost admissible with respect to a measure
II on the a-algebra e of subsets of 0 if there is no estimator (p for which (45) holds
for all 0 with strict inequality on a set having positive II-measure.
Next we give a simple sufficient condition for almost-admissibility of certain
estimators. Although we do not discuss the necessity of the condition, its similarity
to the necessary and sufficient condition of [15] leads us to apply it with
confidence. It should be remarked that in [15] the condition of boundedness of
the risk of the estimator (or strategy, bo in the notation of that paper) was inadvertently
omitted in theorems 3 and 4. It is needed in order to justify the
reduction to (48). If HI is a a-finite measure on e we shall define
(46) (pn(x) = [f a(0)p(x10) dH(0)] f a(0)q(0)p(x10) dH(0)
provided the integrals involved are finite almost everywhere. Observe that if II
is a probability measure, 'pi is the Bayes' estimator of ,q(0), that is, (p= (pM
minimizes
(47) f dH(0)Eo[(p(X) - ,(0)]'a(0)[(P(X) -(0).
If q is a probability density with respect to H we shall write (q, HJ) for the induced
probability measure, that is,
(48) (q, HJ)S = f q(0) dH(0).
AS
ESTIMATION WITH QUADRATIC LOSS 371
THEOREM 3.1. If P is an estimator of q(0) with bounded risk such that for each
set C in a denumerable family 9: of sets whose union is 0
J q(O) dll(O)Ee[,p(X) - p(qju)(X)]'a(O)[,p(X) - p(q,j)(X)]
q0s(c) f q(O) dH(O)
C
where S(C) is the set of probability densities with respect to II which are constant
(but not 0) on C, then 'p is almost admissible with respect to H.
PROOF. Suppose to the contrary that 'pj is such that
(50) Ee['p1(X) - q(6)]'a(0)[Pi(X) -q(0)]
< Ee[sp(X) -rq(0)]'a(O)[1p(X) - 71(0)]
with strict inequality on a set S of positive II-measure. Choose E > 0 and C a
set in 5Y such that II(C n Sf) > 0 where S, is the set of 0 for which
(51) Ee[,pi(X) - 7(0)]'a(0)[p(X) -()]
< Ee[sp(X) -r(O)]'a(6)['p(X) -r7(0)] - e.
Then, for any q E g(C)
(52) f q(6) dH(O)Ee[&p1(X) - 7j()]'a(0)['p1(X) -7(0)]
ff q(O) dH(O)Ee[4p(X) - 7(0)]'a(6)['P(X) - tn()] - e f q(D) dH(O).
cfnsE
It follows that
(53) eII(Ccn )
f q(O) dHI(O)
cns
= f
q(O)dIH(O)
c
-fi(@ d ( )fq(O)dll(O)Ee[vp(X) - 7(0)]'a(0)[sp>(X) - n()]
< fi f q(O) dH(O)Ee[,p(X) -q(O)]'a(O)['p(X) -7(0)]
f q(O) dHl(9)
c
- f q(O) dHJ(O)Ee['p(q,n)(X) -l(0)]'a(O)['p(q,ll)(X) - n()]
fq(O) dH(o)E0['p(X) - 'p(qF,)(X)]'a(O)['P(X) -'p(qjn)(X)]
f q(o) dIH(O)
and this contradicts (49).
372 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
4. Admissibility of Pitman's estimator for location parameters
in certain low dimensional cases
The sample space Z of section 3 is now of the form $E X yIJ, where $tis a finitedimensional
real coordinate space, and yijarbitrary and the o-algebra (B is a product
a-algebra (B = M1M2 where G3, consists of the Borel sets in $r and (B2 is an
arbitrary a-algebra of subsets of y3. Here X is the product measure X = yv
where ,u is a Lebesgue measure on G3i and v is an arbitrary probability measure
on (B2. The parameter space 0 and the action space A coincide with $c. The loss
function is
(54) L(6, a) = (a - 6)'(a - 0).
We observe (X, Y) whose distribution, for given 0, is such that Y is distributed
according to v and the conditional density of X - 0 given Y is p(. Y), a known
density. We assume
(55) f p(x, y) dx = 1
and
(56) fxp(x, y) dx =0
for all y. Condition (56) is introduced only for the purpose of making the natural
estimator X (see [6] or [16]). The condition (49) for the natural estimator X
to be almost admissible becomes the existence of a sequence 7r, of densities with
respect to Lebesgue measure in $Csuch that
1 r |||, _¢X)(1v |
(57) ~inf 1 dp(y) dlf t~rg(x - ,;)p(iijy) dlii2(57) inf 7r(x) j 0.
l4il 1 f 7r6(X - ,q)p(6qIy) d'q
This is derived in formula (63) below.
In [16] one of the authors proved the following theorem.
THEOREM 4.1. When dim SC = 1, if in addition to the above conditions
(58) f dv(y) [f x2p(xjy) dx]312 < oo,
then X is an admissible estimator of 0, that is, there does not exist a function so
such that
(59) f dv(y) f [,p(x, y) - 0]2p(X- Bly) dx < f dv(y) f x2p(xly) dx
for all 0 with strict inequality for some 0.
This is proved by first showing that (57) holds with
(60) 'ir,(x) =
wa( +
ESTIMATION WITH QUADRATIC LOSS 373
so that X is almost admissible, and then proving that this implies that X is
admissible. It is not clear that the condition (58) is necessary.
We shall sketch the proof of a similar but more difficult theorem in the twodimensional
case.
THEOREM 2. When dim X = 2, if in addition to (55) and (56)
(61) f d'(y) [f llxll2 log'+' jlxjl2p(x, y) dx] < o,
then X is an admissible estimator of 0, that is, there does not exist a function so on
DC X lJ to $t such that
(62) f d'(y) f ij,(x, y) - 01l2p(x - Oly) dx _ f dv(y) f l|xl|2p(x|y) dx
for all 0 with strict inequality for some 0.
SKETCH OF PROOF. First we show that (57) implies (49) in the present case.
We take II to be a Lebesgue measure in $C and in accordance with the notation
of the present section write 7r instead of q. Because of the invariance of our
problem under translation we need prove (49) only for C equal to a solid sphere
of radius 1 about the origin. Then the integral in the numerator of (49) is
f lp(X- ti, Y)ir(.i) dt, 2
(63) j 7(r,) diEE X J
|~~~~p(X -,1 Y)ir(tai)
= f 7r,) dt f dv(y) f p(x -, y) dx 2f (x ti)p(x
- i
y)wr,) dt1 2
[f p(x - ,y)prx(-i,)dyd]
f dv(y) f dx f 7r,(Q) dt p(x - i, Y) 2f (x -i -til
[f p(x -
1, y)7r,(Q) dt1]
= f dv(y) f dx J (X- )p(x - , y)7r.() dti|
f p(x - t, y)7r.(Q) dt
=|dp(y) vdx ?1l7r,(x-rnwn, y) dillf
T,(X - q)p(i1, y) dr,
and it is clear that (57) implies (49).
374 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
To prove theorem 2 we define ire by
-,log2V, III2< 11
0*[K, log2 a 1t112 < Ma,
(64) |-log (1 () 1 _
II1I 2 _ Ma,
a aTlogL2og 1il_Ma
where 1 < < 1 + 6, 0 < M < 1, A'VM > 1, and these constants and B are
chosen so that 'r is continuous everywhere and continuously differentiable except
at lltllf = 1. A computation analogous to that in [16] but much more tedious
leads to the following bound for the inner integral on the right side of (63)
12
1l||717re( -t)p(, y) do||
(65) I(y) = dx
f 7rz(x -71)p('i2, y) dr7
<Cllogff | 72(1+ log rq1)p(,7, y) d]
if
(66) f 12p(n, y) d'q _ C2(
where Ci and C2 are positive constants. Also by applying (63) with (J reduced
to the point y it is easy to see that
(67) I(y) _ f 72p(7,) y) d77 for all y.
Now let S. be the set of y for which
(68) f 712(1 + log 712)p(1, y) d7l < C2o.
Then
1f |||717ru(X- n)p(71, y) dq|1
(69) f dv(y) Jdx
f7r,(x- q)p(,q, y) d71
f dv(y)Clog1 r [f212(1 + logo 72)p(71, y) d7]
S"
+ f dv(y) f n2(1 + log,n 2)p(iq, y) d7i.
But
(70) f dv(y) f 712(1 + logo 7)2)p(t, y) dr.
< f dv(y) [f 2(1 + log1 172)p(X, y) dq
ESTIMATION WITH QUADRATIC LOSS 375
It follows from (69) and (70) and (64) that
12
(71) I dp(y) dx
7r,(0) |f 7r(Xn)p(n,q
y) dt1
< C3 log av °-+ as a-a:
--log2 0./a.
so that theorem 2 follows from the remarks around (57).
Theorems 1 and 2 together with the results of section 2 settle in a fairly complete
manner the question of admissibility of Pitman's estimator of a number of
location parameters with positive definite translation-invariant quadratic loss
function. If enough moments exist, Pitman's estimator is admissible if the
number of parameters to be estimated is one or two, but inadmissible if this number
is at least three. In the case where there are also parameters which enter as
nuisance parameters, the only known results are an example of Blackwell [2]
and some trivial consequences of the results given above.
Blackwell formulates his example as a somewhat pathological case involving
one unknown location parameter where Pitman's estimator is almost admissible
in the class of Borel-measurable estimators but not admissible (and not even
almost admissible if measurability is not required). However, it can be reformulated
as a nonpathological example involving four unknown location parameters
with a quadratic loss function of rank one.
Now consider the problem where we have p unknown location parameters and
the positive semidefinite translation-invariant quadratic loss function has rank r.
From the result given without proof at the end of section 2, it follows that if
r _ 3 and all fourth absolute moments exist, then Pitman's estimator is inadmissible.
This follows from the application of the result at the end of section 2
to the problem we obtain if we look only at Pitman's estimator of the r parameters
which enter effectively into the loss function, ignoring the rest of the observation.
If p = 2 and r = 1, then, subject to the conditions of theorem 2,
Pitman's estimator is admissible. If it were not, we could obtain a better estimator
than Pitman's for a problem with r = 2 (contradicting theorem 2) by
using the better estimator for the parameter which occurs in the original loss
function of rank 1 and Pitman's estimator for another linearly independent
parameter, with the new loss function equal to the sum of squares of the errors
in estimating the parameters defined in this way. If r is 1 or 2 but p arbitrary,
Pitman's estimator is admissible (subject to the existence of appropriate moments)
if, for some choice of the coordinate system defining the r - p nuisance
parameters, Pitman's estimator coincides with what would be Pitman's estimator
if these r - p nuisance parameters were known. This is always the case if r _ 2
and the observed random vector X is normally distributed.
These are all the results known to me for the problem of estimating unknown
location parameters with positive semidefinite translation-invariant quadratic
loss function. In the following conjectures it is to be understood that in all cases
376 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
sufficiently many moments are assumed to exist. I conjecture that if p = 3
and r = 1, in the case of a single orbit (that is, when '8, analogous to that of
theorems 1 and 2, reduces to a single point) Pitman's estimator is admissible,
but this does not hold in general when ly does not reduce to a point. In the
other cases not covered in the preceding paragraph, that is, if p _ 3 and r = 2
or if p _ 4 and r = 1, I conjecture that Pitman's estimator is, in general, inadmissible,
but of course there are many exceptions, in particular those mentioned
at the end of the last paragraph. Blackwell's example supports this
conjecture for p _ 4 and r = 1.
5. Some problems where the natural estimator is not mininiax
Kiefer [9] and Kudo [10] have shown that under certain conditions, a statistical
problem invariant under a group of transformations possesses a minimax
solution which is also invariant under this group of transformations. However,
these conditions do not hold for the group of all nonsingular linear transformations
in a linear space of dimension at least two. I shall give here a problem in
multivariate analysis for which I can derive a minimax solution and show that
the natural estimator (invariant under the full linear group) is not minimax.
Consider the problem in which we observe X1, - - , X. independently normally
distributed p-dimensional random vectors with mean 0 and unknown
covariance matrix 2 where n 2 p. Suppose we want to estimate 2, say by 2
with loss function
(72) L(2, :) = tr -1 2 - log det -1 -p.
The problem is invariant under the transformations Xi -> aXi, aa',
- a±a' where a is an arbitrary nonsingular p X p matrix. Also
n
(73) S = xixi=l
is a sufficient statistic and if we make the transformation Xi -* aXi then
S -- aSa'. We may confine our attention to estimators which are functions of
S alone. The condition of invariance of an estimator p (a function on the set of
positive definite p X p symmetric matrices to itself) under transformation by
the matrix a is
(74) (p(asa') = ad(s)a' for all s.
Let us look for the best estimator (p satisfying (74) for all lower triangular
matrices a, that is, those satisfying aij = 0 for j > i. We shall find that this
p(S) is not a scalar multiple of S. At the end of the section we shall sketch the
proof that such an estimator is minimax. Similar results hold for the quadratic
loss function
(75) L*(2, :) = tr (-' 2 -I)2
ESTIMATION WITH QUADRATIC LOSS 377
but I have not been able to get an explicit formula for a minimax estimator in
this case.
Putting s = I in (74) we find
(76) (p(aa') = asp(I)a'.
When we let a range over the set of diagonal matrices with 4 1 on the diagonal,
this yields
(77) (I) = ap(I)a',
which implies that so(I) is a diagonal matrix, say A, with ith diagonal element Ai.
This together with (74) determines so since any positive definite symmetric
matrix S can be factored as
(78) S = KK'
with K lower triangular (with positive diagonal elements) and we then have
(79) p(S) = KAK'.
Since the group of lower triangular matrices operates transitively on the parameter
space, the risk of an invariant procedure (p is constant. Thus we compute the
risk only for 2 = I. We then have
(80) p(I, s) = E[tr p(S) - log det sp(S) - p]
= E(tr KAK' - log det KAK' - p]
E tr KAK' - log detA - E log det S - p.
But
(81) E tr KAK' = EAiEKk1i,k
- EAiExn-+l+pi =_ Ai(n + p - 2i + 1)
since the elements of K are independent of each other, the ith diagonal element
being distributed as Xn-i+l and the elements below the diagonal normal with
mean 0 and variance 1. Also, for the same reason,
(82) E log det S = E log X2 -+1.
It follows that
(83) p(2, p) = p(I, so)
= E [(n + p - 2i + 1)Ai -log Ai] - E E log xn-i+i p.
This attains its minimum value of
p X21
(84) p(2Z, p*) = 1 - log -Elgn_1(=1L n + p-2i + I x =
E [log (n + p - 2i + 1) - E log xn-_+I]
378 FOURTH BERKELEY SYMPOSIUM: JAMES AND STEIN
when
(85) i. = n + p - 2i + 1
We have thus found the minimax estimator in a class of estimators which
includes the natural estimators (multiples of S) to be different from the natural
estimators. Since the group of lower triangular matrices is solvable it follows
from the results of Kiefer [9] that the estimator given by (79) and (85) is
minimax. However, it is not admissible. One can get a better estimator by averaging
this estimator and one obtained by permuting the coordinates, applying
the method given above and then undoing the permutation. It must be admitted
that the problem is somewhat artificial.
6. Some more unsolved problems
In section 4 several conjectures and unsolved problems concerning estimation
of location parameters have been mentioned. Some other problems are listed
below. Of course, one can combine these in many ways to produce more difficult
problems.
(i) What are the admissible estimators of location parameters? In particular,
what are the admissible minimax estimators of location parameters?
(ii) What results can be obtained for more general loss functions invariant
under translation?
(iii) For a problem invariant under a group other than a translation group,
when is the best invariant estimator admissible? In particular, is Pitman's
estimator admissible when both locations and scale parameters are unknown?
(iv) What can we say in the case of more complicated problems where there
may be no natural estimator? For example, consider the problem in which we
observe S1, * , S. independently distributed as aTxk and want to estimate
a22
, a2 by &2
* , &2
with loss function
(86) L(al * *Xn;a, * n** ) =
It is clear that
(87) 1k + 2 Si
is a minimax estimator since the risk for this estimator is constant and it is
minimax when all except one of the at are 0 (see Hodges and Lehmann [7]).
But this estimator is clearly very poor if k is small and n is large. This problem
arises in the estimation of the covariances in a finite stationary circular Gaussian
process.
ESTIMATION WITH QUADRATIC LOSS 379
REFERENCES
[1] A. C. AITKEN, "On least squares and linear combinationi of observations," Proc. Roy. Soc.
Edinburgh, Sect. A, Vol. 55 (1935), pp. 42-48.
[2] D. BLACKWELL, "On the translation parameter problem for discrete variables," Ann.
Math. Statist., Vol. 22 (1951), pp. 393-399.
[3] C. BLYTH, "On minimax statistical decision procedures and their admissibility," Ann.
Math. Statist., Vol. 22 (1951), pp. 22-42.
[4] F. N. DAVID and J. NEYMAN, "Extension of the Markoff theorem on least squares,"
Statist. Res. Mem., Vol. 1 (1938), pp. 105-116.
[5] R. A. FISHER, "On the mathematical foundations of theoretical statistics." Philos. Trans.
Roy. Soc. London, Ser. A, Vol. 222 (1922), pp. 309-368.
[6] M. A. GIRSHICK and L. J. SAVAGE, "Bayes and minimax estimates for quadratic loss
functions," Proceedings of the Second Berkeley Symposium on Mathenaatical Statistics and
Probability, Berkeley and Los Angeles, University of California Press, 1951, pp. 53-73.
[7] J. L. HODGES, JR., and E. L. LEHMANN, "Some applications of the Cram6r-Rao inequality,"
Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability,
Berkeley and Los Angeles, University of California Press, 1951, pp. 13-22.
[8] S. KARLIN, "Admissibility for estimation with quadratic loss," Ann. Math. Statist., Vol.
29 (1958), pp. 406-436.
[9] J. KIEFER, "Invariance, minimax sequential estimation, and continuous time processes,"
Ann. Math. Statist., Vol. 28 (1957), pp. 573-601.
[10] H. KUDO, "On minimax invariant estimators of the transformation parameter," Nat. Sci.
Rep. Ochanomizu Univ., Vol. 6 (1955), pp. 31-73.
[11] E. L. LEHMANN, Testing Statistical Hypotheses, New York, Wiley, 1959, pp. 231 and 338.
[12] A. MARKOV, Calculus of Probability, St. Petersburg, 1908 (2nd ed.). (In Russian.)
[13] E. J. G. PITMAN, "Location and scale parameters," Biometrika, Vol. 30 (1939), pp. 391-
421.
[14] C. STEIN, "Inadmissibility of the usual estimator for the mean of a multivariate normal
distribution," Proceedings of the Third Berkeley Symposium on Mathematical Statistics and
Probability, Berkeley and Los Angeles, University of California Press, 1956, Vol. 1, pp.
197-206.
[15] ,"A necessary and sufficient condition for admissibility," Ann. Math. Statist.,
Vol. 26 (1955), pp. 518-522.
[16] , "The admissibility of Pitman's estimator for a single location parameter,"
Ann. Math. Statist., Vol. 30 (1959), pp. 970-979.
[17] , "Multiple regression," Contributions to Probability and Statistics, Essays in
Honor of Harold Hotelling, Stanford, Stanford University Press, 1960, pp. 424-443.
[18] A. WALD, "Contributions to the theory of statistical estimation and testing hypotheses,"
Ann. Math. Statist., Vol. 10 (1939), pp. 299-326.
[19] R. A. WIJSMAN, "Random orthogonal transformations and their use in some classical
distribution problems in multivariate analysis," Ann. Math. Statist., Vol. 28 (1957),
pp. 415-423.