������� ��
ROBUST MEAN
One from the most inﬂuential method of science is averaging. Result of
averaging provides a convenient smooth representation of studied quantities
and also naturally suppresses potentially unrelated eﬀects.
An arithmetic mean is commonly used method of averaging characterised
by simple formulas and widely known statistical properties. The
arithmetical mean can be introduced by deﬁnition (as in Section 13.1) or
derived from the principle of maximum likelihood (Section 13.2). The
subsequent matter of this chapter, develops methods for determination
of the robust mean, the averaging method “insensitive to small deviation
from assumption” as has been introduced in Chapter 3, from the maximum
likelihood.
This introduced chapter See Section 13.11
for the algorithm.
is considered as the detailed description of
methods suitable for computation of robust mean including its deep explanation.
If reader is not interested in details, Section 13.11 gives short
summary of reliable algorithm for estimation of robust mean.
There are two important books giving background for this chapter.
The general introduction to statistic in data processing – the book Brandt
(2014) including developing of the maximum likelihood method. Robust
parts of this chapter are evolved on base of ideas of the book Huber (1981).
Although, the averaging, realised by the robust mean estimation, looks
trivially on the ﬁrst sight, it demonstrates, in plain basic form, all important
ideas of robust algorithms. This chapter opens gate of a robust land. The
abstract land where butterﬂies can ﬂy without fear that theirs wings will
be broke by storm drops raised by oneself.
13.1 ARITHMETIC MEAN
The arithmetic mean is a standard way to estimate of average of a datasample.
An general practise, to compute of the arithmetic mean, of a set
of N values
{�1� �2� � � � � �N}� (13.1)
is, by deﬁnition, the formula
¯� =
1
N
N�
�=1
��� (13.2)
83
84 Chapter 13. Robust mean
The mean ¯� gives estimation of location of a centre of the input set. The
scatter of points �� around the mean ¯� is represented by the standard
(quadratic) deviation
�2
=
1
N − 1
N�
�=1
(�� − ¯�)2
� (13.3)
The mean ¯� itself is localised more precisely than � indicates. To
estimate statistical uncertainty σ of ¯�, we will suppose that all points �� has
approximately the same σ� ≈ � deviation and the points are statistically
independent. By using of the assumptions, we can use the model for the
error propagation (Brandt (2014))
σ2
=
N�
�=1
�
∂�
∂��
�2
σ2
� (13.4)
on a function of arithmetical mean deﬁned as �({��}) = (�1+�2+· · ·+�N)/N
which is for every point ∂�/∂�� = 1/N. Putting all the terms together, and
summation of a constant term, gives
σ2
=
N�
�=1
1
N2
�2
=
�2
N
� (13.5)
or σ = �/
√
N.
Result of arithmetic meaning are usually presented in the form of the
conﬁdence interval
¯� ± σ (13.6)
which optimistically estimates that the true value X of a quantity lies inside
interval ¯� − σ ≤ X ≤ ¯� + σ with probability of 68%.
Arithmetic mean is commonly used method. One is easy to use, numerically
stable and gives smooth results. Arithmetic mean is ideal for
use in computing by hand. The formula (13.2) is very simple, data can be
easy inspected and potentially false data suppressed. On the contrary, the
arithmetic mean can not be recommended for machine processing. One
is sensitive on deviated data. In case of its presence in a sample, results
will be scattered or, much worse, completely random.
13.2 ARITHMETIC MEAN BY MAXIMUM LIKELIHOOD
The method of maximum likelihood presents impressive point of view on
the full ﬁeld of arithmetic mean. The method brings also a very eﬀective
framework which provides optimal estimates of the average for data with
an arbitrary statistical distribution. So important for generalisation.
13.2. Arithmetic Mean By Maximum Likelihood 85
This section brieﬂy summarises important steps used to estimate arithmetic
mean and its statistical properties by maximum likelihood method.
The description is summary of equivalent chapter of Brandt (2014) which
provides more detailed and exact description.
A basic principle of maximum likelihood method is maximisation of
probability which describes measured quantities. In case of mean, we
are looking for probability, common to all points. Supposing of statistical
independence of single point with the probability density function φ(��|¯�� �),
the join probability of of ﬁtness of all data points can be composed as its
products
φ(�1|¯�� �) · φ(�2|¯�� �) · · · φ(�N|¯�� �)� (13.7)
For use of the method on a data, one a priory assumes a statistical
distribution of every data point. In case of arithmetical mean, we will
suppose Normal distribution �(¯�� �) with the probability density function
φ(�|¯�� �) =
1
√
2π�
e−(�−¯�)2/2�2
� (13.8)
The analytic form of the probability is given by our assumption. The
function represents a “distance” (measure) that every single point with �
and how much the point belong to the distribution set.
Lets deﬁne the likelihood function
L(¯�� �) =
N�
�=1
φ(��|¯�� �) (13.9)
as function of parameters (independent variables) ¯�� �. The parameters
can be estimated by the way: The goal of our eﬀort is set parameters ¯�� �
such way to maximise probability which is a priory known.
The methods for searching of maximum of such products are indirect.
We use property of logarithm which is converts products on sums and
one is a monotone function so the maximum is has unchanged point.
Logarithm of maximum likelihood is
ln L =
N�
�=1
ln φ(��|¯�� �) (13.10)
which depends on the distribution function. With substituting of φ(�) the
function and summing of constant elements gives
ln L = −
N�
�=1
(�� − ¯�)2
2�2
− N ln � −
N
2
ln 2π� (13.11)
86 Chapter 13. Robust mean
The second term is result of sum of constant function over all data. The
extreme of the function is in a point where derivations of the function by
both parameters are vanishing
∂ ln L
∂¯�
=
N�
�=1
�� − ¯�
�2
= 0� (13.12a)
∂ ln L
∂�
=
N�
�=1
(�� − ¯�)2
�3
−
N
�
= 0� (13.12b)
The solution illustrates how the arithmetical mean can be derived from
use of general principle of maximum likelihood and Normal distribution.
The solution of (13.12a) against to ¯� is
¯� =
1
N
N�
�=1
�� (13.13)
and one is identical to (??). The solution of (13.12b) for �1 gives
�2
=
1
N
N�
�=1
(�� − ¯�)2
� (13.14)
The solution is no more equal to (13.3) because we get estimate with
maximum probability, but estimation of the mean value of � is biased. A
possible convenience (Bessel’s correction) way to estimate � for the right
centre is replace it as (Brandt (2014))
N
N − 1
�2
→ �2
(13.15)
which reduces (13.12b) to (13.3) whilst the diﬀerence is negligible for larger
data sets.
The scatter of the parameters is commonly estimated from the hessian
in extreme. Lets suppose that a function � can be in a point r as
r =
�
��
�
�
(13.16)
expanded to Taylor series
�(r + Δr) ≈ �(r) + J(r)Δr +
1
2
ΔrT ˆH(r)Δr + � � � (13.17)
1
Interpretation of � is a parameter of distribution φ(�). The parameter is numerically
equivalent to � deﬁned by (13.3) (the same symbol is used for diﬀerent quantities with the
same meaning).
13.2. Arithmetic Mean By Maximum Likelihood 87
where we introduce Jacobian matrix J as the gradient
J = ∇� =




∂�
∂¯�
∂�
∂�



 � (13.18)
Elements of J are given by formulas (13.12). Hessian ˆH is the matrix of
second derivatives (in maximum for us)
ˆH =




∂2�
∂¯�2
∂2�
∂¯�∂σ
∂2�
∂σ∂¯�
∂2�
∂σ2



 (13.19)
which gives for us in a general point
−
1
�2


N (2/�)
�
�(�� − ¯�)
(2/�)
�
�(�� − ¯�) N − (3/�2)
�
�(�� − ¯�)2

 (13.20)
and in correctly determined extreme where
�
�(�� − ¯�) → 0,
�
�(�� − ¯�)2 →
−2N/�2 (see (13.12)), we get
ˆH = −
N
�2
�
1 0
0 2
�
� (13.21)
The covariance matrix is inverse of hessian in minimum and gives in this
case with (nearly) diagonal matrix
ˆH−1
= −
�2
N
�
1 0
0 1/2
�
� (13.22)
Diagonal matrix means orthogonal choice of parameters, which means,
that changes in the coordinates are independent.
The statistical error of ¯� can be estimated as
σ2
= |�−1
11 |� (13.23)
or σ = �/
√
N which reproduces result (13.5).
While estimation of statistical error of arithmetical mean is highly
appreciated, the error of � is leaved unnoticed. While it can be easy
estimated on �/
√
N/2.
In a minimum of probability, the function ln L can be approximated as
ln L(�� �) = ln L(¯�� �) +
1
2
ΔrT ˆH(r)Δr + � � � (13.24)
88 Chapter 13. Robust mean
which gives
L(�|¯�� σ) =
1
√
2πσ
e−(�−¯�)2/2σ2
� (13.25)
And the join distribution function can be approximated again as Normal
distribution with more narrow width σ.
The fact can be illustrated on test data in Section 13.13. Graph 13.3
shows lilelihood function of this section as “Normal”. The generated points
has distribution �(1� 0�1) so individual functions are wider than graph itself.
But the products of all the funcstions has width about 0�026. The shape of
the product “Normal” and �(1� 0�026) is approximatelly the same, as was
expected. This is one from demonstration of law of large numbers.
The parameter ¯� has meaning of centre of distribution of measured
values. The values is usually the goal of our measures and usually one
is a description of a real system which we are interested in. Opposite
with this, the � only seldom is related to the measured system and one
is more property of a measuring device as its precision. Seldom, there
is requirements for using of measurements with devices with diﬀerent
precision. In the case, see Section 13.8.
13.3 ROBUST MEAN BY MAXIMUM LIKELIHOOD
The maximim likelihood framework sumarised in previsous section can
be used also on robust estimation of mean as the central moment of a
general, as well as robust, distribution. We can see the application in robust
case with robust distribution but one can be used in general case.
The robust mean can be described by the general distribution � and
parameters in the fashion
1
Γ
1
�
�
�
�� − ˜�
�
�
� (13.26)
The function is designed for estimation of central moment ˜� and the scale
� together with analogy to Normal distribution. Note the � which scales
all the function. The factor is necessory to be able for its estimate. The
transformation of �� is importnat and because distribution functions are
centerd on origin and unit mean scatter. The normalisatrion factor Γ (like√
2π for Normal distribution) is deﬁned by property with � =
1
Γ
� ∞
−∞
�(� − ˜�) d� = 1�
The maximum likelihood principle is for the case is as one expected
L =
N�
�=1
1
Γ
1
�
�
�
�� − ˜�
�
�
(13.27)
13.3. Robust Mean By Maximum Likelihood 89
and its logarithm version
ln L =
N�
�=1
ln �
�
�� − ˜�
�
�
− N ln � − N ln Γ� (13.28)
The derivations with use of the substitution (note choice of sign)
ψ = −(ln �)�
= −
��
�
(13.29)
where ψ is a suitable (robust) function. To simplify notation, we will use
the substitution normalised residuals as
�� =
�� − ˜�
�
� (13.30)
The solution leads to the system of equations (analogy of (13.12))
∂ ln L
∂˜�
=
1
�
N�
�=1
ψ(��) = 0� (13.31a)
∂ ln L
∂�
=
1
�
N�
�=1
ψ(��) · �� −
N
�
= 0� (13.31b)
The Hessian is a symetric matrix and has these elements
∂2 ln L
∂˜�2
= −
1
�2
N�
�=1
ψ�
(��)� (13.32a)
∂2 ln L
∂˜�∂�
= −
1
�2
N�
�=1
�
ψ(��) + ψ�
(��) · ��
�
� (13.32b)
∂2 ln L
∂�2
= −
1
�2
N�
�=1
�
2ψ(��) · �� + ψ�
(��) · ��
�
+ N� (13.32c)
The Hessian near of minimum will be (using of (13.31))
ˆH = −
1
�2
� �
ψ�(��)
�
ψ�(��)��
�
ψ�(��)�� N −
�
ψ�(��)�2
�
�
� (13.33)
The oﬀ-diagonal terms will vanish because ψ� is supposed to be a constant
term and �� will distributed to give a minimal sum.
Statistical errors of the parameters are estimated by using of the robust
version of covarience matrix (Huber (1981))
σ2
= �2 N
N − 1
�
ψ2(��)
[
�
ψ�(��)]2
(13.34)
90 Chapter 13. Robust mean
where we identify the bias correction of scale.2
The equations looks unfamiliar. Fortunately, ones can be easy understaned
with the special choice which assympoticaly uses the least squares
�(�) ∝ exp(−�2/2) and by (13.29) we get
ψ(�) = � (13.38)
which reduces the general functions on the aready known set of equations
of arithmerical mean case. Really, lets look on the relations
ψ�
(�) = 1� (13.39a)
ψ�
(��) → 1� (13.39b)
ψ(��) → ��� (13.39c)
which reduces the system (13.32) to (13.20).
The meaning of individual terms in (13.34) can be easy understand with
analogy with Normal distributiion case. The
σ2
� �2
�
ψ(��)2
(13.40)
is practically the robust analogy of residual sum S0. Also
N �
�
ψ�
(��) (13.41)
is estimation of number of acceptable data.
13.4 SOLUTION OF THE NON-LINEAR SYSTEM
There are complication in computation of system of equations for robust
mean (13.31) against to the same system for arithmetic mean (13.12). While
the unknown values can be easy separated in case of (13.12), the system
(13.31) is solvable only numerical way.
2
Huber (1981) suggest multiply the term under square by K-correction. One is for �
parameters
K = 1 +
�
N
var(ψ�
)
(Eψ�)2
(13.35)
with
Eψ�
≈ � =
1
�
�
�
ψ�
(��)� (13.36)
and
var(ψ�
) ≈
1
�
�
�
[ψ�
(��) − �]2
� (13.37)
The corrections has are due to 1/� dependency neglibible except very noisly data and
small datasetds.
13.4. Solution Of The Non-Linear System 91
The basic method with ﬁxes � is described in Subsection 13.4. Join
estimates of ˜� and � are in 13.4, but the method can not be reccoemned
as text of this secrion descriobes. The reccomended method is use of
standard minimalisation provedures.
The estimation of robust mean in the unkind numerical environment of
modern machines is extremely delicate job. The estimation of arithmetical
mean is also little bit delicated, but practically is limited by precision of
represetnation of ﬂoat numbers.
The robust mean is more compilcated, but due to scaling of values,
more numericaly stable. The posssible comlications arises from general
non-lineariry of common robust functions. The solution is more time
consuming and drastically depends on initial estimates.
There is many of methods for solution of non-linear systems of equations
like (13.31). The methods needs are
Lets we know (good) initial estimates of solution ˜�(0)� �(0), Hubber Huber
(1981) in section 6.7 “The computation of M-Estimates”, reccomends four
variants of estimates. I tested extensive (many years of testing, possible
1012 or more computations has been performed) these only two principial
variants:
�������� ���������
The Newtons method for solution of our problem
�(˜�) =
N�
�=1
ψ
�
�� − ˜�
�
�
= 0� (13.42)
The derivation is
��
(˜�) =
1
�
N�
�=1
ψ�
�
�� − ˜�
�
�
�= 0� (13.43)
and we supppose that is non-zero. The general form of the Newton’s
method is
˜�(�+1)
= ˜�(�)
+
�(��)
��(��)
� (13.44)
The initial estimate of scale �(0) is ﬁxed and Newton’s method is used to
estimate of better approximation of robust mean
˜�(�+1)
= ˜�(�)
+ �(0)
�
ψ(��)
�
ψ�(��)
� (13.45)
The method is very reliable. One converges very quickly with averadge
data. The number of iterates is up to 10 for single precision (10−7) and up
to 15 for double precision (10−16). Therefore I am limiting it to seventeen.
92 Chapter 13. Robust mean
The precision is tested agains to machine precision ε and also to relative
precision of expresion
δ =
�
�
�
�
�
ψ(��)
�
ψ�(��)
�
�
�
� (13.46)
as the conditions
δ/|˜�(�+1)
| < ε ∧ δ < ε� (13.47)
It may be important check the doneminator
�
ψ�(��) because a bad initial
estimate of �(0) or a bad data will lead to nullify of it. With the normaly
distributed data, the term will practically constant with meaning of amont
of data. The property is also noticed by Huber (1981).
����� �-��������� �� �������� ��� �����
The previous method is rely on proper initial estimation of scale. The
assumption is only partially true and the estimation can be imprecise
on level of order of ten percent. Because the estimation has ﬂuence on
estimation of mean, the diﬀerence can leads to diﬀernce in mean on level
of a few percent which is inadequte for precisse recults. Therefore, the
similtenoust estimation is equired. Huber (1981) reccomends3 replace of
equations (13.31) by the
N�
�=1
ψ
�
�� − ˜�
�
�
= 0� (13.48a)
N�
�=1
ψ2
�
�� − ˜�
�
�
= N − 1� (13.48b)
where the we replaced N by (N−1) and we are supposse that the expectation
value of the second moment of the distribution Eψ2 (deviation) is exacltly
one.
The system is reccomended to by solved as
�(�+1)
=
�
�
�
� [�(�)]2
N − 1
�
�
ψ2
�
�� − ˜�(�)
�(�)
�
(13.49a)
˜�(�+1)
= ˜�(�)
+ �(�)
�
ψ(��)
�
ψ�(��)
� (13.49b)
The ﬁrst equation is the iteration method while the second is Newton’s
method. The separating solution on the parts is only possible when the
initial estimation is very good and both the parameters are near-orthogonal
3
ψ2
is insensitive to outliers, if winsorisation is applied on data, the original (13.31) will
be so good as well as.
13.5. Region Of Convergence of Gradient 93
-1
0
10 1 2 3
˜�
σ
Figure 13.1:
Minimum of gradient
function
each to other. The orthogonality means independence of mean on scatter.
It is true only partially in real situations (see Section 13.13). The use of
general method for computation of the set of non-linear equation provided
by Minpack as lmder (or lmdif) is reccomended.
There is a few important thinks. All methods belives in the satisfy of
condition
� > 0 (13.50)
which is very important. Both the hessian and gradinet are symeptric
under � → −� and depends on absolute value only.
13.5 REGION OF CONVERGENCE OF GRADIENT
Function (13.28) and its equivalents has one global minimum because this
is sum of functions with one global minimum. There are unique location
of the minimum.
The fast gradient methods – Levendberg - Marquart method – does
not uses the function. The methods locates minimimum of the gradient
� = ∇� as �
�
�� · ��� (13.51)
For functions which we are restricting, the minimum is in the same point.
But the function itself can be diﬀerent far from the minimum. For robust
functions, the The heel shoes like shape see Figure 13.1.
Figure 13.1 is derived from 666 data points with 95% of N(0� 1) and
5% of N(0� 5). The grah shows value of function �2 for various values of
94 Chapter 13. Robust mean
˜�� σ. The function is a surface with lightness of gray protportional to the
function value. The heel has minimum at 0� 1.
There are an important implication. The initial estimate of local minimum
must be relative near the right mininimum. The local turnabout is
approximate about 2 in Figure 13.1.
As the general reccomendation, the optimalisation strategy has two
ways:
a) We can directly minimizes (13.28). The method can not use gradient
(except near neigberhood of minimum). Simplex method (NealderMead)
can be used. The convergence is slow. Estimation of statistical
errors from Hessian is complicated.
b) The minimuzation of gradient (13.31) which reqiures reliable initial
estimation (ideally by a non-gradient method). The convergence is fast
and precise. The estimation of Hessian in minimum is a side eﬀect of
the minimization.
13.6 INITIAL ESTIMATES
The initial estimation of ˜�(0)� �(0) is crutial for success of everything. Huber
(1981) reccomends, on base study of statistical properties of varisou
distributions, the median (med) and mean absolute deviation MAD
µ = median{�1� �2� � � � � �N} (13.52)
the MAD is related to standard devaition as MAD = Φ−1(3/4) ≈ 0�6745 of
inversion cumulative function to Normal distribution and so
�(0)
=
median |�� − µ|
0�6745
� (13.53)
My expriences with this recommended estimators are exclenent for large
datasets of normally distributed data with insigniﬁcant number of outliers
(up to 1/4 of full data).
For a few points (up to ten), the estimatios are unrealible and the
sometimes random. For a few data, it is better use of quantiles which are
more robust and less sensitive to non-uniform (normal) distribnution of
data. To prevent the problems, the quantiles of the empirocal distributiuons
are used. The prepared the following an algorithm is used. The algorithm
can replace also estimate by median, unfortunatelly it is signiﬁcantly slower.
Lets we deﬁne the empricial distribution function from input data as
F� =
1
N
��
�=1
1{�� < �/N} for � = 1� N (13.54)
13.7. Studentizing 95
what symetrizes covering of an interval of points 1/N to
(1/N)/2� (2/N)/2 − (1/N)/2� � � � � 1 − (1/N)/2 (13.55)
and the parameters can be estimated as quantiles �1/2 = 1/2 which is lineary
interpolated in the quantile function where for � such F� < �1/2 ≤ F�+1
µ =
��+1 − ��
F�+1 − F�
(�1/2 − F�) + �� (13.56)
and analogicaly for �(0) the |� − µ| is used.
The use of the cuimuulative distribution function is ﬁne version of
median. The median uses sorted values, as CDF, but the center is roughly
estimated as middle point. There an interpolation on � = 0�5 is used.
13.7 STUDENTIZING
13.8 QUESTION OF WEIGHTS
The presented approach can be generalized on data with diﬀerent σ� of each
observation ��. The situation can be encountered when the obseravtions
has been acquired by devices with diﬀerent internal precision or under
diﬀernent conditions. The data with signiﬁcantly diﬀerent σ� are for data
with Normal distribution meet very rarely.4
The key change of the use of already known dispersion σ� is transfor-
mation
�� − ¯�
σ�
of all data to normal-like distribution N(0� 1). The preassumption is very
optimistics. Therefore, it is better to suppose that values σ� are estimated
only roughly and the transformation on the normal distribution can be
assumed as
�� − ¯�
�σ�
(13.57)
where � is an eﬀective scale of observed scatter. When estimates of σ� are
correct, one can expect � ≈ 1. This is important espetially for non-least
information distributions.
The generalisation of equation (13.9) is strightforward
L =
N�
�=1
�(��|¯�� �σ�) (13.58)
4
Metaphoricaly, one assignes a weight �� for every point. The choice of individual
weights depends on “experience”. From statistical point of view, the weights will �� = 1/σ2
� .
Unfortunatelly, the weights has been choosed randomly or, more worstly, to remove
inconvenient points. This kind of manipulation can not be reccomended by any way.
Primarily, the robust methods oﬀers better way.
96 Chapter 13. Robust mean
and for Normal distribution (non-robust) leads to set of equations
∂ ln L
∂¯�
=
1
�2
N�
�=1
�� − ¯�
σ2
�
= 0� (13.59a)
∂ ln L
∂�
=
1
�2
N�
�=1
(�� − ¯�)2
�σ2
�
− N = 0� (13.59b)
which has the solution
¯� =
N�
�=1
��
σ2
�
N�
�=1
1
σ2
�
(13.60)
and
�2
=
1
N
N�
�=1
(�� − ¯�)2
σ2
�
� (13.61)
The solution for �2 is χ2 distribution divided by total count of data.
Another important aspect is change of meaning of σ. It no more means
scatter of data but it is a scale of data scater.
Robust approach is similar. We also starts from (13.9) where � is an
robust function. Logarith of likelikood fucntion is
ln L =
N�
�=1
ln �
�
�� − ˜�
�σ�
�
−
N�
�=1
ln σ� − N ln �� (13.62)
with solution given by a set of non-linear equations (subtitution (13.29))
∂ ln L
∂˜�
=
N�
�=1
ψ
�
�� − ˜�
�σ�
�
1
�σ�
= 0� (13.63a)
∂ ln L
∂�
=
N�
�=1
ψ
�
�� − ˜�
�σ�
�
�� − ˜�
�2σ�
−
N
�
= 0� (13.63b)
which must be performed numericaly. Note that the equation (13.63b) for
� should be rewrited to equivalent form
N�
�=1
ψ2
�
�� − ˜�
�σ�
�
= (N − 1)�� (13.64)
The simultaneous estimation of both ˜� and � is highly reccomended because
estimates of σ� are rarely correct which can signiﬁcantly degrade robust
estimates.
Just for information, robust mean can be rewriten in the form of weight
mean as ....
13.9. Question Of Descending Estimator 97
0
5
10
15
-10 -5 0 5 10
by Tukey by Andrew’s
Figure 13.2:
Convergence
intervals for
descending functions
13.9 QUESTION OF DESCENDING ESTIMATOR
There are two kinds of robust estimators – monotone and descending –
from shape of ψ function. Huber’s function is monotone and all others
(Hampel’s, Tukey’s, Andrew’s) are descending.
The impotance of the the shape of the function can be show on the
Newton’s method. The convergence criterii of Newton’s method is known
(Ralston and Rabinowitz (2012))
�(�+1)
=
|���(�(�))|
2|��(�(�))|
(�(�+1)
)2
� (13.65)
where the condition must be satisﬁed
|���
(�)| < 2|��
(�)|� (13.66)
The region of convergence can be easy discovered from(13.65) and see
Figure 13.2.
Dale ukazat ze treba pro tukeye nebo andrewse to neni splneno na
intevralu −� � � � �. Pekne grafy to ukazou.
13.10 QUESTION OF OUTLIERS
Robust estimators excellently works with “small changes from presasuptions”,
but the large deviations like data from another statistical sample,
errorneous data commonly known as outliers, can destroy robust estimate.
There are two alternatives which minimizes ﬂuence of outliers: a choice
of robust function which vanishes in inﬁnity and removing or replacing of
outliers.
Robust functions which vanishes in inﬁnity are Hammpel (11.3), Andrews
(11.4) or Tukey (11.5). The condition of vanishing requires nonmonotone
function. The condition implicates that the Newton’s method
98 Chapter 13. Robust mean
(13.45) may diverge. The convergence of the method requires that ratio�
ψ/
�
ψ� < 1 is small. Unfortunatelly, in certain range of parameter the
condition is not meet which leads to divergence. Therefore the use of
the functions cannot be reccomended for Newtwon method. The use of a
gradient free method can be satisfactory when estimation of uncertainities
is not required. Huber (1981) has his “A Word of Caution” in section 4.8
Descending M-estimates.
�������������
As more reliable way for handling with outliers is a technique which
replaces outliers. The replacement by the formula (for deﬁnition of sign
function see (11.2))
�∗
� =
�
��� |�� − µ| ≤ ���
µ + �� sign ��� |�� − µ| > ���
(13.67)
is known as data winsorization (winsorizing – we are Winsorize the data.
Procedue is named aster inventor Charles P. Winsor (1895 – 1951)). The
estimation of the parameter must be executed by a robust method – by
median µ as I give a hint. The parameter setting limit � should by set to an
appropriate value 1 < � < 2. See Huber (1981), Section 1.7. My experiences
with using on data with outliers shows that better interval is 1 < � < � (the
upper constant is given by Hubber � = 1�349), say 0�9�.
An alterative for winsorization is well known “clipping” in which the
outliers are removed from a sample. Usually on base of similar criterii as
in winsorisation. The result will generally similar except for short samples,
when clipping can remove signiﬁcant amount of data.
13.11 AN ALGORITHM
There is a summary of the development of this section in the form of an
algorithm for computation of robust mean. The algorithm has been heavy
tested as the part of Munipack code. One can be considered as a prototype
of a robust algorithm.
Prerequisite. Lets {�1� �2� � � � � �N} is a set of N single non-identical numbers
from R. The data should be represented in computer by an array of
real (ﬂoating point) numbers.
Robust Mean Algorithm.
i) The initial estimation of central moment µ is given by median (13.52)
µ = median{�1� �2� � � � � �N}� (13.68)
13.11. An Algorithm 99
ii) The initial estimation standard deviation �. Lets absolute deviations are
�� = |�� − µ|� for � = 1� � � � � N (13.69)
and median of the absolute deviations (mad) is
�mad = median{�1� �2� � � � � �N}� (13.70)
The estimation of standard deviation will be ﬁnally by (13.53)
� =
�mad
0�6745
(13.71)
It is strongly recommended to check the condition � > � (� is non-zero
positive constant – larger than machine precision).
iii) Winsorisation according to (13.67) with substitution χ = 1�2 �
�∗
� =
�
��� |�� − µ| ≤ χ�
µ + χ sign ��� |�� − µ| > χ�
(13.72)
iv) Location of minimum of robust function. By deﬁning of residuals
�� =
�∗
� − µ
�
� for � = 1� � � � � N (13.73)
we use function
ln L(��|˜�� �) =
N�
�=1
�(��) + N ln � (13.74)
where the integral of robust function �(�) is given by (??).
This steps locates of extreme ln L with certain precision. Recommended
method for minimisation is the simplex method (Nelder and Mead
(1965)) or any method using no derivations.
v) Robust estimation by minimising of set of equations against to parameters
˜�� � (via ��)
N�
�=1
ψ(��) = 0� (13.75a)
N�
�=1
ψ(��) · �� = N� (13.75b)
Recommended method for minimising is Levendberg-Marquart (Marquardt
(1963)) with analytic Jacobian given by (13.32). The method is
regularised (insensitive for errors), fast and provides the most precise
solution.
100 Chapter 13. Robust mean
vi) The uncertainties of the robust mean σ can be estimated in the minimum
as (13.34) using of results of previous step
σ2
= �2 N
N − 1
�N
�=1 ψ2(��)
[
�N
�=1 ψ�(��)]2
(13.76)
Result. The result of the algorithm is estimation of robust mean with
uncertainty
˜� ± σ (13.77)
and the standard deviation �.
Recommendations. A reliable implementation should check than N >
0� � > 0 (� during all interactions).
13.12 AN SIMPLIFIED ALGORITHM
The general algorithm in Section 13.11 simultaneously minimizes both
the paramteres ˜�� �. This simpliﬁed version estimates the scale parameter
(standard deviation) � by median od absolute deviation. This reduces space
of parameters in one dimension. Newton’s method of root ﬁnding can be
used and it importantly speed-up iteraction due to quadratic convergence.
There is a small loss of precision (up to 10 − 20 %).
The simpliﬁed algorithm has assumptions and notation the same as a
general algorithm of Section 13.11. Initial steps i) – iii) (winsorisation) are
the same and the alternative way starts the steps iv) and v) are replaced by
the single step
v) The next step are by (13.45) where ˜�(0) = µ, �
(�)
� = (�∗
� − ˜�(�))/�:
˜�(�+1)
= ˜�(�)
+ �
�N
�=1 ψ(�
(�)
� )
�N
�=1 ψ�(�
(�)
� )
for � = 0� � � � (13.78)
The iterations can stop when |˜�(�+1) − ˜�(�)| < � where � is a required
precision (machine precision).
Estimation of uncertainties is again by step vi).
An reliable implementation should check N > 0� � > 0 when initialisation
is ﬁnished. The interaction can converge only when the second correction
element is |�
�
ψ(��)| < |
�
ψ�(��)| < 1 and also
�
ψ�(��) �= 0. When no
convergence occurs, the � should by limited on an appropriate amount of
interactions (42).
13.13. An Example 101
13.13 AN EXAMPLE
There are an illustration of the methods on numerical example. The
example can be also used for testing purposes. All results are rounded on
4 digits although ones has been computed on at least 16 digits.
As a test set a sequence of 15 elements has been generated5 both from
Normal distrobution with the same dispersion but centre at point 1 (good)
and 0 (bad):
{�� ∈ �(1� 0�1)� � = 1� 13} + {�� ∈ �(0� 0�1)� � = 14� 15} (13.80)
with the result
{0�719� 0�983� 0�818� 0�933� 1�034� 1�005� 1�145� 1�255�
1�039� 1�041� 1�078� 1�111� 0�872� 0�288� 0�137}� (13.81)
Amount of data is small and it is instructive only. The data of bad distribution
represents 13% of all points.
���������� ����
Results of deriving of arithmetic mean as has been introduced in Section
13.2 are
¯� = 0�8972� (13.82a)
� = 0�3088� (13.82b)
σ = 0�0797� (13.82c)
At minimum, the hessian is
ˆH =
�
−157�2703� 0�0000�
0�0000� −314�5407
�
� (13.83)
5
A generator of random numbers from a standard library of Fortran compiler has
been used. The elements are selection from Uniform distribution � ∈ �(0� 1) (in interval
0 ≤ � < 1) and � represents probability. Normal distribution has been established from
inverse to cumulative distribution fuinction of � ∈ �(µ� σ) as � = µ −
√
2σ erf−1
(2� − 1)
where the inverse erf function has an approximation with its precision better than 3�5·10−4
:
erf−1
(�) ≈ sign(�)
�
�
�
�
�
�
2
π�
+
ln(1 − �2)
2
�2
−
ln(1 − �2)
�
−
�
2
π�
+
ln(1 − �2)
2
�
� (13.79)
where
� =
8(π − 3)
3π(4 − π)
�
This approximation has been published only at Wikipedia (2016).
102 Chapter 13. Robust mean
The results demonstrates strong bias toward the bad data. Results can not
be accepted.
Figure 13.3 shows likelihood function of Normal distribution for the
data. Maximum of the fuinction is visible shifted. The conﬁdence interval
does not includes expected center location.
������ ����
Application of single steps of algorithm from Section 13.11:
i) Initial estimation of robust mean by median is
µ = 0�9940� (13.84)
ii) and it scatter
� = 0�1490� (13.85)
iii) Limits for winsorisation is ±1�21 so values are in range 0�872 � � � 1�116.
The original set is transformed to (changed values are denoted)
{0�872∗
� 0�983� 0�818� 0�933� 1�034� 1�005� 1�145� 1�116∗
�
1�039� 1�041� 1�078� 1�111� 0�872� 0�872∗
� 0�872∗
}� (13.86)
Total number of changed values is 4 (27%). Because limits was under
1�21 of �(0� 1), the expected number of changes was 11% (1 − 2
elements) and we have only two outliers, which is exactly what we
expected.
iv) Robust estimation by minimizing of integral of Hubber function with
Simplex algorithm gives
˜� = 0�9753� (13.87)
� = 0�1205� (13.88)
v) The same result (due rounding) gives Marquart-Levenberg minimisation
of Hubber’s function
˜� = 0�9753� (13.89)
� = 0�1386� (13.90)
σ = 0�0547 (13.91)
Hessian at minimum is
ˆH =
�
−894�9690� 185�7405�
185�7405� −1780�6926
�
� (13.92)
13.13. An Example 103
0.9 0.95 1 1.05 1.1 1.15
Likelihood(arbitraryunits)
Location parameter �
Huber
Laplace
Normal
�(1� 0�026)
Figure 13.3:
Likelihood function
for various
distributions
Another point of view shows graph of cummulative distribution function
on Fig. 13.4. The maximum diﬀerence between theoretical and empiriocal
is at point 0�288 and is 0�13 and it is abowe critical value of ColmogorovSmirnoﬀ
test (xxx) which conﬁrms hypothesis that the point violates the
Normal distribution.
The teoretical distribution function on Fig. 13.4 is cummulative of
�(0�9753� 0�14), eg. Normal with parameters by robust estimation. The
graph conﬁms that the ﬁt is appropriate, the original data set has biased
mean due to limited number of points (conﬁdence looks better asympoticaly).
The digram is relative steping due to small amount of data.
Fig. 13.3 shows likelihood functions for both initial estimation and robust
function. The initial estimation as “Laplace” shows piecewise proﬁle. In
maximum, the point is equivalent to median. Robust likelihood “Huber”
is shifted from expected value which must be considered as a random
coincidence by generated data. The appropriate amount of data conﬁrms
this hypothesis.
Hessian shows weak depencnce of both parameters. It reveales eﬀects
of winsorising.
���������� ������ ����
Result of sipliﬁed algorithm by Section 13.12 are
˜� = 0�9688� (13.93)
σ = 0�0499� (13.94)
� = 0�1654� (13.95)
104 Chapter 13. Robust mean
0.0
0.2
0.4
0.6
0.8
1.0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Probability
Values {��}
Empirical
Φ(0�9753� 0�14)
Figure 13.4:
Distribution functions
of the example data
While the estimation is a little bit worse. On the other side, the algorithm
was signiﬁcantly simpler and faster. The proper rounding will give for
both the algorithms the same value 0�97 ± 0�05 so there is no important
diﬀerence.