������� �� ROBUST MEAN One from the most influential method of science is averaging. Result of averaging provides a convenient smooth representation of studied quantities and also naturally suppresses potentially unrelated effects. An arithmetic mean is commonly used method of averaging characterised by simple formulas and widely known statistical properties. The arithmetical mean can be introduced by definition (as in Section 13.1) or derived from the principle of maximum likelihood (Section 13.2). The subsequent matter of this chapter, develops methods for determination of the robust mean, the averaging method “insensitive to small deviation from assumption” as has been introduced in Chapter 3, from the maximum likelihood. This introduced chapter See Section 13.11 for the algorithm. is considered as the detailed description of methods suitable for computation of robust mean including its deep explanation. If reader is not interested in details, Section 13.11 gives short summary of reliable algorithm for estimation of robust mean. There are two important books giving background for this chapter. The general introduction to statistic in data processing – the book Brandt (2014) including developing of the maximum likelihood method. Robust parts of this chapter are evolved on base of ideas of the book Huber (1981). Although, the averaging, realised by the robust mean estimation, looks trivially on the first sight, it demonstrates, in plain basic form, all important ideas of robust algorithms. This chapter opens gate of a robust land. The abstract land where butterflies can fly without fear that theirs wings will be broke by storm drops raised by oneself. 13.1 ARITHMETIC MEAN The arithmetic mean is a standard way to estimate of average of a datasample. An general practise, to compute of the arithmetic mean, of a set of N values {�1� �2� � � � � �N}� (13.1) is, by definition, the formula ¯� = 1 N N� �=1 ��� (13.2) 83 84 Chapter 13. Robust mean The mean ¯� gives estimation of location of a centre of the input set. The scatter of points �� around the mean ¯� is represented by the standard (quadratic) deviation �2 = 1 N − 1 N� �=1 (�� − ¯�)2 � (13.3) The mean ¯� itself is localised more precisely than � indicates. To estimate statistical uncertainty σ of ¯�, we will suppose that all points �� has approximately the same σ� ≈ � deviation and the points are statistically independent. By using of the assumptions, we can use the model for the error propagation (Brandt (2014)) σ2 = N� �=1 � ∂� ∂�� �2 σ2 � (13.4) on a function of arithmetical mean defined as �({��}) = (�1+�2+· · ·+�N)/N which is for every point ∂�/∂�� = 1/N. Putting all the terms together, and summation of a constant term, gives σ2 = N� �=1 1 N2 �2 = �2 N � (13.5) or σ = �/ √ N. Result of arithmetic meaning are usually presented in the form of the confidence interval ¯� ± σ (13.6) which optimistically estimates that the true value X of a quantity lies inside interval ¯� − σ ≤ X ≤ ¯� + σ with probability of 68%. Arithmetic mean is commonly used method. One is easy to use, numerically stable and gives smooth results. Arithmetic mean is ideal for use in computing by hand. The formula (13.2) is very simple, data can be easy inspected and potentially false data suppressed. On the contrary, the arithmetic mean can not be recommended for machine processing. One is sensitive on deviated data. In case of its presence in a sample, results will be scattered or, much worse, completely random. 13.2 ARITHMETIC MEAN BY MAXIMUM LIKELIHOOD The method of maximum likelihood presents impressive point of view on the full field of arithmetic mean. The method brings also a very effective framework which provides optimal estimates of the average for data with an arbitrary statistical distribution. So important for generalisation. 13.2. Arithmetic Mean By Maximum Likelihood 85 This section briefly summarises important steps used to estimate arithmetic mean and its statistical properties by maximum likelihood method. The description is summary of equivalent chapter of Brandt (2014) which provides more detailed and exact description. A basic principle of maximum likelihood method is maximisation of probability which describes measured quantities. In case of mean, we are looking for probability, common to all points. Supposing of statistical independence of single point with the probability density function φ(��|¯�� �), the join probability of of fitness of all data points can be composed as its products φ(�1|¯�� �) · φ(�2|¯�� �) · · · φ(�N|¯�� �)� (13.7) For use of the method on a data, one a priory assumes a statistical distribution of every data point. In case of arithmetical mean, we will suppose Normal distribution �(¯�� �) with the probability density function φ(�|¯�� �) = 1 √ 2π� e−(�−¯�)2/2�2 � (13.8) The analytic form of the probability is given by our assumption. The function represents a “distance” (measure) that every single point with � and how much the point belong to the distribution set. Lets define the likelihood function L(¯�� �) = N� �=1 φ(��|¯�� �) (13.9) as function of parameters (independent variables) ¯�� �. The parameters can be estimated by the way: The goal of our effort is set parameters ¯�� � such way to maximise probability which is a priory known. The methods for searching of maximum of such products are indirect. We use property of logarithm which is converts products on sums and one is a monotone function so the maximum is has unchanged point. Logarithm of maximum likelihood is ln L = N� �=1 ln φ(��|¯�� �) (13.10) which depends on the distribution function. With substituting of φ(�) the function and summing of constant elements gives ln L = − N� �=1 (�� − ¯�)2 2�2 − N ln � − N 2 ln 2π� (13.11) 86 Chapter 13. Robust mean The second term is result of sum of constant function over all data. The extreme of the function is in a point where derivations of the function by both parameters are vanishing ∂ ln L ∂¯� = N� �=1 �� − ¯� �2 = 0� (13.12a) ∂ ln L ∂� = N� �=1 (�� − ¯�)2 �3 − N � = 0� (13.12b) The solution illustrates how the arithmetical mean can be derived from use of general principle of maximum likelihood and Normal distribution. The solution of (13.12a) against to ¯� is ¯� = 1 N N� �=1 �� (13.13) and one is identical to (??). The solution of (13.12b) for �1 gives �2 = 1 N N� �=1 (�� − ¯�)2 � (13.14) The solution is no more equal to (13.3) because we get estimate with maximum probability, but estimation of the mean value of � is biased. A possible convenience (Bessel’s correction) way to estimate � for the right centre is replace it as (Brandt (2014)) N N − 1 �2 → �2 (13.15) which reduces (13.12b) to (13.3) whilst the difference is negligible for larger data sets. The scatter of the parameters is commonly estimated from the hessian in extreme. Lets suppose that a function � can be in a point r as r = � �� � � (13.16) expanded to Taylor series �(r + Δr) ≈ �(r) + J(r)Δr + 1 2 ΔrT ˆH(r)Δr + � � � (13.17) 1 Interpretation of � is a parameter of distribution φ(�). The parameter is numerically equivalent to � defined by (13.3) (the same symbol is used for different quantities with the same meaning). 13.2. Arithmetic Mean By Maximum Likelihood 87 where we introduce Jacobian matrix J as the gradient J = ∇� =     ∂� ∂¯� ∂� ∂�     � (13.18) Elements of J are given by formulas (13.12). Hessian ˆH is the matrix of second derivatives (in maximum for us) ˆH =     ∂2� ∂¯�2 ∂2� ∂¯�∂σ ∂2� ∂σ∂¯� ∂2� ∂σ2     (13.19) which gives for us in a general point − 1 �2   N (2/�) � �(�� − ¯�) (2/�) � �(�� − ¯�) N − (3/�2) � �(�� − ¯�)2   (13.20) and in correctly determined extreme where � �(�� − ¯�) → 0, � �(�� − ¯�)2 → −2N/�2 (see (13.12)), we get ˆH = − N �2 � 1 0 0 2 � � (13.21) The covariance matrix is inverse of hessian in minimum and gives in this case with (nearly) diagonal matrix ˆH−1 = − �2 N � 1 0 0 1/2 � � (13.22) Diagonal matrix means orthogonal choice of parameters, which means, that changes in the coordinates are independent. The statistical error of ¯� can be estimated as σ2 = |�−1 11 |� (13.23) or σ = �/ √ N which reproduces result (13.5). While estimation of statistical error of arithmetical mean is highly appreciated, the error of � is leaved unnoticed. While it can be easy estimated on �/ √ N/2. In a minimum of probability, the function ln L can be approximated as ln L(�� �) = ln L(¯�� �) + 1 2 ΔrT ˆH(r)Δr + � � � (13.24) 88 Chapter 13. Robust mean which gives L(�|¯�� σ) = 1 √ 2πσ e−(�−¯�)2/2σ2 � (13.25) And the join distribution function can be approximated again as Normal distribution with more narrow width σ. The fact can be illustrated on test data in Section 13.13. Graph 13.3 shows lilelihood function of this section as “Normal”. The generated points has distribution �(1� 0�1) so individual functions are wider than graph itself. But the products of all the funcstions has width about 0�026. The shape of the product “Normal” and �(1� 0�026) is approximatelly the same, as was expected. This is one from demonstration of law of large numbers. The parameter ¯� has meaning of centre of distribution of measured values. The values is usually the goal of our measures and usually one is a description of a real system which we are interested in. Opposite with this, the � only seldom is related to the measured system and one is more property of a measuring device as its precision. Seldom, there is requirements for using of measurements with devices with different precision. In the case, see Section 13.8. 13.3 ROBUST MEAN BY MAXIMUM LIKELIHOOD The maximim likelihood framework sumarised in previsous section can be used also on robust estimation of mean as the central moment of a general, as well as robust, distribution. We can see the application in robust case with robust distribution but one can be used in general case. The robust mean can be described by the general distribution � and parameters in the fashion 1 Γ 1 � � � �� − ˜� � � � (13.26) The function is designed for estimation of central moment ˜� and the scale � together with analogy to Normal distribution. Note the � which scales all the function. The factor is necessory to be able for its estimate. The transformation of �� is importnat and because distribution functions are centerd on origin and unit mean scatter. The normalisatrion factor Γ (like√ 2π for Normal distribution) is defined by property with � = 1 Γ � ∞ −∞ �(� − ˜�) d� = 1� The maximum likelihood principle is for the case is as one expected L = N� �=1 1 Γ 1 � � � �� − ˜� � � (13.27) 13.3. Robust Mean By Maximum Likelihood 89 and its logarithm version ln L = N� �=1 ln � � �� − ˜� � � − N ln � − N ln Γ� (13.28) The derivations with use of the substitution (note choice of sign) ψ = −(ln �)� = − �� � (13.29) where ψ is a suitable (robust) function. To simplify notation, we will use the substitution normalised residuals as �� = �� − ˜� � � (13.30) The solution leads to the system of equations (analogy of (13.12)) ∂ ln L ∂˜� = 1 � N� �=1 ψ(��) = 0� (13.31a) ∂ ln L ∂� = 1 � N� �=1 ψ(��) · �� − N � = 0� (13.31b) The Hessian is a symetric matrix and has these elements ∂2 ln L ∂˜�2 = − 1 �2 N� �=1 ψ� (��)� (13.32a) ∂2 ln L ∂˜�∂� = − 1 �2 N� �=1 � ψ(��) + ψ� (��) · �� � � (13.32b) ∂2 ln L ∂�2 = − 1 �2 N� �=1 � 2ψ(��) · �� + ψ� (��) · �� � + N� (13.32c) The Hessian near of minimum will be (using of (13.31)) ˆH = − 1 �2 � � ψ�(��) � ψ�(��)�� � ψ�(��)�� N − � ψ�(��)�2 � � � (13.33) The off-diagonal terms will vanish because ψ� is supposed to be a constant term and �� will distributed to give a minimal sum. Statistical errors of the parameters are estimated by using of the robust version of covarience matrix (Huber (1981)) σ2 = �2 N N − 1 � ψ2(��) [ � ψ�(��)]2 (13.34) 90 Chapter 13. Robust mean where we identify the bias correction of scale.2 The equations looks unfamiliar. Fortunately, ones can be easy understaned with the special choice which assympoticaly uses the least squares �(�) ∝ exp(−�2/2) and by (13.29) we get ψ(�) = � (13.38) which reduces the general functions on the aready known set of equations of arithmerical mean case. Really, lets look on the relations ψ� (�) = 1� (13.39a) ψ� (��) → 1� (13.39b) ψ(��) → ��� (13.39c) which reduces the system (13.32) to (13.20). The meaning of individual terms in (13.34) can be easy understand with analogy with Normal distributiion case. The σ2 � �2 � ψ(��)2 (13.40) is practically the robust analogy of residual sum S0. Also N � � ψ� (��) (13.41) is estimation of number of acceptable data. 13.4 SOLUTION OF THE NON-LINEAR SYSTEM There are complication in computation of system of equations for robust mean (13.31) against to the same system for arithmetic mean (13.12). While the unknown values can be easy separated in case of (13.12), the system (13.31) is solvable only numerical way. 2 Huber (1981) suggest multiply the term under square by K-correction. One is for � parameters K = 1 + � N var(ψ� ) (Eψ�)2 (13.35) with Eψ� ≈ � = 1 � � � ψ� (��)� (13.36) and var(ψ� ) ≈ 1 � � � [ψ� (��) − �]2 � (13.37) The corrections has are due to 1/� dependency neglibible except very noisly data and small datasetds. 13.4. Solution Of The Non-Linear System 91 The basic method with fixes � is described in Subsection 13.4. Join estimates of ˜� and � are in 13.4, but the method can not be reccoemned as text of this secrion descriobes. The reccomended method is use of standard minimalisation provedures. The estimation of robust mean in the unkind numerical environment of modern machines is extremely delicate job. The estimation of arithmetical mean is also little bit delicated, but practically is limited by precision of represetnation of float numbers. The robust mean is more compilcated, but due to scaling of values, more numericaly stable. The posssible comlications arises from general non-lineariry of common robust functions. The solution is more time consuming and drastically depends on initial estimates. There is many of methods for solution of non-linear systems of equations like (13.31). The methods needs are Lets we know (good) initial estimates of solution ˜�(0)� �(0), Hubber Huber (1981) in section 6.7 “The computation of M-Estimates”, reccomends four variants of estimates. I tested extensive (many years of testing, possible 1012 or more computations has been performed) these only two principial variants: �������� ��������� The Newtons method for solution of our problem �(˜�) = N� �=1 ψ � �� − ˜� � � = 0� (13.42) The derivation is �� (˜�) = 1 � N� �=1 ψ� � �� − ˜� � � �= 0� (13.43) and we supppose that is non-zero. The general form of the Newton’s method is ˜�(�+1) = ˜�(�) + �(��) ��(��) � (13.44) The initial estimate of scale �(0) is fixed and Newton’s method is used to estimate of better approximation of robust mean ˜�(�+1) = ˜�(�) + �(0) � ψ(��) � ψ�(��) � (13.45) The method is very reliable. One converges very quickly with averadge data. The number of iterates is up to 10 for single precision (10−7) and up to 15 for double precision (10−16). Therefore I am limiting it to seventeen. 92 Chapter 13. Robust mean The precision is tested agains to machine precision ε and also to relative precision of expresion δ = � � � � � ψ(��) � ψ�(��) � � � � (13.46) as the conditions δ/|˜�(�+1) | < ε ∧ δ < ε� (13.47) It may be important check the doneminator � ψ�(��) because a bad initial estimate of �(0) or a bad data will lead to nullify of it. With the normaly distributed data, the term will practically constant with meaning of amont of data. The property is also noticed by Huber (1981). ����� �-��������� �� �������� ��� ����� The previous method is rely on proper initial estimation of scale. The assumption is only partially true and the estimation can be imprecise on level of order of ten percent. Because the estimation has fluence on estimation of mean, the difference can leads to differnce in mean on level of a few percent which is inadequte for precisse recults. Therefore, the similtenoust estimation is equired. Huber (1981) reccomends3 replace of equations (13.31) by the N� �=1 ψ � �� − ˜� � � = 0� (13.48a) N� �=1 ψ2 � �� − ˜� � � = N − 1� (13.48b) where the we replaced N by (N−1) and we are supposse that the expectation value of the second moment of the distribution Eψ2 (deviation) is exacltly one. The system is reccomended to by solved as �(�+1) = � � � � [�(�)]2 N − 1 � � ψ2 � �� − ˜�(�) �(�) � (13.49a) ˜�(�+1) = ˜�(�) + �(�) � ψ(��) � ψ�(��) � (13.49b) The first equation is the iteration method while the second is Newton’s method. The separating solution on the parts is only possible when the initial estimation is very good and both the parameters are near-orthogonal 3 ψ2 is insensitive to outliers, if winsorisation is applied on data, the original (13.31) will be so good as well as. 13.5. Region Of Convergence of Gradient 93 -1 0 10 1 2 3 ˜� σ Figure 13.1: Minimum of gradient function each to other. The orthogonality means independence of mean on scatter. It is true only partially in real situations (see Section 13.13). The use of general method for computation of the set of non-linear equation provided by Minpack as lmder (or lmdif) is reccomended. There is a few important thinks. All methods belives in the satisfy of condition � > 0 (13.50) which is very important. Both the hessian and gradinet are symeptric under � → −� and depends on absolute value only. 13.5 REGION OF CONVERGENCE OF GRADIENT Function (13.28) and its equivalents has one global minimum because this is sum of functions with one global minimum. There are unique location of the minimum. The fast gradient methods – Levendberg - Marquart method – does not uses the function. The methods locates minimimum of the gradient � = ∇� as � � �� · ��� (13.51) For functions which we are restricting, the minimum is in the same point. But the function itself can be different far from the minimum. For robust functions, the The heel shoes like shape see Figure 13.1. Figure 13.1 is derived from 666 data points with 95% of N(0� 1) and 5% of N(0� 5). The grah shows value of function �2 for various values of 94 Chapter 13. Robust mean ˜�� σ. The function is a surface with lightness of gray protportional to the function value. The heel has minimum at 0� 1. There are an important implication. The initial estimate of local minimum must be relative near the right mininimum. The local turnabout is approximate about 2 in Figure 13.1. As the general reccomendation, the optimalisation strategy has two ways: a) We can directly minimizes (13.28). The method can not use gradient (except near neigberhood of minimum). Simplex method (NealderMead) can be used. The convergence is slow. Estimation of statistical errors from Hessian is complicated. b) The minimuzation of gradient (13.31) which reqiures reliable initial estimation (ideally by a non-gradient method). The convergence is fast and precise. The estimation of Hessian in minimum is a side effect of the minimization. 13.6 INITIAL ESTIMATES The initial estimation of ˜�(0)� �(0) is crutial for success of everything. Huber (1981) reccomends, on base study of statistical properties of varisou distributions, the median (med) and mean absolute deviation MAD µ = median{�1� �2� � � � � �N} (13.52) the MAD is related to standard devaition as MAD = Φ−1(3/4) ≈ 0�6745 of inversion cumulative function to Normal distribution and so �(0) = median |�� − µ| 0�6745 � (13.53) My expriences with this recommended estimators are exclenent for large datasets of normally distributed data with insignificant number of outliers (up to 1/4 of full data). For a few points (up to ten), the estimatios are unrealible and the sometimes random. For a few data, it is better use of quantiles which are more robust and less sensitive to non-uniform (normal) distribnution of data. To prevent the problems, the quantiles of the empirocal distributiuons are used. The prepared the following an algorithm is used. The algorithm can replace also estimate by median, unfortunatelly it is significantly slower. Lets we define the empricial distribution function from input data as F� = 1 N �� �=1 1{�� < �/N} for � = 1� N (13.54) 13.7. Studentizing 95 what symetrizes covering of an interval of points 1/N to (1/N)/2� (2/N)/2 − (1/N)/2� � � � � 1 − (1/N)/2 (13.55) and the parameters can be estimated as quantiles �1/2 = 1/2 which is lineary interpolated in the quantile function where for � such F� < �1/2 ≤ F�+1 µ = ��+1 − �� F�+1 − F� (�1/2 − F�) + �� (13.56) and analogicaly for �(0) the |� − µ| is used. The use of the cuimuulative distribution function is fine version of median. The median uses sorted values, as CDF, but the center is roughly estimated as middle point. There an interpolation on � = 0�5 is used. 13.7 STUDENTIZING 13.8 QUESTION OF WEIGHTS The presented approach can be generalized on data with different σ� of each observation ��. The situation can be encountered when the obseravtions has been acquired by devices with different internal precision or under differnent conditions. The data with significantly different σ� are for data with Normal distribution meet very rarely.4 The key change of the use of already known dispersion σ� is transfor- mation �� − ¯� σ� of all data to normal-like distribution N(0� 1). The preassumption is very optimistics. Therefore, it is better to suppose that values σ� are estimated only roughly and the transformation on the normal distribution can be assumed as �� − ¯� �σ� (13.57) where � is an effective scale of observed scatter. When estimates of σ� are correct, one can expect � ≈ 1. This is important espetially for non-least information distributions. The generalisation of equation (13.9) is strightforward L = N� �=1 �(��|¯�� �σ�) (13.58) 4 Metaphoricaly, one assignes a weight �� for every point. The choice of individual weights depends on “experience”. From statistical point of view, the weights will �� = 1/σ2 � . Unfortunatelly, the weights has been choosed randomly or, more worstly, to remove inconvenient points. This kind of manipulation can not be reccomended by any way. Primarily, the robust methods offers better way. 96 Chapter 13. Robust mean and for Normal distribution (non-robust) leads to set of equations ∂ ln L ∂¯� = 1 �2 N� �=1 �� − ¯� σ2 � = 0� (13.59a) ∂ ln L ∂� = 1 �2 N� �=1 (�� − ¯�)2 �σ2 � − N = 0� (13.59b) which has the solution ¯� = N� �=1 �� σ2 � N� �=1 1 σ2 � (13.60) and �2 = 1 N N� �=1 (�� − ¯�)2 σ2 � � (13.61) The solution for �2 is χ2 distribution divided by total count of data. Another important aspect is change of meaning of σ. It no more means scatter of data but it is a scale of data scater. Robust approach is similar. We also starts from (13.9) where � is an robust function. Logarith of likelikood fucntion is ln L = N� �=1 ln � � �� − ˜� �σ� � − N� �=1 ln σ� − N ln �� (13.62) with solution given by a set of non-linear equations (subtitution (13.29)) ∂ ln L ∂˜� = N� �=1 ψ � �� − ˜� �σ� � 1 �σ� = 0� (13.63a) ∂ ln L ∂� = N� �=1 ψ � �� − ˜� �σ� � �� − ˜� �2σ� − N � = 0� (13.63b) which must be performed numericaly. Note that the equation (13.63b) for � should be rewrited to equivalent form N� �=1 ψ2 � �� − ˜� �σ� � = (N − 1)�� (13.64) The simultaneous estimation of both ˜� and � is highly reccomended because estimates of σ� are rarely correct which can significantly degrade robust estimates. Just for information, robust mean can be rewriten in the form of weight mean as .... 13.9. Question Of Descending Estimator 97 0 5 10 15 -10 -5 0 5 10 by Tukey by Andrew’s Figure 13.2: Convergence intervals for descending functions 13.9 QUESTION OF DESCENDING ESTIMATOR There are two kinds of robust estimators – monotone and descending – from shape of ψ function. Huber’s function is monotone and all others (Hampel’s, Tukey’s, Andrew’s) are descending. The impotance of the the shape of the function can be show on the Newton’s method. The convergence criterii of Newton’s method is known (Ralston and Rabinowitz (2012)) �(�+1) = |���(�(�))| 2|��(�(�))| (�(�+1) )2 � (13.65) where the condition must be satisfied |��� (�)| < 2|�� (�)|� (13.66) The region of convergence can be easy discovered from(13.65) and see Figure 13.2. Dale ukazat ze treba pro tukeye nebo andrewse to neni splneno na intevralu −� � � � �. Pekne grafy to ukazou. 13.10 QUESTION OF OUTLIERS Robust estimators excellently works with “small changes from presasuptions”, but the large deviations like data from another statistical sample, errorneous data commonly known as outliers, can destroy robust estimate. There are two alternatives which minimizes fluence of outliers: a choice of robust function which vanishes in infinity and removing or replacing of outliers. Robust functions which vanishes in infinity are Hammpel (11.3), Andrews (11.4) or Tukey (11.5). The condition of vanishing requires nonmonotone function. The condition implicates that the Newton’s method 98 Chapter 13. Robust mean (13.45) may diverge. The convergence of the method requires that ratio� ψ/ � ψ� < 1 is small. Unfortunatelly, in certain range of parameter the condition is not meet which leads to divergence. Therefore the use of the functions cannot be reccomended for Newtwon method. The use of a gradient free method can be satisfactory when estimation of uncertainities is not required. Huber (1981) has his “A Word of Caution” in section 4.8 Descending M-estimates. ������������� As more reliable way for handling with outliers is a technique which replaces outliers. The replacement by the formula (for definition of sign function see (11.2)) �∗ � = � ��� |�� − µ| ≤ ��� µ + �� sign ��� |�� − µ| > ��� (13.67) is known as data winsorization (winsorizing – we are Winsorize the data. Procedue is named aster inventor Charles P. Winsor (1895 – 1951)). The estimation of the parameter must be executed by a robust method – by median µ as I give a hint. The parameter setting limit � should by set to an appropriate value 1 < � < 2. See Huber (1981), Section 1.7. My experiences with using on data with outliers shows that better interval is 1 < � < � (the upper constant is given by Hubber � = 1�349), say 0�9�. An alterative for winsorization is well known “clipping” in which the outliers are removed from a sample. Usually on base of similar criterii as in winsorisation. The result will generally similar except for short samples, when clipping can remove significant amount of data. 13.11 AN ALGORITHM There is a summary of the development of this section in the form of an algorithm for computation of robust mean. The algorithm has been heavy tested as the part of Munipack code. One can be considered as a prototype of a robust algorithm. Prerequisite. Lets {�1� �2� � � � � �N} is a set of N single non-identical numbers from R. The data should be represented in computer by an array of real (floating point) numbers. Robust Mean Algorithm. i) The initial estimation of central moment µ is given by median (13.52) µ = median{�1� �2� � � � � �N}� (13.68) 13.11. An Algorithm 99 ii) The initial estimation standard deviation �. Lets absolute deviations are �� = |�� − µ|� for � = 1� � � � � N (13.69) and median of the absolute deviations (mad) is �mad = median{�1� �2� � � � � �N}� (13.70) The estimation of standard deviation will be finally by (13.53) � = �mad 0�6745 (13.71) It is strongly recommended to check the condition � > � (� is non-zero positive constant – larger than machine precision). iii) Winsorisation according to (13.67) with substitution χ = 1�2 � �∗ � = � ��� |�� − µ| ≤ χ� µ + χ sign ��� |�� − µ| > χ� (13.72) iv) Location of minimum of robust function. By defining of residuals �� = �∗ � − µ � � for � = 1� � � � � N (13.73) we use function ln L(��|˜�� �) = N� �=1 �(��) + N ln � (13.74) where the integral of robust function �(�) is given by (??). This steps locates of extreme ln L with certain precision. Recommended method for minimisation is the simplex method (Nelder and Mead (1965)) or any method using no derivations. v) Robust estimation by minimising of set of equations against to parameters ˜�� � (via ��) N� �=1 ψ(��) = 0� (13.75a) N� �=1 ψ(��) · �� = N� (13.75b) Recommended method for minimising is Levendberg-Marquart (Marquardt (1963)) with analytic Jacobian given by (13.32). The method is regularised (insensitive for errors), fast and provides the most precise solution. 100 Chapter 13. Robust mean vi) The uncertainties of the robust mean σ can be estimated in the minimum as (13.34) using of results of previous step σ2 = �2 N N − 1 �N �=1 ψ2(��) [ �N �=1 ψ�(��)]2 (13.76) Result. The result of the algorithm is estimation of robust mean with uncertainty ˜� ± σ (13.77) and the standard deviation �. Recommendations. A reliable implementation should check than N > 0� � > 0 (� during all interactions). 13.12 AN SIMPLIFIED ALGORITHM The general algorithm in Section 13.11 simultaneously minimizes both the paramteres ˜�� �. This simplified version estimates the scale parameter (standard deviation) � by median od absolute deviation. This reduces space of parameters in one dimension. Newton’s method of root finding can be used and it importantly speed-up iteraction due to quadratic convergence. There is a small loss of precision (up to 10 − 20 %). The simplified algorithm has assumptions and notation the same as a general algorithm of Section 13.11. Initial steps i) – iii) (winsorisation) are the same and the alternative way starts the steps iv) and v) are replaced by the single step v) The next step are by (13.45) where ˜�(0) = µ, � (�) � = (�∗ � − ˜�(�))/�: ˜�(�+1) = ˜�(�) + � �N �=1 ψ(� (�) � ) �N �=1 ψ�(� (�) � ) for � = 0� � � � (13.78) The iterations can stop when |˜�(�+1) − ˜�(�)| < � where � is a required precision (machine precision). Estimation of uncertainties is again by step vi). An reliable implementation should check N > 0� � > 0 when initialisation is finished. The interaction can converge only when the second correction element is |� � ψ(��)| < | � ψ�(��)| < 1 and also � ψ�(��) �= 0. When no convergence occurs, the � should by limited on an appropriate amount of interactions (42). 13.13. An Example 101 13.13 AN EXAMPLE There are an illustration of the methods on numerical example. The example can be also used for testing purposes. All results are rounded on 4 digits although ones has been computed on at least 16 digits. As a test set a sequence of 15 elements has been generated5 both from Normal distrobution with the same dispersion but centre at point 1 (good) and 0 (bad): {�� ∈ �(1� 0�1)� � = 1� 13} + {�� ∈ �(0� 0�1)� � = 14� 15} (13.80) with the result {0�719� 0�983� 0�818� 0�933� 1�034� 1�005� 1�145� 1�255� 1�039� 1�041� 1�078� 1�111� 0�872� 0�288� 0�137}� (13.81) Amount of data is small and it is instructive only. The data of bad distribution represents 13% of all points. ���������� ���� Results of deriving of arithmetic mean as has been introduced in Section 13.2 are ¯� = 0�8972� (13.82a) � = 0�3088� (13.82b) σ = 0�0797� (13.82c) At minimum, the hessian is ˆH = � −157�2703� 0�0000� 0�0000� −314�5407 � � (13.83) 5 A generator of random numbers from a standard library of Fortran compiler has been used. The elements are selection from Uniform distribution � ∈ �(0� 1) (in interval 0 ≤ � < 1) and � represents probability. Normal distribution has been established from inverse to cumulative distribution fuinction of � ∈ �(µ� σ) as � = µ − √ 2σ erf−1 (2� − 1) where the inverse erf function has an approximation with its precision better than 3�5·10−4 : erf−1 (�) ≈ sign(�) � � � � � � 2 π� + ln(1 − �2) 2 �2 − ln(1 − �2) � − � 2 π� + ln(1 − �2) 2 � � (13.79) where � = 8(π − 3) 3π(4 − π) � This approximation has been published only at Wikipedia (2016). 102 Chapter 13. Robust mean The results demonstrates strong bias toward the bad data. Results can not be accepted. Figure 13.3 shows likelihood function of Normal distribution for the data. Maximum of the fuinction is visible shifted. The confidence interval does not includes expected center location. ������ ���� Application of single steps of algorithm from Section 13.11: i) Initial estimation of robust mean by median is µ = 0�9940� (13.84) ii) and it scatter � = 0�1490� (13.85) iii) Limits for winsorisation is ±1�21 so values are in range 0�872 � � � 1�116. The original set is transformed to (changed values are denoted) {0�872∗ � 0�983� 0�818� 0�933� 1�034� 1�005� 1�145� 1�116∗ � 1�039� 1�041� 1�078� 1�111� 0�872� 0�872∗ � 0�872∗ }� (13.86) Total number of changed values is 4 (27%). Because limits was under 1�21 of �(0� 1), the expected number of changes was 11% (1 − 2 elements) and we have only two outliers, which is exactly what we expected. iv) Robust estimation by minimizing of integral of Hubber function with Simplex algorithm gives ˜� = 0�9753� (13.87) � = 0�1205� (13.88) v) The same result (due rounding) gives Marquart-Levenberg minimisation of Hubber’s function ˜� = 0�9753� (13.89) � = 0�1386� (13.90) σ = 0�0547 (13.91) Hessian at minimum is ˆH = � −894�9690� 185�7405� 185�7405� −1780�6926 � � (13.92) 13.13. An Example 103 0.9 0.95 1 1.05 1.1 1.15 Likelihood(arbitraryunits) Location parameter � Huber Laplace Normal �(1� 0�026) Figure 13.3: Likelihood function for various distributions Another point of view shows graph of cummulative distribution function on Fig. 13.4. The maximum difference between theoretical and empiriocal is at point 0�288 and is 0�13 and it is abowe critical value of ColmogorovSmirnoff test (xxx) which confirms hypothesis that the point violates the Normal distribution. The teoretical distribution function on Fig. 13.4 is cummulative of �(0�9753� 0�14), eg. Normal with parameters by robust estimation. The graph confims that the fit is appropriate, the original data set has biased mean due to limited number of points (confidence looks better asympoticaly). The digram is relative steping due to small amount of data. Fig. 13.3 shows likelihood functions for both initial estimation and robust function. The initial estimation as “Laplace” shows piecewise profile. In maximum, the point is equivalent to median. Robust likelihood “Huber” is shifted from expected value which must be considered as a random coincidence by generated data. The appropriate amount of data confirms this hypothesis. Hessian shows weak depencnce of both parameters. It reveales effects of winsorising. ���������� ������ ���� Result of siplified algorithm by Section 13.12 are ˜� = 0�9688� (13.93) σ = 0�0499� (13.94) � = 0�1654� (13.95) 104 Chapter 13. Robust mean 0.0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Probability Values {��} Empirical Φ(0�9753� 0�14) Figure 13.4: Distribution functions of the example data While the estimation is a little bit worse. On the other side, the algorithm was significantly simpler and faster. The proper rounding will give for both the algorithms the same value 0�97 ± 0�05 so there is no important difference.