1 An Improved Estimator for Removing Boundary Bias in Kernel CDF Estimation Jan Koláček Department of mathematics and statistics Faculty of Science Masaryk University Brno, Czech Republic www.muni.cz COMPSTAT’08, 28. August, Porto CONTENTS 1 - 1 Contents • Introduction • Kernel distribution estimators • Boundary effects • Proposed estimator • Examples • References COMPSTAT’08, 28. August, Porto KERNEL ESTIMATORS 2 - 1 Kernel function Let ν, k be nonnegative integers, 0 ≤ ν ≤ k − 2, k ≤ k0, ν + k even integer. Let K be a real valued function continuous on R and satisfying conditions K ∈ Lip [−1, 1], support(K) = [−1, 1] 1 −1 xj K(x)dx =    0, 0 ≤ j < k, j = ν (−1)ν ν!, j = ν βk = 0, j = k . Such a function K is called a kernel of order k and a class of such functions is denoted by Sν,k. COMPSTAT’08, 28. August, Porto KERNEL ESTIMATORS 2 - 2 Table of kernels ν k Kernel (on [−1, 1]) 0 2 K0,2(x) = 3 4 (1 − x2 ) 0 2 K0,2(x) = 15 16 (1 − x2 )2 0 2 K0,2(x) = 35 32 (1 − x2 )3 0 4 K0,4(x) = 15 32 (x2 − 1)(7x2 − 3) 2 4 K2,4(x) = 105 16 (1 − x2 )(5x2 − 1) 1 3 K1,3(x) = 15 4 x(1 − x2 ) COMPSTAT’08, 28. August, Porto KERNEL DISTRIBUTION ESTIMATORS 3 - 1 Kernel distribution estimators Let X1,. . . ,Xn be independent real random variables each having the same cumulative distribution F. Our model is defined by the assumption F ∈ Ck0 , where k0 is a positive integer. For the given data set the corresponding kernel estimate of a distribution function F is Fh,K(x) = 1 n n i=1 W x − Xi h , W(x) = x −1 K(t)dt (1) where h is a smoothing parameter called bandwidth (h = h(n) is a non-random sequence of positive numbers) and K ∈ S0,2, K(x) ≥ 0 on [−1, 1]. COMPSTAT’08, 28. August, Porto KERNEL DISTRIBUTION ESTIMATORS 3 - 2 Optimal bandwidth Under additional assumptions lim n→∞ h = 0, lim n→∞ nh = ∞ it can be shown (e.g. Bowman, A., Hall, P., Prvan, T. [2]) that the leading term of MISE (Mean Integrated Square Error) takes the form MISE(Fh,K) = 1 n F(x)(1 − F(x))dx − q1 h n var(Fh,K ) + q2h4 bias 2 (Fh,K ) , q1 = 1 −1 W(x)(1 − W(x))dx > 0, q2 = β2 2 4 (F(2) (x))2 dx. Hence, the optimal bandwidth hF opt,0,2 minimizing MISE with respect to h is hF opt,0,2 = n−1/3 q1 4q2 1/3 . (2) COMPSTAT’08, 28. August, Porto BOUNDARY EFFECTS 4 - 1 Boundary Effects Assumptions: • Xi, i = 1, . . . , n are nonnegative • the distribution function F has a support [0, ∞) • f(0) = 0 Boundary effects arise by estimates in points “near” the left boundary, it is for x ∈ [0, h]. In next, we will write x = ch, 0 ≤ c ≤ 1. COMPSTAT’08, 28. August, Porto BOUNDARY EFFECTS 4 - 2 X ∼ Exp(1) – the kernel estimate of F (n = 100, hF opt,0,2 = 0.8479) −1 0 1 2 3 4 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 h COMPSTAT’08, 28. August, Porto BOUNDARY EFFECTS 4 - 3 The Bias of Fh,K(x) in x = ch, • “near” the left boundary (0 ≤ c < 1): E(Fh,K(x)) − F(x) = hf(0) −c −1 W(t)dt + h2 f(1) (0)    c2 2 + c −c −1 W(t)dt − c −1 tW(t)dt    + o(h2 ) • interior points (c ≥ 1): E(Fh,K(x)) − F(x) = h2 2 f(1) (0) 1 −1 tW(t)dt + o(h2 ) COMPSTAT’08, 28. August, Porto BOUNDARY EFFECTS 4 - 4 Possible solutions • boundary kernels – estimators could be negative, some remedies have been proposed • pseudo-data – generating some extra data nearby the boundary and then combining them with the original data • data transformation (a) a transformation is selected from a parametric family, (b) a kernel estimator is applied to transformed data, (c) estimated values are converted by an inverse formula • reflection method – reflecting the data and applying the classical kernel estimator Fh,K(x) = 1 n n i=1 W x − Xi h − W − x + Xi h (3) COMPSTAT’08, 28. August, Porto PROPOSED ESTIMATOR 5 - 1 Proposed estimator “Generalized” reflection method (Zhang et al. [10], Karunamuni and Alberts [5] – the density case) Fh,K(x) = 1 n n i=1 W x − g1(Xi) h − W − x + g2(Xi) h g1 = g2 ⇒ Fh,K(0) = 0 Set g := g1 = g2 • g is nonnegative, continuous and monotonically increasing function defined on [0, ∞) • g−1 exists • g(0) = 0 • g(1) (0) = 1 • g(2) exists and is continuous on [0, ∞). COMPSTAT’08, 28. August, Porto PROPOSED ESTIMATOR 5 - 2 The bias of Fh,K(x) at x = ch, 0 ≤ c < 1 E(Fh,K(x)) − F(x) = h2 f(1) (0)[c2 /2 + 2cI1 − I2] −f(0)g(2) (0)[c2 + 2cI1 − I2] + O(h3 ), where I1 = −c −1 W(t)dt, I2 = c −c tW(t)dt The bias of Fh,K(x) at x = ch, c ≥ 1 E(Fh,K(x)) − F(x) = 1 2 h2 f(1) (0)β2 − f(0)g(2) (0)[c2 + β2] + O(h3 ) COMPSTAT’08, 28. August, Porto PROPOSED ESTIMATOR 5 - 3 Set g(2) (0) =    d1 c2 2 +2cI1−I2 c2+2cI1−I2 , for 0 ≤ c < 1 d1 β2 c2+β2 , for c ≥ 1 (= Ac) where d1 = f(1) (0) f(0) . COMPSTAT’08, 28. August, Porto PROPOSED ESTIMATOR 5 - 4 A construction of g(y) An estimate of d1 d1 = f(1) (0) f(0) = (ln f(x)) (1) x=0 ≈ ˆd1 = ln f∗ (h1) − ln f∗ (0) h1 , h1 ≈ n− 1 6 (see Zhang et al. [10], Karunamuni R.J., Alberts T. [5]) Hence ˆd1 ⇒ Ac gc(y) = λA2 cy3 + 1 2 Acy2 + y, where λ is a positive constant such that λ > 1 12 . (our experience: λ = 0.1) COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 1 A simulation study • X ∼ Exp(0.005), n = 100 (Dette, H., Weissbach, R. [3]) • 1 000 replications • We used the quartic kernel K0,2(x) = 15 16 (1 − x2 )2 I[−1,1], where IA is the indicator function on the set A. • The optimal bandwidth was computed from (2) • The results were compared with classical estimator (1) and the reflection method (3) COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 2 X ∼ Exp(0.005) – the kernel estimate of F (n = 100, hF opt,0,2 = 231.35) −300 −200 −100 0 100 200 300 400 500 600 700 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 3 A comparison MISE – Mean Integrated Square Error on the interval [0, hF opt,0,2] Method Mean STD Classical 0.0068 0.0014 Reflection 0.0020 0.0020 Proposed 0.0010 0.0014 Table 1. Means and STD’s for MISE COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 4 (1) (2) (3) 0 5 10 15 x 10 −3 MISE for estimates of CDF for the classical estimator with boundary effects (1), the reflection method (2) and for our proposed method (3). COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 5 Classical Reflection Proposed c Mean STD Mean STD Mean STD 0.00 0.0215 0.0048 0.0000 0.0000 0.0000 0.0000 0.25 0.0009 0.0013 0.0023 0.0017 0.0008 0.0010 0.50 0.0021 0.0025 0.0032 0.0032 0.0016 0.0021 0.75 0.0026 0.0033 0.0027 0.0034 0.0017 0.0024 Table 2. Means and STD’s for MSE at x = chF opt,0,2. COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 6 0 0.25 h 0.5 h 0.75 h 0 0.25 h 0.5 h 0.75 h 0 0.25 h 0.5 h 0.75 h 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 classical reflection proposed MSE at points x = chF opt,0,2, c = 0, 0.25, 0.5, 0.75 for the classical estimator, the reflection method and for our proposed method. COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 7 Practical usage ROC • The Receiver Operating Characteristic (ROC) describes the performance of a diagnostic test which classifies subjects into either group without condition G0 or group with condition G1 by means of a continuous discriminant score X, i.e. subject is classified as G1 if X ≥ d and G0 otherwise for the given cutoff point d ∈ R. • Let F0 and F1 be the distribution functions of X in the G0 and G1. COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 8 • The ROC is defined as a plot of probability of false classification of subjects from G1 versus the probability of true classification of subjects from G0 across all possible cutoff point values of X. • ROC curve can be written as R(p) = 1 − F1(F−1 0 (1 − p)), 0 < p < 1 where p is the false positive rate in (0, 1) as the corresponding cut-off point d ranges from −∞ to +∞. COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 9 ROC −5 0 5 10 15 20 25 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 d G0 G1 FPR TPR COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 10 Real data Consumer loans data The use of some (not specified) scoring function for predicting the solidity of a client. We are interested in determining which clients are able to pay their loans. A test set: 332 clients – 309 have paid back their loans (group G0) and 22 had problems with payments or did not pay (group G1). We use the ROC curve to assess the discrimination between clients with and without a good solidity. We want to know if our scoring function is a good predictor of the solidity. COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 11 The estimate of f0(x) (ˆhf0 opt,0,2 = 0.0032) and f1(x) (ˆhf1 opt,0,2 = 0.0153) with boundary effects −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1 0 10 20 30 40 50 60 70 COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 12 The estimate of f0(x) (ˆhf0 opt,0,2 = 0.0032) and f1(x) (ˆhf1 opt,0,2 = 0.0153) with NO boundary effects −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1 0 10 20 30 40 50 60 70 COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 13 The estimate of F0(x) (ˆhF 0 opt,0,2 = 0.0068) and F1(x) (ˆhF 1 opt,0,2 = 0.0286) with boundary effects −0.04 −0.02 0 0.02 0.04 0.06 0.08 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 14 The estimate of F0(x) (ˆhF 0 opt,0,2 = 0.0068) and F1(x) (ˆhF 1 opt,0,2 = 0.0286) with NO boundary effects −0.04 −0.02 0 0.02 0.04 0.06 0.08 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 COMPSTAT’08, 28. August, Porto EXAMPLES 6 - 15 The estimate of ROC 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 COMPSTAT’08, 28. August, Porto REFERENCES 7 - 1 References [1] Azzalini, A.: A note on the estimation of a distribution function and quantiles by a kernel method. Biometrika, 68, No 1, pp. 326–328, 1981. [2] Bowman, A., Hall, P., Prvan, T.: Bandwidth selection for the smoothing of distribution functions. Biometrika, 85, No 4, pp. 799–808, 1998. [3] Dette, H., Weissbach, R.: Kolmogorov-Smirnov-type testing for the partial homogeneity of Markov processes – with application to credit risk. Applied Stochastic Models in Business and Industry, Vol. 23, No. 3, pp. 223–234, 2007. [4] Horová, I., Zelinka, J.: Different approaches to ROC curve fitting for a continuous diagnostic test. CSDA, submitted, 2007. COMPSTAT’08, 28. August, Porto REFERENCES 7 - 2 [5] Karunamuni, R.J., Alberts T.: On boundary correction in kernel density estimation. Statistical Methodology 2, pp. 191–212, 2005. [6] Lloyd, C.J., Zhou Yong: Kernel estimators of the ROC curve are better than empirical. Statistics and Prob. Letters 44, pp. 221–228, 1999. [7] Silverman, B.W.: Density estimation for statistics and Data Analysis. Chapman and Hall, New York, 1986. [8] Terrell, G. R.: The maximal smoothing principle in density estimation. Journal of the American Statistical Association. Vol. 85, No. 410, pp. 440-447, 1990. [9] Wand, I.P. and Jones, M.C.: Kernel smoothing. Chapman & Hall, London, 1995. [10] Zhang, S., Karunamuni, R.J., Jones, M.C.: An improved estimator of the density function at the boundary. Journal of the Amer. Stat. Assoc., 448, pp. 1231–1241, 1999. COMPSTAT’08, 28. August, Porto