Bootstrap methods for nonparametric curve estimation Ricardo Cao (Universidade da Coruna, Spain) Masaryk University (Brno, Czech Republic) April 13-14, 2011 Contents 1 Introduction to the bootstrap 5 1.1 Motivation............................ 6 1.2 The bootstrap method ..................... 8 1.3 Use of the bootstrap ...................... 11 1.4 An example: the median .................... 12 2 Introduction to nonparametric curve estimation 14 2.1 Nonparametric density estimation................ 14 2.1.1 Background....................... 14 2.1.2 Bias, variance and mean squared error......... 16 2.1.3 Mean integrated squared error (MISE)......... 18 2.2 Nonparametric regression estimation.............. 19 3 Bootstrap methods for density estimation 21 3.1 Bootstrap approximation for the sampling distribution of the Parzen-Rosenblatt kernel estimator............... 21 3.1.1 Asymptotic distribution of the Parzen-Rosenblatt estimator .......................... 22 3.1.2 Plug-inapproximation.................. 23 3.1.3 Bootstrap approximation................ 24 3.2 Bootstrap methods for bandwith selection........... 26 3.2.1 Asymptotic expression for the optimal bandwidth ... 26 3.2.2 Bootstrap version of MISE.............. 27 3.2.3 Closed expression for MISE* ............. 29 3.2.4 Pilot bandwidth choice................. 30 3.2.5 Asymptotic results.................... 31 3.2.6 Gaussiankernelcase................... 31 3.2.7 Comparisonwithotherbandwidthselectors ...... 33 4 Bootstrap methods for nonparametric regression estimation 35 4.1 Asymptotic distribution of the Nadaraya-Watson estimator . . 35 4.2 Plug-in approximation...................... 37 4.3 Wild bootstrap.......................... 38 4.4 Smoothed bootstrap in the explanatory variable........ 42 4.5 Comparison of convergence rates................ 47 1 Introduction to the bootstrap The basic ideas about the bootstrap methods are introduced in this section. 1.1 Motivation Let us consider a simple random sample (SRS), (X1,X2,...,Xn), from a distribution function F. Consider the problem of constructing a confidence interval for the mean, M,of F, with known standard deviation, a. The classical statistic used for this aim is the well-known standardized diference of the sample mean and the population mean T=n 1 (n pn=1 xi - m) _ d If F is a normal distribution then T = N (0,1). Since the distribution of T is known (and it is tabulated) it can be used to construct exact confidence d intervals for \±- Even when the distribution of F is not normal then T — N (0, 1), but for samples of moderate or small size the approximation of the distribution of T by a standard normal distribution may be poor. If F is the exponential distribution with parameter A (F (x) = 1 — exp (—Ax)) then n d S = £ X = r (A,n), i=1 frequencies -i on IJU \J ./I4 0 1 ' 1 \J 1 120 / 1 £- \J I inn J 80 —•—frequencies / 60 1 4 0 -6 -4 -2 ( 24 Figure 1: Distribution of T with n = 5 for an exponential distribution with A =1. so 2 (ns - a) _ i / a2 n 2 T =-^_- m) = t (n)Fn (x)j (1 — Fn (x))n—j. Consequently, P* (X(*m) = X(i)) where P* (X(*m) < X(i)) - P* (X**m) 0 is a smoothing parameter, often called bandwidth, that regulated thesizeofthe neighbourusedfor theestimation. This estimator generalizes the wellknown histogram, more precisely the moving histogram. Choosing K as the density of a U (—1,1), the Parzen-Rosenblatt estimator results in: n 1 (x-X- \ 1 n i-1 i-1 # {Xi G (x — h,x + h)} 2nh which is the relative frequency of data Xi in the interval (x — h, x + h) divided by the length of this intervalo (2h). It is typically assumed that the kernel function, K, is nonnegative and integrates out to one: /oo K (u) du -1. o It is also common to assume that K is a symmetric function: K (—u) — K (u). The choice of the function K does not have a big impact in the properties of the estimator (just in its regularity: continuity, differentiability, etc.) but the choice of the smoothing parameter is crucial for a correct estimation. In other words, the size of the neighbourhood for the nonparametric estimation should be adecuate (not to large, not too small). 2.1.2 Bias, variance and mean squared error Straight forward calculations lead the bias of the Parzen-Rosenblatt estimator: = (Kh * f )(x) - f (x), where * denotes the convolution operator: (f * g)(x) = J f (x - y) 9 (y) dy. Using the bias expression an asymptotic expression for the bias can be obtained: E (fh (x)) - f (x) = ^h2f" (x) + O (h^ , with dK = J t2K (t) dt. The variance can be handled similarly: Far (fh (x)) 1 nh2 1 nh2 1 Var K [/k( x - y h f (y) dy -(/ KJ^-^f (y) dy) 2 1 = n[((k-)2 * f) (x) - ((Kh * f )(x))2 nh 1 K2) h * f (x) -- [(K- * f )(x)] Its asymptotic expression results in: Var (f- (x)) = (x) - 1 f (x)2 + O f-) with cK = J K (t)2 dt. Consequently the mean squared error of the estimator is: MSE (/h (x)) = J (/h (x) - / (x)) 2 dx = Sesgo (x)) 2 + Var (/h (x)) [(Kh * / )(x) - / (x)]2 + ^[(K2) h * / (x) - - [(Kh * / )(x)]2 n Its asymptotic expression is: MSE (/h (x)) = ^h4/" (x)2 + f^/ (x) - I/ (x)2 + O (h6) + O (£) 2.1.3 Mean integrated squared error (MISE) A global error measure (not for a particular x) of the estimator is them mean integrated squared error: MISE (/h (x)) = J E (/h (x) - / (x))2 dx = j MSE (ih (x)) dx = / [(Kh * /)(x) - / (x)]2 dx + fK - 1 / [(Kh * /)(x)]2 dx. An asymptotic expression for it is the following: (x)) = ^h4 f f" (x)2 dx+^-1 f f (x)2 dx+O fh6) +O f-1 . J 4 J nh n J v ' \n/ The negative effect of choosing a too large or too small bandwidth (h)is evident from this expression. 2.2 Nonparametric regression estimation Let {(X1,Y1), (X2,Y2),(Xn,Yi)} de aSRS from atwo-dimensional population (X, Y), with E (|Y|) < oo. Wewould liketoestimatethe regression function of Y given X: m (x) = E (Y|x=x). The regression function can be written as: f (x,y) , Jyf (x,y) dy m (x) = / yf2|1(y|x) dy = / y .^j) dy = f1 (x) f1 (x) Jyf1|2 (x|y) f2(y) dy V (x) f1 (x) f1 (x) where fi (x) is the marginal density function of X and * (x) = /yfi|2 f2 (y) dy = e (y/i|2 (x\Y)). The functions V (x)and fi (x) can be estimated using the kernel method: A, (x) = g K (^). *h (x) = K (^) Y„ nh i= which give the Nadaraya-Watson kernel estimator (see Nadaraya (1964) and Watson (1964)): *h (x) _ n pn=i Kh (x - Xi) Y h mh (x) = IF-TT = fi,h (x) n pn=i Kh (x - Xi) Similar properties to those mentioned for the Parzen-Rosenblatt kernel density estimator can be proved for the regression estimator. 3 Bootstrap methods for density estimation In the next two subsections two different problems in nonparametric density estimation are considered: approximating the sampling distribution of the Parzen-Rosenblatt kernel estimator (to obtain confidence intervals for the density, for instance) and bandwidth selection. The bootstrap is used for these two aims. 3.1 Bootstrap approximation for the sampling distribution of the Parzen-Rosenblatt kernel estimator Before introducing the bootstrap in this setup the limit distribution of the Parzen-Rosenblatt estimator will be presented. Other approximations will be also considered (see Cao (1990) for futher details on these results). 3.1.1 Asymptotic distribution of the Parzen-Rosenblatt estimator The minimal requirements for the bias and variance to tend to zero when the sample size tends to infinity are h — 0, nh — oo. Under these assumptions (nh)1/2 (Jh (x) - / (a)) — N (B,V). On the other hand, it can be proved that the asymptotically optimal value of h, in the sense of MSE,is h = con-1/5,with This choice for h leads to the following values for the mean and the variance of the normal limit distribution: In order to use the limit distribution to construct confidence intervals for f (x) one can ... B= V= 2c^^2dKf" (x), 1. estimate B and V and use them in the normal distribution (plug-in method). 2. design a resampling plan and use the bootstrap method. 3.1.2 Plug-in approximation It consists in estimating B and V by means of B = 100/2dKf/ (x) , where g is a suitable bandwidth to estimate the second derivative of the density function. Using the Berry-Esseen inequality the following rate of convergence can be obtained: sup ze R P (nh)1/2 f (x) - / (x)) < z] - «>( z - B V Op (n-1/5) This rate is worse than that of the theoretical normal approximation, based on the exact mean and variance (Bn and Vn): sup ze R P (nh)i/2 (fh (x) - f (x)) < z] - «>( n O (n-2/5) but is not worse than that of the asymptotic normal, N (B, V), which has a rate of order OP (n-V5V 3.1.3 Bootstrap approximation The bootstrap resampling plan is as follows: 1. Use the sample (Xi,X2,...,Xn)and a pilot bandwidth, g,to compute the Parzen-Rosenblatt estimator, fg. 2. Draw bootstrap resamples ^Xj,X2>,...,Xn) from the density fg. 3. Construct the bootstrap version of the Parzen-Rosenblatt estimator /h (x) = K(x-^) 4. Approximate the sampling distribution of (nh)1/2 (x) — / (x)^ by means of the resampling distribution of (nh)1/2 (^fj** (x) — /g (x)). If we were interested in the bias or the variance of /h (x) (rather than in its asymptotic distribution) then Step 4 in the previous algorithm will be replaced by computing the bootstrap version of the bias, E* (/^ (x) — /g (x)) ,or the variance, Var* (/h (x)Y In the previous algorithm, the bandwidth g has to be asymptotically larger than h. In fact, a reasonable choice for g is the minimizer of E Asymptotically, this minimizer has the form 1/9 (/£ (x) — /" (x) g 5/ (x) J K00 (t)2 dt The convergence rate for the bootstrap aproximation is given by sup P zeR Op (n-2/9) (nh)V2 (A (x) - / (*)) < Z - P* [(nh)1/2 (ih (x) - fg (x^ < z which is better than those of the theoretical normal aproximation and the plug-in method. 3.2 Bootstrap methods for bandwith selection 3.2.1 Asymptotic expression for the optimal bandwidth The MISE has an asymptotic expression that may be used as a criterion to obtain an optimal value for the smoothing parameter: h MISE (h) = AMISE (h) + O (h6) + O (h) with n 1 f (x) dx. The smoothing parameter that minimizes AMISE is 1/5 hAMISE n dK I f" (x)2 dx There exist plenty of methods devoted to the problem of bandwidth selection. Among them we mention plug-in methods, cross validation methods (smoothed or not) and, of course, bootstrap methods (see, for instance, Marron (1992)). 3.2.2 Bootstrap version of MISE The basic idea consists in providing a smoothed bootstrap resampling plan to estimate MISE. We will follow the proposal by Cao (1993). It consists of the following steps: 1. Given the sample (X1,X2,...,Xn) a pilot bandwidth, g,isusedtocom-pute the Parzen-Rosenblatt kernel estimator, /g. 2. Bootstrap resamples (X*,X2, ...,Xn) are drawn from the density /g. 3. For every S> 0, the bootstrap version of the Parzen-Rosenblatt estimator is computed JS (x) = nhg X) 4. The bootstrap version of MISE is constructed: MISE * (S)= / E * (/S (x) — /g (x))2 dx. 5. MISE* (S) is minimized in S> 0 and the bootstrap selector is obtained: SMlSE = argmin MISE* (S) S>0 3.2.3 Closed expression for MISE* It is possible to obtain a closed expression for the bootstrap version of MISE: 2 MISE* (h) = / [(Kh * fg) (x) - fg (x) / [(Kh * fg) (x) +^ -- / l(Kh * nh n dx 2 dx = nh - ^ X [(Kh * Kg) * (Kh * Kg)] (x, - Xj) 1 n + n2 J2 [(Kh * Kg - Kg) * (Kh * Kg - Kg)] (X* - Xj) n2 Such a property is not very common for the bootstrap in other setups. 3.2.4 Pilot bandwidth choice The bandwidth selection problem for the piloto smoothing parameter, g,is close related to minimizing the mean squared error of the density curvature: 2" The asymptotical value of the pilot bandwidth, g, minimizing the previous expression is g0 ( J K00 (t)2 dt 3.2.5 Asymptotic results Using any deterministic piloto bandwidth with the following property -—^ = 9-90 90 O (n-1/14Yitholds hM/SE - hM/SE hM/SE M/SE (feM/sE - MISE (hMJSE) MISE (hM/SE) Op (n-5/14) OP (n-5/7) . Using somewhat more sofisticated techniques (that let the pilot bandwidth, g, depend on h), a slightly better rates can be obtained: hM/SE Op (n-1/2) 3.2.6 Gaussian kernel case If the kernel function, K, is Gaussian (the density la function of a N (0, i)), then: Kfr is the density of a N ^0,h2) Kg is the density of a N (o,g2) • Kh * Kg is the density of a N (0,h2 + g2) • (Kh * Kg) * (Kh * Kg) is the density of a N (o, 2h2 + 2g2) • (Kh * Kg) * Kg is the density of a N (0,h2 + 2g2) • Kg * Kg is the density of a N (0, 2g2) Consequently, c 1 n MISE* (h) = -K - Y K, 2 2,1/2 [Xi - Xi) K ' nh n3 .4= (2h2+2g2)1/2 V i 3> 1 n r +n2 ij=1 [K(2h2+2g2)1/2 (Xi - X) -2K(h2+2g2)1/2 (Xi - X) + K(2g2)1/2 (Xi - X) 3.2.7 Comparison with other bandwidth selectors The bootstrap method presented here is very similar to the smoothed cross validation method proposed by Hall, Marron and Park (1992). In comparative simulation studies (see Cao, Cuevas and Gonzalez-Manteiga (1993)) it can be observed that this bootstrap method is very competitive with other methods for bandwidth selection. In general this bootstrap method is the one which presents a better behaviour, together with the solve-the-equation plug-in method by Sheather and Jones (1991) and the smooth cross validation. There exist other bootstrap bandwidth selectors that exhibit a much worse behaviour. Among them we include: • Hall (1990), that resamples from the empirical cdf, so it does not mimic the bias. • Faraway and Jhun (1990), that choose g as the cross validation bandwidth, that results to be too small. • Taylor (1989), that chooses g = h ,so MISE* (h) — 0, when h — oo, which produces a global minimum of MISE* that is inconsistent with hMISE. 4 Bootstrap methods for nonparametric regression estimation In this section, two bootstrap methods are presented for nonparametric regression. The aim is to approximate the sampling distribution of the Nadaraya-Watson estimator. The asymptotic results show the behaviour of these bootstrap methods, conditionally on the sample of the explanatory variable as well as in an unconditional sense. 4.1 Asymptotic distribution of the Nadaraya-Watson estimator Before embarking on presenting the bootstrap in this context, it is useful to present the limit distribution of the Nadaraya-Watson estimator, given by mh (x) = -^j- M; n En=1 k- (x - Xi) As in the density case, it can be proved that the minimal conditions required for the consistency of the estimator are h — 0, nh — oo,when n — oo. Under these assumptions, (nh)1/2 (mh (x) — m (x)) — N (B, V). On the other hand, it can be proved that the asimptotically optimal value for h,in the sense of MSE,is of the form h = q)n—1/5.For such a bandwidth, the mean and variance for the normal limit distribution are 1 5/2 m" (x) / (x) + 2m0 (x) f (x) 2 / (x) a2 (x) V = CK7(xr ■ where / (x) is the marginal density function of X and a2 (x) = Var (Y|x=x) is the conditional la variance of Y given X = x. As for the density case, to construct confidence intervals for m (x)wemay 1. Estimate B and V and use them in the corresponding normal distribution (plug-in method). 2. Design a resampling plan using the bootstrap method. The rates of convergence for the approximation of the (conditional or unconditional) distribution of the statistic to the normal limit distribution are: sup zR PYIx [(nh)1/2 (mh (x) - m (x)) < z sup zR P 1 /2 (nh) / (mh (x) - m (x)) < z 0 0 z - B z - B ~1V~ Op (n-1/5) O (n-2/5) , where PY|x (B) denotes P (b|x ) 4.2 Plug-in approximation The plug-in approximation consists in estimating B and V using suitable estimators of / (x), /0 (x), m (x), m0 (x), m" (x)and a2(x). For any of these functions one could use bandwidth selectors intended to approximate the optimal bandwidths for every one (this is a rather tedious process). Using such an aproach, estimators for the bias, B, and the variance, "\/,can be obtained. It may be proved for these estimators that II — B = Op (ji—2/9^j and V — V = OP (n—2/5). Conseq uently the following (conditional and unconditional) convergence rates can be proved for the plug-in approximation: sup ze R PY\x [(nh)1/2 (mh (x) — m (x)) < z sup P (nh)1/2 (rhh (x) — m (x)) < z 0 0 z — B Vh z — II Vh OP (n—1/5 OP (n—2/9' The firstrateisworse than andthe second oneisthe same as theratefor the theoretical normal limit approximation (see Cao (1991)). 4.3 Wild bootstrap This bootstrap resampling method, proposed by Wu (1986) and estudied by Hardle and Marron (1991), proceeds as follows: 1. Using the Nadaraya-Watson estimator of m (x) and using the initial smoothing parameter, h, the residuals are constructed 0 = Y - m- (Xi), i = 1, 2,...,n. 2. For every index i = 1, 2,...,n, and conditionally on the observed sample, {(X1,Y[), (X2,Y2)(Xn,Yn)}, a bootstrap residual, g*,is drawn from a probability distribution fulfilling E* (g*) =0, E* (g**2) = g2 and E* (g*3) =g3. Although the third moment condition is not strictly necessary, it is useful to prove the asymptotic validity of the method. 3. Using a pilot bandwidth, g, asymptotically larger than h (i.e. g/h —> o ), bootstrap versions of the observations from the response variable are drawn: Yi* = mOg (Xi) + g*, i = 1, 2,...,n. 4. The bootstrap resample X1,Y1* , X2,Y2* ,...,(Xn,Yn*) is used to construct a bootstrap version of the Nadaraya-Watson estimator: mh (x) = - hK) npn=1 k-(x - Xi) 5. The sampling distribution of (nh)1/2 (mh (x) — m (x)) is approximated by the resampling distribution of (nh)1/2 (^fhh (x) — rhg (x)^. Step 2 is usually carried out by considering a random variable, V*,fulfilling E* (V*) = 0, E* (V*2) = 1 and E* (V*3) = 1, drawing a sample of size n from it, (vf* ,Vn* ,...,Vj*^j ,and then definining e* = hiV* for i = 1, 2,...,n. A common choice for the distribution of V* is the discrete distribution that gives positive probability to only two points: P* (V* = a) = p and P* (V* = b) = 1 — p. The distribution can be obtained as the solution of the system of three equations given by the first three moments: ap + b (1 — p) = 0, a2p + b2(1 — p) = 1, a3p + b3(1 — p) = 1. This gives rise to the so called wild bootstrap (or golden section bootstrap), with a = 1-21/2, b = 1+21/2, p = 5+51/2,i.e. 1 - 51/2 \ 5 + 51/2 P * (V * =--1 = P * V 2 ; 10 1 + 51/2 \ ^51/2 2 10 The selection of the pilot bandwidth g, appearing in Step 3, is strongly linked to the estimation of m00 (x), since this is the critical term to estimate B and V. Taking a pilot bandwidth of optimal order in this sense, go ' don-1/9, the following (conditional and unconditional) rates of convergence for the wild bootstrap approximation can be obtained: sup ze R PYlx [(nh)1/2 (mh (x) - m (x)) < z -P* [(nh)1/2 (mh (x) - mg (x)) < z = Op (n-2/9) , sup zR r i /o P (nh) / (mh (x) - m (x)) < z OP (n-1/5) . P* [(nh)1/2 (mh (x) - mg (x)) < z 4.4 Smoothed bootstrap in the explanatory variable The idea of this resampling method is to consider the variability coming from the explanatory variable (in the wild bootstrap this part of the resampling is kept fixed) and also to allow the resampling distribution of Y*|x*=x not to be degenerate (as it is in the two-dimensional naive bootstrap). The resampling plan, proposed by Cao and Gonzalez-Manteiga (1993), consists of the following steps: 1. Given the sample {(X^Y.), (X2,Y2), (Xn,Yi)},some estimator of the joint distribution function of (X, Y) is constructed: 1 n r x Fg (X, y) = 1{Yi