franparametric Siaiisiics, Vol. 2. pp. 307-33! © 1993 Gordon and Breach Science Publishers Reprints available directly Írom the publisher Printed in the United States of America Photocopying permitted by license only TESTS OF LINEAR HYPOTHESES BASED ON REGRESSION RANK SCORES C GUTENBRUNNER1, J. JUREČKOVÁ2, R. KOENKER3 and S. PORTNOY3 'Philipps Universität, Marburg, Germany, Charles University, Prague, Czechoslovakia und ^University of Illinois at Urbana-Champaign, USA (Received September 25, 1992; in final form December 16, 1992) Dedicated to the memory of Jaroslav Hájek We propose a general class of asymptotically distribution-free tests of a linear hypothesis in ihc linear regression model. The tests are based on regression rank scores, recently introduced by Gutenbrunncr and Jurcčková (1992) as dual variables to the regression quantiles of Koenker and Bassctt (1978). Their properties arc analogous to those of the corresponding rank tests in location model. Unlike Ihc other regression tests based on aligned rank statistics, however, our tests do not require preliminary estimation of nuisance parameters, indeed they arc invariant with respect to a regression shift of (he nuisance parameters. KEYWORDS: Ranks, regression quantiles, regression rank scores. 1. INTRODUCTION Several authors including Koul (1970), Puri and Sen (1985) and Adichie (1978) have developed asymptotically distribution-free tests of linear hypotheses for the linear regression model based upon aligned rank statistics. Excellent reviews of these results including extensions to multivariate models may be found in Puri and Sen (1985) and the survey paper of Adichie (1984). The hypothesis under consideration typically involves nuisance parameters which require preliminary estimation; the aligned (or signed) rank statistics are then based on residuals from the preliminary estimate. Alternative approaches to inference based on rank estimation have been considered by McKean and Hettmansperger (1978), Aubuchon and Hettmansperger (1988) and Draper (1988) among others. A completely new approach to the construction of rank statistics for the linear model has recently been introduced by Gutenbrunner and JureČková (1992). Their approach is based on the dual solutions to the regression quantile statistics of Koenker and Bassett (1978). These regression rank scores represent a natural extension of the "location rank scores" introduced by Hájek and Šidák (1967, Section V.3.5), which play a fundamental role in the classical theory of rank AMS 1980 subject classifications: 62GI0, 62J1Ü. The work was partially supported by NSF grants 88-02555 and 89-22472 to S. Portnoy and R. Koenker and by support from the Australian National University to J. Jurčíková and R. Koenker. *>. 307 ne C. CiUTENBRUNNhK ti at. statistics. In this paper we consider tests of a general linear hypothesis for the linear regression model based upon regression rank scores. These tests have the advantages of more familiar rank tests; they are robust to outliers in the response variable and they are asymptotically distribution free in the sense that no nuisance parameter depending on the error distribution need be estimated in order to compute the test statistic. Furthermore, they are considerably simpler than many of the proposed aligned rank tests which require preliminary estimation of the linear model by computationally demanding rank estimation methods. The robustness of the proposed tests and the sensitivity of the aligned rank procedures to response outliers is illustrated in the sensitivity analysis of the example discussed in Section 2. In the classical linear model, Y = X/3 + E, (1.1) the vector $(a) ■*($!(<*)> - • • • ßpi**))' e Rp of «th regression quantiles is any solution of the problem min f, pa(Y,-x;t), t€Rp (1.2) /-i where PÁ») = I«I i(l - *)/[« <0] + «l[u >0]}, u e R'. (1.3) Least absolute error regression corresponds to the median case with a = 2. In the one-sample location model, with X = l„. solutions to (1.2) are the ordinary sample quantiles: when not is an integer we have an interval of solutions between two adjacent order statistics. Computation of the regression quantiles is greatly facilitated by expressing (1.2) as the linear program alX + (1 - a)l>- : = min (1.4) \ß + u* - iT = Y ßeRp, u\ireR1 and 1„ ■ (1,. .. , 1)' e R", with 0< a< 1. Even in this form, the problem of finding all the regression quantile solutions may appear computationally demanding, since there would appear to be a distinct problem to solve for each a e (0, 1). Fortunately, there are only a few distinct solutions. In the location model we know, of course, that there arc at most n distinct quantiles. In regression, Portnoy (1991) has shown that the number of distinct solutions to (1.2) is Op {n log/i). Finding all the regression quantiles is a straightforward exercise in parametric linear programming. From any given solution for fixed a we may compute the interval containing a for which is solution remains optimal, and one simplex pivot brings us to a new solution at either endpoint of the interval. Proceeding in this way we may compute the entire path /?(-) which is a piecewise constant function from (0,1] to Rp. Detailed descriptions of algorithms to compute the regression quantiles may be found in Koenker and d'Orey (1990), and Osborne (1992). Finite-sample as well as asymptotic properties of ß(a) are studied in Koenker and Basset! (1978), Ruppert and Carroll (1980), Jurečková (1984), Gutenbrunner (1986), Koenker and Portnoy (1987), Gutenbrunner and Jurečková (1992), and Portnoy (1991b). I I SIS HASH) ON KANK S( OKI S 309 The regression rank scores introduced in Gutenbrunner and Jurečková (1992) arise as a n-vector ä„(o) = (änl(a),... , änn{a))' of solutions to the dual form of the linear program required to compute the regression quantiles. The formal dual program to (1.4) can be written in the form Y'ä(o-): = max X'á(tt) = (l-a)X'lfl (1.5) ä(tt)e[0, If, 0<*<1 As shown in Gutenbrunner and Jurečková (1992), many aspects of the duality of order statistics and ranks in the location model generalize naturally to the linear model through (1.4) and (1.5). Moreover, as pointed out there, ä is regression invariant with respect to XI, in the sense that a(a) is unchanged if Y is transformed to Y + XI y for any y e R'. To motivate our approach, consider {Án(a), 0< a< 1} in the location model with X ■ 1„. In this case, anl{a) specializes to ŮJa)mal{Ritet)~ 1 if <*<(/*,- \)ln R.-an if (R,-l)fn< or sRJn (1.6) 0 if RJn < a where R, is the rank of Y, among Vj.....Y„. The function a*{j, a), j = 1,... ,n 0Ě*«A<*> /-I if y, <£**&*) (1.7) /-í while the components of a„(a) corresponding to {*' \ Y, = x',ß(a)} are determined by the equality constraints of (1.5). Thus, as in the location model, the regression rankscore for observation í is one while y, is above the ath quantile regression plane, and zero when v, falls below this plane, and taking an intermediate value while y, falls on the ath plane. Integrating the regression rankscore function for each observation over [0, 1] yields a vector of (Wilcoxon) ranks: observations falling "below" most the the others receiving small ranks, while those falling "above" the others, and thus having rankscore one over a wide interval, receive large ranks. This observation is completely transparent in the location model where "above" and "below" have an obvious interpretation. In regression, the 310 C GU'ľHNBRUNNĽR ti al. interpretation of these terms relies on the optimization problem defining the regression quantiies. The resulting rank scores illustrated, for example, in Figure 2.1, are, we believe, a useful graphical diagnostic in linear regression in addition to their role in formal hypothesis testing. The next section of the paper surveys our results, establishes some notation, and provides an illustrative example. Section 3 develops some theory of the regression rank score process. Section 4 treats the theory of simple linear rank statistics based on this process, and Section 5 contains a formal treatment of the proposed tests. 2. NOTATION AND PRELIMINARY CONSIDERATIONS We will partition the classical linear regression model Y = X/? + E (2.1) as Y = X,/J,+X2ft + E (2.2) where ß, and ß> are p- and ^-dimensional parameters, X = X„ is a known, nx(p + q) design matrix with rows x^ = x', - (x'lt, xá) e Rpr", i=l,...,n. We will assume throughout that x,, =■ 1 for i = 1,.. . , n. Y is a vector of observations and E is an n x I vector of i.i.d. errors with common distribution function F. As in the familiar two-sample rank test, our test statistics is shift-invariant and hence independent of location. Thus like other rank tests, hypotheses on the intercept cannot be tested. This is immediately apparent from the regression invariance of the test statistic noted above. The precise form of F need not be known but we shall generally assume that F has an absolutely continuous density /on (A, B) where -»s/1 = sup{*: F(x) = 0} and +<»s B = inf{.r: F(x) = I}. Moreover, we shall impose some conditions on the tails of/assuming, among other conditions, that / monotonically decreases to 0 when x—*A + , or x—*B—. Define D„ = B"lXiXj, H^X^XlX.r'X; and Q„ = /r'(X2-X2)'(X2-X2) (2.3) with X2 = H,X3 being the projection of X2 on the space spanned by the columns of X,. We shall also assume limD„ = D, limQn=Q (2.4) n—•» n—»or where I) and Q are positive definite {p x p) and (q x q) matrices, respectively. We are interested in testing the hypotheseis H0:ß2 = 0, ß, unspecified (2.5) versus the Pitman (local) alternatives Hm-fa^n'^fio (2.6) with ßo being a fixed vector in R*. TESTS BASED ON RANK SCORES Ml As in (he classical theory of rank tests, we shall consider a score-function (0-*)2ď. 9= í ) depends only on the score function and not on (the unknown) F. This is familiar from the theory of rank tests, but stands in sharp constrast with other methods of testing in the linear model where typically some estimation of a scale parameter of F is required to compute the test statistic. See for example the discussion in Aubuchon and Hettmansperger (1989) and Draper (1988). We shall show in Section 5, that the asymptotic distribution of 7^, under Ha is central x! w',n 9 degrees of freedom while under H„ it is noncentral %z with q degrees of freedom and noncentrality parameter where *r= [A

F)M2(V)]/%Q/3o (2.11) y(«P.F)=-f Thus for Wilcoxon scores (see below) the asymptotic relative efficiency of the test based on T„ relative to the classical F test is 3/*t = 0.955 at the normal distribution and is bounded below by 0.864 for all F. When F is heavy tailed this asymptotic efficiency is generally greater than one, and can in fact be unbounded. 312 C. GUTENBRUNNER ei al. For normal (van der Waerden) scores (~'(h)) the situation is even more striking. Here the test based on 7;, has asymptotic efficiency greater than one, relative to the classical F test, for all symmetric F, attaining one at the normal distribution. See e.g. Lehmann (1959, p. 239), and Lehmann (1983, pp 383-87). Let us now examine more closely the scores (2.7), which can be written as 4, = -/(t)di, i = l, ...,n There are three typical choices of >F) = //(ír"'( = J*J 6 0« II ObsNo 7 rank--0.17 COS NO 22rank=-0.16 Obs No 29 rank--01 - 10 03 0" 0.4 42 O* 06 08 10 0.0 02 0" OB OS ľ ObS No 30 rank- -0.06 ObsNo 10 rank--0.06 Obs No 24 rank- -0.01 0? 4* 06 OS 0.4 42 °A 06 49 04 02 0« 04 M U Figure 2.1. Regression rank scores for tobacco data. TESrS HASED ON RANK SCORES 315 Obs No 12 tank-0 ObsNo 16 rank-0.41 ObsNo H rank-0-42 oo ol o* o« o« io o« 02 0.4 oo oa n Figuře 2.1. cont'd. Obs N<5 9 rank. 048 00 02 O« OB OB 1.0 TESTS BASED ON RANK SCORES» .ip original dala. The same perturbation of yi changes the Wilcoxon regression rankscore test statistic from 13.17 to 14.70 with a correlation between the two rank vectors of 0.87. A more robust initial estimator would improve the performance of the aligned rank test somewhat. The regression rank score version of the test is seen to be relatively insensitive to such perturbations. One should be aware that comparable perturbations in the X7 design observations may wreck havoc even with the rank score form of the test. Recent work of Antoch and Jurečková (1985) and deJongh, deWet, and Welsh (1988) contain suggestions on robustifying regression quantiles and therefore the corresponding regression rank scores to the effect of influential design points. Compulation of the tests was carried oul in S+ using ihe algorithm described in Koenker and d'Orey (1987, 1990) to compute regression quantiles. 3. PROPERTIES OF REGRESSION RANK SCORES Consider ihe linear regression model (2.1) with design X„ of dimension n x p. Let ß(&)eW be the a-regression quantile and ä(a)e/?" be the vector of ath regression rank scores defined in (2.7). We see from ihe form of the linear constraints in (1.5) that the regression rank scores are regression invariant, i.e., än(ff, Y + Xb) ^ ä„(a. Y), b e R". (3.1) Moreover, in view of the invariance, we may assume JU-0, / = 2,...,p (3.2) /-i without loss of generality. Our primary interest in this section will be the properlies of ihe regression rank scores process (M* 0*1*1}. (3-3) Gutenbrunner and Jurečková (1992) studied the process W* = [wJO = V^ t *AW: 0 a r < 1 j (3.4) and showed that Vft(t)^VÍ(t) + op{l) (3.5) where I*) = n~u2 Í áJlE, > F-'OM (3.6) ř-l as n—»oc uniformly on any fixed interval [e, 1 — e\, where 0<£<2 for any appropriately standardized triangular array {áni: i = 1,... , n] of vectors from R*. They also showed that the process (3.3) (and hence (3.4)) has continuous trajectories and, under the standardization E^i dnr = 0, (3.5) is tied-down to 0 at r = 0, and / = 1. The same authors also established the weak convergence of (3.4) to ihe Brownian bridge over [e, 1 - s]. Note however that Theorem V.3.5 in Hájek and Šidák (1967) establishes the weak convergence of (3.4) to the 318 C. GUTENBRUNNER ei at. Brownian bridge over the entire interval [0,1] in Ihe special case of the location submodel. Here we extend the results of Gutenbrunner and Jurečková (1992) in the tails of [0,1], in order to find the asymptotic behavior of the rank scores and the test statistics (2.7) and (2.8), for which the score functions are not constant in the tails. H may be noted that this extension is rather delicate. If the rank scores involved integration from e to 1 — t (i.e., if (p were constant near 0 and 1). then the earlier Gutenbrunncr-Jurečková (1992) representation theorem could be used to obtain the asymptotic distribution theory here under somewhat weaker hypotheses (see the remark following Theorem 5.1). It is the desirability of treating such tests as the Wilcoxon and Normal Scores Tests that requires the extensions here. Nonetheless, the fact shown here that the rank score process can be represented uniformly on an interval (a*, 1 — a*) with a* decreasing as a negative power of n (precisely, o-'= ŕí"1'Q) is rather remarkable and of independent theoretical interest. To this end, we will assume that the errors Bif..., £„ in (2.1) are independent and identically distributed according to the distribution function F(x) which has an absolutely continuous density /. We will assume that / is positive for AB = '\ni{x\ F(x) = l}. For 0a denote the score function corresponding to (1.2): V*(x) = a - I[x <0], x € R1. (3.7) We shall impose the following conditions on F: (F.l) \F-l(a)\*c(a(l-a))~a for 00 and c>0. (F.2) 1/'/'(F'l{a))sc(a(\ - a))~l~a for 00. (F.3) f(x) >0 is absolutely continuous, bounded and monotonically decreasing as x—*A+ and x—* B-. The derivative /' is bounded a.e. (F-4) fix) fix) £cbr| for \x\* K ^0, c>0. Remark. These conditions are satisfied, for example, by the normal, logistic, double exponential and t distributions with 5, or more, degrees of freedom. Condition (F.l) implies J |i|4T,1dF(/)< +« for some 0>Q. Hence using (F.4) also, F has finite Fisher Information, a fact to be applied in Theorem 5.1. The following design assumptions will also be employed. (X.l)*i«l,i-l,...tii (X.2) UmM_DM = D where D„ = n 'X;,X„ and D is a positive definite pXp matrix. (X.3) B-IS?-i|M4,= 0(i)a*"-*w- (X.4) maxl*JKn\\xl\\ = 0(n(2l'>-a>-Ay<,+4b)) for some 6>0 and Ô>0 such that 0 < b - a < e SI (hence 0 < b < J - e/2). -v:o C. GUTENBRIJNNER et at. The following theorem which follows from Theorem 3.2 is an extension of Theorem V,3.5 in Hájek and Šidák (1967) to the regression rank scores. Some applications of this result to Kolmogorov-Smirnov type tests appears in Jurecková (1991). Theorem 3.3. Under the conditions of Theorem 3.2, as n-*x sup irw 2 cU",; ,-', !!-■.- .v- «-inE*rÄ«) "■ W iTwS{l):0 under which we may integrate (3.47) and obtain an asymptotic representation for S„ of the form Sn = n',a Í d,MHEd) + 0,(1). (4.2) i-i We shall prove (4.2) for

1 - *")> U - «1,1) and denote the respective integrals by /,,... , 75. Regarding Theorem 3.2, we immediately get that /v^*0 by the dominated convergence theorem. Similarly, for some «•>t ran « I4is IVC01 »""S«*-«-««) J«.; --i *c wi^or^'wi-o)"2- »^wi-or^x-UMO-w» * J«; i-1 JOD «i-or-%-w*-*,(i)-«v(i). t? 328 C. GUTENBRUNNER et al. Finally, |/,|<«-Iŕ2 max 14*1 fV(')l Ž lU0-M0ld^/.. + /i2 ' - 'i JQ ŕ=l = «-|ainax|rfJ l?>''(0/[< > FfäÜ * ŕ-1 ^o = n""2 i <[?(*:) - v(F(£,))lflf(£») < *:] i-l Now wt may assume that q>(a*)<0 for n^nn, since otherwise if q> were bounded from below then la^* 0- Hence Var(/12) s«"1 Í ^£([2(p(ŕ-(£())]2/[F(E,) < «„*]) * f" 0 /-I -"0 due to the square-integrability of ip. Treating the integrals /4, /5 analogously, we arrive at (4.5) and this proves the representation (4.2) D 5. TESTS OF LINEAR SUBHYPOTHESES BASED ON REGRESSION RANK SCORES Returning to the model (2.2), assume that the design matrix X = (X,: X3) satisfies the conditions (X.1)-(X.4), (2.3) and (2.4). We want to test the hypothesis ri0:ß2 = 9 OS, unspecified) against the alternative Hn:ß2„ = n~l'2ß0 {ßoeR" fixed). Let %„(<*) = (á„i(íV)(.... ä„„(a)) denote the regression rank scores corresponding to the submodel V = X,& + E under /&. (5.1) Let ?(f):(0, 1)~>R' be a nondecreasing and square integrable score-generating function. Define the scores Sni, i = I,... , n by the relation (2.7), and consider the test statistic T^S'&'S JA\q>) (5.2) where S„ = «-l/2(Xn2-Xn,)'bn (5.3) TESTS BASED ON RANK SCORES 329 and where Q„ and A2( satisfying (4.3), and nondecreasing and square-integrable on (0,1). (i) Then, under M,, the statistic T„ is asymptotically central y} w',n q degrees of freedom, (ii) Under H„, Tn is asymptotically nonccntral x with q degrees of freedom and with noncentrality parameter. T)2 = ß'oQß«YH and h. Such local alternatives provide insight into the structure of the regions of constant efficiency for regression rank 330 C. GUTENBRUNNER et at. Proof, (i) II follows from Theorem 4.1 that, under Hq, S„ has the same asymptotic distribution as šn=n-lfl(xn2-xn2yba where b„ = (BnU. .., £„„)' and S,u =