Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=utas20 The American Statistician ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: http://www.tandfonline.com/loi/utas20 The Bagplot: A Bivariate Boxplot Peter J. Rousseeuw , Ida Ruts & John W. Tukey To cite this article: Peter J. Rousseeuw , Ida Ruts & John W. Tukey (1999) The Bagplot: A Bivariate Boxplot, The American Statistician, 53:4, 382-387 To link to this article: https://doi.org/10.1080/00031305.1999.10474494 Published online: 17 Feb 2012. Submit your article to this journal Article views: 739 Citing articles: 15 View citing articles Statistical Computing and Graphics~ The Bagplot: A Bivariate Boxplot Peter J. ROUSSEEUW,Ida RUTS,and John W. TUKEY We propose the bagplot, a bivariate generalization of the univariate boxplot. The key notion is the halfspace location depth of a point relative to a bivariate dataset, which extends the univariate concept of rank. The “depth median” is the deepest location, and it is surrounded by a “bag” containing the n/2 observations with largest depth. Magnifying the bag by a factor 3 yields the “fence” (which is not plotted). Observations between the bag and the fence are marked by a light gray loop, whereas observations outside the fence are flagged as outliers. The bagplot visualizes the location, spread, correlation, skewness, and tails of the data. It is equivariant for linear transformations, and not limited to elliptical distributions. Software for drawing the bagplot is made available for the S-Plus and MATLAB environments. The bagplot is illustrated on several datasets-for example, in a scatterplot matrix of multivariate data. KEY WORDS: display; Location depth; Ranks. Algorithms; Depth contours; Graphical 1. THE UNIVARIATEBOXPLOT The univariate boxplot (or box-and-whiskers plot) was proposed by Tukey (1977) as a tool for exploratory data analysis. Two examples of univariate boxplots are shown outside the frame of Figure 1,which displays the car weight IC, and engine displacement yi of 60 cars (Chambers and Hastie 1993, pages 46-47). Along the z-axis we see a horizontal boxplot of the I C ~ .It consists of a box from the lower quartile of the xi to their upper quartile, with a crossbar at the median of the xi. Outside of the box, the upper fence is given by Q2 +4(Q3 - Q2) and the lower fence by Q2 +4(Q1- Q2), where Q j is the jth quartile hence Q2 is the median. (The fences are not drawn.) The whiskers are the horizontal lines going from the box to the most extreme values inside the fences. For car weight, no xi lies outside the fences. Along the y-axis we see the vertical boxplot of the engine displacements. The four yi lying outside of the fences are Peter J. Rousseeuw is Professor, and Ida Ruts is Assistant, Department of Mathematics and ComputerScience,Universitaire Instelling Antwerpen (UIA), Universiteitsplein 1, B-2610 Wilrijk, Belgium. John W. Tnkey is Emeritus Professor, Princeton University. We are grateful to the referees, the associate editor, and the editor for their helpful suggestions. flagged as outliers. (Note that the Camaro and the Caprice share the same engine displacement, as do the Mustang and the Victoria, hence the vertical boxplot shows only two out- liers.) 2. THE BIVARIATE CASE The univariate boxplot is based on ranks since the box goes from the observation with rank 141to that with rank [%I, and the central bar of the box is drawn at the median. A natural generalization of ranks to multivariate data is the notion of halfspace depth (Tukey 1975),which we will explain in the next section. Using this concept, we propose a bivariate version of the boxplot. Its main components are a bug that contains 50% of the data points, a fence that separates inliers from outliers, and a loop indicating the points outside the bag but inside the fence. The resulting graph is called a bagplot. Consider the scatterplot in Figure 1.The depth medianthat is, the point with highest halfspace depth-lies in the center and is indicated by a cross. The bag is the polygon drawn as a full line, with dark gray interior. The observations that lie outside of the bag but inside of the fence are indicated by a light gray loop. The fence itself is not plotted Mustang V8 Ford VictoriaV8 Nissan Van w I I Weight (in pounds) Figure 1. Car weight and engine displacement of 60 cars. @ I999 American Statistical Association382 The American Statistician, November 1999, Vol. 53, No. 4 0 5 10 15 20 Figure 2. Bagplots of two datasets. because it would draw the attention away from the data. We also see four observations outside the fence. These outliers are indicated by black stars and labeled. We also labeled the Nissan Van because it came close to the fence, so it is a boundary case. Note that the bagplot generalizes the spine of the boxplot: For a very “flat” bivariate dataset (e.g., all vz FZ 0) the bag becomes a box. The light gray loop plays the same role as the two whiskers in one dimension, so we could call Figure 1 a “bag-and-bolster plot” to stress the analogy with the term “box-and-whiskers plot.” Like the univariate boxplot, the bagplot also visualizes several characteristics of the data: its location (the depth median), spread (the size of the bag), correlation (the orientation of the bag), skewness (the shape of the bag and the loop), and tails (the points near the boundary of the loop and the outliers). To illustrate these characteristics, Figure 2 contains the bagplots of two generated datasets, each with 100 points. Their medians (indicated by crosses) are far apart. The bags are of roughly the same size (area), so the datasets have a similar spread. But the bags have a different orientation: the left one slopes upward (positive correlation) and the other slopes downward. We also see that the first dataset is very skewed because the median lies in the lower left part of the bag, where the loop is also narrow, whereas the right part of the bag is wider and has a much wider loop. By contrast, the bagplot of the second dataset is nicely balanced, and its form suggests an elliptic distribution. Finally, both datasets are medium-tailed, judging from the size of the loop and the absence of outliers. When showing several (possibly overlapping) datasets in one display, it is convenient to plot the bags in different colors. For instance, one bag may be plotted in blue with a light blue loop and dark blue stars for the outliers, whereas the other bag may be red with a light red loop and dark red stars. 3. CONSTRUCTION OF THE BAGPLOT The halfspace location depth ldepth(8,Z) of some point 8 E R2relative to a bivariate data cloud 2 = {zl, z2,.. . ,z,} was introduced by Tukey (1975); see also Eddy (1985) and Green (1985). It is the smallest number of zi contained in any closed halfplane with boundary line through 8. A time-efficient algorithm for ldepth(8,Z) was provided by Rousseeuw and Ruts (1996). The depth region Dk is the set of all 8 with ldepth(8,Z) 2 k, and was algorithmically constructed by Ruts and Rousseeuw (1996). The depth regions are convex polygons, and D k + l c Dk. (Note that these regions are different from those generated by convex hull peeling, which first removes the vertices of the convex hull of the data cloud, then repeats this on the remainder of the dataset, and so on.) The depth median T*of 2 (Donoho and Gasko 1992) is defined as the 8 with highest ldepth(8,Z) if there is only one such 8.Otherwise, T*is defined as the center of gravity of the deepest region. An algorithm for T* was provided by Rousseeuw and Ruts (1998). We now construct the bag B as follows. Let #Dk denote the number of data points contained in Dk. One first determines the value k for which #Dk 5 Ln/2] < #Dkm1 and then interpolates linearly between DI,and Dk--l (relative to the point T*)to obtain the set B. The bag B is thus again a convex polygon. Appendix A of Rousseeuw and Ruts (1997) describes the construction of the bag in more detail. The fence is obtained by inflating B (relative to T*)by a factor 3. The choice of the value 3 is based on simulations (see Rousseeuw and Ruts 1997, sec. 5). The points outside the fence are flagged as outliers. The loop contains all the data points between the bag and the fence. To be precise, its outer boundary is the convex hull of the bag and the nonoutliers. When the observations zi = (zi,yi) are subjected to a translation and/or a nonsingular linear transformation (e.g., a rotation), their bagplot is transformed accordingly. This is because the halfspace depth is invariant under such mappings, and convex polygons are mapped to convex polygons. Therefore the points inside the bag remain inside, the outliers remain outliers. and so on. 4. EXAMPLES Figure 3a plots the concentration of plasma triglycerides against that of plasma cholesterol for n = 320 patients with evidence of narrowing arteries (Hand, Daly, Lunn, McConway, and Ostrowski 1994, p. 221). We see the depth median (marked by a cross), the dark gray bag, the light gray loop, and five outliers highlighted by black stars. Figure 3a illustrates the option of not plotting the points inside the bag, for people who prefer to avoid overplotting. (Our default is to plot all points.) Looking at this bagplot we see much skewness, which suggests taking the logarithm of both variables (also because they are both chemical concentrations). The result is shown in Figure 3b. We see that four of the five previous outliers are now inside the loop, whereas two new outliers arise at the bottom of the figure. The American Statistician, November 1999, Vol. 53, No. 4 383 Figure 4a plots the abundance of butterflies versus their altitudinal range, in a mountain area in the northern Iberian peninsula (Gutierrez and Menendez 1995). This figure illustrates an alternative representation of the bagplot. The bag is the polygon drawn as a full line, and now the boundary of the loop is indicated by a dotted line. In Figure 4a we note three outliers. When we plot the logarithm of the y-variable in Figure 4b, the outliers disappear. 3.5' 5. ALGORITHM AND IMPLEMENTATIONOF THE The entire display is constructed by the algorithm BAGPLOT, which draws on computational tools developed in earlier papers. Two components are the subroutines LDEPTH (Rousseeuw and Ruts 1996) which computes the BAGPLOT I I 1000 * t 700 150 200 250 300 350 400 Concentrationof plasma cholesterol(in mg/dl) (a) location depth of an arbitrary point in O(nlog n) time, and ISODEPTH (Ruts and Rousseeuw 1996) which constructs the vertices of a depth contour in O(n210gn)time. For computing the depth median, the O(n2(logn)2)subroutine HALFMED (Rousseeuw and Ruts 1998) is called. Since these times increase quickly with n, we currently use an approximation when n is larger than 150. In that case we compute the depth median and the bag for a random subsample of size 150, and perform the other computations (whose time is linear in n) on the full dataset. In this way we can easily deal with datasets of a few thousand points. Our S-Plus code for the bagplot is available from the website http://win-www.uia.ac.be/u/statis/index.html. Several options are available. For instance, the user may choose whether or not to plot the observations inside the bag. The outliers-that is, the observations outside the m m m 0 500 1000 1500 2000 Altitudinal range 7 , I I (a) . * .' = . I * * 384 Statistical Computing and Graphics fence-are always plotted. Any observation can be identified (e.g., labeled) by clicking on it. Our MATLAB code for the bagplot is also available from the website mentioned earlier. Again several options are available-for example, the shading inside the bag may be omitted. Figures 3a and 3b illustrate the option of not plotting the points inside the bag. The user may also choose to plot the fence. We prefer the bagplot as in Figures 1 and 2-that is, with dark gray shading for the bag, light gray shading for the loop, no fence, and plotting all the points. 0.5 0.45 0.4 -- 0.35 + In E 0.3Y ._ e 0.25 0.2 0.15 0.1 m cu 7 2, 0 r c? ~ - - - ~ - - - - 0 -2 -1 0 1 2 3 X Figure 5. Bagplot for linear data 50 100 150 200 250 300 350 400 450 Concentration of PCB 500 Figure 6. Blotched bagplot of PCB concentrationand shell thickness of 65 pelican eggs. The white blotch inside the bag is a 95% confidence region for the depth median of the population. By showing all the points we preserve the advantages of the scatterplot, because we can still see the local structure and note phenomena like a grid-like appearance (involvement of counts), curvature, clustering, holes in the data, and so on. For large datasets the computation time can be kept down in several ways. Recently, Johnson, Kwok, and Ng (1998) constructed a fast exact algorithm for depth contours which outperforms ISODEPTH as soon as n 2 500. Work is underway to construct algorithms for depth contours and the depth median with lower time complexity (Ileana Streinu, personal communication), but these are not yet available. For small datasets, Rousseeuw and Ruts (1997, sec. 5) found that the variability of the fence is too large to reliably detect outliers (this is unavoidable for the outlier detection problem in two dimensions). Therefore, when n < 15 we draw only the depth median T*and line segments between T* and the data points. If the dataset is linear, the bagplot reduces to a univariate boxplot. In that case our software draws it with a rectangular box, as in Figure 5. In a univariate boxplot the upper fence is given by Q2 +4(Q3- Q z ) and the lower fence by QZ+4(Q1 - Q 2 ) , where Qj is the jth quartile. (This version is well-suited for both symmetric and asymmetric distributions.) Note that the factor 4 in the univariate boxplot differs from the factor 3 used in the bivariate bagplot. Both values were obtained by simulations and experience, so the difference is not accidental but due to the dimensionality of the plots. (In Figure 5, using the factor 3 or 4 makes no difference because this example contains no points outside the fence by either definition.) 6. BLOTCHED BAGPLOTS It is possible to incorporate a confidence region for the depth median into the bagplot. In this way we can extend one idea of notches (McGill, Tukey, and Larsen 1978) to the bivariate case. Consider the example of Figure 6. For 65 Anacapa pelican eggs, the concentration in parts per million of PCB (polychlorinated biphenyl, an industrial pollutant) was measured, along with the thickness of the egg shell (Hand et al. 1994, p. 131). All the observations have been plotted, and no outliers are flagged. Around the depth median we have drawn the blotch, which is a 95% confidence region for the depth median of the population. It was obtained by means of formula (2) in Rousseeuw, Van Aelst and Hubert (1999).This formula allows us to find the largest value k for which P(ldepth(8,X,) 2 k ) 2 .95when X , comes from a distribution with population median 6.We then use the algorithm ISODEPTH to construct the corresponding depth region DI, = (0 E Rz;ldepth(0,Z) 2 k } which we call the blotch. For the pelican data we obtained k = 21 yielding the white blotch in Figure 6. 7. OTHER BIVARIATEDISPLAYS Another type of bivariate boxplot was proposed by Becketti and Gould (1987).They consider the univariate boxplot of I(: (as in Figure 1) from which they keep the median, the quartiles, and the endpoints of both whiskers. They do The American Statistician, November 1999, Vol. 53, No. 4 385 ** 8 ** Figure 7. Bagplot matrix of the three-dimensional aquifer data with 85 data points. the same for the y-variable, and then use these numbers to draw vertical and horizontal lines on the scatterplot, thereby forming a cross and a rectangle. Lenth (1988) modified this plot to put more emphasis on the univariate quartiles. However, neither version reflects the bivariate shape and correlation of the data. Goldberg and Iglewicz (1992) proposed two generalizations of the boxplot which are truly bivariate. When the data can be assumed to be elliptically symmetric, they construct a robust elliptic plot (relplot). Here the “box” is an ellipse, obtained by a robust method such as the minimum volume ellipsoid estimator proposed by Rousseeuw (1984). For asymmetric data Goldberg and Iglewicz (1992) constructed a quarter elliptic plot (quelplot) where the “box” consists of four quarter ellipses, computed by a kind of M-estimator. The bagplot differs from the relplot and the quelplot in that its shape is more general. Whereas the relplot/quelplot approach estimates parameters of (nearly) elliptical models, the bagplot is model-free because the halfspace depth is. Other variants of the bagplot have recently been constructed by Romanazzi (1997) and Liu, Parelius, and Singh (in press). Zani, Riani, and Corbellini (1998) applied convex hull peeling (see, e.g., Green 1985),which is somewhat less robust than halfspace depth, as shown by Donoho and Gasko (I992). Recently, Hyndman (1996) constructed a plot of highest density regions (HDR’s). First the bivariate density of 386 Statistical Computing and Graphics the data is estimated-for example, using a kernel method. Then the 50% HDR is given by the density contour that encompasses 50% of the mass. Typically, the 50% HDR and the 99% HDR are superimposed on the scatterplot of the data. Note that a HDR need not be convex, or even connected.This makes the HDR plot particularly useful to display multimodal distributions,because it focuses on local properties. The HDR plot is not a generalizationof the univariate boxplot;for instance,its one-dimensionalversion may contain several “boxes.” The basic notion of the HDR plot is data density,whereas the bagplot is based on ranking. Both approaches are appealing in different ways. The HDR approach is well-suited for multiple modes, but depends on the choice of a density estimator and a bandwidth. On the other hand, the bagplot is equivariant for linear transformations, needs fewer data points, and extends the familiar univariate boxplot. Note that both approacheshave different (but equally important)views of what constitutes an outlier: for the HDR plot it is a point lying in an empty area (this could be called a “thinlier”),whereas for the bagplot it is a point lying far away from the bulk of the data. It should be noted that the depth median T* and the bag B are robust, which makes the bagplot particularly suited for detecting (the latter type of) outliers. 8. MORE THAN TWO VARIABLES The halfspacedepth and the depth median exist in any dimension, so the bag can still be defined. For instance,there is now a fast approximate algorithm for the multivariate depth median (Struyf and Rousseeuw in press). This algorithm takes three minutes for 1,000 points in five dimensions. Currently, algorithms for depth contours in three or more dimensions are not yet available. In three dimensions the bag is a convex polyhedron, which in more dimensions becomes hard to visualize. However, in any dimension one can draw the bagplot matrix which containsthe bagplot of each pair of variables, as in Figure 7. This figure shows the aquifer data (Hand et al. 1994,p. 215-216) in which we changed the first two 5 2 values to create outliers. (The variables z1and z2 are coordinates measured in miles, and the water level x3 is measured in feet above sea level.) Each diagonal cell is the bagplot of a variable against itself, where all the points lie on the 45” line. By construction such a bagplot reduces to a univariate boxplot, as explained in Section 5. Here we have drawn these boxplots with their usual factor 4 (instead of the factor 3 used in the bivariate bagplots). An advantage of having all the data points in a bagplot matrix like Figure 7 is that we can still interactively color and brush as in the usual scatterplot matrices. The light gray zones of the bagplots should not detract from this, but rather aid the interpretation. [Received May 1998. Revised May 1999.1 REFERENCES Becketti, S.,and Gould, W. (1987),“RangefinderBox Plots,” TheAmerican Statistician, 41, 149. Chambers, J. M., and Hastie, T. J. (1993), StatisticalModels in S, London: Chapman and Hall. Donoho, D. L., and Gasko, M. (19921, “Breakdown Properties of Location Estimates Based on Halfspace Depth and Projected Outlyingness,” The Annals of Statistics, 20, 1803-1827. Eddy, W. F. (1985), “Ordering of Multivariate Data,” in ComputerScience and Statistics: Proceedings of the 16th Symposium on the Interface, ed. L. Billard, Amsterdam: North-Holland, pp. 25-30. Goldberg, K. M., and Iglewicz, B. (1992), “Bivariate Extensions of the Boxplot,” Technometrics,34, 307-320. Green, P. J. (1985), “Peeling Data,” in Encyclopedia of Statistical Sciences (Vol. 61, eds. S . Kotz and N. Johnson, New York Wiley, pp. 660-664. Gutihez, D., and Mentndez, R. (1995), “Distribution and Abundance of Butterflies in a Mountain Area in the Northern Iberian Peninsula,”Ecography, 18,209-216. Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J., and Ostrowski, E. (19941,A Handbook of Small Data Sets, London: Chapman and Hall. Hyndman, R. J. (1996), “Computing and Graphing Highest Density Regions,” The American Statistician, 50, 120-126. Johnson, T., Kwok, I., and Ng, R. (1998), “Fast Computation of 2Dimensional Depth Contours,” Technical Report, AT&T Research Center, Florham Park, NJ. Lenth, R. (1988), Comment on “Rangefinder Box Plots” by S. Becketti and W. Gould (with reply by Becketti), The American Statistician, 42, Liu, R. Y., Parelius, J. M., and Singh, K. (in press), “Multivariate Analysis by Data Depth Descriptive Statistics, Graphics and Inference,” Annals of Statistics. McGill, R., Tukey, J. W., and Larsen, W. A. (1978), “Variations of Box Plots,” The American Sfatistzcian,32, 12-16. Romanazzi, M. (1997), “A Schematic Plot for Bivariate Data,” Student, 2, Rousseeuw, P. J. (1984), “Least Median of Squares Regression,” Journal of the American Statistical Association, 79, 871-880. Rousseeuw, P. J., and Ruts, I. (1996), “AS 307: Bivariate Location Depth,” Applied Statistics (JRSS-C),45, 516-526. (1997), “The Bagplot: A Bivariate Box-and-Whiskers Plot,” Technical Report, Universitaire Instelling Antwerpen, Belgium. Available at http://win-www.uia.ac.be/u/statis/. (1998), “Constructing the Bivariate Tukey Median,” Statistica Sinica, 8, 827-839. Rousseeuw, P. J., Van Aelst, S., and Hubert, M. (1999), “Rejoinder to the Discussion of Regression Depth,” Journal of the American Statistical Association, 94, 419-433. Ruts, I., and Rousseeuw, P. J. (1996), “Computing Depth Contours of Bivariate Point Clouds,” Computational Statistics and Data Analysis, 23, Struyf, A., and Rousseeuw, P. J. (in press), “High-Dimensional Computation of the Deepest Location,” ComputationalStatistics and Data Anal- ysis. Tukey,J. W. (1975),“Mathematics and the Picturing of Data,” Proceedings of the International Congressof Mathematicians, 2, 523-53 1. (1977),ExploratoryData Analysis, Reading, MA: Addison-Wesley. Zani, S., Riani, M., and Corbellini, A. (1998), “Robust Bivariate Boxplots and Multiple Outlier Detection,” Computational Statistics and Data Analysis, 28, 257-270. 87-88. 149-1 58. 153-1 68. The American Statistician, November 1999, Vol. 53, No. 4 387