Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=utas20
The American Statistician
ISSN: 0003-1305 (Print) 1537-2731 (Online) Journal homepage: http://www.tandfonline.com/loi/utas20
The Bagplot: A Bivariate Boxplot
Peter J. Rousseeuw , Ida Ruts & John W. Tukey
To cite this article: Peter J. Rousseeuw , Ida Ruts & John W. Tukey (1999) The Bagplot: A
Bivariate Boxplot, The American Statistician, 53:4, 382-387
To link to this article: https://doi.org/10.1080/00031305.1999.10474494
Published online: 17 Feb 2012.
Submit your article to this journal
Article views: 739
Citing articles: 15 View citing articles
Statistical Computing and Graphics~
The Bagplot: A Bivariate Boxplot
Peter J. ROUSSEEUW,Ida RUTS,and John W. TUKEY
We propose the bagplot, a bivariate generalization of the
univariate boxplot. The key notion is the halfspace location
depth of a point relative to a bivariate dataset, which extends
the univariate concept of rank. The “depth median” is the
deepest location, and it is surrounded by a “bag” containing
the n/2 observations with largest depth. Magnifying the
bag by a factor 3 yields the “fence” (which is not plotted).
Observations between the bag and the fence are marked by
a light gray loop, whereas observations outside the fence
are flagged as outliers. The bagplot visualizes the location,
spread, correlation, skewness, and tails of the data. It is
equivariant for linear transformations, and not limited to
elliptical distributions. Software for drawing the bagplot is
made available for the S-Plus and MATLAB environments.
The bagplot is illustrated on several datasets-for example,
in a scatterplot matrix of multivariate data.
KEY WORDS:
display; Location depth; Ranks.
Algorithms; Depth contours; Graphical
1. THE UNIVARIATEBOXPLOT
The univariate boxplot (or box-and-whiskers plot) was
proposed by Tukey (1977) as a tool for exploratory data
analysis. Two examples of univariate boxplots are shown
outside the frame of Figure 1,which displays the car weight
IC, and engine displacement yi of 60 cars (Chambers and
Hastie 1993, pages 46-47). Along the z-axis we see a horizontal
boxplot of the I C ~ .It consists of a box from the lower
quartile of the xi to their upper quartile, with a crossbar
at the median of the xi. Outside of the box, the upper
fence is given by Q2 +4(Q3 - Q2) and the lower fence
by Q2 +4(Q1- Q2), where Q j is the jth quartile hence Q2
is the median. (The fences are not drawn.) The whiskers are
the horizontal lines going from the box to the most extreme
values inside the fences. For car weight, no xi lies outside
the fences.
Along the y-axis we see the vertical boxplot of the engine
displacements. The four yi lying outside of the fences are
Peter J. Rousseeuw is Professor, and Ida Ruts is Assistant, Department
of Mathematics and ComputerScience,Universitaire Instelling Antwerpen
(UIA), Universiteitsplein 1, B-2610 Wilrijk, Belgium. John W. Tnkey is
Emeritus Professor, Princeton University. We are grateful to the referees,
the associate editor, and the editor for their helpful suggestions.
flagged as outliers. (Note that the Camaro and the Caprice
share the same engine displacement, as do the Mustang and
the Victoria, hence the vertical boxplot shows only two out-
liers.)
2. THE BIVARIATE CASE
The univariate boxplot is based on ranks since the box
goes from the observation with rank 141to that with rank
[%I, and the central bar of the box is drawn at the median.
A natural generalization of ranks to multivariate data is
the notion of halfspace depth (Tukey 1975),which we will
explain in the next section. Using this concept, we propose
a bivariate version of the boxplot. Its main components are
a bug that contains 50% of the data points, a fence that
separates inliers from outliers, and a loop indicating the
points outside the bag but inside the fence. The resulting
graph is called a bagplot.
Consider the scatterplot in Figure 1.The depth medianthat
is, the point with highest halfspace depth-lies in the
center and is indicated by a cross. The bag is the polygon
drawn as a full line, with dark gray interior. The observations
that lie outside of the bag but inside of the fence are
indicated by a light gray loop. The fence itself is not plotted
Mustang V8 Ford VictoriaV8
Nissan Van
w
I I
Weight (in pounds)
Figure 1. Car weight and engine displacement of 60 cars.
@ I999 American Statistical Association382 The American Statistician, November 1999, Vol. 53, No. 4
0 5 10 15 20
Figure 2. Bagplots of two datasets.
because it would draw the attention away from the data. We
also see four observations outside the fence. These outliers
are indicated by black stars and labeled. We also labeled
the Nissan Van because it came close to the fence, so it is
a boundary case.
Note that the bagplot generalizes the spine of the boxplot:
For a very “flat” bivariate dataset (e.g., all vz FZ 0) the bag
becomes a box. The light gray loop plays the same role as
the two whiskers in one dimension, so we could call Figure
1 a “bag-and-bolster plot” to stress the analogy with the
term “box-and-whiskers plot.”
Like the univariate boxplot, the bagplot also visualizes
several characteristics of the data: its location (the depth
median), spread (the size of the bag), correlation (the orientation
of the bag), skewness (the shape of the bag and the
loop), and tails (the points near the boundary of the loop
and the outliers).
To illustrate these characteristics, Figure 2 contains the
bagplots of two generated datasets, each with 100 points.
Their medians (indicated by crosses) are far apart. The bags
are of roughly the same size (area), so the datasets have a
similar spread. But the bags have a different orientation: the
left one slopes upward (positive correlation) and the other
slopes downward. We also see that the first dataset is very
skewed because the median lies in the lower left part of the
bag, where the loop is also narrow, whereas the right part
of the bag is wider and has a much wider loop. By contrast,
the bagplot of the second dataset is nicely balanced, and its
form suggests an elliptic distribution. Finally, both datasets
are medium-tailed, judging from the size of the loop and
the absence of outliers.
When showing several (possibly overlapping) datasets in
one display, it is convenient to plot the bags in different
colors. For instance, one bag may be plotted in blue with a
light blue loop and dark blue stars for the outliers, whereas
the other bag may be red with a light red loop and dark red
stars.
3. CONSTRUCTION OF THE BAGPLOT
The halfspace location depth ldepth(8,Z) of some
point 8 E R2relative to a bivariate data cloud 2 =
{zl, z2,.. . ,z,} was introduced by Tukey (1975); see also
Eddy (1985) and Green (1985). It is the smallest number
of zi contained in any closed halfplane with boundary line
through 8. A time-efficient algorithm for ldepth(8,Z) was
provided by Rousseeuw and Ruts (1996). The depth region
Dk is the set of all 8 with ldepth(8,Z) 2 k, and was algorithmically
constructed by Ruts and Rousseeuw (1996).
The depth regions are convex polygons, and D k + l c Dk.
(Note that these regions are different from those generated
by convex hull peeling, which first removes the vertices of
the convex hull of the data cloud, then repeats this on the
remainder of the dataset, and so on.)
The depth median T*of 2 (Donoho and Gasko 1992) is
defined as the 8 with highest ldepth(8,Z) if there is only
one such 8.Otherwise, T*is defined as the center of gravity
of the deepest region. An algorithm for T* was provided
by Rousseeuw and Ruts (1998).
We now construct the bag B as follows. Let #Dk denote
the number of data points contained in Dk. One first
determines the value k for which #Dk 5 Ln/2] < #Dkm1
and then interpolates linearly between DI,and Dk--l (relative
to the point T*)to obtain the set B. The bag B is thus
again a convex polygon. Appendix A of Rousseeuw and
Ruts (1997) describes the construction of the bag in more
detail.
The fence is obtained by inflating B (relative to T*)by a
factor 3. The choice of the value 3 is based on simulations
(see Rousseeuw and Ruts 1997, sec. 5). The points outside
the fence are flagged as outliers. The loop contains all the
data points between the bag and the fence. To be precise,
its outer boundary is the convex hull of the bag and the
nonoutliers.
When the observations zi = (zi,yi) are subjected to a
translation and/or a nonsingular linear transformation (e.g.,
a rotation), their bagplot is transformed accordingly. This
is because the halfspace depth is invariant under such mappings,
and convex polygons are mapped to convex polygons.
Therefore the points inside the bag remain inside, the outliers
remain outliers. and so on.
4. EXAMPLES
Figure 3a plots the concentration of plasma triglycerides
against that of plasma cholesterol for n = 320 patients
with evidence of narrowing arteries (Hand, Daly, Lunn, McConway,
and Ostrowski 1994, p. 221). We see the depth
median (marked by a cross), the dark gray bag, the light
gray loop, and five outliers highlighted by black stars. Figure
3a illustrates the option of not plotting the points inside
the bag, for people who prefer to avoid overplotting. (Our
default is to plot all points.) Looking at this bagplot we
see much skewness, which suggests taking the logarithm
of both variables (also because they are both chemical concentrations).
The result is shown in Figure 3b. We see that
four of the five previous outliers are now inside the loop,
whereas two new outliers arise at the bottom of the figure.
The American Statistician, November 1999, Vol. 53, No. 4 383
Figure 4a plots the abundance of butterflies versus their
altitudinal range, in a mountain area in the northern Iberian
peninsula (Gutierrez and Menendez 1995). This figure illustrates
an alternative representation of the bagplot. The
bag is the polygon drawn as a full line, and now the boundary
of the loop is indicated by a dotted line. In Figure 4a
we note three outliers. When we plot the logarithm of the
y-variable in Figure 4b, the outliers disappear.
3.5'
5. ALGORITHM AND IMPLEMENTATIONOF THE
The entire display is constructed by the algorithm BAGPLOT,
which draws on computational tools developed
in earlier papers. Two components are the subroutines
LDEPTH (Rousseeuw and Ruts 1996) which computes the
BAGPLOT
I I
1000
*
t
700 150 200 250 300 350 400
Concentrationof plasma cholesterol(in mg/dl)
(a)
location depth of an arbitrary point in O(nlog n) time, and
ISODEPTH (Ruts and Rousseeuw 1996) which constructs
the vertices of a depth contour in O(n210gn)time. For
computing the depth median, the O(n2(logn)2)subroutine
HALFMED (Rousseeuw and Ruts 1998) is called. Since
these times increase quickly with n, we currently use an
approximation when n is larger than 150. In that case we
compute the depth median and the bag for a random subsample
of size 150, and perform the other computations
(whose time is linear in n) on the full dataset. In this way
we can easily deal with datasets of a few thousand points.
Our S-Plus code for the bagplot is available from
the website http://win-www.uia.ac.be/u/statis/index.html.
Several options are available. For instance, the user may
choose whether or not to plot the observations inside the
bag. The outliers-that is, the observations outside the
m m
m
0 500 1000 1500 2000
Altitudinal range
7 , I
I
(a)
. *
.'
= .
I * *
384 Statistical Computing and Graphics
fence-are always plotted. Any observation can be identified
(e.g., labeled) by clicking on it.
Our MATLAB code for the bagplot is also available from
the website mentioned earlier. Again several options are
available-for example, the shading inside the bag may be
omitted. Figures 3a and 3b illustrate the option of not plotting
the points inside the bag. The user may also choose to
plot the fence. We prefer the bagplot as in Figures 1 and
2-that is, with dark gray shading for the bag, light gray
shading for the loop, no fence, and plotting all the points.
0.5
0.45
0.4
--
0.35
+
In
E 0.3Y
._
e
0.25
0.2
0.15
0.1
m
cu
7
2,
0
r
c?
~
-
-
-
~
-
-
-
-
0
-2 -1 0 1 2 3
X
Figure 5. Bagplot for linear data
50 100 150 200 250 300 350 400 450
Concentration of PCB
500
Figure 6. Blotched bagplot of PCB concentrationand shell thickness
of 65 pelican eggs. The white blotch inside the bag is a 95% confidence
region for the depth median of the population.
By showing all the points we preserve the advantages of the
scatterplot, because we can still see the local structure and
note phenomena like a grid-like appearance (involvement
of counts), curvature, clustering, holes in the data, and so
on.
For large datasets the computation time can be kept down
in several ways. Recently, Johnson, Kwok, and Ng (1998)
constructed a fast exact algorithm for depth contours which
outperforms ISODEPTH as soon as n 2 500. Work is underway
to construct algorithms for depth contours and the
depth median with lower time complexity (Ileana Streinu,
personal communication), but these are not yet available.
For small datasets, Rousseeuw and Ruts (1997, sec. 5)
found that the variability of the fence is too large to reliably
detect outliers (this is unavoidable for the outlier detection
problem in two dimensions). Therefore, when n < 15 we
draw only the depth median T*and line segments between
T* and the data points.
If the dataset is linear, the bagplot reduces to a univariate
boxplot. In that case our software draws it with a rectangular
box, as in Figure 5. In a univariate boxplot the upper
fence is given by Q2 +4(Q3- Q z ) and the lower fence by
QZ+4(Q1 - Q 2 ) , where Qj is the jth quartile. (This version
is well-suited for both symmetric and asymmetric distributions.)
Note that the factor 4 in the univariate boxplot differs
from the factor 3 used in the bivariate bagplot. Both values
were obtained by simulations and experience, so the difference
is not accidental but due to the dimensionality of
the plots. (In Figure 5, using the factor 3 or 4 makes no
difference because this example contains no points outside
the fence by either definition.)
6. BLOTCHED BAGPLOTS
It is possible to incorporate a confidence region for the
depth median into the bagplot. In this way we can extend
one idea of notches (McGill, Tukey, and Larsen 1978) to
the bivariate case. Consider the example of Figure 6. For
65 Anacapa pelican eggs, the concentration in parts per million
of PCB (polychlorinated biphenyl, an industrial pollutant)
was measured, along with the thickness of the egg
shell (Hand et al. 1994, p. 131). All the observations have
been plotted, and no outliers are flagged. Around the depth
median we have drawn the blotch, which is a 95% confidence
region for the depth median of the population. It
was obtained by means of formula (2) in Rousseeuw, Van
Aelst and Hubert (1999).This formula allows us to find the
largest value k for which P(ldepth(8,X,) 2 k ) 2 .95when
X , comes from a distribution with population median 6.We
then use the algorithm ISODEPTH to construct the corresponding
depth region DI, = (0 E Rz;ldepth(0,Z) 2 k }
which we call the blotch. For the pelican data we obtained
k = 21 yielding the white blotch in Figure 6.
7. OTHER BIVARIATEDISPLAYS
Another type of bivariate boxplot was proposed by Becketti
and Gould (1987).They consider the univariate boxplot
of I(: (as in Figure 1) from which they keep the median,
the quartiles, and the endpoints of both whiskers. They do
The American Statistician, November 1999, Vol. 53, No. 4 385
** 8
**
Figure 7. Bagplot matrix of the three-dimensional aquifer data with 85 data points.
the same for the y-variable, and then use these numbers to
draw vertical and horizontal lines on the scatterplot, thereby
forming a cross and a rectangle. Lenth (1988) modified this
plot to put more emphasis on the univariate quartiles. However,
neither version reflects the bivariate shape and correlation
of the data.
Goldberg and Iglewicz (1992) proposed two generalizations
of the boxplot which are truly bivariate. When the
data can be assumed to be elliptically symmetric, they construct
a robust elliptic plot (relplot). Here the “box” is an
ellipse, obtained by a robust method such as the minimum
volume ellipsoid estimator proposed by Rousseeuw (1984).
For asymmetric data Goldberg and Iglewicz (1992) constructed
a quarter elliptic plot (quelplot) where the “box”
consists of four quarter ellipses, computed by a kind of
M-estimator.
The bagplot differs from the relplot and the quelplot in
that its shape is more general. Whereas the relplot/quelplot
approach estimates parameters of (nearly) elliptical models,
the bagplot is model-free because the halfspace depth
is. Other variants of the bagplot have recently been constructed
by Romanazzi (1997) and Liu, Parelius, and Singh
(in press). Zani, Riani, and Corbellini (1998) applied convex
hull peeling (see, e.g., Green 1985),which is somewhat
less robust than halfspace depth, as shown by Donoho and
Gasko (I992).
Recently, Hyndman (1996) constructed a plot of highest
density regions (HDR’s). First the bivariate density of
386 Statistical Computing and Graphics
the data is estimated-for example, using a kernel method.
Then the 50% HDR is given by the density contour that
encompasses 50% of the mass. Typically, the 50% HDR
and the 99% HDR are superimposed on the scatterplot of
the data. Note that a HDR need not be convex, or even
connected.This makes the HDR plot particularly useful to
display multimodal distributions,because it focuses on local
properties. The HDR plot is not a generalizationof the
univariate boxplot;for instance,its one-dimensionalversion
may contain several “boxes.” The basic notion of the HDR
plot is data density,whereas the bagplot is based on ranking.
Both approaches are appealing in different ways. The HDR
approach is well-suited for multiple modes, but depends on
the choice of a density estimator and a bandwidth. On the
other hand, the bagplot is equivariant for linear transformations,
needs fewer data points, and extends the familiar
univariate boxplot. Note that both approacheshave different
(but equally important)views of what constitutes an outlier:
for the HDR plot it is a point lying in an empty area (this
could be called a “thinlier”),whereas for the bagplot it is a
point lying far away from the bulk of the data. It should be
noted that the depth median T* and the bag B are robust,
which makes the bagplot particularly suited for detecting
(the latter type of) outliers.
8. MORE THAN TWO VARIABLES
The halfspacedepth and the depth median exist in any dimension,
so the bag can still be defined. For instance,there
is now a fast approximate algorithm for the multivariate
depth median (Struyf and Rousseeuw in press). This algorithm
takes three minutes for 1,000 points in five dimensions.
Currently, algorithms for depth contours in three or
more dimensions are not yet available. In three dimensions
the bag is a convex polyhedron, which in more dimensions
becomes hard to visualize.
However, in any dimension one can draw the bagplot
matrix which containsthe bagplot of each pair of variables,
as in Figure 7. This figure shows the aquifer data (Hand
et al. 1994,p. 215-216) in which we changed the first two
5 2 values to create outliers. (The variables z1and z2 are
coordinates measured in miles, and the water level x3 is
measured in feet above sea level.) Each diagonal cell is the
bagplot of a variable against itself, where all the points lie
on the 45” line. By construction such a bagplot reduces to a
univariate boxplot, as explained in Section 5. Here we have
drawn these boxplots with their usual factor 4 (instead of
the factor 3 used in the bivariate bagplots).
An advantage of having all the data points in a bagplot
matrix like Figure 7 is that we can still interactively color
and brush as in the usual scatterplot matrices. The light
gray zones of the bagplots should not detract from this, but
rather aid the interpretation.
[Received May 1998. Revised May 1999.1
REFERENCES
Becketti, S.,and Gould, W. (1987),“RangefinderBox Plots,” TheAmerican
Statistician, 41, 149.
Chambers, J. M., and Hastie, T. J. (1993), StatisticalModels in S, London:
Chapman and Hall.
Donoho, D. L., and Gasko, M. (19921, “Breakdown Properties of Location
Estimates Based on Halfspace Depth and Projected Outlyingness,” The
Annals of Statistics, 20, 1803-1827.
Eddy, W. F. (1985), “Ordering of Multivariate Data,” in ComputerScience
and Statistics: Proceedings of the 16th Symposium on the Interface, ed.
L. Billard, Amsterdam: North-Holland, pp. 25-30.
Goldberg, K. M., and Iglewicz, B. (1992), “Bivariate Extensions of the
Boxplot,” Technometrics,34, 307-320.
Green, P. J. (1985), “Peeling Data,” in Encyclopedia of Statistical Sciences
(Vol. 61, eds. S . Kotz and N. Johnson, New York Wiley, pp. 660-664.
Gutihez, D., and Mentndez, R. (1995), “Distribution and Abundance of
Butterflies in a Mountain Area in the Northern Iberian Peninsula,”Ecography,
18,209-216.
Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J., and Ostrowski, E.
(19941,A Handbook of Small Data Sets, London: Chapman and Hall.
Hyndman, R. J. (1996), “Computing and Graphing Highest Density Regions,”
The American Statistician, 50, 120-126.
Johnson, T., Kwok, I., and Ng, R. (1998), “Fast Computation of 2Dimensional
Depth Contours,” Technical Report, AT&T Research Center,
Florham Park, NJ.
Lenth, R. (1988), Comment on “Rangefinder Box Plots” by S. Becketti
and W. Gould (with reply by Becketti), The American Statistician, 42,
Liu, R. Y., Parelius, J. M., and Singh, K. (in press), “Multivariate Analysis
by Data Depth Descriptive Statistics, Graphics and Inference,” Annals
of Statistics.
McGill, R., Tukey, J. W., and Larsen, W. A. (1978), “Variations of Box
Plots,” The American Sfatistzcian,32, 12-16.
Romanazzi, M. (1997), “A Schematic Plot for Bivariate Data,” Student, 2,
Rousseeuw, P. J. (1984), “Least Median of Squares Regression,” Journal
of the American Statistical Association, 79, 871-880.
Rousseeuw, P. J., and Ruts, I. (1996), “AS 307: Bivariate Location Depth,”
Applied Statistics (JRSS-C),45, 516-526.
(1997), “The Bagplot: A Bivariate Box-and-Whiskers Plot,” Technical
Report, Universitaire Instelling Antwerpen, Belgium. Available at
http://win-www.uia.ac.be/u/statis/.
(1998), “Constructing the Bivariate Tukey Median,” Statistica
Sinica, 8, 827-839.
Rousseeuw, P. J., Van Aelst, S., and Hubert, M. (1999), “Rejoinder to the
Discussion of Regression Depth,” Journal of the American Statistical
Association, 94, 419-433.
Ruts, I., and Rousseeuw, P. J. (1996), “Computing Depth Contours of Bivariate
Point Clouds,” Computational Statistics and Data Analysis, 23,
Struyf, A., and Rousseeuw, P. J. (in press), “High-Dimensional Computation
of the Deepest Location,” ComputationalStatistics and Data Anal-
ysis.
Tukey,J. W. (1975),“Mathematics and the Picturing of Data,” Proceedings
of the International Congressof Mathematicians, 2, 523-53 1.
(1977),ExploratoryData Analysis, Reading, MA: Addison-Wesley.
Zani, S., Riani, M., and Corbellini, A. (1998), “Robust Bivariate Boxplots
and Multiple Outlier Detection,” Computational Statistics and Data
Analysis, 28, 257-270.
87-88.
149-1 58.
153-1 68.
The American Statistician, November 1999, Vol. 53, No. 4 387