Chapter
Two-Sample Inferences
99.1 Introduction
9.2 Testing H0: μX = μY
9.3 Testing H0: σ2
X = σ2
Y —The F Test
9.4 Binomial Data: Testing H0: pX = pY
9.5 Confidence Intervals for the Two-Sample
Problem
9.6 Taking a Second Look at Statistics (Choosing
Samples)
Appendix 9.A.1 A Derivation of the Two-Sample
t Test (A Proof of Theorem 9.2.2)
Appendix 9.A.2 Minitab Applications
After earning an Oxford degree in mathematics and chemistry, Gosset began
working in 1899 for Messrs. Guinness, a Dublin brewery. Fluctuations in materials
and temperature and the necessarily small-scale experiments inherent in brewing
convinced him of the necessity for a new, small-sample theory of statistics. Writing
under the pseudonym “Student,” he published work with the t ratio that was destined
to become a cornerstone of modern statistical methodology.
—William Sealy Gosset (“Student”) (1876–1937)
9.1 Introduction
The simplicity of the one-sample model makes it the logical starting point for
any discussion of statistical inference, but it also limits its applicability to the real
world. Very few experiments involve just a single treatment or a single set of conditions.
On the contrary, researchers almost invariably design experiments to compare
responses to several treatment levels—or, at the very least, to compare a single
treatment with a control.
In this chapter we examine the simplest of these multilevel designs, two-sample
inferences. Structurally, two-sample inferences always fall into one of two different
formats: Either two (presumably) different treatment levels are applied to two independent
sets of similar subjects or the same treatment is applied to two (presumably)
different kinds of subjects. Comparing the effectiveness of germicide A relative to
that of germicide B by measuring the zones of inhibition each one produces in two
sets of similarly cultured Petri dishes would be an example of the ﬁrst type. On the
other hand, examining the bones of sixty-year-old men and sixty-year-old women, all
lifelong residents of the same city, to see whether both sexes absorb environmental
strontium-90 at the same rate would be an example of the second type.
Inference in two-sample problems usually reduces to a comparison of location
parameters. We might assume, for example, that the population of responses associated
with, say, treatment X is normally distributed with mean μX and standard
457
458 Chapter 9 Two-Sample Inferences
deviation σX while the Y distribution is normal with mean μY and standard deviation
σY . Comparing location parameters, then, reduces to testing H0: μX =μY . As
always, the alternative may be either one-sided, H1: μX <μY or H1: μX >μY , or twosided,
H1: μX = μY . (If the data are binomial, the location parameters are pX and
pY , the true “success” probabilities for treatments X and Y, and the null hypothesis
takes the form H0: pX = pY .)
Sometimes, although much less frequently, it becomes more relevant to compare
the variabilities of two treatments, rather than their locations. A food company,
for example, trying to decide which of two types of machines to buy for ﬁlling cereal
boxes would naturally be concerned about the average weights of the boxes ﬁlled
by each type, but they would also want to know something about the variabilities
of the weights. Obviously, a machine that produces high proportions of “underﬁlls”
and “overﬁlls” would be a distinct liability. In a situation of this sort, the appropriate
null hypothesis is H0: σ2
X = σ2
Y .
For comparing the means of two normal populations when σX =σY , the standard
procedure is the two-sample t test. As described in Section 9.2, this is a relatively
straightforward extension of Chapter 7’s one-sample t test. If σX = σY , an approximate
t test is used. For comparing variances, though, it will be necessary to introduce
a completely new test—this one based on the F distribution of Section 7.3. The
binomial version of the two-sample problem, testing H0: pX = pY , is taken up in
Section 9.4.
It was mentioned in connection with one-sample problems that certain inferences,
for various reasons, are more aptly phrased in terms of conﬁdence intervals
rather than hypothesis tests. The same is true of two-sample problems. In Section 9.5,
conﬁdence intervals are constructed for the location difference of two populations,
μX − μY (or pX − pY ), and the variability quotient, σ2
X /σ2
Y .
9.2 Testing H0: μX = μY
We will suppose that the data for a given experiment consist of two independent
random samples, X1, X2,..., Xn and Y1,Y2,...,Ym, representing either of the models
referred to in Section 9.1. Furthermore, the two populations from which the X’s and
Y’s are drawn will be presumed normal. Let μX and μY denote their means. Our
objective is to derive a procedure for testing H0: μX = μY .
As it turns out, the precise form of the test we are looking for depends on the
variances of the X and Y populations. If it can be assumed that σ2
X and σ2
Y are equal,
it is a relatively straightforward task to produce the GLRT for H0: μX = μY . (This is,
in fact, what we will do in Theorem 9.2.2.) But if the variances of the two populations
are not equal, the problem becomes much more complex. This second case, known
as the Behrens-Fisher problem, is more than seventy-ﬁve years old and remains one
of the more famous “unsolved” problems in statistics. What headway investigators
have made has been conﬁned to approximate solutions. These will be discussed later
in this section. For what follows next, it can be assumed that σ2
X = σ2
Y .
For the one-sample test μ = μ0, the GLRT was shown to be a function of a special
case of the t ratio introduced in Deﬁnition 7.3.3 (recall Theorem 7.3.5). We begin
this section with a theorem that gives still another special case of Deﬁnition 7.3.3.
Theorem
9.2.1
Let X1, X2,..., Xn be a random sample of size n from a normal distribution with
mean μX and standard deviation σ and let Y1,Y2,...,Ym be an independent random
sample of size m from a normal distribution with mean μY and standard deviation σ.
9.2 Testing H0: μX = μY 459
Let S2
X and S2
Y be the two corresponding sample variances, and S2
p the pooled variance,
where
S2
p =
(n − 1)S2
X + (m − 1)S2
Y
n + m − 2
=
n
i=1
(Xi − X)2
+
m
i=1
(Yi − Y)2
n + m − 2
Then
Tn+m−2 =
X − Y − (μX − μY )
Sp
1
n
+ 1
m
has a Student t distribution with n + m − 2 degrees of freedom.
Proof The method of proof here is very similar to what was used for Theorem 7.3.5.
Note that an equivalent formulation of Tn+m−2 is
Tn+m−2 =
X−Y−(μX −μY )
σ
√1
n + 1
m
S2
p/σ2
=
X−Y−(μX −μY )
σ
√1
n + 1
m
1
n+m−2
n
i=1
Xi −X
σ
2
+
m
i=1
Yi −Y
σ
2
But E(X − Y) = μX − μY and Var(X − Y) = σ2
/n + σ2
/m, so the numerator of the
ratio has a standard normal distribution, fZ (z).
In the denominator,
n
i=1
Xi − X
σ
2
=
(n − 1)S2
X
σ2
and
m
i=1
Yi − Y
σ
2
=
(m − 1)S2
Y
σ2
are independent χ2
random variables with n − 1 and m − 1 df, respectively, so
n
i=1
Xi − X
σ
2
+
m
i=1
Yi − Y
σ
2
has a χ2
distribution with n + m − 2 df (recall Theorem 7.3.1 and Theorem 4.6.4).
Also, by Appendix 7.A.2, the numerator and denominator are independent.
It follows from Deﬁnition 7.3.3, then, that
X − Y − (μX − μY )
Sp
1
n
+ 1
m
has a Student t distribution with n + m − 2 df.
460 Chapter 9 Two-Sample Inferences
Theorem
9.2.2
Let x1, x2,..., xn and y1, y2,..., ym be independent random samples from normal
distributions with means μX and μY , respectively, and with the same standard
deviation σ. Let
t =
x − y
sp
1
n
+ 1
m
a. To test H0: μX = μY versus H1: μX > μY at the α level of signiﬁcance, reject H0 if
t ≥ tα,n+m−2.
b. To test H0: μX = μY versus H1: μX < μY at the α level of signiﬁcance, reject H0 if
t ≤ −tα,n+m−2.
c. To test H0: μX = μY versus H1: μX = μY at the α level of signiﬁcance, reject H0 if
t is either (1) ≤ −tα/2,n+m−2 or (2) ≥ tα/2,n+m−2.
Proof See Appendix 9.A.1.
Case Study 9.2.1
The mystery surrounding the nature of Mark Twain’s participation in the Civil
War was discussed (but not resolved) in Case Study 1.2.2. Recall that historians
are still unclear as to whether the creator of Huckleberry Finn and Tom Sawyer
was a civilian or a combatant in the early 1860s and whether his sympathies lay
with the North or with the South.
A tantalizing clue that might shed some light on the matter is a set of ten
war-related essays written by one Quintus Curtius Snodgrass, who claimed to
be in the Louisiana militia, although no records documenting his service have
ever been found. If Snodgrass was just a pen name Twain used, as some suspect,
then these essays are basically a diary of Twain’s activities during the war, and
the mystery is solved. If Quintus Curtius Snodgrass was not a pen name, these
essays are just a red herring, and all questions about Twain’s military activities
remain unanswered.
Assessing the likelihood that Twain and Snodgrass were one and the
same would be the job of a “forensic statistician.” Authors have characteristic
word-length proﬁles that effectively serve as verbal ﬁngerprints (much
like incriminating evidence left at a crime scene). If Authors A and B tend to
use, say, three-letter words with signiﬁcantly different frequencies, a reasonable
inference would be that A and B are different people.
Table 9.2.1 shows the proportions of three-letter words in each of the ten
Snodgrass essays and in eight essays known to have been written by Mark
Twain. If xi denotes the ith Twain proportion, i = 1,2,...,8, and yi denotes
the ith Snodgrass proportion, i = 1,2,...,10, then
8
i=1
xi = 1.855 so x = 1.855/8 = 0.2319
(Continued on next page)
9.2 Testing H0: μX = μY 461
Table 9.2.1 Proportion of Three-Letter Words
Twain Proportion QCS Proportion
Sergeant Fathom letter 0.225 Letter I 0.209
Madame Caprell letter 0.262 Letter II 0.205
Mark Twain letters in Letter III 0.196
Territorial Enterprise Letter IV 0.210
First letter 0.217 Letter V 0.202
Second letter 0.240 Letter VI 0.207
Third letter 0.230 Letter VII 0.224
Fourth letter 0.229 Letter VIII 0.223
First Innocents Abroad letter Letter IX 0.220
First half 0.235 Letter X 0.201
Second half 0.217
and
10
i=1
yi = 2.097 so y = 2.097/10 = 0.2097
The question to be answered is whether the difference between 0.2319 and
0.2097 is statistically signiﬁcant.
Let μX and μY denote the true average proportions of three-letter words
that Twain and Snodgrass, respectively, tended to use. Our objective is to test
H0 : μX = μY
versus
H1 : μX = μY
Since
8
i=1
x2
i = 0.4316 and
10
i=1
y2
i = 0.4406
the two sample variances are
s2
X =
8(0.4316) − (1.855)2
8(7)
= 0.0002103
and
s2
Y =
10(0.4406) − (2.097)2
10(9)
= 0.0000955
(Continued on next page)
462 Chapter 9 Two-Sample Inferences
(Case Study 9.2.1 continued)
Combined, they give a pooled standard deviation of 0.0121:
sp =
8
i=1
(xi − 0.2319)2 +
10
i=1
(yi − 0.2097)2
n + m − 2
=
(n − 1)s2
X + (m − 1)s2
Y
n + m − 2
=
7(0.0002103) + 9(0.0000955)
8 + 10 − 2
=
√
0.0001457
= 0.0121
According to Theorem 9.2.1, if H0: μX = μY is true, the sampling distribution
of
T =
X − Y
Sp
1
8
+ 1
10
is described by a Student t curve with 16 (= 8 + 10 − 2) degrees of freedom.
Suppose we let α =0.01. By part (c) of Theorem 9.2.2, H0 should be rejected
in favor of a two-sided H1 if either (1) t ≤ −tα/2,n+m−2 = −t.005,16 = −2.9208 or
(2) t ≥ tα/2,n+m−2 = t.005,16 = 2.9208 (see Figure 9.2.1). But
t =
0.2319 − 0.2097
0.0121 1
8
+ 1
10
= 3.88
0
Reject H0
– 2.9208
Reject H
Area = 0.005
Student t
distribution
with 16 df
2.9208
0
Figure 9.2.1
a value falling considerably to the right of t.005,16. Therefore, we should reject
H0—it appears that Twain and Snodgrass were not the same person. So, unfortunately,
nothing that Twain did can be inferred from anything that Snodgrass
wrote.
About the Data The Xi ’s and Yi ’s in Table 9.2.1, being proportions, are necessarily
not normally distributed random variables with the same variance, so the basic
conditions of Theorem 9.2.2 are not met. Fortunately, the consequences of violated
assumptions on the probabilistic behavior of Tn+m−2 are frequently minimal. The
9.2 Testing H0: μX = μY 463
robustness property of the one-sample t ratio that we investigated in Chapter 7 also
holds true for the two-sample t ratio.
Case Study 9.2.2
Dislike your statistics instructor? Retaliation time will come at the end of the
semester, when you pepper the student course evaluation form with 1’s. Were
you pleased? Then send a signal with a load of 5’s. Either way, students’ evaluations
of their instructors do matter. These instruments are commonly used for
promotion, tenure, and merit raise decisions.
Studies of student course evaluations show that they do have value. They
tend to show reliability and consistency. Yet questions remain as to the ability
of these questionnaires to identify good teachers and courses.
A veteran instructor of developmental psychology decided to do a study
(201) on how a single changed factor might affect his students’ course evaluations.
He had attended a workshop extolling the virtue of an enthusiastic style
in the classroom—more hand gestures, increased voice pitch variability, and the
like. The vehicle for the study was the large-lecture undergraduate developmental
psychology course he had taught in the fall semester. He set about to
teach the spring-semester offering in the same way, with the exception of a more
enthusiastic style.
The professor fully understood the difﬁculty of controlling for the many
variables. He selected the spring class to have the same demographics as the
one in the fall. He used the same textbook, syllabus, and tests. He listened
to audiotapes of the fall lectures and reproduced them as closely as possible,
covering the same topics in the same order.
The ﬁrst step in examining the effect of enthusiasm on course evaluations
is to establish that students have, in fact, perceived an increase in enthusiasm.
Table 9.2.2 summarizes the ratings the instructor received on the “enthusiasm”
question for the two semesters. Unless the increase in sample means (2.14 to
4.21) is statistically signiﬁcant, there is no point in trying to compare fall and
spring responses to other questions.
Table 9.2.2
Fall, xi Spring, yi
n = 229 m = 243
x = 2.14 y = 4.21
sX = 0.94 sY = 0.83
Let μX and μY denote the true means associated with the two different
teaching styles. There is no reason to think that increased enthusiasm on the
part of the instructor would decrease the students’ perception of enthusiasm, so
it can be argued here that H1 should be one-sided. That is, we want to test
H0: μX = μY
versus
H1: μX < μY
(Continued on next page)
464 Chapter 9 Two-Sample Inferences
(Case Study 9.2.2 continued)
Let α = 0.05.
Since n = 229 and m = 243, the t statistic has 229 + 243 − 2 = 470 degrees of
freedom. Thus, the decision rule calls for the rejection of H0 if
t =
x − y
sP
1
229
+ 1
243
≤ −tα,n+m−2 = −t.05,470
A glance at Table A.2 in the Appendix shows that for any value n > 100, zα is a
good approximation of tα,n. That is, −t.05,470
.
= −z.05 = −1.64.
The pooled standard deviation for these data is 0.885:
sP =
228(0.94)2 + 242(0.83)2
229 + 243 − 2
= 0.885
Therefore,
t =
2.14 − 4.21
0.885 1
229
+ 1
243
= −25.42
and our conclusion is a resounding rejection of H0—the increased enthusiasm
was, indeed, noticed.
The real question of interest is whether the change in enthusiasm produced
a perceived change in some other aspect of teaching that we know did not
change. For example, the instructor did not become more knowledgeable about
the material over the course of the two semesters. The student ratings, though,
disagree.
Table 9.2.3 shows the instructor’s fall and spring ratings on the “knowledgeable”
question. Is the increase from x = 3.61 to y = 4.05 statistically signiﬁcant?
Yes. For these data, sP = 0.898 and
t =
3.61 − 4.05
0.898 1
229
+ 1
243
= −5.33
which falls far to the left of the 0.05 critical value (= −1.64).
What we can glean from these data is both reassuring yet a bit disturbing.
Table 9.2.2 appears to conﬁrm the widely held belief that enthusiasm
is an important factor in effective teaching. Table 9.2.3, on the other hand,
strikes a more cautionary note. It speaks to another widely held belief—that
student evaluations can sometimes be difﬁcult to interpret. Questions that purport
to be measuring one trait may, in fact, be reﬂecting something entirely
different.
Table 9.2.3
Fall, xi Spring, yi
n = 229 m = 243
x = 3.61 y = 4.05
sX = 0.84 sY = 0.95
9.2 Testing H0: μX = μY 465
About the Data The ﬁve-choice responses in student evaluation forms are very
common in survey questionnaires. Such questions are known as Likert items,
named after the psychologist Rensis Likert. The item typically asks the respondent
to choose his or her level of agreement with a statement, for example,
“The instructor shows concern for students.” The choices start with “strongly disagree,”
which is scored with a “1,” and go up to a “5” for “strongly agree.”
The statistic for a given question in a survey is the average value taken over all
responses.
Is a t test an appropriate way to analyze data of this sort? Maybe, but the nature
of the responses raises some serious concerns. First of all, the fact that students talk
with each other about their instructors suggests that not all the sample values will
be independent. More importantly, the ﬁve-point Likert scale hardly resembles the
normality assumption implicit in a Student t analysis. For many practitioners—but
not all—the robustness of the t test would be enough to justify the analysis described
in Case Study 9.2.2.
The Behrens-Fisher Problem
Finding a statistic with known density for testing the equality of two means from
normally distributed random samples when the standard deviations of the samples
are not equal is known as the Behrens-Fisher problem. No exact solution is known,
but a widely used approximation is based on the test statistic
W =
X − Y − (μX − μY )
S2
X
n
+
S2
Y
m
where, as usual, X and Y are the sample means, and S2
X and S2
Y are the unbiased
estimators of the variance. B. L. Welch, a faculty member at University College,
London, in a 1938 Biometrika article showed that W is approximately distributed
as a Student t random variable with degrees of freedom given by the nonintuitive
expression
σ2
1
n1
+
σ2
2
n2
2
σ4
1
n2
1(n1−1)
+
σ4
2
n2
2(n2−1)
To understand Welch’s approximation, it helps to rewrite the random variable
W as
W =
X − Y − (μX − μY )
S2
X
n
+
S2
Y
m
=
X − Y − (μX − μY )
σ2
X
n
+
σ2
Y
m
÷
S2
X
n
+
S2
Y
m
σ2
X
n
+
σ2
Y
m
In this form, the numerator is a standard normal variable. Suppose there is a chi
square random variable V with ν degrees of freedom such that the square of the
denominator is equal to V/ν. Then the expression would indeed be a Student t
variable with ν degrees of freedom. However, in general, the denominator will
not have exactly that distribution. The strategy, then, is to ﬁnd an approximate
equality for
S2
X
n
+
S2
Y
m
σ2
X
n
+
σ2
Y
m
=
V
ν
466 Chapter 9 Two-Sample Inferences
or, equivalently,
S2
X
n
+
S2
Y
m
=
σ2
X
n
+
σ2
Y
m
V
ν
At issue is the value of ν. The method of moments (recall Section 5.2) suggests a
solution. If the means and variances of both sides are equated, it can be shown that
ν =
σ2
X
n
+
σ2
Y
m
2
σ4
X
n2(n−1)
+
σ4
Y
m2(m−1)
Moreover, the expression for ν depends only on the ratio of the variances, θ =
σ2
X
σ2
Y
.
To see why, divide the numerator and denominator by σ4
Y . Then
1
n
σ2
X
σ2
Y
+ 1
m
2
1
n2(n−1)
σ2
X
σ2
Y
2
+ 1
m2(m−1)
=
1
n
θ + 1
m
2
1
n2(n−1)
θ2 + 1
m2(m−1)
and multiplying numerator and denominator by n2
gives the somewhat more
appealing form
ν =
θ + n
m
2
1
(n−1)
θ2 + 1
(m−1)
n
m
2
Of course, the main application of this theory occurs when σ2
X and σ2
Y are
unknown and θ must thus be estimated, the obvious choice being θ =
s2
X
s2
Y
.
This leads us to the following theorem for testing the equality of means when
the variances cannot be assumed equal.
Theorem
9.2.3
Let X1, X2,..., Xn and Y1,Y2,...,Ym be independent random samples from normal
distributions with means μX and μY , and standard deviations σX and σY , respectively.
Let
W =
X − Y − (μX − μY )
S2
X
n
+
S2
Y
m
Using ˆθ =
s2
X
s2
Y
, take ν to be the expression
ˆθ+ n
m
2
1
(n−1)
ˆθ2+ 1
(m−1) ( n
m )2 , rounded to the nearest
integer. Then W has approximately a Student t distribution with ν degrees of freedom.
Case Study 9.2.3
Does size matter? While a successful company’s large number of sales should
mean bigger proﬁts, does it yield greater proﬁtability? Forbes magazine periodically
rates the top two hundred small companies (52), and for each gives the
proﬁtability as measured by the ﬁve-year percentage return on equity. Using
data from the Forbes article, Table 9.2.4 gives the return on equity for the twelve
companies with the largest number of sales (ranging from $679 million to $738
(Continued on next page)
9.2 Testing H0: μX = μY 467
million) and for the twelve companies with the smallest number of sales (ranging
from $25 million to $66 million). Based on these data, can we say that the
return on equity differs between the two types of companies?
Table 9.2.4
Large-Sales Companies
Return on
Equity (%) Small-Sales Companies
Return on
Equity (%)
Deckers Outdoor 21 NVE 21
Jos. A. Bank Clothiers 23 Hi-Shear Technology 21
National Instruments 13 Bovie Medical 14
Dolby Laboratories 22 Rocky Mountain Chocolate
Factory
31
Quest Software 7 Rochester Medical 19
Green Mountain Coffee
Roasters
17 Anika Therapeutics 19
Lufkin Industries 19 Nathan’s Famous 11
Red Hat 11 Somanetics 29
Matrix Service 2 Bolt Technology 20
DXP Enterprises 30 Energy Recovery 27
Franklin Electric 15 Transcend Services 27
LSB Industries 43 IEC Electronics 24
Let μX and μY be the respective average returns on equity. The indicated
test of hypotheses is
H0 : μX = μY
versus
H1 : μX = μY
For the data in the table, x =18.6, y =21.9, s2
X =115.9929, and s2
Y =35.7604. The
test statistic is
w =
x − y − (μX − μY )
s2
X
n
+
s2
Y
m
=
18.6 − 21.9
115.9929
12
+ 35.7604
12
= −0.928
Also,
ˆθ =
s2
X
s2
Y
=
115.9929
35.7604
= 3.244
so
3.244 + 12
12
2
1
11
(3.244)2 + 1
11
12
12
2
= 17.2
which implies that ν = 17.
We should reject H0 at the α = 0.05 level of signiﬁcance if w > t0.025,17 =
2.1098 or w < −t0.025,17 = −2.1098. Here, w = −0.928 falls in between the two
critical values, so the difference between x and y is not statistically signiﬁcant.
468 Chapter 9 Two-Sample Inferences
Comment It occasionally happens that an experimenter wants to test H0: μX = μY
and knows the values of σ2
X and σ2
Y . For those situations, the t test of Theorem 9.2.2
is inappropriate. If the n Xi ’s and m Yi ’s are normally distributed, it follows from the
corollary to Theorem 4.3.3 that
Z =
X − Y − (μX − μY )
σ2
X
n
+
σ2
Y
m
(9.2.1)
has a standard normal distribution. Any such test of H0: μX = μY , then, should be
based on an observed Z ratio rather than an observed t ratio.
If the degrees of freedom for a t test exceed 100, then the test statistic of Equation
9.2.1 is used, but it is treated as a Z ratio. In either the test of Theorem 9.2.2
or 9.2.3, if the degrees of freedom exceed 100, the statistic of Theorem 9.2.3 is used
with the z tables.
Questions
9.2.1. Some states that operate a lottery believe that
restricting the use of lottery proﬁts to supporting education
makes the lottery more proﬁtable. Other states
permit general use of the lottery income. The proﬁtability
of the lottery for a group of states in each category is
given below.
State Lottery Proﬁts
For Education For General Use
State % Proﬁt State % Proﬁt
New Mexico 24 Massachusetts 21
Idaho 25 Maine 22
Kentucky 28 Iowa 24
South Carolina 28 Colorado 27
Georgia 28 Indiana 27
Missouri 29 Dist. Columbia 28
Ohio 29 Connecticut 29
Tennessee 31 Pennsylvania 32
Florida 31 Maryland 32
California 35
North Carolina 35
New Jersey 35
Source: New York Times, National Section, October 7, 2007, p. 14.
Test at the α = 0.01 level whether the mean proﬁt of states
using the lottery for education is higher than that of states
permitting general use. Assume that the variances of the
two random variables are equal.
9.2.2. As the United States has struggled with the growing
obesity of its citizens, diets have become big business.
Among the many competing regimens for those seeking
weight reduction are the Atkins and Zone diets. In a comparison
of these two diets for one-year weight loss, a study
(59) found that seventy-seven subjects on the Atkins diet
had an average weight loss of x = −4.7 kg and a sample
standard deviation of sX = 7.05 kg. Similar ﬁgures for the
seventy-nine people on the Zone diet were y = −1.6 kg
and sY = 5.36 kg. Is the greater reduction with the Atkins
diet statistically signiﬁcant? Test for α = 0.05.
9.2.3. A medical researcher believes that women typically
have lower serum cholesterol than men. To test this
hypothesis, he took a sample of 476 men between the ages
of nineteen and forty-four and found their mean serum
cholesterol to be 189.0 mg/dl with a sample standard deviation
of 34.2. A group of 592 women in the same age range
averaged 177.2 mg/dl and had a sample standard deviation
of 33.3. Is the lower average for the women statistically
signiﬁcant? Set α = 0.05.
9.2.4. In the academic year 2004–05, 1126 high school
freshmen took the SAT Reasoning Test. On the Critical
Reasoning portion, this group had a mean score of
491 with a standard deviation of 119. The following year,
5042 sophomores (none of them in the 2004–05 freshmen
group) scored an average of 498, with a standard deviation
of 129. Is the higher average score for the sophomores a
result of such factors as additional schooling and increased
maturity or simply a random effect? Test at the α = 0.05
level of signiﬁcance.
Source: College Board SAT, Total Group Proﬁle Report,
2008.
9.2.5. The University of Missouri–St. Louis gave a validation
test to entering students who had taken calculus in
high school. The group of ninety-three students receiving
no college credit had a mean score of 4.17 on the validation
test with a sample standard deviation of 3.70. For
the twenty-eight students who received credit from a high
school dual-enrollment class, the mean score was 4.61 with
a sample standard deviation of 4.28. Is there a signiﬁcant
difference in these means at the α = 0.01 level?
Source: MAA Focus, December 2008, p. 19.
9.2.6. Ring Lardner was one of this country’s most popular
writers during the 1920s and 1930s. He was also a
9.2 Testing H0: μX = μY 469
chronic alcoholic who died prematurely at the age of fortyeight.
The following table lists the life spans of some of
Lardner’s contemporaries (36). Those in the sample on the
left were all problem drinkers; they died, on the average,
at age sixty-ﬁve. The twelve (sober) writers on the right
tended to live a full ten years longer. Can it be argued that
an increase of that magnitude is statistically signiﬁcant?
Test an appropriate null hypothesis against a one-sided
H1. Use the 0.05 level of signiﬁcance. (Note: The pooled
sample standard deviation for these two samples is 13.9.)
Authors Noted for
Alchohol Abuse
Authors Not Noted for
Alchohol Abuse
Name
Age at
Death Name
Age at
Death
Ring Lardner 48 Carl Van Doren 65
Sinclair Lewis 66 Ezra Pound 87
Raymond Chandler 71 Randolph Bourne 32
Eugene O’Neill 65 Van Wyck Brooks 77
Robert Benchley 56 Samuel Eliot Morrison 89
J.P. Marquand 67 John Crowe Ransom 86
Dashiell Hammett 67 T.S. Eliot 77
e.e. cummings 70 Conrad Aiken 84
Edmund Wilson 77 Ben Ames Williams 64
Average: 65.2 Henry Miller 88
Archibald MacLeish 90
James Thurber 67
Average: 75.5
9.2.7. Poverty Point is the name given to a number
of widely scattered archaeological sites throughout
Louisiana, Mississippi, and Arkansas. These are the
remains of a society thought to have ﬂourished during the
period from 1700 to 500 b.c. Among their characteristic
artifacts are ornaments that were fashioned out of clay and
then baked. The following table shows the dates (in years
b.c.) associated with four of these baked clay ornaments
found in two different Poverty Point sites, Terral Lewis
and Jaketown (86). The averages for the two samples are
1133.0 and 1013.5, respectively. Is it believable that these
two settlements developed the technology to manufacture
baked clay ornaments at the same time? Set up and test an
appropriate H0 against a two-sided H1 at the α =0.05 level
of signiﬁcance. For these data sx = 266.9 and sy = 224.3.
Terral Lewis Estimates, xi Jaketown Estimates, yi
1492 1346
1169 942
883 908
988 858
9.2.8. A major source of “mercury poisoning” comes
from the ingestion of methylmercury (CH203
3 ), which is
found in contaminated ﬁsh (recall Question 5.3.3). Among
the questions pursued by medical investigators trying to
understand the nature of this particular health problem
is whether methylmercury is equally hazardous to men
and women. The following (114) are the half-lives of
methylmercury in the systems of six women and nine men
who volunteered for a study where each subject was given
an oral administration of CH203
3 . Is there evidence here
that women metabolize methylmercury at a different rate
than men do? Do an appropriate two-sample t test at the
α = 0.01 level of signiﬁcance. The two sample standard
deviations for these data are sX = 15.1 and sY = 8.1.
Methylmercury CH203
3 Half-Lives (in Days)
Females, xi Males, yi
52 72
69 88
73 87
88 74
87 78
56 70
78
93
74
9.2.9. Lipton, a company primarily known for tea, considered
using coupons to stimulate sales of its packaged
dinner entrees. The company was particularly interested
whether there was a diffences in the effect of coupons on
singles versus married couples. A poll of consumers asked
them to respond to the question “Do you use coupons
regularly?” by a numerical scale, where 1 stands for agree
strongly, 2 for agree, 3 for neutral, 4 for disagree, and 5 for
disagree strongly. The results of the poll are given in the
following table (19).
Use Coupons Regularly
Single (X) Married (Y)
n = 31 n = 57
x = 3.10 y = 2.43
sX = 1.469 sY = 1.350
Is the observed difference signiﬁcant at the α =0.05 level?
9.2.10. A company markets two brands of latex paint—
regular and a more expensive brand that claims to dry
an hour faster. A consumer magazine decides to test this
claim by painting ten panels with each product. The average
drying time of the regular brand is 2.1 hours with a
sample standard deviation of 12 minutes. The fast-drying
version has an average of 1.6 hours with a sample standard
deviation of 16 minutes. Test the null hypothesis that
the more expensive brand dries an hour quicker. Use a
one-sided H1. Let α = 0.05.
470 Chapter 9 Two-Sample Inferences
9.2.11. (a) Suppose H0: μX = μY is to be tested against
H1: μX = μY . The two sample sizes are 6 and 11. If sp =
15.3, what is the smallest value for |x − y| that will result
in H0 being rejected at the α = 0.01 level of signiﬁcance?
(b) What is the smallest value for x − y that will lead to
the rejection of H0: μX = μY in favor of H1: μX > μY if
α = 0.05, sP = 214.9, n = 13, and m = 8?
9.2.12. Suppose that H0: μX = μY is being tested against
H1: μX = μY , where σ2
X and σ2
Y are known to be 17.6 and
22.9, respectively. If n = 10, m = 20, x = 81.6, and y = 79.9,
what P-value would be associated with the observed Z
ratio?
9.2.13. An executive has two routes that she can take
to and from work each day. The ﬁrst is by interstate; the
second requires driving through town. On the average it
takes her 33 minutes to get to work by the interstate and
35 minutes by going through town. The standard deviations
for the two routes are 6 and 5 minutes, respectively.
Assume the distributions of the times for the two routes
are approximately normally distributed.
(a) What is the probability that on a given day, driving
through town would be the quicker of her choices?
(b) What is the probability that driving through town
for an entire week (ten trips) would yield a lower
average time than taking the interstate for the entire
week?
9.2.14. Prove that the Z ratio given in Equation 9.2.1 has
a standard normal distribution.
9.2.15. If X1, X2, . . . , Xn and Y1, Y2,...,Ym are independent
random samples from normal distributions with the
same σ2
, prove that their pooled sample variance, s2
p, is an
unbiased estimator for σ2
.
9.2.16. Let X1, X2,..., Xn and Y1,Y2, . . . , Ym be independent
random samples drawn from normal distributions
with means μX and μY , respectively, and with the same
known variance σ2
.Use the generalized likelihood ratio
criterion to derive a test procedure for choosing between
H0: μX = μY and H1: μX = μY .
9.2.17. A person exposed to an infectious agent, either by
contact or by vaccination, normally develops antibodies
to that agent. Presumably, the severity of an infection
is related to the number of antibodies produced. The
degree of antibody response is indicated by saying that
the person’s blood serum has a certain titer, with higher
titers indicating greater concentrations of antibodies. The
following table gives the titers of twenty-two persons
involved in a tularemia epidemic in Vermont (18). Eleven
were quite ill; the other eleven were asymptomatic. Use an
approximate t ratio to test H0: μX =μY against a one-sided
H1 at the 0.05 level of signiﬁcance.
The sample standard deviations for the “Severely Ill”
and “Asymptomatic” groups are 428 and 183, respectively.
Severely Ill Asymptomatic
Subject Titer Subject Titer
1 640 12 10
2 80 13 320
3 1280 14 320
4 160 15 320
5 640 16 320
6 640 17 80
7 1280 18 160
8 640 19 10
9 160 20 640
10 320 21 160
11 160 22 320
9.2.18. For the approximate two-sample t test described
in Question 9.2.17, it will be true that
v < n + m − 2
Why is that a disadvantage for the approximate test? That
is, why is it better to use the Theorem 9.2.1 version of the
t test if, in fact, σ2
X = σ2
Y ?
9.2.19. The two-sample data described in Question 8.2.2
would be analyzed by testing H0: μX = μY , where μX and
μY denote the true average motorcycle-related fatality
rates for states having “limited” and “comprehensive”
helmet laws, respectively.
(a) Should the t test for H0: μX = μY follow the format
of Theorem 9.2.2 or the approximation given in
Theorem 9.2.3? Explain.
(b) Is there anything unusual about these data? Explain.
9.2.20. Some ﬁnancial analysts believe that the election
of a Republican president is good for the stock market.
To test this claim, one study (155) recorded the ten-year
growth in Standard & Poor’s index following each election
of a new president. The results are given in the table
below.
Democrats Republicans
Winner S&P Growth Winner S&P Growth
Roosevelt ’36 22.4 Eisenhower ’52 45.7
Roosevelt ’40 24.0 Eisenhower ’56 28.6
Roosevelt ’44 38.0 Nixon ’68 14.2
Truman ’48 45.7 Nixon ’72 18.8
Kennedy ’60 21.2 Reagan ’80 50.3
Johnson ’64 17.9 Reagan ’84 40.1
Carter ’76 38.2 Bush ’88 52.4
Clinton ’92 33.7
Clinton ’96 23.8
Is the higher average for the Republicans statistically
signiﬁcant? Test at the 0.01 level. Do not assume the
variances are equal.
9.3 Testing H0: σ2
X = σ2
Y—The F Test 471
9.3 Testing H0: σ2
X = σ2
Y—The F Test
Although by far the majority of two-sample problems are set up to detect possible
shifts in location parameters, situations sometimes arise where it is equally
important—perhaps even more important—to compare variability parameters. Two
machines on an assembly line, for example, may be producing items whose average
dimensions (μX and μY ) of some sort—say, thickness—are not signiﬁcantly different
but whose variabilities (as measured by σ2
X and σ2
Y ) are. This becomes a critical piece
of information if the increased variability results in an unacceptable proportion of
items from one of the machines falling outside the engineering speciﬁcations (see
Figure 9.3.1).
Figure 9.3.1 Variability of
machine outputs.
μ
Engineering
specifications
X
μY
(Acceptable) proportion
too thin (Acceptable) proportion
too thick
σX
Output from machine X
(Unacceptable) proportion
too thick
(Unacceptable) proportion
too thin
σX
σY
Output from machine Y
In this section we will examine the generalized likelihood ratio test of H0: σ2
X =
σ2
Y versus H1: σ2
X = σ2
Y . The data will consist of two independent random samples
of sizes n and m: The ﬁrst—x1, x2,..., xn—is assumed to have come from a
normal distribution having mean μX and variance σ2
X ; the second—y1, y2,..., ym—
from a normal distribution having mean μY and variance σ2
Y . (All four parameters
are assumed to be unknown.) Theorem 9.3.1 gives the test procedure that
will be used. The proof will not be given, but it follows the same basic pattern
we have seen in other GLRTs; the important step is showing that the
likelihood ratio is a monotonic function of the F random variable described in
Deﬁnition 7.3.2.
Comment Tests of H0: σ2
X = σ2
Y arise in another, more routine context. Recall that
the procedure for testing the equality of μX and μY depends on whether or not the
two population variances are equal. This implies that a test of H0: σ2
X = σ2
Y should
precede every test of H0: μX =μY . If the former is accepted, the t test on μX and μY is
done according to Theorem 9.2.2; but if H0: σ2
X =σ2
Y is rejected, Theorem 9.2.2 is not
entirely appropriate. A frequently used alternative in that case is the approximate t
test described in Theorem 9.2.3.
Theorem
9.3.1
Let x1, x2,..., xn and y1, y2,..., ym be independent random samples from normal
distributions with means μX and μY and standard deviations σX and σY , respectively.
a. To test H0: σ2
X = σ2
Y versus H1: σ2
X > σ2
Y at the α level of signiﬁcance, reject H0 if
s2
Y /s2
X ≤ Fα,m−1,n−1.
472 Chapter 9 Two-Sample Inferences
b. To test H0: σ2
X = σ2
Y versus H1: σ2
X < σ2
Y at the α level of signiﬁcance, reject H0 if
s2
Y /s2
X ≥ F1−α,m−1,n−1.
c. To test H0: σ2
X = σ2
Y versus H1: σ2
X = σ2
Y at the α level of signiﬁcance, reject H0 if
s2
Y /s2
X is either (1) ≤ Fα/2,m−1,n−1 or (2) ≥ F1−α/2,m−1,n−1.
Comment The GLRT described in Theorem 9.3.1 is approximate for the same sort
of reason the GLRT for H0: σ2
= σ2
0 is approximate (see Theorem 7.5.2). The distribution
of the test statistic, S2
Y /S2
X , is not symmetric, and the two ranges of variance
ratios yielding λ’s less than or equal to λ∗
(i.e., the left tail and right tail of the
critical region) have slightly different areas. For the sake of convenience, though,
it is customary to choose the two critical values so that each cuts off the same
area, α/2.
Case Study 9.3.1
Electroencephalograms are records showing ﬂuctuations of electrical activity
in the brain. Among the several different kinds of brain waves produced, the
dominant ones are usually alpha waves. These have a characteristic frequency
of anywhere from eight to thirteen cycles per second.
The objective of the experiment described in this example was to see
whether sensory deprivation over an extended period of time has any effect on
the alpha-wave pattern. The subjects were twenty inmates in a Canadian prison
who were randomly split into two equal-sized groups. Members of one group
were placed in solitary conﬁnement; those in the other group were allowed
to remain in their own cells. Seven days later, alpha-wave frequencies were
measured for all twenty subjects (60), as shown in Table 9.3.1.
Table 9.3.1 Alpha-Wave Frequencies (CPS)
Nonconﬁned, xi Solitary Conﬁnement, yi
10.7 9.6
10.7 10.4
10.4 9.7
10.9 10.3
10.5 9.2
10.3 9.3
9.6 9.9
11.1 9.5
11.2 9.0
10.4 10.9
Judging from Figure 9.3.2, there was an apparent decrease in alpha-wave
frequency for persons in solitary conﬁnement. There also appears to have been
an increase in the variability for that group. We will use the F test to determine
whether the observed difference in variability (s2
X = 0.21 versus s2
Y = 0.36) is
statistically signiﬁcant.
(Continued on next page)
9.3 Testing H0: σ2
X = σ2
Y—The F Test 473
11
0
10
9 Nonconfined
Solitary
Alpha-wavefrequency(cps)
Figure 9.3.2 Alpha-wave frequencies (cps).
Let σ2
X and σ2
Y denote the true variances of alpha-wave frequencies for
nonconﬁned and solitary-conﬁned prisoners, respectively. The hypotheses to be
tested are
H0: σ2
X = σ2
Y
versus
H1: σ2
X = σ2
Y
Let α = 0.05 be the level of signiﬁcance. Given that
10
i=1
xi = 105.8
10
i=1
x2
i = 1121.26
10
i=1
yi = 97.8
10
i=1
y2
i = 959.70
the sample variances become
s2
X =
10(1121.26) − (105.8)2
10(9)
= 0.21
and
s2
Y =
10(959.70) − (97.8)2
10(9)
= 0.36
Dividing the sample variances gives an observed F ratio of 1.71:
F =
s2
Y
s2
X
=
0.36
0.21
= 1.71
Both n and m are ten, so we would expect S2
Y /S2
X to behave like an F random
variable with nine and nine degrees of freedom (assuming H0: σ2
X = σ2
Y is
true). From Table A.4 in the Appendix, we see that the values cutting off areas
of 0.025 in either tail of that distribution are 0.248 and 4.03 (see Figure 9.3.3).
Since the observed F ratio falls between the two critical values, our decision
is to fail to reject H0—a ratio of sample variances equal to 1.71 does not rule out
(Continued on next page)
474 Chapter 9 Two-Sample Inferences
(Case Study 9.3.1 continued)
the possibility that the two true variances are equal. (In light of the Comment
preceding Theorem 9.3.1, it would now be appropriate to test H0: μX =μY using
the two-sample t test described in Section 9.2.)
Reject H
Area = 0.025
0
0.248
Reject H
Area = 0.025
F distribution with
9 and 9 degrees
of freedom
4.03
0
Density
Figure 9.3.3 Distribution of S2
Y /S2
X when H0 is true.
Questions
9.3.1. Case Study 9.2.3 was offered as an example of testing
means when the variances are not assumed equal. Was
this a correct assumption about the variances? Test at the
0.05 level of signiﬁcance.
9.3.2. Two popular forms of mortgage are the thirty-year
ﬁxed-rate mortgage, where the borrower has thirty years
to repay the loan at a constant rate, and the adjustablerate
mortgage (ARM), one version of which is for ﬁve
years with the possibility of yearly changes in the interest
rate. Since the ARM offers less certainty, its rates are
usually lower than those of ﬁxed-rate mortgages. However,
such vehicles should show more variability in rates.
Test this hypothesis at the 0.10 level of signiﬁcance using
the following samples of mortgage offerings for a loan
of $160,000 (the borrower needs $200,000, but must pay
$40,000 up front).
$160,000 Mortgage Rates
30-Year Fixed ARM
5.500 3.875
5.500 5.125
5.250 5.000
5.125 4.750
5.875 4.375
5.625
5.250
4.875
9.3.3. Among the standard personality inventories used
by psychologists is the thematic apperception test (TAT)
in which a subject is shown a series of pictures and is asked
to make up a story about each one. Interpreted properly,
the content of the stories can provide valuable insights
into the subject’s mental well-being. The following data
show the TAT results for 40 women, 20 of whom were
the mothers of normal children and 20 the mothers of
schizophrenic children. In each case the subject was shown
the same set of 10 pictures. The ﬁgures recorded were
the numbers of stories (out of 10) that revealed a positive
parent–child relationship, one where the mother was
clearly capable of interacting with her child in a ﬂexible,
open-minded way (199).
TAT Scores
Mothers of Normal Mothers of Schizophrenic
Children Children
8 4 6 3 1 2 1 1 3 2
4 4 6 4 2 7 2 1 3 1
2 1 1 4 3 0 2 4 2 3
3 2 6 3 4 3 0 1 2 2
(a) Test H0 : σ2
X = σ2
Y versus H1 : σ2
X = σ2
Y , where σ2
X and
σ2
Y are the variances of the scores of mothers of normal
children and scores of mothers of schizophrenic
children, respectively. Let α = 0.05.
(b) If H0 :σ2
X =σ2
Y is accepted in part (a), test H0 :μX =μY
versus H1 : μX = μY . Set α equal to 0.05.
9.3.4. In a study designed to investigate the effects of a
strong magnetic ﬁeld on the early development of mice