366 Chapter 6 Hypothesis Testing
the basis of these data that transmitting underwater
predator sounds is an effective technique for clearing
ﬁshing waters of unwanted whales?
(b) Calculate the P-value for these data. For what values
of α would H0 be rejected?
6.3.2. Efforts to ﬁnd a genetic explanation for why
certain people are right-handed and others left-handed
have been largely unsuccessful. Reliable data are difﬁcult
to ﬁnd because of environmental factors that also
inﬂuence a child’s “handedness.” To avoid that complication,
researchers often study the analogous problem
of “pawedness” in animals, where both genotypes and
the environment can be partially controlled. In one such
experiment (27), mice were put into a cage having a feeding
tube that was equally accessible from the right or
the left. Each mouse was then carefully watched over a
number of feedings. If it used its right paw more than
half the time to activate the tube, it was deﬁned to be
“right-pawed.” Observations of this sort showed that 67%
of mice belonging to strain A/J are right-pawed. A similar
protocol was followed on a sample of thirty-ﬁve mice
belonging to strain A/HeJ. Of those thirty-ﬁve, a total of
eighteen were eventually classiﬁed as right-pawed. Test
whether the proportion of right-pawed mice found in the
A/HeJ sample was signiﬁcantly different from what was
known about the A/J strain. Use a two-sided alternative
and let 0.05 be the probability associated with the critical
region.
6.3.3. Defeated in his most recent attempt to win a congressional
seat because of a sizeable gender gap, a politician
has spent the last two years speaking out in favor
of women’s rights issues. A newly released poll claims to
have contacted a random sample of 120 of the politician’s
current supporters and found that 72 were men. In the
election that he lost, exit polls indicated that 65% of those
who voted for him were men. Using an α = 0.05 level of
signiﬁcance, test the null hypothesis that the proportion
of his male supporters has remained the same. Make the
alternative hypothesis one-sided.
6.3.4. Suppose H0: p = 0.45 is to be tested against H1: p >
0.45 at the α = 0.14 level of signiﬁcance, where p = P(ith
trial ends in success). If the sample size is 200, what is
the smallest number of successes that will cause H0 to be
rejected?
6.3.5. Recall the median test described in Example 5.3.2.
Reformulate that analysis as a hypothesis test rather than
a conﬁdence interval. What P-value is associated with the
outcomes listed in Table 5.3.3?
6.3.6. Among the early attempts to revisit the death postponement
theory introduced in Case Study 6.3.2 was an
examination of the birth dates and death dates of 348 U.S.
celebrities (134). It was found that 16 of those individuals
had died in the month preceding their birth month. Set up
and test the appropriate H0 against a one-sided H1. Use
the 0.05 level of signiﬁcance.
6.3.7. What α levels are possible with a decision rule of
the form “Reject H0 if k ≥ k∗
” when H0: p = 0.5 is to be
tested against H1: p > 0.5 using a random sample of size
n = 7?
6.3.8. The following is a Minitab printout
of the binomial pdf pX (k) = 9
k
(0.6)k
(0.4)9−k
,
k = 0,1,...,9. Suppose H0: p = 0.6 is to be tested against
H1: p > 0.6 and we wish the level of signiﬁcance to be
exactly 0.05. Use Theorem 2.4.1 to combine two different
critical regions into a single randomized decision rule for
which α = 0.05.
MTB > pdf;
SUBC > binomial 9 0.6.
Probability Density Function
Binomial with n = 9 and p = 0.6
x P(X = x)
0 0.000262
1 0.003539
2 0.021234
3 0.074318
4 0.167215
5 0.250823
6 0.250823
7 0.161243
8 0.060466
9 0.010078
6.3.9. Suppose H0: p = 0.75 is to be tested against H1: p <
0.75 using a random sample of size n = 7 and the decision
rule “Reject H0 if k ≤ 3.”
(a) What is the test’s level of signiﬁcance?
(b) Graph the probability that H0 will be rejected as a
function of p.
6.4 Type I and Type II Errors
The possibility of drawing incorrect conclusions is an inevitable byproduct of
hypothesis testing. No matter what sort of mathematical facade is laid atop the
decision-making process, there is no way to guarantee that what the test tells us
is the truth. One kind of error—rejecting H0 when H0 is true—ﬁgured prominently
in Section 6.3: It was argued that critical regions should be deﬁned so as to keep the
probability of making such errors small, often on the order of 0.05.
6.4 Type I and Type II Errors 367
In point of fact, there are two different kinds of errors that can be committed
with any hypothesis test: (1) We can reject H0 when H0 is true and (2) we can fail to
reject H0 when H0 is false. These are called Type I and Type II errors, respectively.
At the same time, there are two kinds of correct decisions: (1) We can fail to reject
a true H0 and (2) we can reject a false H0. Figure 6.4.1 shows these four possible
“Decision/State of nature” combinations.
Figure 6.4.1 True State of Nature
H0 is true H1 is true
Fail to Correct Type II
reject H0 decision error
Our
Decision
Type I Correct
Reject H0
error decision
Computing the Probability of Committing a Type I Error
Once an inference is made, there is no way to know whether the conclusion reached
was correct. It is possible, though, to calculate the probability of having made an
error, and the magnitude of that probability can help us better understand the
“power” of the hypothesis test and its ability to distinguish between H0 and H1.
Recall the fuel additive example developed in Section 6.2: H0:μ = 25.0 was to
be tested against H1:μ > 25.0 using a sample of size n = 30. The decision rule stated
that H0 should be rejected if y, the average mpg with the new additive, equalled or
exceeded 25.718. In that case, the probability of committing a Type I error is 0.05:
P(Type I error) = P(Reject H0 | H0 is true)
= P(Y ≥ 25.718 | μ = 25.0)
= P
Y − 25.0
2.4/
√
30
≥
25.718 − 25.0
2.4/
√
30
= P(Z ≥ 1.64) = 0.05
Of course, the fact that the probability of committing a Type I error equals 0.05
should come as no surprise. In our earlier discussion of how “beyond reasonable
doubt” should be interpreted numerically, we speciﬁcally chose the critical region so
that the probability of the decision rule rejecting H0 when H0 is true would be 0.05.
In general, the probability of committing a Type I error is referred to as a test’s
level of signiﬁcance and is denoted α (recall Deﬁnition 6.2.3). The concept is a crucial
one: The level of signiﬁcance is a single-number summary of the “rules” by which the
decision process is being conducted. In essence, α reﬂects the amount of evidence
the experimenter is demanding to see before abandoning the null hypothesis.
Computing the Probability of Committing a Type II Error
We just saw that calculating the probability of a Type I error is a nonproblem:
There are no computations necessary, since the probability equals whatever value
the experimenter sets a priori for α. A similar situation does not hold for Type
368 Chapter 6 Hypothesis Testing
II errors. To begin with, Type II error probabilities are not speciﬁed explicitly by
the experimenter; also, each hypothesis test has an inﬁnite number of Type II error
probabilities, one for each value of the parameter admissible under H1.
As an example, suppose we want to ﬁnd the probability of committing a Type
II error in the gasoline experiment if the true μ (with the additive) were 25.750. By
deﬁnition,
P(Type II error | μ = 25.750) = P(We fail to reject H0 | μ = 25.750)
= P(Y < 25.718 | μ = 25.750)
= P
Y − 25.75
2.4/
√
30
<
25.718 − 25.75
2.4/
√
30
= P(Z < −0.07) = 0.4721
So, even if the new additive increased the fuel economy to 25.750 mpg (from
25 mpg), our decision rule would be “tricked” 47% of the time: that is, it would
tell us on those occasions not to reject H0.
The symbol for the probability of committing a Type II error is β. Figure 6.4.2
shows the sampling distribution of Y when μ = 25.0 (i.e., when H0 is true) and when
μ = 25.750 (H1 is true); the areas corresponding to α and β are shaded.
Figure 6.4.2
25.718
Sampling
distribution
of Y when
H is true
Sampling
distribution
of Y when
μ = 25.75
β = 0.4721 α = 0.05
1.0
0.5
Accept H0
0
0Reject H
24 25 26 27
Figure 6.4.3
25.718
Sampling
distribution
of Y when
H is true
Sampling
distribution
of Y when
μ = 26.8
β = 0.0068 α = 0.05
1.0
0.5
Accept H0
0
0Reject H
24 25 26 27 28
Clearly, the magnitude of β is a function of the presumed value for μ. If, for
example, the gasoline additive is so effective as to raise fuel efﬁciency to 26.8 mpg,
6.4 Type I and Type II Errors 369
the probability that our decision rule would lead us to make a Type II error is a much
smaller 0.0068:
P(Type II error | μ = 26.8) = P(We fail to reject H0 | μ = 26.8)
= P(Y < 25.718 | μ = 26.8) = P
Y − 26.8
2.4/
√
30
<
25.718 − 26.8
2.4/
√
30
= P(Z < −2.47) = 0.0068
(See Figure 6.4.3.)
Power Curves
If β is the probability that we fail to reject H0 when H1 is true, then 1 − β is the
probability of the complement—that we reject H0 when H1 is true. We call 1 − β
the power of the test; it represents the ability of the decision rule to “recognize”
(correctly) that H0 is false.
The alternative hypothesis H1 usually depends on a parameter, which makes
1−β a function of that parameter. The relationship they share can be pictured by
drawing a power curve, which is simply a graph of 1 − β versus the set of all possible
parameter values.
Figure 6.4.4 shows the power curve for testing
H0:μ = 25.0
versus
H1:μ > 25.0
where μ is the mean of a normal distribution with σ = 2.4, and the decision rule is
“Reject H0 if y ≥ 25.718.” The two marked points on the curve represent the (μ,1 −
β) pairs just determined, (25.75, 0.5297) and (26.8, 0.9932). One other point can
be gotten for every power curve, without doing any calculations: When μ = μ0 (the
value speciﬁed by H0), 1−β =α. Of course, as the true mean gets further and further
away from the H0 mean, the power will converge to 1.
Figure 6.4.4
25.50 26.00 26.50 27.00
0.5
α
1.0
25.00
Power = 0.72
1 – β
Power = 0.29
Presumed value for μ
Power curves serve two different purposes. On the one hand, they completely
characterize the performance that can be expected from a hypothesis test. In
370 Chapter 6 Hypothesis Testing
Figure 6.4.5
α
1 Method B
Method A
θ0
1 – β
Figure 6.4.4, for example, the two arrows show that the probability of rejecting
H0 : μ = 25 in favor of H1 : μ > 25 when μ = 26.0 is approximately 0.72. (Or, equivalently,
Type II errors will be committed roughly 28% of the time when μ = 26.0.)
As the true mean moves closer to μo (and becomes more difﬁcult to distinguish)
the power of the test understandably diminishes. If μ = 25.5, for example, the graph
shows that 1 − β falls to 0.29.
Power curves are also useful for comparing one inference procedure with
another. For every conceivable hypothesis testing situation, a variety of procedures
for choosing between H0 and H1 will be available. How do we know which to use?
The answer to that question is not always simple. Some procedures will be
computationally more convenient or easier to explain than others; some will make
slightly different assumptions about the pdf being sampled. Associated with each
of them, though, is a power curve. If the selection of a hypothesis test is to hinge
solely on its ability to distinguish H0 from H1, then the procedure to choose is the
one having the steepest power curve.
Figure 6.4.5 shows the power curves for two hypothetical methods A and B,
each of which is testing H0:θ = θo versus H1:θ = θo at the α level of signiﬁcance.
From the standpoint of power, Method B is clearly the better of the two—it always
has a higher probability of correctly rejecting H0 when the parameter θ is not equal
to θo.
Factors That Inﬂuence the Power of a Test
The ability of a test procedure to reject H0 when H0 is false is clearly of prime
importance, a fact that raises an obvious question: What can an experimenter do
to inﬂuence the value of 1 − β? In the case of the Z test described in Theorem 6.2.1,
1 − β is a function of α,σ, and n. By appropriately raising or lowering the values of
those parameters, the power of the test against any given μ can be made to equal
any desired level.
The Effect of α on 1 − β
Consider again the test of
H0:μ = 25.0
versus
H1:μ > 25.0
6.4 Type I and Type II Errors 371
discussed earlier in this section. In its original form, α =0.05, σ =2.4, n =30, and the
decision rule called for H0 to be rejected if y ≥ 25.718.
Figure 6.4.6 shows what happens to 1 − β (when μ = 25.75) if σ,n, and μ are
held constant but α is increased to 0.10. The top pair of distributions shows the
conﬁguration that appears in Figure 6.4.2; the power in this case is 1 − 0.4721, or
0.53. The bottom portion of the graph illustrates what happens when α is set at 0.10
instead of 0.05—the decision rule changes from “Reject H0 if y ≥ 25.718” to “Reject
H0 if y ≥ 25.561” (see Question 6.4.2) and the power increases from 0.53 to 0.67:
1 − β = P(Reject H0 | H1 is true)
= P(Y ≥ 25.561 | μ = 25.75)
= P
Y − 25.75
2.4/
√
30
≥
25.561 − 25.75
2.4/
√
30
= P(Z ≥ −0.43)
= 0.6664
Figure 6.4.6
25.718
Sampling
distribution
of Y when
H is true
Sampling
distribution
of Y when
μ = 25.75
β = 0.4721 α = 0.05
1.0
0.5
Power = 0.53
0
Accept H0 Reject H0
25.561
Sampling
distribution
of Y when
H is true
Sampling
distribution
of Y when
μ = 25.75
β = 0.3336 α = 0.10
1.0
0.5
Power = 0.67
0
Accept H0 Reject H0
24 25 26 27
24 25 26 27
The speciﬁcs of Figure 6.4.6 accurately reﬂect what is true in general: Increasing
α decreases β and increases the power. That said, it does not follow in practice that
experimenters should manipulate α to achieve a desired 1 − β. For all the reasons
cited in Section 6.2, α should typically be set equal to a number somewhere in the
neighborhood of 0.05. If the corresponding 1 − β against a particular μ is deemed to
be inappropriate, adjustments should be made in the values of σ and/or n.
372 Chapter 6 Hypothesis Testing
The Effects of σ and n on 1 − β
Although it may not always be feasible (or even possible), decreasing σ will necessarily
increase 1−β. In the gasoline additive example, σ is assumed to be 2.4 mpg,
the latter being a measure of the variation in gas mileages from driver to driver
achieved in a cross-country road trip from Boston to Los Angeles (recall p. 351).
Intuitively, the environmental differences inherent in a trip of that magnitude would
be considerable. Different drivers would encounter different weather conditions and
varying amounts of trafﬁc, and would perhaps take alternate routes.
Suppose, instead, that the drivers simply did laps around a test track rather than
drive on actual highways. Conditions from driver to driver would then be much more
uniform and the value of σ would surely be smaller. What would be the effect on
1−β when μ = 25.75 (and α = 0.05) if σ could be reduced from 2.4 mpg to 1.2 mpg?
As Figure 6.4.7 shows, reducing σ has the effect of making the H0 distribution
of Y more concentrated around μo(= 25) and the H1 distribution of Y more concentrated
around μ(= 25.75). Substituting into Equation 6.2.1 (with 1.2 for σ in place
Figure 6.4.7
24 25 26
25.718
27
When σ = 2.4
Sampling
distribution
of Y when
H is true
Sampling
distribution
of Y when
μ = 25.75
β = 0.4721 α = 0.05
1.0
0.5
Power = 0.53
β = 0.0375 α = 0.05
2.0
Power = 0.96
0
Sampling
distribution
of Y when
H is true0
Accept H0 Reject H0
Sampling
distribution
of Y when
μ = 25.75
24 25 26 27
When σ = 1.2
25.359
Accept H0 Reject H0
6.4 Type I and Type II Errors 373
of 2.4), we ﬁnd that the critical value y∗
moves closer to μo [from 25.718 to 25.359
= 25 + 1.64 · 1.2√
30
and the proportion of the H1 distribution above the rejection
region (i.e., the power) increases from 0.53 to 0.96:
1 − β = P(Y ≥ 25.359 | μ = 25.75)
= P Z ≥
25.359 − 25.75
1.2/
√
30
= P(Z ≥ −1.78) = 0.9625
In theory, reducing σ can be a very effective way of increasing the power of
a test, as Figure 6.4.7 makes abundantly clear. In practice, though, reﬁnements in
the way data are collected that would have a substantial impact on the magnitude
of σ are often either difﬁcult to identify or prohibitively expensive. More typically,
experimenters achieve the same effect by simply increasing the sample size.
Look again at the two sets of distributions in Figure 6.4.7. The increase in 1−β
from 0.53 to 0.96 was accomplished by cutting the denominator of the test statistic
z = y−25
σ/
√
30
in half by reducing the standard deviation from 2.4 to 1.2. The same
numerical effect would be produced if σ were left unchanged but n was increased
from 30 to 120—that is, 1.2√
30
= 2.4√
120
. Because it can easily be increased or decreased,
the sample size is the parameter that researchers almost invariably turn to as the
mechanism for ensuring that a hypothesis test will have a sufﬁciently high power
against a given alternative.
Example
6.4.1
Suppose an experimenter wishes to test
H0:μ = 100
versus
H1:μ > 100
at the α =0.05 level of signiﬁcance and wants 1−β to equal 0.60 when μ=103. What
is the smallest (i.e., cheapest) sample size that will achieve that objective? Assume
that the variable being measured is normally distributed with σ = 14.
Finding n, given values for α,1 − β,σ, and μ, requires that two simultaneous
equations be written for the critical value y∗
, one in terms of the H0 distribution
and the other in terms of the H1 distribution. Setting the two equal will yield the
minimum sample size that achieves the desired α and 1 − β.
Consider, ﬁrst, the consequences of the level of signiﬁcance being equal to 0.05.
By deﬁnition,
α = P(We reject H0 | H0 is true)
= P(Y ≥ y∗
| μ = 100)
= P
Y − 100
14/
√
n
≥
y∗
− 100
14/
√
n
= P Z ≥
y∗
− 100
14/
√
n
= 0.05
But P(Z ≥ 1.64) = 0.05, so
y∗
− 100
14/
√
n
= 1.64
374 Chapter 6 Hypothesis Testing
or, equivalently,
y∗
= 100 + 1.64 ·
14
√
n
(6.4.1)
Similarly,
1 − β = P(We reject H0 | H1 is true) = P(Y ≥ y∗
| μ = 103)
= P
Y − 103
14/
√
n
≥
y∗
− 103
14/
√
n
= 0.60
From Appendix Table A.1, though, P(Z ≥ −0.25) = 0.5987
.
= 0.60, so
y∗
− 103
14/
√
n
= −0.25
which implies that
y∗
= 103 − 0.25 ·
14
√
n
(6.4.2)
It follows, then, from Equations 6.4.1 and 6.4.2 that
100 + 1.64 ·
14
√
n
= 103 − 0.25 ·
14
√
n
Solving for n shows that a minimum of seventy-eight observations must be taken to
guarantee that the hypothesis test will have the desired precision.
Decision Rules for Nonnormal Data
Our discussion of hypothesis testing thus far has been conﬁned to inferences involving
either binomial data or normal data. Decision rules for other types of probability
functions are rooted in the same basic principles.
In general, to test H0:θ =θo, where θ is the unknown parameter in a pdf fY (y;θ),
we initially deﬁne the decision rule in terms of ˆθ, where the latter is a sufﬁcient statistic
for θ. The corresponding critical region is the set of values of ˆθ least compatible
with θo (but admissible under H1) whose total probability when H0 is true is α. In
the case of testing H0:μ = μo versus H1:μ > μo, for example, where the data are
normally distributed, Y is a sufﬁcient statistic for μ, and the least likely values for
the sample mean that are admissible under H1 are those for which y ≥ y∗
, where
P(Y ≥ y∗
| H0 is true) = α.
Example
6.4.2
A random sample of size n =8 is drawn from the uniform pdf, fY (y;θ)=1/θ,0≤ y ≤
θ, for the purpose of testing
H0:θ = 2.0
versus
H1:θ < 2.0
at the α = 0.10 level of signiﬁcance. Suppose the decision rule is to be based on Y8,
the largest order statistic. What would be the probability of committing a Type II
error when θ = 1.7?