366 Chapter 6 Hypothesis Testing the basis of these data that transmitting underwater predator sounds is an effective technique for clearing fishing waters of unwanted whales? (b) Calculate the P-value for these data. For what values of α would H0 be rejected? 6.3.2. Efforts to find a genetic explanation for why certain people are right-handed and others left-handed have been largely unsuccessful. Reliable data are difficult to find because of environmental factors that also influence a child’s “handedness.” To avoid that complication, researchers often study the analogous problem of “pawedness” in animals, where both genotypes and the environment can be partially controlled. In one such experiment (27), mice were put into a cage having a feeding tube that was equally accessible from the right or the left. Each mouse was then carefully watched over a number of feedings. If it used its right paw more than half the time to activate the tube, it was defined to be “right-pawed.” Observations of this sort showed that 67% of mice belonging to strain A/J are right-pawed. A similar protocol was followed on a sample of thirty-five mice belonging to strain A/HeJ. Of those thirty-five, a total of eighteen were eventually classified as right-pawed. Test whether the proportion of right-pawed mice found in the A/HeJ sample was significantly different from what was known about the A/J strain. Use a two-sided alternative and let 0.05 be the probability associated with the critical region. 6.3.3. Defeated in his most recent attempt to win a congressional seat because of a sizeable gender gap, a politician has spent the last two years speaking out in favor of women’s rights issues. A newly released poll claims to have contacted a random sample of 120 of the politician’s current supporters and found that 72 were men. In the election that he lost, exit polls indicated that 65% of those who voted for him were men. Using an α = 0.05 level of significance, test the null hypothesis that the proportion of his male supporters has remained the same. Make the alternative hypothesis one-sided. 6.3.4. Suppose H0: p = 0.45 is to be tested against H1: p > 0.45 at the α = 0.14 level of significance, where p = P(ith trial ends in success). If the sample size is 200, what is the smallest number of successes that will cause H0 to be rejected? 6.3.5. Recall the median test described in Example 5.3.2. Reformulate that analysis as a hypothesis test rather than a confidence interval. What P-value is associated with the outcomes listed in Table 5.3.3? 6.3.6. Among the early attempts to revisit the death postponement theory introduced in Case Study 6.3.2 was an examination of the birth dates and death dates of 348 U.S. celebrities (134). It was found that 16 of those individuals had died in the month preceding their birth month. Set up and test the appropriate H0 against a one-sided H1. Use the 0.05 level of significance. 6.3.7. What α levels are possible with a decision rule of the form “Reject H0 if k ≥ k∗ ” when H0: p = 0.5 is to be tested against H1: p > 0.5 using a random sample of size n = 7? 6.3.8. The following is a Minitab printout of the binomial pdf pX (k) = 9 k (0.6)k (0.4)9−k , k = 0,1,...,9. Suppose H0: p = 0.6 is to be tested against H1: p > 0.6 and we wish the level of significance to be exactly 0.05. Use Theorem 2.4.1 to combine two different critical regions into a single randomized decision rule for which α = 0.05. MTB > pdf; SUBC > binomial 9 0.6. Probability Density Function Binomial with n = 9 and p = 0.6 x P(X = x) 0 0.000262 1 0.003539 2 0.021234 3 0.074318 4 0.167215 5 0.250823 6 0.250823 7 0.161243 8 0.060466 9 0.010078 6.3.9. Suppose H0: p = 0.75 is to be tested against H1: p < 0.75 using a random sample of size n = 7 and the decision rule “Reject H0 if k ≤ 3.” (a) What is the test’s level of significance? (b) Graph the probability that H0 will be rejected as a function of p. 6.4 Type I and Type II Errors The possibility of drawing incorrect conclusions is an inevitable byproduct of hypothesis testing. No matter what sort of mathematical facade is laid atop the decision-making process, there is no way to guarantee that what the test tells us is the truth. One kind of error—rejecting H0 when H0 is true—figured prominently in Section 6.3: It was argued that critical regions should be defined so as to keep the probability of making such errors small, often on the order of 0.05. 6.4 Type I and Type II Errors 367 In point of fact, there are two different kinds of errors that can be committed with any hypothesis test: (1) We can reject H0 when H0 is true and (2) we can fail to reject H0 when H0 is false. These are called Type I and Type II errors, respectively. At the same time, there are two kinds of correct decisions: (1) We can fail to reject a true H0 and (2) we can reject a false H0. Figure 6.4.1 shows these four possible “Decision/State of nature” combinations. Figure 6.4.1 True State of Nature H0 is true H1 is true Fail to Correct Type II reject H0 decision error Our Decision Type I Correct Reject H0 error decision Computing the Probability of Committing a Type I Error Once an inference is made, there is no way to know whether the conclusion reached was correct. It is possible, though, to calculate the probability of having made an error, and the magnitude of that probability can help us better understand the “power” of the hypothesis test and its ability to distinguish between H0 and H1. Recall the fuel additive example developed in Section 6.2: H0:μ = 25.0 was to be tested against H1:μ > 25.0 using a sample of size n = 30. The decision rule stated that H0 should be rejected if y, the average mpg with the new additive, equalled or exceeded 25.718. In that case, the probability of committing a Type I error is 0.05: P(Type I error) = P(Reject H0 | H0 is true) = P(Y ≥ 25.718 | μ = 25.0) = P Y − 25.0 2.4/ √ 30 ≥ 25.718 − 25.0 2.4/ √ 30 = P(Z ≥ 1.64) = 0.05 Of course, the fact that the probability of committing a Type I error equals 0.05 should come as no surprise. In our earlier discussion of how “beyond reasonable doubt” should be interpreted numerically, we specifically chose the critical region so that the probability of the decision rule rejecting H0 when H0 is true would be 0.05. In general, the probability of committing a Type I error is referred to as a test’s level of significance and is denoted α (recall Definition 6.2.3). The concept is a crucial one: The level of significance is a single-number summary of the “rules” by which the decision process is being conducted. In essence, α reflects the amount of evidence the experimenter is demanding to see before abandoning the null hypothesis. Computing the Probability of Committing a Type II Error We just saw that calculating the probability of a Type I error is a nonproblem: There are no computations necessary, since the probability equals whatever value the experimenter sets a priori for α. A similar situation does not hold for Type 368 Chapter 6 Hypothesis Testing II errors. To begin with, Type II error probabilities are not specified explicitly by the experimenter; also, each hypothesis test has an infinite number of Type II error probabilities, one for each value of the parameter admissible under H1. As an example, suppose we want to find the probability of committing a Type II error in the gasoline experiment if the true μ (with the additive) were 25.750. By definition, P(Type II error | μ = 25.750) = P(We fail to reject H0 | μ = 25.750) = P(Y < 25.718 | μ = 25.750) = P Y − 25.75 2.4/ √ 30 < 25.718 − 25.75 2.4/ √ 30 = P(Z < −0.07) = 0.4721 So, even if the new additive increased the fuel economy to 25.750 mpg (from 25 mpg), our decision rule would be “tricked” 47% of the time: that is, it would tell us on those occasions not to reject H0. The symbol for the probability of committing a Type II error is β. Figure 6.4.2 shows the sampling distribution of Y when μ = 25.0 (i.e., when H0 is true) and when μ = 25.750 (H1 is true); the areas corresponding to α and β are shaded. Figure 6.4.2 25.718 Sampling distribution of Y when H is true Sampling distribution of Y when μ = 25.75 β = 0.4721 α = 0.05 1.0 0.5 Accept H0 0 0Reject H 24 25 26 27 Figure 6.4.3 25.718 Sampling distribution of Y when H is true Sampling distribution of Y when μ = 26.8 β = 0.0068 α = 0.05 1.0 0.5 Accept H0 0 0Reject H 24 25 26 27 28 Clearly, the magnitude of β is a function of the presumed value for μ. If, for example, the gasoline additive is so effective as to raise fuel efficiency to 26.8 mpg, 6.4 Type I and Type II Errors 369 the probability that our decision rule would lead us to make a Type II error is a much smaller 0.0068: P(Type II error | μ = 26.8) = P(We fail to reject H0 | μ = 26.8) = P(Y < 25.718 | μ = 26.8) = P Y − 26.8 2.4/ √ 30 < 25.718 − 26.8 2.4/ √ 30 = P(Z < −2.47) = 0.0068 (See Figure 6.4.3.) Power Curves If β is the probability that we fail to reject H0 when H1 is true, then 1 − β is the probability of the complement—that we reject H0 when H1 is true. We call 1 − β the power of the test; it represents the ability of the decision rule to “recognize” (correctly) that H0 is false. The alternative hypothesis H1 usually depends on a parameter, which makes 1−β a function of that parameter. The relationship they share can be pictured by drawing a power curve, which is simply a graph of 1 − β versus the set of all possible parameter values. Figure 6.4.4 shows the power curve for testing H0:μ = 25.0 versus H1:μ > 25.0 where μ is the mean of a normal distribution with σ = 2.4, and the decision rule is “Reject H0 if y ≥ 25.718.” The two marked points on the curve represent the (μ,1 − β) pairs just determined, (25.75, 0.5297) and (26.8, 0.9932). One other point can be gotten for every power curve, without doing any calculations: When μ = μ0 (the value specified by H0), 1−β =α. Of course, as the true mean gets further and further away from the H0 mean, the power will converge to 1. Figure 6.4.4 25.50 26.00 26.50 27.00 0.5 α 1.0 25.00 Power = 0.72 1 – β Power = 0.29 Presumed value for μ Power curves serve two different purposes. On the one hand, they completely characterize the performance that can be expected from a hypothesis test. In 370 Chapter 6 Hypothesis Testing Figure 6.4.5 α 1 Method B Method A θ0 1 – β Figure 6.4.4, for example, the two arrows show that the probability of rejecting H0 : μ = 25 in favor of H1 : μ > 25 when μ = 26.0 is approximately 0.72. (Or, equivalently, Type II errors will be committed roughly 28% of the time when μ = 26.0.) As the true mean moves closer to μo (and becomes more difficult to distinguish) the power of the test understandably diminishes. If μ = 25.5, for example, the graph shows that 1 − β falls to 0.29. Power curves are also useful for comparing one inference procedure with another. For every conceivable hypothesis testing situation, a variety of procedures for choosing between H0 and H1 will be available. How do we know which to use? The answer to that question is not always simple. Some procedures will be computationally more convenient or easier to explain than others; some will make slightly different assumptions about the pdf being sampled. Associated with each of them, though, is a power curve. If the selection of a hypothesis test is to hinge solely on its ability to distinguish H0 from H1, then the procedure to choose is the one having the steepest power curve. Figure 6.4.5 shows the power curves for two hypothetical methods A and B, each of which is testing H0:θ = θo versus H1:θ = θo at the α level of significance. From the standpoint of power, Method B is clearly the better of the two—it always has a higher probability of correctly rejecting H0 when the parameter θ is not equal to θo. Factors That Influence the Power of a Test The ability of a test procedure to reject H0 when H0 is false is clearly of prime importance, a fact that raises an obvious question: What can an experimenter do to influence the value of 1 − β? In the case of the Z test described in Theorem 6.2.1, 1 − β is a function of α,σ, and n. By appropriately raising or lowering the values of those parameters, the power of the test against any given μ can be made to equal any desired level. The Effect of α on 1 − β Consider again the test of H0:μ = 25.0 versus H1:μ > 25.0 6.4 Type I and Type II Errors 371 discussed earlier in this section. In its original form, α =0.05, σ =2.4, n =30, and the decision rule called for H0 to be rejected if y ≥ 25.718. Figure 6.4.6 shows what happens to 1 − β (when μ = 25.75) if σ,n, and μ are held constant but α is increased to 0.10. The top pair of distributions shows the configuration that appears in Figure 6.4.2; the power in this case is 1 − 0.4721, or 0.53. The bottom portion of the graph illustrates what happens when α is set at 0.10 instead of 0.05—the decision rule changes from “Reject H0 if y ≥ 25.718” to “Reject H0 if y ≥ 25.561” (see Question 6.4.2) and the power increases from 0.53 to 0.67: 1 − β = P(Reject H0 | H1 is true) = P(Y ≥ 25.561 | μ = 25.75) = P Y − 25.75 2.4/ √ 30 ≥ 25.561 − 25.75 2.4/ √ 30 = P(Z ≥ −0.43) = 0.6664 Figure 6.4.6 25.718 Sampling distribution of Y when H is true Sampling distribution of Y when μ = 25.75 β = 0.4721 α = 0.05 1.0 0.5 Power = 0.53 0 Accept H0 Reject H0 25.561 Sampling distribution of Y when H is true Sampling distribution of Y when μ = 25.75 β = 0.3336 α = 0.10 1.0 0.5 Power = 0.67 0 Accept H0 Reject H0 24 25 26 27 24 25 26 27 The specifics of Figure 6.4.6 accurately reflect what is true in general: Increasing α decreases β and increases the power. That said, it does not follow in practice that experimenters should manipulate α to achieve a desired 1 − β. For all the reasons cited in Section 6.2, α should typically be set equal to a number somewhere in the neighborhood of 0.05. If the corresponding 1 − β against a particular μ is deemed to be inappropriate, adjustments should be made in the values of σ and/or n. 372 Chapter 6 Hypothesis Testing The Effects of σ and n on 1 − β Although it may not always be feasible (or even possible), decreasing σ will necessarily increase 1−β. In the gasoline additive example, σ is assumed to be 2.4 mpg, the latter being a measure of the variation in gas mileages from driver to driver achieved in a cross-country road trip from Boston to Los Angeles (recall p. 351). Intuitively, the environmental differences inherent in a trip of that magnitude would be considerable. Different drivers would encounter different weather conditions and varying amounts of traffic, and would perhaps take alternate routes. Suppose, instead, that the drivers simply did laps around a test track rather than drive on actual highways. Conditions from driver to driver would then be much more uniform and the value of σ would surely be smaller. What would be the effect on 1−β when μ = 25.75 (and α = 0.05) if σ could be reduced from 2.4 mpg to 1.2 mpg? As Figure 6.4.7 shows, reducing σ has the effect of making the H0 distribution of Y more concentrated around μo(= 25) and the H1 distribution of Y more concentrated around μ(= 25.75). Substituting into Equation 6.2.1 (with 1.2 for σ in place Figure 6.4.7 24 25 26 25.718 27 When σ = 2.4 Sampling distribution of Y when H is true Sampling distribution of Y when μ = 25.75 β = 0.4721 α = 0.05 1.0 0.5 Power = 0.53 β = 0.0375 α = 0.05 2.0 Power = 0.96 0 Sampling distribution of Y when H is true0 Accept H0 Reject H0 Sampling distribution of Y when μ = 25.75 24 25 26 27 When σ = 1.2 25.359 Accept H0 Reject H0 6.4 Type I and Type II Errors 373 of 2.4), we find that the critical value y∗ moves closer to μo [from 25.718 to 25.359 = 25 + 1.64 · 1.2√ 30 and the proportion of the H1 distribution above the rejection region (i.e., the power) increases from 0.53 to 0.96: 1 − β = P(Y ≥ 25.359 | μ = 25.75) = P Z ≥ 25.359 − 25.75 1.2/ √ 30 = P(Z ≥ −1.78) = 0.9625 In theory, reducing σ can be a very effective way of increasing the power of a test, as Figure 6.4.7 makes abundantly clear. In practice, though, refinements in the way data are collected that would have a substantial impact on the magnitude of σ are often either difficult to identify or prohibitively expensive. More typically, experimenters achieve the same effect by simply increasing the sample size. Look again at the two sets of distributions in Figure 6.4.7. The increase in 1−β from 0.53 to 0.96 was accomplished by cutting the denominator of the test statistic z = y−25 σ/ √ 30 in half by reducing the standard deviation from 2.4 to 1.2. The same numerical effect would be produced if σ were left unchanged but n was increased from 30 to 120—that is, 1.2√ 30 = 2.4√ 120 . Because it can easily be increased or decreased, the sample size is the parameter that researchers almost invariably turn to as the mechanism for ensuring that a hypothesis test will have a sufficiently high power against a given alternative. Example 6.4.1 Suppose an experimenter wishes to test H0:μ = 100 versus H1:μ > 100 at the α =0.05 level of significance and wants 1−β to equal 0.60 when μ=103. What is the smallest (i.e., cheapest) sample size that will achieve that objective? Assume that the variable being measured is normally distributed with σ = 14. Finding n, given values for α,1 − β,σ, and μ, requires that two simultaneous equations be written for the critical value y∗ , one in terms of the H0 distribution and the other in terms of the H1 distribution. Setting the two equal will yield the minimum sample size that achieves the desired α and 1 − β. Consider, first, the consequences of the level of significance being equal to 0.05. By definition, α = P(We reject H0 | H0 is true) = P(Y ≥ y∗ | μ = 100) = P Y − 100 14/ √ n ≥ y∗ − 100 14/ √ n = P Z ≥ y∗ − 100 14/ √ n = 0.05 But P(Z ≥ 1.64) = 0.05, so y∗ − 100 14/ √ n = 1.64 374 Chapter 6 Hypothesis Testing or, equivalently, y∗ = 100 + 1.64 · 14 √ n (6.4.1) Similarly, 1 − β = P(We reject H0 | H1 is true) = P(Y ≥ y∗ | μ = 103) = P Y − 103 14/ √ n ≥ y∗ − 103 14/ √ n = 0.60 From Appendix Table A.1, though, P(Z ≥ −0.25) = 0.5987 . = 0.60, so y∗ − 103 14/ √ n = −0.25 which implies that y∗ = 103 − 0.25 · 14 √ n (6.4.2) It follows, then, from Equations 6.4.1 and 6.4.2 that 100 + 1.64 · 14 √ n = 103 − 0.25 · 14 √ n Solving for n shows that a minimum of seventy-eight observations must be taken to guarantee that the hypothesis test will have the desired precision. Decision Rules for Nonnormal Data Our discussion of hypothesis testing thus far has been confined to inferences involving either binomial data or normal data. Decision rules for other types of probability functions are rooted in the same basic principles. In general, to test H0:θ =θo, where θ is the unknown parameter in a pdf fY (y;θ), we initially define the decision rule in terms of ˆθ, where the latter is a sufficient statistic for θ. The corresponding critical region is the set of values of ˆθ least compatible with θo (but admissible under H1) whose total probability when H0 is true is α. In the case of testing H0:μ = μo versus H1:μ > μo, for example, where the data are normally distributed, Y is a sufficient statistic for μ, and the least likely values for the sample mean that are admissible under H1 are those for which y ≥ y∗ , where P(Y ≥ y∗ | H0 is true) = α. Example 6.4.2 A random sample of size n =8 is drawn from the uniform pdf, fY (y;θ)=1/θ,0≤ y ≤ θ, for the purpose of testing H0:θ = 2.0 versus H1:θ < 2.0 at the α = 0.10 level of significance. Suppose the decision rule is to be based on Y8, the largest order statistic. What would be the probability of committing a Type II error when θ = 1.7?