CHAPTER 2 Sampling and Measurement 2.1 VARIABLES AND THEIR MEASUREMENT 2.2 RANDOMIZATION 2.3 SAMPLING VARIABILITY AND POTENTIAL BIAS 2.4 OTHER PROBABILITY SAMPLING METHODS* 2.5 CHAPTER SUMMARY To analyze social phenomena with a statistical analysis, descriptive methods summarize the data and inferential methods use sample data to make predictions about populations. In gathering data, we must decide which subjects to sample. Selecting a sample that is representative of the population is a primary topic of this chapter. Given a sample, we must convert our ideas about social phenomena into data through deciding what to measure and how to measure it. Developing ways to measure abstract concepts such as achievement, intelligence, and prejudice is one of the most challenging aspects of social research. A measure should have validity, describing what it is intended to measure and accurately reflecting the concept. It should also have reliability, being consistent in the sense that a subject will give the same response when asked again. Invalid or unreliable data-gathering instruments render statistical manipulations of the data meaningless. The first section of this chapter introduces definitions pertaining to measurement, such as types of data. The other sections discuss ways, good and bad, of selecting the sample. 2.1 VARIABLES AND THEIR MEASUREMENT Statistical methods help us determine the factors that explain variability among subjects. For instance, variation occurs from student to student in their college grade point average (GPA). What is responsible for that variability? The way those students vary in how much they study per week? in how much they watch TV per day? in their IQ? in their college board score? in their high school GPA? Variables Any characteristic we can measure for each subject is called a variable. The name reflects that values of the characteristic vary among subjects. Variable A variable is a characteristic that can vary in value among subjects in a sample or population. Different subjects may have different values of a variable. Examples of variables are income last year, number of siblings, whether employed, and gender. The values the variable can take form the measurement scale. For gender, for instance, the 11 12 Chapter 2 Sampling and Measurement measurement scale consists of the two labels, female and male. For number of siblings it is 0,1,2,3..... The valid statistical methods for a variable depend on its measurement scale. We treat a numerical-valued variable such as annual income differently than a variable with a measurement scale consisting of categories, such as (yes, no) for whether employed. We next present ways to classify variables. The first type refers to whether the measurement scale consists of categories or numbers. Another type refers to the number of levels in that scale. Quantitative and Categorical Variables A variable is called quantitative when the measurement scale lias numerical values. The values represent different magnitudes of the variable. Examples of quantitative variables are a subject's annual income, number of siblings, age, and number of years of education completed. A variable is called categorical when the measurement scale is a set of categories. For example, marital status, with categories (single, married, divorced, widowed), is categorical. For Canadians, the province of residence is categorical, with the categories Alberta, British Columbia, and so on. Other categorical variables are whether employed (yes, no), primary clothes shopping destination (local mall, local downtown, Internet, other), ravorite type of music (classical, country, folk, jazz, rock), religious affiliation (Protestant, Catholic, Jewish, Muslim, other, none), and political party preference. For categorical variables, distinct categories differ in quality, not in numerical magnitude. Categorical variables are often called qualitative. We distinguish between categorical and quantitative variables because different statistical methods apply to each type. Some methods apply to categorical variables and others apply to quantitative variables. For example, the average is a statistical summary for a quantitative variable, because it uses numerical values. It's possible to find the average for a quantitative variable such as income, but not for a categorical variable such as religious affiliation or favorite type of music. Nominal, Ordinal, and Interval Scales of Measurement For a quantitative variable, the possible numerical values are said to form an interval scale. Interval scales have a specific numerical distance or interval between each pair of levels. Annual income is usually measured on an interval scale. The interval between $40,000 and $30,000, for instance, equals $10,000. We can compare outcomes in terms of how much larger or how much smaller one is than the other. Categorical variables have two types of scales. For the categorical variables mentioned in the previous subsection, the categories are unordered. The scale does not have a "high" or "low" end. The categories are then said to form a nominal scale. For another example, a variable measuring primary mode of transportation to work might use the nominal scale with categories (automobile, bus, subway, bicycle, walk). Although the different categories are often called the levels of the scale, for a nominal variable no level is greater than or smaller than any other level. Names or labels such as "automobile" and "bus" for mode of transportation identify the categories but do not represent different magnitudes. By contrast, each possible value of a quantitative variable is greater than or less than any other possible value. A third type of scale falls, in a sense, between nominal and interval. It consists of categorical scales having a natural ordering of values. The levels form an ordinal scale. Examples are social class (upper, middle, lower), political philosophy (very liberal, slightly liberal, moderate, slightly conservative, very conservative), Section 2.1 Variables and Their Measurement 13 government spending on the environment (too little, about right, too much), and frequency of religious activity (never, less than once a month, about 1-3 times a month, every week, more than once a week). These scales are not nominal, because the categories are ordered. They are not interval, because there is no defined distance between levels. For example, a person categorized as very conservative is more conservative than a person categorized as slightly conservative, but there'is no numerical value for how much more conservative that person is. In summary, for ordinal variables the categories have a natural ordering, whereas for nominal variables the categories are unordered. The scales refer to the actual measurement and not to the phenomena themselves. Place of residence may indicate a geographic place name such as a county (nominal), the distance of that place from a point on the globe (interval), the size of the place (interval or ordinal), or other kinds of variables. Quantitative Aspects of Ordinal Data As we've discussed, levels of nominal scales are qualitative, varying in quality, not in quantity. Levels of interval scales are quantitative, varying in magnitude. The position of ordinal scales on the quantitative-qualitative classification is fuzzy. Because their scale is a set of categories, they are often analyzed using the same methods as nominal scales. But in many respects, ordinal scales more closely resemble interval scales. They possess an important quantitative feature: Each level has a greater or smaller magnitude than another level. Some statistical methods apply specifically to ordinal variables. Often, though, it's helpful to analyze ordinal scales by assigning numerical scores to categories. By treating ordinal variables as interval rather than nominal, we can use the more powerful methods available for quantitative variables. For example, course grades (such as A, B, C, D, E) are ordinal. But we treat them as interval when we assign numbers to the grades (such as 4, 3, 2, 1, 0) to compute a grade point average. Treating ordinal variables as interval requires good judgment in assigning scores. In doing this, you can conduct a "sensitivity analysis" by checking whether conclusions would differ in any significant way for other choices of the scores. Discrete and Continuous Variables One other way to classify a variable also helps determine which statistical methods are appropriate for it. This classification refers to the number of values in the measurement scale. Discrete and Continuous Variables A variable is discrete if its possible values form a set of separate numbers, such as 0, 1, 2, 3,----It is continuous if it can take an infinite continuum of possible real number values. Examples of discrete variables are the number of siblings and the number of visits to a physician last year. Any variable phrased as "the number of ..." is discrete, because it is possible to list its possible values {0,1,2,3,4,...}. Examples of continuous variables are height, weight, and the amount of time it takes to read a passage of a book. It is impossible to write down all the distinct potential values, since they form an interval of infinitely many values. The amount of time needed to read a book, for example, could take on the value 8,6294473... hours. Discrete variables have a basic unit of measurement that cannot be subdivided. For example, 2 and 3 are possible values for the number of siblings, but 2.5716 is 14 Chapter 2 Sampling and Measurement not. For a continuous variable, by contrast, between any two possible values there is always another possible value. For example, age is continuous in the sense that an individual does not age in discrete jumps. At some well-defined point during the year in which you age from 21 to 22, you are 21.3851 years old, and similarly for every other real number between 21 and 22. A continuous, infinite collection of age values occurs between 21 and 22 alone. Any variable with a finite number of possible values is discrete. All categorical variables, nominal or ordinal, are discrete, having a finite set of categories. Quantitative variables can be discrete or continuous; age is continuous, and number of siblings is discrete. For quantitative variables the distinction between discrete and continuous variables can be blurry, because of how variables are actually measured. In practice, we round continuous variables when measuring them, so the measurement is actually discrete. We say that an individual is 21 years old whenever that person's age is somewhere between 21 and 22. On the other hand, some variables, although discrete, have a very large number of possible values. In measuring annual family income in dollars, the potential values are 0, 1, 2, 3, ..., up to some very large value in many millions. What's the implication of this? Statistical methods for discrete variables are mainly used for quantitative variables that take relatively few values, such as the number of times a person has been married. Statistical methods for continuous variables are used for quantitative variables that can take lots of values, regardless of whether they are theoretically continuous or discrete. For example, statisticians treat variables such as age, income, and IQ as continuous. In summary, • Variables are either quantitative (numerical valued) or categorical. Quantitative variables are measured on an interval scale. Categorical variables with unordered categories have a nominal scale, and categorical variables with ordered categories have an ordinal scale. « Categorical variables (nominal or ordinal) are discrete. Quantitative variables can be either discrete or continuous. In practice, quantitative variables that can take lots of values are treated as continuous. Figure 2.1 summarizes the types of variables, in terms of the (quantitative, categorical), (nominal, ordinal, interval), and (continuous, discrete) classifications. Note: Ordinal ilain arc [railed someiiiutis as categorical unil Hnmatimca as i|iianlil!ilive FIGURE 2.1: 5ummary of Quantitative-Categorical, Nominal-Ordinal-lnterval, Continuous-Discrete Classifications Section 2.2 Randomization 15 .2 RANDOMIZATION Inferential statistical methods use sample statistics to make predictions about population parameters. The quality of the inferences depends on how Well the sample represents the population. This section introduces an important sampling method that incorporates randomization, the mechanism for achieving good sample representation. Simple Random Sampling Subjects of a population to be sampled could be individuals, families, schools, cities, hospitals, records of reported crimes, and so on. Simple random sampling is a method of sampling for which every possible sample has equal chance of selection. Let n denote the number of subjects in the sample, called the sample size. Simple Random Sample A simple random sample of n subjects from a population is one in which each possible sample of that size has the same probability (chance) of being selected. For instance, suppose a researcher administers a questionnaire to one randomly selected adult in each of several households. A particular household contains four adults—mother, father, aunt, and uncle—identified as M, F, A, and U. For a simple random sample of n = 1 adult, each of the four adults is equally likely to be interviewed. You could select one by placing the four names on four identical ballots and selecting one blindly from a hat. For a simple random sample of n = 2 adults, each possible sample of size two is equally likely. The six potential samples are (M, F), (M, A), (M, U), (F, A), (F, U), and (A, U), To select the sample, you blindly select two ballots from the hat. A simple random sample is often just called a random sample. The simple adjective is used to distinguish this type of sampling from more complex sampling schemes presented in Section 2.4 that also have elements of randomization. Why is it a good idea to use random sampling? Because everyone has the same chance of inclusion in the sample, so it provides fairness. This reduces the chance that the sample is seriously biased in some way, leading to inaccurate inferences about the population. Most inferential statistical methods assume randomization of the sort provided by random sampling. How to Select a Simple Random Sample To select a random sample, we need a list of all subjects in the population. This list is called the sampling frame. Suppose you plan to sample students at your school. The population is all students at the school. One possible sampling frame is the student directory. The most common method for selecting a random sample is to (1) number the subjects in the sampling frame, (2) generate a set of these numbers randomly, and (3) sample the subjects whose numbers were generated. Using random numbers to select the sample ensures that each subject has an equal chance of selection. Random Numbers_ Random numbers are numbers that are computer generated according to a scheme whereby each digit is equally likely to be any of the integers 0,1,2,...,9 and does not depend on the other digits generated.