CATEGORICAL DATA ANALYSIS: 2nd EDSTIOSM DANIEL A. POWERS Department of Sociology and Population Research Center, University of Texas at Austin, Austin, Texas, USA YUXIE Department of Sociology, Department of Statistics, unci Institute for Social Research, University of Michigan, Aim Arbor, Michigan, USA UiiilMl Kingdom • North America • Jnpa ljidin • MaLiysiu » China ft Chapter 1 Introduction I.l. Why Categorical Data Analysis? Whal is common about birth, marriage, schooling, employment, occupation, migration, divorce, und death? The answer: they are all categorical variables commonly studied in social science research. In fact, most observed outcomes in .social science research are measured categorically. If you arc a practicing social scientist, chances are good that yott have studied a phenomenon involving a categorical variable. (This is true even if you have not used any special statistical method Tor handling categorical data.) If"you are in a graduate program to become a social scientist, you will soon, if not already, encounter n categorical variable. Notice that even our statement of whether or not you have encountered a categorical variable in your career is itself a categorical measurement! Statistical methods and techniques for categorical data analysis have undergone rapid development in the past 25 years or so. Their applications in applied research have become commonplace in recent years, due in large part to the availability of commercial software and inexpensive computing. Since some of the material is rather new and dispersed among several disciplines, we believe that there is a need far a systematic treatment of the subject in a single book. This book is aimed at helping applied social scientists use special tools that are well suited for analyzing categorical data, in this chapter, we will first define categorical variables and then introduce our approach to the subject. 1.1.1. Defining Categorical Variables We define categorical variables as those variables that can be measured using only a limited number of values or categories. This definition distinguishes categorical variables from continuous variables, which, in principle, can assume an infinite number of values. Although this definition of categorical variables is clear, its application to applied work is far more anibigunus. Many variables of long-lasting interest to social scientists are clearly categorical. Such variables include: race, gender, immigration status, marital status, employment, birth, and death. However, conceptually continuous variables are sometimes treated as continuous and other times as categorical. When a continuous variable is treated as a categorical variable, it is called categorization or discretization of the continuous variable. Categorization is often necessary in practice because either the substantive meaning or the actual measurement of a continuous variable is categorical. Age is a good example. Although conceptually continuous, age is often treated as categorical in actual research for substantive and practical reasons. Substantively, age serves as a proxy for qualitative states Tor some research purposes, qualitatively transforming an individual's status at certain key points. Changes in legal and social status occur first during the transition into adulthood and later during the transition out of the labor force. For practical reasons, age is usually reported in single-year or live-year intervals.1 Indeed, our usual instruments in social science research are crude in the sense that they typically constrain possible responses to a limited number of possible values. It is for this reason that we earlier stated that most, if not all, observed outcomes in social science are categorical. What variables should then be considered categorical as opposed to continuous in empirical research? The answer depends on many factors, two of which are their substantive meaning in the theoretical model and their measurement precision. One requirement for treating a variable as categorical is that its values are repealed for at least a significant portion of the sample.2 As will be shown later, the distinction between continuous and categorical variables is far more consequential for response variables than for explanatory variables. 1.1.2. Dependent and Independent Variables A dependent (also called response, outcome, or endogenous) variable represents a population characteristic of interest being explained in a study. Independent (also called explanatory, predetermined, or exogenous) variables are variables that are used to explain the variation in the dependent variable. Typically, the characteristic of interest is the population mean of the dependent variable (or its transformation) conditioned on values of an independent variable or set of independent variables. It is in this sense that we mean that the dependent variable depends on, is explained by, or is a function of independent variables in regression-type statistical models. By regression-type statistical models, we mean models that predict either the expected value of the dependent variable or some other characteristic of the dependent variable, as a regression function of independent variables. Although in principle we could design our models to best predict any population parameter (e.g.. the median) of the dependent variable or its transformation, in practice we "high-school diploma," "college degree," or "graduate degree" cannot be captured without cnlcgoráaiion. k lew categories offer n concise representation of the important points in the distribution of education. 2. Note that it continuous variable can be truncated, meaning that it has zero probability or yielding a value beyund a particular threshold or cut-off point. When a continuous variable is truncated, the imtruneiUcd part is still continuous, whereas the purl that is truncated resembles a categorical variable. commonly use the term regression to denote the problem of predicting conditional means. When the regression function is a iinear combination of independent variables, we have so-called linear regressions, which are widely used for continuous dependent variables. 1.1.3. Categorical Dependent Variables Although categorical and continuous variables share many properties in common, we wish to highlight some of the differences here. The distinction between categorical and continuous variables as dependent variables requires special attention. In contrast, the distinction is of relatively minor significance when they are used as independent variables in regression-type statistical models. Our definition of regression-type statistical models includes statistical methods for the analysis of variance and covariance, which can be represented by regressing the dependent variable on a set of dummy variables and, in the case of the analysis of covariance, other continuous covariates. Hence, including categorical variables as independent variables in regression-type models does not present any particular difficulties, as it mainly involves constructing dummy variables corresponding to different categories of the independent variable; all known properties of regression models are directly generalizable to models for the analysis of variance and covariance. As we will show later in this book, the situation changes drastically when we treat categorical variables as dependent variables, as much of our knowledge derived from linear regressions is simply inapplicable. In brief, special statistical methods are required for categorical data analysis (i.e., analysis involving categorical dependent variables). Although the methods for analyzing categorical variables as independent variables in regression-type models have been a part of the standard statistical knowledge base that is now required for most advanced degrees in social science, methods for the analysis of categorical dependent variables are much less widely known. Much of the fundamental research on the methodology of analyzing categorical data has been developed only recently. We aim to give a systematic treatment of several important topics on categorical data analysis in this book so as to facilitate the integration of the material into social science research. Unlike methods for continuous variables, methods for categorical data require close attention to the type of measurement of the dependent variable. Methods for analyzing one type of categorical dependent variable may be inappropriate for analyzing another type of variable. 1-1.4. Types of Measurement t The type of measurement plays a key role in determining the appropriate method of analysis when a variable is used as a dependent variable. We present a typology for 4 Statistical Methods far Categorical Data Analysis Introduction 5 four types oT measurement based on three distinctions.'1 First, lei us distinguish between quantitative unci qualitative measurements, The distinction between the wo is that quantitative measurements closely index the substantive meanings of u variable with numerical values, whereas numerical values Tor qualitative measurements ure substantively less meaningful, sometimes merely as classifications to denote mutually exclusive categories of characteristics (or attributes) uniquely. Qualitative variables are categorical variables. Within the class of quantitative variables, it is oficn useful to distinguish runner between continuous and discrete variabies. Continuous variables, also called interval variables, may assume any real value. Variables such as income and socioeconomic status are typically treated as continuous over their plausible range of values. Discrete variables may assume only integer values and often represent event counts. Variables such as (he number of children per family, the number of delinquent acts committed by a juvenile, and the number of accidents per year at a particular intersection are examples or discrete variables. According to our earlier definition, discrete (but quantitative) variables are also categorical variables. Qualitative measurements can be further distinguished between ordinal and nominal. Ordinal measurement!! give rise to ordered qualitative variables, or ordinal variables. It is quite common to use numerical values to denote the ordering information in an ordered qualitative variable. However, numerical values corresponding to categories of ordinal variables reflect only the ranking order in a particular attribute; therefore, distances between two adjacent values are not the same. Attitudes toward gun control (strongly approve, approve, neutral, disapprove, and strongly disapprove), occupational skill level (highly skilled, medium skilled, low skilled, and unskilled), and the classification of levels of education as (grade school, high school, college, and graduate) are examples of ordinal variables. Nominal measurements yield unordered qualitative variables, often referred to as nominal variables. Nominal variables possess no inherent ordering, nor numericul distance, between category levels. Classifications of race and ethnicity (while, black, Hispanic, and other), gender (mule and female), and marital status (never married, married, divorced, and widowed) are examples of unordered qualitative variables. It is worth noting al this point, however, that the distinction between ordinal and nominal variables is not always clear-cut. Much of the distinction depends on the research questions. The same variable may be ordinal for some researchers but nominal for others. To further illustrate the last point, let us use occupation as nn example, Distinct occupations ure often measured by open-ended questions and then, manually coded into a classification system with three-digit numerical codes that do not represent magnitudes in substantive dimensions. Since the number of potential occupations is large (usually at least a Tew hundred in a coding scheme for a modern society), it is desirable, and indeed necessary, to reduce the amount of detail in an occupational 3. Fur an historical background, set Duncan's [ I9K4) inipcmnni boot Nuictt an Soda! Mvmwwwih. CuiiriiauuLic DifiCrwc Categorical f Ordinal Quoliimivu I Nnmina| Figure 1.1: Typology of the four types of measurements. measure through dala reduction. One method of data reduction is to collapse detailed occupational codes into major occupational categories and treat them as constituting either an ordinal or even a nominal measurement (Duncan, 3979; Mauser, 1978). Another method of data reduction is to scale occupations along the dimension of a socioeconomic index (SEI) (Duncan, 196!) — thus into an interval variable. More recently, T-lauser and Warren (1997) challenged Duncan's approach and suggested instead that to measure occupational socioeconomic status, occupations are best scaled into two separate dimensions of occupational income and occupational education. Mauser and Warren's work illustrates the importance of considering multiple dimensions when nominal measures are scaled into interval measures. Figure 1.1 summarizes our typology scheme for the four types of measurements. According to this typology, there are three types at categorical variables: discrete, ordinal, and nominal, ail of which will be discussed in this book. This distinction among the three types of categorical variables is useful only when the number of possible values equals or exceeds three. When the number of possible values is two, we have a special case called a binary variable. A binary variable can be discrete, ordinal, or nominal, depending on the researcher's interpretation. For example, if a researcher is interested in studying compliance with the one-child policy in China, the dependent variable is whether a couple has given birth id more than one child. For simplicity, assume that in n particular sample a woman bus at least one child and no more than two children. Let us code y so that y = 0 if u woman has one child, and y = I if she has two children. In this case, the dependent variable can be interpreted as discrete (number of children - I}, ordinal (one child or more than one child), or nominal (compliance vs. noncompliance), Forlunutely, the researcher may apply the same statistical methods for all three cases. II is the substantive understanding of the results that varies from one interpretation to another. 1.2. Two Philosophies of Categorical Data The development of methods for the analysts of categorical dala has benefitted greatly from contributions by scholars in such diverse fields as statistics, biostatistics, economics, psychology, and sociology. This multidisciplinary origin has given categorical data" analysis multiple approaches to similar problems and multiple interpretations for similar methodologies. As a result, categorical data analysis is an intellectually rich and expanding field, However, this interdisciplinary nature has also 4 6 Statistical Methods for Categorical Dum Analysis Introduction 7 made synthesizing and consolidating available techniques difficult due to the diverse applications mid differing terminology across disciplines, Fart of this difficulty stems from two fundamentally different "philosophies" concerning the nature of categorical data. One philosophy views categorical variables as being inherently categorical and relies an transformations of the data to derive regression-type models. The other philosophy presumes that categorical variables are conceptually continuous but are observed, or measured, as categorical. In the one-child policy example, a researcher may view "compliance" as a behavioral continuum. However, he/she can only observe two distinct values of this dependent variable. This approach relies on latent variables to derive regression-type models. These very different philosophies can be traced bade to the acrimonious debate between Karl Pearson and G. Uduy Yule between 190-1 and 1913 (Agresti, 2001, pp. 61 y-622). Although these two approaches can be found in any single discipline, the first is more closely identified with statistics and biostutistics, and the second with econometrics and psychometrics. For simplicity, we will refer to the first approach as statistical or transformational and to the second as econometric or latent variable. We intend the terms statistical and econometric here as short-hand labels rather than as descriptions of the two disciplines. 1.2.1. The Transformational Approach Jn the transformational, or statistical, approach, categorical data are considered as inherently categorical and should be modeled as such. In this approach, there is a direct one-to-one correspondence between population parameters of interest and sample statistics. The focus is on estimating population parameters that correspond to their sample analogs. No latent, or unobserved, variable is invoked. In the transformational approach, statistical modeling means that the expected value of the categorical dependent variable, after some transformation, is expressed as a linear function of lite independent variables. Given the categorical nature of the dependent variable, the regression function cannot be linear. The problem of nonlinearity is handled through nonlinear functions that transform the expected value of the categorical variable into a linear function of Ihe independent variables. Such transformation functions are now commonly referred to as link functions.4 For example, in the analysis of discrete (count) data, the expected frequencies (or cell counts) must be nonnegative. To ensure that the predicted values from regression models fit these constraints, the natural logarithm function (or lag link) is used to transform the expected value of the dependent variable so that a model for the logged count can be expressed as a linear function of independent variables. This lagllneor transformation serves two purposes: it ensures that the fitted values are appropriate 4. Moduls that Clin he ininslonncd to äiniar models via lint functions arc [efemd lu us gaicrattziv! IIiki maJdx. MtCullugli anü Nidder (19H9) provide an extensive truiiiincnt of diese lypes of models, for count data (i.e., nonneuativej, and it permits the unknown regression parameters to fie within the entire real space (parameter space). In binomial response models, estimated probabilities must lie in the interval [0,1], a range that is violated by any linear function if independent variables are allowed to vury freely. Instead of directly modeling probabilities in this range, we can model a transformation or probability tiiat lies in the interval (- co, + go). There are a number of ways to transform probabilities. The lagit transformation, log [pj{\ -p)], can be used to transform the probability scale so that it can be expressed as a linear function ofindependent variables. A prohil transformation, ~ '(i>), can be used in a similar fashion to re-scule probabilities. The probit link utilizes the inverse oT the cumulative standard normal distribution function to transform the expected probability to the range (-co, + co) (i.e., by transforming probabilities to .Z-scores). As in the loyil model, the probit link transforms the probability so that it can be expressed as a linear function ofindependent variables. Both the legit and probit transformations ensure that the predicted probabilities are in the proper range for ail possible values of parameters and independent variables. 1.2.2. The Latent Variable Approach The latent variable, or econometric, approach provides a somewhat different view of categorical data. The key to this approach is to assume the existence of a continuous unobserved or latent variable underlying an observed categorical variable. When the latent variable crosses a threshold, the observed categorical variable takes on a ;■■ different value. According to the latent variable approach, what makes categorical variables different from usual continuously distributed variables is partial observability. That is, we can inTer from observed categorical values only the intervals within which latent variables lie but not the actual values themselves. For v-'—-'-.......this reason, econometricinns commonly refer to categorical variables as limited- ; dependent variables (Maddala, I9S3). In the latent variable approach, the researcher's theoretical interest lies more in how independent variables affect the latent continuous variables (called structural analysis) than in how independent variables affect the observed categorical variable. From the latent variable perspective, it is thus convenient to think of the sample data as actual realizaiimts of population quantities that are mmbxervable. For instance, the observed response categories may reflect She actual choices made by individuals in a sample, but underlying each choice at the papulation level is a latent variable representing the difference between the cost and the benefit of u particular choice ^iS: made by an individual decision maker. Similarly, u binary variable may be thought or as the sample realization of a continuous variable representing an unobserved ; ;;ä'j;