I V Your Guide to a Powerful, State-of-the-Art Statistical Program_ Now updated for use with Version 9! For students and practicing researchers alike, Statistics with Statu opens t be door to full use of the popular Staid program—a fast, flexible, and easy-to-use environment for data management and statistical analysis. Now integrating Stuta's impressive new graphics, this comprehensive book presents hundreds of examples showing how you can appl Statu to accomplish a wide variety of tasks. Like Statu itself. Statistics with Statu will make il easier for you In move lluidly through the world of modern data analysis. Its contents include: A A complete chapter on database management, including sections on how to create, translate, update, or restructure datasets. A A detailed, example-based introduction to the new graphical capabilities of Slata. Topics range from simple histograms and lime plots to regression diagnostics and qualhy control charts, New sections describe1 methods to combine or enhance graphs for publication. A Basic, statistical tools, including tables, para-melric tesls. chi-square and oilier nonparamel-ric tests, I tests, ANOYA/ANCOVA, correlation, linear regression, and multiple regression. Advanced melhods. including nonlinear, robust; and quantile regression; logit, multinomial logit, and other models for categorical dependent variables; survival and event-count analysis; generalized linear modeling (GLM), factor analysis, and cluster analysis—all demonstrated through practical, easy-to-follow examples with an emphasis on interpretation. Guidelines for writing your own programs in Statu—user written programs allow creation of powerful new tools for database management and statistical analysis and support computation-intensive methods, such as bootstrapping and Monte Carlo simulation. Data files are available at hlLi»://www.duxbur>,.c*«in1 the Duxbury Wei) site. THOMSON -*-- BROOKS/COLE Visit Brooks/Cole online at www.brookscole.com For your learning solutions: www.thomson.com/learning ■jb'i■ '-7. -X 804 t> 6" 1 0 3 7? T ft ' B > QA 276 .4 .H356; 2006 Lawrence C. Hamilton DUXBURY Statistics with STATA Updated for Version 9 Lawrence C. Hamilton University of New Hampshire THOMSON -*-- BROOKS/COLE Australia • Brazil • Canada • Mexico • Singapore • Spain United Kingdom • United States THOMSON --*-- BROOKS/COLE Statistics with Stata: Updated for Version 9 Lawrence C. Hamilton Publisher: Curt Hinrichs Senior Assistant Editor: Ann Day Editorial Assistant: Daniel Geller Technology Project Manager: Fiona Chong Marketing Manager: Joe Rogove Marketing Assistant: Brian Smith Executive Marketing Communications Manager: Darlenc Amidon-Brent © 2006 Duxbury, an imprint of Thomson Brooks/Cole, a part of The Thomson Corporation. Thomson, the Star logo, and Brooks/Cole are trademarks used herein under license. ALL RIGHTS RESERVED. No part of this work covered by the copyright hereon may be reproduced or used in any form or bv any means—graphic, electronic, or mechanical, including photocopying, recording, taping, web distribution, information storage and retrieval systems, or in any other manner—without the written permission of the publisher. Printed in Canada 1 2 3 4 5 6 7 09 08 07 06 05 For more information about our products, contact us at: Thomson Learning Academic Resource Center 1-800-423-0563 For permission to use material from this text or product, submit a request online at http://www.thomsonrights.com. Anv additional questions about permissions can be submitted by email to thomsonrights@thomson.com. Library of Congress Control Number: 2005933358 ISBN Ö-495-10972-X Project Manager, Editorial Production: Kelsey McGee Creative Director: Rob Hugel Art Director: Lee Friedman Print Buyer: Darlene Suruki Permissions Editor: Kiely Sisk Cover Designer: Denise Davidson/Simple Design Cover Image: ©Imtek Imagineering/Masterfile Cover Printing, Printing & Binding: Webcom Limited Thomson Higher Education 10 Davis Drive Belmont, CA 94002-3098 USA Asia (including India) Thomson Learning 5 Shenton Way #01-01 UIC Building Singapore 068808 Australia/New Zealand Thomson Learning Australia 102 Dodds Street Southbank, Victoria 3006 Australia Canada Thomson Nelson 1120 Birchmount Road Toronto, Ontario M1K 5G4 Canada UK/Europe/Middle East/Africa Thomson Learning High Holborn House 50-51 Bedford Row London IVOR 4LR United Kingdom Latin America Thomson Learning Seneca, 53 Colonia Polanco 11560 Mexico D.F. Mexico Contents Preface.........................................................ix 1 Stata and Stata Resources...................................... 1 A Typographical Note ...............................................1 An Example Stata Session ............................................2 Stata's Documentation and Help Files...................................7 Searching for Information.............................................8 Stata Corporation ...................................................9 Statalist..........................................................10 The Stata Journal.................................................. 10 Books Using Stata.................................................. 11 2 Data Management............................................ 12 Example Commands................................................13 Creating a New Dataset .............................................15 Specifying Subsets of the Data: in and if Qualifiers .................. 19 Generating and Replacing Variables ...................................23 Using Functions ...................................................27 Converting between Numeric and String Formats.........................32 Creating New Categorical and Ordinal Variables .........................35 Using Explicit Subscripts with Variables................................38 Importing Data from Other Programs...................................39 Combining Two or More Stata Files ...................................42 Transposing. Reshaping, or Collapsing Data.............................47 Weighting Observations.............................................54 Creating Random Data and Random Samples ............................56 Writing Programs for Data Management ................................60 Managing Memory.................................................61 3 Graphs..................................................... 64 Example Commands................................................65 Histograms .......................................................67 Scatterplots.......................................................72 Line Plots ........................................................77 Connected-Line Plots ...............................................83 Other Twoway Plot Types ...........................................84 Box Plots.........................................................90 Pie Charts ........................................................92 Bar Charts........................................................94 c i vi Statistics with Stata Contents vii Dot Plots.........................................................99 Symmetry and Quantile Plots........................................ 1Q0 Quality Control Graphs............................................. 105 Adding Text to Graphs............................................. 109 Overlaying Multiple Twoway Plots................................... HO Graphing with Do-Files ............................................ 115 Retrieving and Combining Graphs.................................... 116 4 Summary Statistics and Tables................................ 120 Example Commands...............................................1ZU Summary Statistics for Measurement Variables.......................... 121 Exploratory Data Analysis ..........................................124 Normality Tests and Transformations .................................126 Frequency Tables and Two-Way Cross-Tabulations......................130 Multiple Tables and Multi-Way Cross-Tabulations.......................133 Tables of Means, Medians, and Other Summary Statistics .................136 Using Frequency Weights...........................................!38 5 ANOVA and Other Comparison Methods ........................ 141 Example Commands............................................... 142 One-Sample Tests................................................. 143 Two-Sample Tests ................................................ ]46 One-Way Analysis of Variance (ANOVA) ............................. 149 Two- and N-Way Analysis of Variance ................................ 152 Analysis of Covariance (ANCOVA) .................................. 153 Predicted Values and Error-Bar Charts ................................ 155 6 Linear Regression Analysis................................... 159 Example Commands...............................................159 The Regression Table..............................................162 Multiple Regression ...............................................164 Predicted Values and Residuals ......................................165 Basic Graphs for Regression......................................... 168 Correlations......................................................171 Hypothesis Tests..................................................*75 Dummy Variables................................................. 176 Automatic Categorical-Variable Indicators and Interactions................183 Stepwise Regression...............................................186 Polynomial Regression.............................................188 Panel Data.......................................................,9' 7 Regression Diagnostics...................................... 196 Example Commands...............................................196 SAT Score Regression, Revisited..................................... 198 Diagnostic Plots ..................................................200 Diagnostic Case Statistics...........................................205 Multicollinearity..................................................210 8 Fitting Curves.............................................. 215 Example Commands...............................................215 Band Regression..................................................217 Lowess Smoothing ................................................219 Regression with Transformed Variables - 1 ............................223 Regression with Transformed Variables - 2 ............................227 Conditional Effect Plots ............................................230 Nonlinear Regression - 1 ...........................................232 Nonlinear Regression-2 ...........................................235 9 Robust Regression.......................................... 239 Example Commands...............................................240 Regression with Ideal Data..........................................240 Y Outliers .......................................................243 X Outliers (Leverage)..............................................246 Asymmetrical Error Distributions ....................................248 Robust Analysis of Variance ........................................249 Further rreg and qreg Applications..............................255 Robust Estimates of Variance— 1....................................256 Robust Estimates of Variance — 2....................................258 10 Logistic Regression ......................................... 262 Example Commands...............................................263 Space Shuttle Data ................................................265 Using Logistic Regression ..........................................270 Conditional Effect Plots ............................................273 Diagnostic Statistics and Plots .......................................274 Logistic Regression with Ordered-Category y...........................278 Multinomial Logistic Regression.....................................280 11 Survival and Event-Count Models.............................. 288 Example Commands...............................................289 Survival-Time Data................................................291 Count-Time Data .................................................293 Kaplan-Meier Survivor Functions....................................295 Cox Proportional Hazard Models.....................................299 Exponential and Weibull Regression..................................305 Poisson Regression................................................309 Generalized Linear Models..........................................313 v'tii Statistics with Stata 12 Principal Components, Factor, and Cluster Analysis .............. 318 Example Commands...............................................319 Principal Components..............................................320 Rotation.........................................................322 Factor Scores.....................................................323 Principal Factoring................................................326 Maximum-Likelihood Factoring......................................327 Cluster Analysis — 1 ..............................................329 Cluster Analysis—2 ..............................................333 13 Time Series Analysis ........................................ 339 Example Commands...............................................339 Smoothing.......................................................341 Further Time Plot Examples.........................................346 Lags, Leads, and Differences........................................349 Correlograms.....................................................351 A RIM A Models ..................................................354 14 Introduction to Programming.................................. 361 Basic Concepts and Tools...........................................361 Example Program: Moving Autocorrelation............................369 Ado-File ........................................................373 Help File........................................................375 Matrix Algebra...................................................378 Bootstrapping....................................................382 Monte Carlo Simulation............................................387 References ................................................... 395 Index ........................................................ 401 Preface Statistics with Stata is intended for students and practicing researchers, to bridge the gap between statistical textbooks and Stata's own documentation. In this intermediate role, it does not provide the detailed expositions of a proper textbook, nor does it come close to describing all of Stata's features. Instead, it demonstrates how to use Stata to accomplish a wide variety of statistical tasks. Chapter topics follow conceptual themes rather than focusing on particular Stata commands, which gives Statistics with Stata a different structure from the Stata reference manuals. The chapter on Data Management, for example, covers a variety of procedures for creating, updating, and restructuring data files. Chapters on Summary Statistics and Tables, ANOVA and Other Comparison Methods, and Fitting Curves, among others, have similarly broad themes that encompass a number of separate techniques. The general topics of the first six chapters (through ordinary least squares regression) roughly parallel an introductory course in applied statistics, but with additional depth to cover practical issues often encountered by analysts — how to aggregate data, create dummy variables, draw publication-quality graphs, or translate ANOVA into regression, for instance. In Chapter 7 (Regression Diagnostics) and beyond, we move into the territory of advanced courses or original research. Here readers can find basic information and illustrations of how to obtain and interpret diagnostic statistics and graphs: perform robust, quantile. nonlinear, logit. ordered logit, multinomial logit, or Poisson regression; fit survival-time and event-count models; construct composite variables through factor analysis and principal components; divide observations into empirical types or clusters; and graph or model time-series data. Stata has worked hard in recent years to advance its state-of-the-art standing, and this effort is particularly apparent in the wide range of regression and model-fitting commands it now offers. Final ly, we conclude with a look at programming in Stata. Many readers will find that Stata does everything they need already, so they have no need to write original programs. For an active minority^ however, programmability is one of Stata's principal attractions, and it certainly underlies Stata's currency and rapid advancement. This chapter opens the door for new users to explore Stata programming, whether for specialized data management tasks, to establish a new statistical capability, for Monte Carlo experiments, or for teaching. Generally similar versions ("flavors") of Stata run on Windows, Macintosh, and Unix computers. Across all platforms, Stata uses the same commands, data files, and output. The versions differ in some details of screen appearance, menus, and file handling, where Stata follows the conventions native to each platform — such as director.' filename file specifications under Windows, in contrast with the /directory filename specifications under Unix. Rather than display all three, I employ Windows conventions, but users with other systems should find that only minor translations are needed. x Statistics with Stata Notes on the Fifth Edition T began using Stata in 1985. the first year of its release. (Stata's 20th anniversary'in 2005 was marked by a special issue of the Stata Journal. 2005:5< 1), filled with historical articles and interviews including a brief history of Statistics with Stata.) Initially, Stata ran only on MS-DOS personal computers, but its PC orientation made it distinctly more modern than its main competitors — most of which had originated before the desktop revolution, in the 80-column punched-card Fortran environment of mainframes. Unlike mainframe statistical packages that believed each user was a stack of cards, Stata viewed the user as a conversation. Its interactive nature and integration of statistical procedures with data management and graphics supported the natural flow of analytical thought in ways that other programs did not. graph and predict soon became favorite commands. I was impressed enough by how it all fit together to start writing the original Statistics with Stata, published in 1989 for Stata version 2. A great deal about Stata has changed since that book, in which T observed that "Stata is not a do-everything program .... The things it does, however, it does very well." The expansion of Stata's capabilities has been striking. This is very noticeable in the proliferation, and later in the steady rationalization, of model fitting procedures. William Gould's architecture for Stata, with its programming tools and unified syntax, has aged well and proven able to incorporate new statistical ideas as these were developed. The formidable list of modeling commands that begins Chapter 10. or of functions in Chapter 2, illustrate some of the ways that Stata became richer over the years. Suites of new techniques such as those for panel ( xt). survey ( svy ), time series ( ts ), or survival time ( st) data open worlds of possibility, as do programmable commands for nonlinear regression ( nl ) and general linear modeling ( glm). or general procedures for maximum-likelihood estimation. Other critical expansions include the development of a matrix programming capability, and the wealth of new data-management features. Data management, with good reason, has been promoted from an incidental topic in the first Statistics with Stata to the second-longest chapter in this fifth edition. Stata version 8 marked the most radical upgrade in Stata's history, led by the new menu system or GUI (graphical user interface), and completely redesigned graphing capabilities. A limited menu system, evolved from the student program StataQuest, had been available as an option since version 4, but Stata 8 for the first time incorporated an integrated menu interface offering a full range of alternatives to typed commands. These menus are more easily learned through exploration than by reading a book, so Statistics with Stata provides only general suggestions about menus at the beginning of each chapter. For the most part, this book employs commands to show what Stata can do: those commands' menu counterparts should be easy to discover. The redesigned graphing capabilities of Stata 8 called for similarly sweeping changes in Chapter 3, turning it into the longest chapter in this edition. The topic itself is complex, as the thick Graphics Reference Manual (and other material scattered through the documentation) attests. Rather than try to condense the syntax-based reference manuals. I have taken a completely different, complementary approach based on examples. Chapter 3 thus provides an organized gallery of 49 diverse graphs, each with instructions for how it was drawn. Further examples appear throughout the book; even the last graphs in Chapter 14 demonstrate new variations. To an unexpected degree. Statistics with Stata became a showcase for the new graphics. Preface xi Less drastic but also noteworthy changes from the previous Statistics with Stata include new sections on panel data (Chapter 6). robust standard errors (Chapter 9), and cluster analysis (Chapter 12). The Example Commands sections have been revised and expanded in several chapters as well. Readers coming from the older Statistics with Stata 5 will find more here that has changed, including new emphasis on Internet resources (Chapter 1). tools for data management (Chapter 2), table commands (Chapter 4), graphs for ANOVA (Chapter 5), a sharper look at multicollineariry (Chapter 8), robust and median-based analogues to ANOVA (Chapter 9), conditional effects plots for multinomial logit (Chapter 10), generalized linear models (Chapter 11). a new chapter on time series (Chapter 13). and a rewritten chapter on programming (Chapter 14). Other new Stata features, or improvements in old commands (including graph and predict), are scattered throughout the book. Because Stata now does so much, far beyond the scope of an introductory book, Statistics with Stata presents more procedures telegraphically in the "Example Commands"' sections that begin most chapters, or in lists of options followed by advice that readers should ''type help whatever " for details. Stata's online help and search features have advanced to keep pace with the program, so this is not idle advice. Beyond the help files are Stata's web site, Internet and documentation search capabilities, user-community listserver, NetCourses, the Stata Journal, and of course Stata's formidable printed documentation — presently over 5,000 pages and growing. Statistics with Stata provides an accessible gateway to Stata; these other resources can help you go further. Acknowledgments_ Stata's architect, William Gould, deserves credit for originating the elegant program that Statistics with Stata describes. My editor, Curt Hinrichs, backed by some reviewers and persistent readers, persuaded me to begin the effort of writing a new book. Pat Branton at Stata Corporation, along with many others including Shannon Driver and Lisa Gilmore, gave invaluable feedback, often on short deadlines, once this effort got underway. It would not have been possible without their assistance. Alen Riley and Vince Wiggins responded quickly to my questions about graphics. James Hamilton contributed advice about time series. Leslie Hamilton proofread the final manuscript. The book is built around data. To avoid endlessly recycling my own old examples, I drew on work by others including Carole Seyfrit, Sally Ward, Heather Turner, Rasmus Ole Rasmussen, Erich Buch, Paul Mayewski, Loren D. Meeker, and Dave Hamilton. Steve Selvin shared severa\ of h\se\amp\es fvomPractical Biostafistical Methods for Chapter 11. Data from agencies including Statistics Iceland, Statistics Greenland, Statistics Canada, the Northwest Atlantic Fisheries Organization. Greenland's Natural Resources Institute, and the Department of Fisheries and Oceans Canada can also be found among the examples. A presentation given by Brenda Topliss inspired the "gossip" programming example in Chapter 14. Other forms of encouragement or ideas came from Anna Kerttula, Richard Haedrich. Jeffrey Runge, Igor Belkin, James Morison, Oddmund Otterstad, James Tucker, and Cynthia M. Duncan. Stata and Stata Resources Stata is a full-featured statistical program for Windows, Macintosh, and Unix computers. It combines ease of use with speed, a library of pre-programmed analytical and data-management capabilities, and programmability that allows users to invent and add further capabilities as needed. Most operations can be accomplished either via the pull-down menu system, or more directly via typed commands. Menus help newcomers to learn Stata. and help anyone to apply an unfamiliar procedure. The consistent, intuitive syntax of Stata commands frees experienced users to work more efficiently, and also makes it straightforward to develop programs for complex or repetitious tasks. Menu and command instructions can be mixed as needed during a Stata session. Extensive help, search, and link features make it easy to look up command syntax and other information instantly, on the fly. After introductory information, we'll begin with an example Stata session to give you a sense of the "flow" of data analysis, and of how analytical results might be used. Later chapters explain in more detail. Even without explanations, however, you can see how straightforward the commands are — use filename to retrieve daXaset filename, summarize when you want summary statistics, correlate to getacouelationmatrix.andso forth. Alternatively, the same results can be obtained by making choices from the Data or Statistics menus. Stata users have available a variety of resources to help them learn about Stata and solve problems at any level of difficulty. These resources come notjust from Stata Corporation, but also from an active community of users. Sections of this chapter introduce some key resources — Stata's online help and printed documentation: where to phone, fax. write, or e-mail for technical help; Stata's web site (www.stata.com), which provides many services including updates and answers to frequently asked questions; the Statalist Internet forum; and therefereed Stata Journal, A Typographical Note This book employs several typographical conventions as a visual cue to how words are used: ■ Commands typed by the user appear in a bold Courier font . When the whole command line is given, it starts with a period, as seen in a Stata Results window or log (output) file: list year boats men penalty ■ Variable or file names within these commands appear in italics to emphasize the fact that they are arbitrary and not a fixed part of the command. 2 Statistics with Stata Stata and Stata Resources 3 ■ Names of variables or files also appear in italics within the main text to distinguish them from ordinary words. ■ Items from Stata's menus are shown in an Ariaf font. with successive options separated by a dash. For example, we can open an existing dataset by selecting File - Open , and then finding and clicking on the name of the particular dataset. Note that some common menu actions can be accomplished either with text choices from Stata's top menu bar. File Edit Prefs Data Graphics Statistics User Window Help or with the row of icons below these. For example, selecting File - Open is equivalent to clicking the leftmost icon, an opening file folder: & . One could also accomplish the same thing by typing a direct command of the form use filename ■ Stata output as seen in the Results window is shown in a st = ; ; courier fent. The small font allows Stata's 80-column output to tit within the margins of this book. Thus, we show the calculation of summary statistics for a variable named penalty as follows: summari ze penalty penalty I 10 i i 5 9 . z 1- 4 1 3 II 1 = 3 These typographic conventions exist only in this book, and not within the Stata program itself. Stata can display a variety of onscreen fonts, but it does not use italics in commands. Once Stata log files have been imported into a word processor, or a results table copied and pasted, you might want to format them in a Courier font. 10 point or smaller, so that columns will line up correctly. In its commands and variable names. Stata is case sensitive. Thus, summarize is a command, but Summarize and SUMMARIZE are not. Penalty and penalty would be two different variables. An Example Stata Session_ As a preview showing Stata at work, this section retrieves and analyzes a previously-created dataset named lofoten.dta. Jentoft and Kristofferson {1989) originally published these data in an article about self-management amongfishermen on Norway's arctic Lofoten Islands. There are 10 observations (years) and 5 variables, including penalty, a count of how many fishermen were cited each year for violating fisheries regulations. If we might eventually want a record of our session, the best way to prepare for this is by opening a 'log file" at the start. Log files contain commands and results tables, but not graphs. To begin a log file, click the scroll-shaped Begin Log icon, £>j , and specify a name and folder for the resulting log file. Alternatively, a log file could be started by choosing File - Log -Begin from the top menu bar. or by typing a direct command such as log using mondayl Multiple ways of doing such things are common in Stata. Each has its own advantages, and each suits different situations or user tastes. Log files can be created either in a special Stata format (.smel), or in ordinary text or ASCII format (.log). A .smel ("Stata markup and control language") file will be nicely formatted for viewing or printing within Stata. It could also contain hyperlinks that help to understand commands or error messages, .log (text) files lack such formatting, but are simpler to use if you plan later to insert or edit the output in a word processor. After selecting which type of log file you want, click Save . For this session, we will create a .smel log file named mondavl.smel. An existing Stata-format dataset named lofoten.dta will be analyzed here. To open or retrieve this dataset. we again have several options: select File - Open - lofoten.dta using the top menu bar; select rj£: - lofoten.dta ; or type the command use lofoten . Under its default Windows configuration. Stata looks for data files in folderC: data. If the file we want is in a different folder, we could specify its location in the use command. . use c; \books \sws8\ chapterOl \ lofoten or change the session's default folder by issuing a cd (change directory) command: cd c;\books\sws8\chapterOl\ use lofoten Often, the simplest way to retrieve a file will be to choose File - Open and browse through folders in the usual way. To see a brief description of the dataset now in memory, type describe C c r. z a l n s data fror C : da:- Iii z-zer.. dzä or 3 : 1 2 Je.itcf: 5 fr:s:o:fersen '39 v a r s : ö 3C Jun 2C ' c I iJ : 3 6 b i. ze : 1 3C 93.5* :: neno ry free} stci age d.SDlr, value variable nar.e z\ ^ e fornat 1 acel va riacle a b e 1 year ir -_ - ; . Z o Year coats i r t "- ? . j N -nber of tich:.-.? ooata rer. i r t ~:; '? . Z -z K-inter of f linemen r="citj i r t ■ - ■ -■ ■? K-r.fcer zi per.aloies decade t e ; !> . Zq decade Early 11" 13 °~ early l^Süs Sorted by: decade Many Stata commands can be abbreviated to their first few letters. For example, we could shorten describe to just the letter d. Using menus, the same table could be obtained by choosing Data - Describe data - Describe variables in memory - OK. This dataset has only 10 observations and 5 variables, so we can easily list its contents by-typing the command list (or the letter 1 ; or Data - Describe data - List data - OK): 4 Statistics with Stata Stata and Stata Resources 5 list year ::ats ir.e r. penalty decace i 9 71 1 13 3 9 tti. 71 1973s 19 7 2 2 3 1 7 eS 34 152 1 97 2s 1 0 *? 2C 68 6~ 9 4 183 1 S"7 JS 197 4 16 9 3 5 2 2 7 3 9 19 7 5 1441 4 2 7^ 3 6 19 7 0s "561 1 5 4 : 4 3 3 1 11 1 9 3 C 3 15 8 2 1 o 3 9 426" 1 5 1 9 5 C 5 19 8 2 184 2 4 4 3 j 3 4 " 3 g " ej 1984 1 84" 4 6 2 2 74 198 1s 193 5 1365 5 314 1 5 - 0 3 2' p Analysis could begin with a table of means, standard deviations, minimum values, and maximum values (type summarize or su ; orselect Statistics - Summaries, tables, & tests - Summary statistics - Summary statistics - OK): summarize *a r i ac le C o s Mear. Stá. Dev. Min Max year 1 13 19 78 5 . 4 7 "7 2 2 6 1971 19 85 boats 1 1 j I7}.:.! 2 32 . 1 3 2a 2 0 6 5 r.en 1 1 '.■ 4 3 5 4.9 10:5 . 5 77 2 514 67 94 penalty I 1 0 63 5 9.594 53 11 13 5 decaae 1 1 3 . 5 2 7 C 4 63 2 1 To print results from the session so far, bring the Results window to the front by clicking on this window or on ^ (Bring Results Window to Front), and then click ^$ (Print). To copy a table, commands, or other information from the Results window into a word processor, again make sure that the Results window is in front by clicking on this window or on Hi* ^raS ™e mouse to select the results you want, right-click the mouse, and then choose Copy Text from the mouse's menu. Finally, switch to your word processor and, at the desired insertion point either right-click and Paste or click a "clipboard" icon on the word processor's menu bar. Did the number of penalties for fishing violations change over the two decades covered by these data? A table containing summary statistics for penalty at each value of decade shows that there were more penalties in the 1970s: tabulate decade, sum{penalty) Z a r 1 \ 19 7 0s 1 C - early I :umirary c: :i':rcie: cf penalt ies 19 8 2s 1 Me a n 5 ta. Cev. F req. 1 9 7 j s 1 96.2 6 7.41439 5 19 3 0s : 25.8 2 6 .281 172 5 Total 63 5 9 . 5 54 52 9 1 C The same table could be obtained through menus: Statistics - Summaries, tables, & tests - Tables - One/two-way table of summary statistics, then fill in decade as variable 1. and penalty-as the variable to be summarized. Although menu choices are often straightforward to use, you can see that they tend to be more complicated to describe than the simple text commands. From this point on, we will focus primarily on the commands, mentioning menu alternatives only occasionally. Fully exploring the menus, and working out how to use them to accomplish the same tasks, will be left to the reader. For similar reasons, the Stata reference manuals likewise take a command-based approach. Perhaps the number of penalties declined because fewer people were fishing in the 1980s. The number of penalties correlates strongly (/• > .8) with the number of boats and fishermen: . correlate boats men penalty teats ser.al : ats men per.a 1 .COCO Z . 8 7 4 £ 1 . 0 2 0 C 3 . 32r9 0.9322 1 . J 0 :■ 0 A graph might help clarify these interrelationships. Figure 1.1 plots men and penalty against year, produced by the graph twoway connected command. In this example, we first ask for a twoway (two-variable) connected-line plot of men against year, using the left-hand y axis, yaxis (l). After the separator | | , we next ask for a connected-line plot of penalty against year, this time using the right-hand y axis, yaxis (2) . The resulting graph visualizes the correspondence between the number of fishermen and the number of penalties over time. ■ graph twoway connected men year, yaxis (1) I I connected penalty year, yaxis (2) Figure 1.1 E 1970 1975 Number of fishermen Year 1980 1985 Number of penalties Because the years 1976 to 1980 are missing in these data, Figure 1.1 shows 1975 connected to 1981. For some purposes, we might hesitate to do this. Instead, we could either find the missing values or leave the gap unconnected by issuing a slightly more complicated set of commands. 6 Statistics with Stata Stata and Stata Resources 7 To print this graph, click on the Graph window or on ^| (Bring Graph Window to Front), and then click the Print icon % . To copy the graph directly into a word processor or other document, bring the Graph window to the front, right-click on the graph, and select Copy . Switch to your word processor, go to the desired insertion point, and issue an appropriate "paste" command such as Edit -Paste, Edit - Paste Special (Metafile), or click a "clipboard" icon (different word processors will handle this differently). To save the graph for future use, either right-click and Save, or select File - Save Graph from the top menu bar. The Save As Type submenu offers several different file formats to chose from. On a Windows system, the choices include Stata graph (*.gph) (A "live" graph, containing enough information for Stata to edit.) As-is graph (*.gph) (A more compact Stata graph format.) Windows Metafile (*.wmf) Enhanced Metafile (*.emf) Portable Network Graphics (*.png) TIFF (*.tif) PostScript (*.ps) Encapsulated PostScript with TIFF preview (*.eps) Encapsulated PostScript (*.eps) Regardless of which graphics format we want, it might be worthwhile also to save a copy of our graph in "live" .gph format. Live .gph graphs can later be retrieved, combined, recolored, or reformatted using the graph use or graph combine commands (Chapter 3). Instead of using menus, graphs can be saved by adding a saving {filename) option to any graph command. To save a graph with the filename figureLgph, add another separator | | , a comma, and saving (figurel) . Chapter 3 explains more about the logic of graph commands. Thecompletecommandnowcontainsthefollowing(typedintheStata Command window with as many spaces as you want, but no hard returns): graph twoway connected men year, yaxis(l) || connected penalty year, yaxis(2) || , saving(figurel) Through all of the preceding analyses, the log file mondayl.smcl has been storing our results. There are several possible ways to review this file to see what we have done: File - Log - View - OK £g| _ view snapshot of log file - OK typing the command view mondayl. smcl We could print the log file by choosing # (Print) . Log files close automatically at the end of a Stata session, or earlier if instructed by one of the following: File - Log - Close ^J; - Close log file - OK typing the command log close Once closed, the file mondayl.smcl could be opened again through File - View during a subsequent Stata session. To make an output file that can be opened easily by your word processor, either translate the log file from .smcl {a Stata format) to .log (standard ASCII text format) by typing translate mondayl.smcl mondayl.log or start out by creating the file in .log instead of .smcl format. Stata's Documentation and Help Files_ The complete Stata 9 Documentation Set includes over 6,000 pages in 15 volumes: a slim Getting Started manual (for example. Getting Started with Statu for Windows), the more extensive User's Guide, the encyclopedic three-volume Base Reference Manual, and separate reference manuals on data management, graphics, longitudinal and panel data, matrix programming (Mata). multivariate statistics, programming, survey data, survival analysis and epidemiological tables, and time series analysis. Getting Started helps you do just that, with the basics of installation, window management, data entry, printing, and so on. The User's Guide contains an extended discussion of general topics, including resources and troubleshooting. Of particular note for new users is the User s Guide section on "Commands everyone should know." The Base Reference Manual lists all Stata commands alphabetically. Entries for each command include the full command syntax, descriptions of all available options, examples, technical notes regarding formulas and rationale, and references for further reading. Data management, graphics, panel data. etc. are covered in the general references, but these complicated topics get more detailed treatment and examples in their own specialized manuals. A Quick Reference and Index volume rounds out the whole collection. When we are in the midst of a Stata session, it is often simpler to ask for onscreen help instead of consulting the manuals. Selecting Help from the top menu bar invokes a drop-down menu of further choices, including help on specific commands, general topics, online updates, the Stata Journal, or connections to Stata's web site (www.stata.com). Alternatively, we can bring the Viewer ( ^) to front and use its Search or Contents features to find information. We can also use the help command. Typing help correlate . for example, causes help information to appear in a Viewer window. Like the reference manuals, onscreen help provides command syntax diagrams and complete lists of options. It also includes some examples, although often less detailed and without the technical discussions found in the manuals. The Viewer help has several advantages over the manuals, however. It can search for keywords in the documentation or on Stata's website. Hypertext links take you directly to related entries. Onscreen help can also include material about recent updates, or the "unofficial" Stata programs that you have downloaded from Stata's web site or from other users. 8 Statistics with Stata Stata and Stata Resources 9 Searching for Information______ Selecting Help - Search - Search documentation and FAQs provides a direct way to search for information in Stata's documentation or in the web site's FAQs (frequently asked questions) and other pages. The equivalent Stata command is search keywords Options available with search allow us to limitour search to the documentation and FAQs, to net resources including the Stata Journal, or to both. For example, search median regression will search the documentation and FAQs for information indexed by both keywords, "median ' and "regression." To search for these keywords across Stata's Internet resources in addition to the documentation and FAQs, type . search median regression, all Search resul ts in the Viewer window contain clickable hyperlinks leading to further information or original citations. One specialized use for the search command is to provide more information on those occasions when our command does not succeed as planned, but instead results in one of Stata's cryptic numerical error messages. For example, typing the one-word command table produces the error or "return code" r( 100): . table va r 11 s o required r (10 0 ) ; The table command evidently requires a list of variables. Often, however, the meaning of an error message is less obvious. To learn more about what return code r( 100) refers to, type search rc 100 Keyword search Keywords: rc ICQ Search: (1) Official help files, FAQs, Examples, Sis, and STBs Search c: official help files, FAQs, Examples, SCs, and STEs -?] error........................Peourn code 10 0 var1ist required; = exp required; using required; cv () option required; certain cor.mar.ds require a varlist or another elerr.er." of ohe i.ao.guage. Tine message specifies fhe required item thai was r.isiir.g frcn ohe command ycu gave. See the comrand' £ syntax diagram.. For example, merge requires using be specified; perhaps, ycu Tiear.o tc type append. Or, rank sum. requires a by U option; see [?■.] signrank. (end cf search) Type help search for more about this command. Stata Corporation For orders, licensing, and upgrade information, you can contact Stata Corporation bye-mail at stata(g stata.com or visit their web site at http: www .stata.com Stata's extensive web site contains a wealth of user-support information and links to resources. Stata Press also has its own web site, containing information about Stata publications including the datasets used for examples. http: //www. s tata-pres s .com Both web sites are well worth exploring. The mailing or physical address is Stata Corporation 4905 Lakeway Drive College Station. TX 77845 USA Telephone access includes an easy-to-remember 800 number. telephone: 1-800-STATAPC U.S. (1-800-782-8272) 1-800-248-8272 Canada 1 -979-696-4600 International fax: 1-979-696-4601 Online updates within major versions are free to licensed Stata users. These provide a fast and simple way to obtain the latest enhancements, bug fixes, etc. for your current version. To find out whether updates exist for your Stata. and initiate the simple online update process itself, type the command update query Technical support can be obtained by sending e-mail messages with your Stata serial number in the subject line to tech_support fastata.com Before calling or writing for technical help, though, you might want to look at www.stata.com to see whether your question is a FAQ. The site also provides product, ordering, and help information; international notes; and assorted news and announcements. Much attention is given to user support, including the following: FAQs — Frequently asked questions and their answers. If you are puzzled by something and can't find the answer in the manuals, check here next — it might be a FAQ. Example questions range from basic — "How can I convert other packages' files to Stata format data files?" — to more technical queries such as "How do I impose the restriction that rho is zero using the heckman command with full ml?" UPDATES — Frequent minor updates or bug fixes, downloadable at no cost by licensed Stata users. 10 Statistics with Stata Stata and Stata Resources 11 OTHER RESOURCES — Links and information including online Stata instruction (NetCourses); enhancements from the Stata Journal; an independent listserver (Statalist) for discussions among Stata users; a bookstore selling books about Stata and other up-to-date statistical references; downloadable datasets and programs for Stata-related books; and links to statistical web sites including State's competitors. The following sections describe some of the most important user-support resources. Statalist_________ Statalist provides a valuable online forum for communication among active Stata users. It is independent of Stata Corporation, although Stata programmers monitor it and often contribute to the discussion. To subscribe to Statalist, send an e-mail message to r.a j cr dcir.c ? hsphsur 2 . ha rva rd . edu The body of this message should contain only the following words: subs cr ibe atal1st The list processor will acknowledge your message and send instructions for using the list, including how to post messages of your own. Any message sent to the following address goes out to all current subscribers: sta-aiist@hsphsun2.harvard.edu Do not try to subscribe or unsubscribe by sending messages directly to the statalist address. This does not work, and your mistake goes to hundreds of subscribers. To unsubscribe from the list, write to the same majordomo address you used to subscribe: rnajordoncShsphsur. 2 . harvard, edu but send only the message unsubscribe statalist or send the equivalent message signoff statalist If you plan to be traveling or offline for a while, unsubscribing will keep your mailbox from filling up with Statalist messages. You can always re-subscribe. Searchable Statalist archives are available at http://www.stata.conVstatalist/archive/ The material on Statalist includes requests for programs, solutions, or advice, as well as answers and general discussion. Along with the Stata Jou/W (discussed below). Statalist plays a major role in extending the capabilities both of Stata and of serious Stata users. The Stata Journal______ From 1991 through 2001, a bimonthly publication called the Stata Technical Bulletin (STB) served as a means of distributing new commands and Stata updates, both user-written and official. Accumulated STB articles were published in book form each year as Stata Technical Bulletin Reprints, which can be ordered directly from Stata Corporation. With the growth of the Internet, instant communication among users became possible through vehicles such as Statalist. Program files could easily be downloaded from distant sources. A bimonthly printed journal and disk no longer provided the best avenues either for communicating among users, or for distributing updates and user-written programs. To adapt to a changing world, the STB had to evolve into something new. The Stata Journal was launched to meet this challenge and the needs of Stata's broadening user base. Like the old STB, the Stata Journal contains articles describing new commands by users along with unofficial commands written by Stata Corporation employees. New-commands are not its primary focus, however. The Stata Journal also contains refereed expository articles about statistics, book reviews, and a number of interesting columns, including "Speaking Stata" by Nicholas J. Cox, on effective use of the Stata programming language. The Stata Journal is intended for novice as well as experienced Stata users. For example, here are the contents from one recent issue: "Exploratory analysis of single nucleotide polymorphism (SNP) for MA. Cleves quantitative traits" "Value label utilities: labeldup and labelrename" J. Weesie "Multilingual datasets1* j_ Weesie "Multiple imputation of missing values: update" p. Royston "Estimation and testing of fixed-effect panel-data systems" J.L. Blaekwell, III "Data inspection using biplots" \j. Kohler & M. Luniak "Stata in space: Econometric analysis of spatially explicit raster data" D. Miiller "Using the file command to produce formatted output for other applications" E. Slaymaker "Teaching statistics to physicians using Stata" S.M. Hailpern "Speaking Stata: Density probability plots" N. J. Cox "Review of Regression Methods in Biostatistics: S. Lemeshow & M.L. Moeschberger Linear, Logistic, Sunival, and Repeated Measures Models" The Stata Journal is published quarterly. Subscriptions can be purchased directly from Stata Corporation by visiting wwAv.stata.com. Books Using Stata In addition to Stata's own reference manuals, a growing library of books describe Stata, or use Stata to illustrate analytical techniques. These books include general introductions; disciplinary applications such as social science, biostatistics or econometrics; and focused texts concerning survey analysis, experimental data, categorical dependent variables, and other subjects. The Bookstore pages on Stata's web site have up-to-date lists, with descriptions of content: http; //www. s t at a. c o m/b o o ks t o re / This online bookstore provides a central place to learn about and order Stata-relevant books from many different publishers. Data Management The first steps in data analysis involve organizing the raw data into a format usable by Stata. We can bring new data into Stata in several ways: type the data from the keyboard; read a text or ASCII file containing the raw data; paste data from a spreadsheet into the Editor; or, using a third-party data transfer program, translate the dataset directly from a system file created by another spreadsheet, database, or statistical program. Once Stata has the data in memory, we can save the data in Stata format for easy retrieval and updating in the future. Data management encompasses the initial tasks of creating a dataset, editing to correct errors, and adding internal documentation such as variable and value labels. It also encompasses many other jobs required by ongoing projects, such as adding new observations or variables; reorganizing, simplifying, or sampling from the data; separating, combing, or collapsing datasets; convening variable types; and creating new variables through algebraic or logical expressions. When data-management tasks become complex or repetitive, Stata users can write their own programs to automate the work. Although Stata is best known for its analytical capabilities, it possesses a broad range of data-management features as well. This chapter introduces some of the basics. The User's Guide provides an overview of the different methods for inputting data, followed by eight rules for determining which input method to use. Input, editing, and many other operations discussed in this chapter can be accomplished through the Data menus. Data menu subheadings refer to the general category of task: Describe data Data editor Data browser (read-only editor) Create or change variables Sort Combine datasets La be Is Notes Variable utilities Matrices Other utilities Data Management 13 Example Commands . append using olddata Reads previously-saved dataset olddcita.dta and adds all its observations to the data currently in memory. Subsequently typing save newdata, replace will save the combined dataset as newdata.dta. browse Opens the spreadsheet-like Data Browser for viewing the data. The Browser looks similar to the Data Editor, but it has no editing capability, so there is no risk of inadvertently changing your data. Alternatively, click . browse boats men if year > 1980 Opens the Data Browser showing only the variables boats and men for observations in which year is greater than 1980. This example illustrates the if qualifier, which can be used to focus the operation of many Stata commands, compress Automatically converts ail variables to their most efficient storage types to conserve memory and disk space. Subsequently typing the command save filename, replace will make these changes permanent. . drawnorm zl z2 z3, n{5000) Creates an artificial dataset with 5,000 observations and three random variables. zL ~2. and z3. sampled from uncorrected standard normal distributions. Options could specify other means, standard deviations, and correlation or covariance matrices. . edit Opens the spreadsheet-like Data Editor where data can be entered or edited. Alternatively, choose Window - Data Editor or click f™?. edit boats year men Opens the Data Editor with only the variables boats, year, and men (in that order) visible and available for editing. encode stringvar, gen(nuravar) Creates a new variable named numvur, with labeled numerical values based on the string (non-numeric) variable string\!ar. format rainfall %8.2f Establishes a fixed ( f ) display format for numeric variable rainfall: 8 columns wide, with two digits always shown after the decimal. . generate newvar = (x + y)/100 Creates a new variable named newvar. equal to.v plus v divided by 100. . generate newvar = uniform() Creates a new variable with values sampled from a uniform random distribution over the interval ranging from 0 to nearly 1, written [0,1). . infile x y z using data.raw Reads an ASCII file named data.raw containing data on three variables: .v, w and z. The values of these variables are separated by one or more white-space characters — blanks, tabs, and newlines (carriage return, linefeed, or both) — or by commas. With white-space Statistics with Stata Data Management 15 delimiters, missing values are represented by periods, not blanks. With comma-delimited data, missing values are represented by a period or by two consecutive commas. Stata also provides for extended missing values, which we will discuss later. Other commands are better suited for reading tab-delimited, comma-delimited, or fixed-column raw data; type help infiling for more infomation. list Lists the data in default or ■'table" format. If the dataset contains many variables, table format becomes hard to read, and list, display produces better results. See help list for other options controlling the format of data lists. list x y z in 5/20 Lists the v. r, and z values of the 5th through 20th observations, as the data are presently sorted. The in qualifier works in similar fashion with most other Stata commands as well. merge id using olddata Reads the previously-saved dataset olddata.dta and matches observations from olddata with observations in memory that have identical id values. Both olddata (the "using" data) and the data currently in memory (the "master"' data) must already be sorted by id. replace oldvar = 100 * oldvar Replaces the values of oldvar with 100 times their previous values. sample 10 Drops all the observations in memory except for a 10% random sample. Instead of selecting a certain percentage, we could select a certain number of cases. For example, sample 55, count would drop all but a random sample of size n = 55. save newfile Saves the data currently in memory, as a file named newfile.dta. If newfde.dta already exists, and you want to write over the previous version, type save new file, replace. Alternatively, use the menus: File Save or File - Save As . To save newfUe.dta in the format of Stata version 7, type saveold new file . set memory 2 4m (Windows or Unix systems only) Allocates 24 megabytes of memory for Stata data. The amount set could be greater or less than the current allocation. Virtual memory (disk space) is used if the request exceeds physical memory. Type clear to drop the current data from memory before using set memory . sort x Sorts the data from lowest to highest values of .v. Observations with missing _y values appear last after sorting because Stata views missing values as very high numbers. Type help gsort for a more general sorting command that can arrange values in either ascending or descending order and can optionally place the missing values first. tabulate x if y > 65 Produces a frequency table for.v using only those observations that have r values above 65. The if qualifier works similarly with most other Stata commands. ise oldflle Retrieves previously-saved Stata-format dataset oldfile.dta from disk, and places it in memory. If other data are currently in memory, and you want to discard those data without saving them, type use oldflle, clear . Alternatively, these tasks can be accomplished through File - Open or by clicking rj£ . Creating a New Dataset Data that were previously saved in Stata format can be retrieved into memory either by typing a command of the form use filename, orbymenu selections. This section describes basic methods for creating a Stata-format dataset in the first place, using as our example the 1995 data on Canadian provinces and territories listed in Table 2.1. (From the Federal, Provincial and Territorial Advisory Committee on Population Health, 1996. Canada's newest territory, Nunavut, is not listed here because it was part of the Northwest Territories until 1999.) Table 2.1: Data on Canada and Its Provinces I 995 Pop. Unemployment Male Life Female Life Place ( lOOO's) Rate (percent) Expectancy Expectancy Canada 29606.1 10.6 75.1 81.1 Newfoundland 575.4 19.6 73.9 79.8 Prince Edward Island 136.1 19.1 74.8 81.3 Nova Scotia 937.8 13.9 74.2 80.4 New Brunswick 760.1 13.8 74.8 80.6 Quebec 733 4.2 13.2 74.5 81.2 Ontario 11 I 00.3 9.3 75.5 81.1 Manitoba l 137.5 8.5 75.0 80.8 Saskatchewan 101 5.6 7.0 75.2 81.8 Alberta 2747.0 8.4 75.5 81.4 British Columbia 3766.0 9.8 75.8 81.4 Yukon 30.1 — 71.3 80.4 Northwest Territories 65.8 — 70.2 78.0 The simplest way to create a dataset from Table 2.1 is through Stata's spreadsheet-like Data Editor, which is invoked either by clicking Imjj. selecting Window - Data Editor from the top menu bar, or by typing the command edit. Then begin typing values for each variable, in columns that Stata automatically calls varL var2, etc. Thus, varl contains place names (Canada, Newfoundland, etc.); varl. populations; and so forth. m Stata Editoi Pteserve | l - 1 ■ 1 » 1 Hide 1 [,v '* 1 uarll3] = varl var3 war 4 var5 1 Canada 29606.1 10.6 75.1 81.1 2 Ne wf a un dl an d 57S.4 19.6 ?3.9 79.8 16 Statistics with Stata Data Management 17 We can assign more descriptive variable names by double-clicking on the column headings (such as varJ) and then typing a new name in the resulting dialog box — eight characters or fewer wrorks best, although names with up to 32 characters are allowed. We can also create variable labels that contain a brief description. For example, var2 (population) might be renamed pop, and given the variable label "Population in 1000s, 1995". Renaming and labeling variables can also be done outside of the Data Editor through the rename and label variable commands: . rename var 2 pop . label variable pop "Population in 1000s, 1995" Cells left empty, such as employment rates for the Yukon and Northwest Territories, will automatically be assigned Stata's system (default) missing value code, a period. At any time, we can close the Data Editor and then save thedataset to disk. Clicking jffHj or Window - Data Editor brings the Editor back. If the first value entered for a variable is a number, as with population, unemployment, and life expectancy, then Stata assumes that this column is a "numerical variable" and it will thereafter permit only numerical values. Numerical values can also begin with a plus or minus sign, include decimal points, or be expressed in scientific notation. For example, we could represent Canada* s population as 2.96061 e+7. which means 2.96061 x 10 or about 29.6 million people. Numerical values should not include any commas, such as 29.606.100. If we did happen to put commas within the first value typed in a column, Stata would interpret this as a "string variable" (next paragraph) rather than as a number. If the first value entered for a variable includes non-numerical characters, as did the place names above (or "1,000" with the comma), then Stata thereafter considers this column to be a string variable. String variable values can be almost any combination of letters, numbers, symbols, or spaces up to 80 characters long in Intercooled or Small Stata. and up to 244 characters in Stata, SE. We can thus store names, quotations, or other descriptive information. String variable values can be tabulated and counted, but do not allow the calculation of means, correlations, or most other statistics. In the Data Editor or Data Browser, string variable values appear in red. so we can visually distinguish the two variable types. After typing in the information from Table 2.1 in this fashion, we close the Data Editor and save our data, perhaps with the name eanadaO.dta: save canadaO Stata automatically adds the extension .dta to any dataset name, unless we tell it to do otherwise. If we already had saved and named an earlier version of this file, it is possible to write over that with the newest version by typing save, replace At this point, our new dataset looks like this: describe c c s a r s 5 3 3 I. s 9 ; n a d a ._'<. dta type fcrirat flea-, flea: f lea: flea: ■arable Ute. 10 C0 s, 15 9 5 list 4 . 1 . 1 var: pep va r 3 var 4 va r5 1 Canada 2 9 h C c . _ - : f. -,5 - 8 1 1 Newfcur.clar.d 1 ?rince reward Island 5 5 . 4 1 9 1 7 3 . 5 7 4 . S 61 . : 1 Neva Scotia 1 N 6 3 r u r_ s w i o 7 . 8 0 . 1 1 J 3 7 4.2 7 4.8 8 0 S , 4 Or. :ario Manitcba 111C 111 7 q 1 3 8 5 ^ £ c -i q c SI 1 1 Alberta 101 5 . 6 "17 e. A - C "i 8 1 8 1 8 Bri zish Col-rr.bia 3 7 6 c y . c "5. e 31 Y uken 3 1 . 1 -. - Kcr:hyec: Territories c 5 . 8 7 0 . 2 3 summarize riacle I td. De- ■rarl pop 'a r J a r4 'a r5 4 3 54.7 69 11.115 0 9 74.29231 SO."1539 =214.304 4.250143 .9754C27 Max 2 9 6 C 6 . 1 15.6 31.8 Examining such output tables gives us a chance to look for errors that should be corrected. The summarize table, for instance, provides several numbers useful in proofreading, including the count of nonmissing observations (always 0 for string variables) and the minimum and maximum for each variable. Substantive interpretation of the summary statistics would be premature at this point, because our dataset contains one observation (Canada) that represents a combination of the other 12 provinces and territories. The next step is to make our dataset more self-documenting. The variables could be given more descriptive names, such as the following: rename varl place rename var3 unemp 18 Statistics with Stata Data Management 19 rename var4 mlife rename var5 flife Stata also permits us to add several kinds of labels to the data, label data describes the dataset as a whole. For example. label data "Canadian dataset 0" label variable describes an individual variable. For example, . label variable place "Place name" label variable unemp "% 15+ population unemployed, 1995" label variable mlife "Male life expectancy years" label variable flife "Female life expectancy years" By labeling data and variables, we obtain a dataset that is more self-explanatory: describe 7 r, r. t a : - = iat a :::n . : : ia - a ■ :anada'J . J:a - - ^ • 3 3 3 < 2 9 . 9 ft c f n en r y f r a e i r-t ■;-1 a a e c : s p 1 a y '." a 1 - a tvre f ;r.TicL labe". variable labe- - - Place r a me r 1 -„ a t 9 . Z q i on in "- 1 0 C s , 1 9 9 - - ilea t 9.03 : 1 3 - ^ ;.;u iat ior. or.eirplcved., ' J c r,L;f o fleet 3 ■ 0 3 Male 1i fe expe:iar.;y years - ■ - - h Ferna _ e life sxpec'ar.cy years r i ^ t by: Once labeling is completed, we should save the data to disk by using File - Save or typing save, replace We can later retrieve these data any time through q£ . File - Open, or by typing . use c:\data\canada0 i a - -. - i ar. ictise: 0 i We can then proceed with a new analysis. We might notice, for instance, that male and female life expectancies correlate positively with each other and also negatively with the unemployment rate. The life expectancy-unemployment rate correlation is slightly stronger for males. correlate unemp mlife flife ( 0 c 3 = _ 1 ) ..nanpl 1 . 3 0 Ü : rr : i :s - : . ~ 4 4 : 1 . Z 00C The order of observations within a dataset can be changed through the sort command. For example, to rearrange observations from smallest to largest in population, type sort pop String variables are sorted alphabetically instead of numerically. Typing sort place will rearrange observations putting Alberta first. British Columbia second, and so on. We can control the order of variables in the data using the order command. For example, we could make unemployment rate the second variable, and population last: order place unemp mlife flife pop The Data Editor also has buttons that perform these functions. The Sort button applies to the column currently highlighted by the cursor. The << and » buttons move the current variable to the beginning or end of the variable list, respectively. As with any other editing, these changes only become permanent if we subsequently save our data. The Data Editor's Hide button does not rearrange the data, but rather makes a column temporarily invisible on the spreadsheet. This feature is convenient if. for example, we need to type in more variables and want to keep the province names or some other case identification column in view, adjacent to the "active" column where we are entering data. We can also restrict the Data Editor beforehand to work only with certain variables, in a specified order, or with a specified range of values. For example, edit place mlife flife or edit place unemp if pop > 100 The last example employs an if qualifier, an important tool described in the next section. Specifying Subsets of the Data: in and if Qualifiers Many Stata commands can be restricted to a subset of the data by adding an in or if qualifier. (Qualifiers are also available for many menu selections: look for an if/in or by/if/in tab along the top of the menu.) in specifies the observation numbers to which the command applies. Forexample, list in 5 tells Stata to list only the 5th observation. To list the 1 st through 20th observations, type . list in 1/20 The letter 1 denotes the last case, and -4 . for example, the fourth-from-last. Thus.wecould list the four most populous Canadian places (which will include Canada itself) as follows: sort pop list place pop in -4/1 Note the important, although typographically subtle, distinction between 1 (number one, or first observation) and 1 (!etter"el,"orlastobservation). The in qualifier works in a similar way with most other analytical or data-editing commands. It always refers to the data as presently sorted. 20 Statistics with Stata Data Management 21 The if qualifier also has broad applications, but it selects observations based on specific variable values. As noted, the observations in canadaO.dta include not only 12 Canadian provinces or territories, but also Canada as a whole. For many purposes, we might want to exclude Canada from analyses involving the 12 territories and provinces. One way to do so is to restrict the analysis to only those places with populations below 20 million (20.000 thousand): that is. every place except Canada: summarize if pop < 20000 "a r : i'c ie ..".t : Me = n Z t z. . Cev . Mi n Max ur.exp 1 : ■ I . 2 6 4 . 44 &7 / 13.6 f: i fs i: = o. ■= r3 3 3 i.o: 16 7 e ei. 5 Compare this with the earlier summarize output to see how much has changed. The previous mean of population, for example, was grossly misleading because it counted every person twice. The " < " (is less than) sign is one of six relational operators: == is equal to ! = is not equal to (*■= also works) > is greater than < is less than >= is greater than or equal to <= is less than or equal to A double equals sign. " =- ". denotes the logical test, "Is the value on the left side the same as the value on the right?" To Stata. a single equals sign means something different: "Make the value on the left side be the same as the value on the right." The single equals sign is not a relational operator and cannot be used within if qualifiers. Single equals signs have other meanings. They are used with commands that generate new variables, or replace the values of old ones, according to algebraic expressions. Single equals signs also appear in certain specialized applications such as weighting and hypothesis tests. Any of these relational operators can be used to select observations based on their values for numerical variables. Only two operators, = and • = . make sense with string variables. To use string variables in an if qualifier, enclose the target value in double quotes. For example, we could get a summary excluding Canada (leaving in the 12 provinces and territories): summarize if place != "Canada" Two or more relational operators can be combined within a single if expression by the use of logical operators. Stata's logical operators are the following: & and | or (symbol is a vertical bar, not the number one or letter "el") ! not (~ also works) The Canadian territories (Yukon and Northwest) both have fewer than 100,000 people. To find the mean unemployment and life expectancies for the 10 Canadian provinces only, excluding both the smaller places (territories) and the largest (Canada), we could use this command: . summarize unemp mlife flife if pop > 100 & pop < 20000 Variable : ots Mean s - d. Cev. y_~Lr. Max -nerrp 1 C 12.1 6 4 . 4 4 S "7 7 7 1 y . 6 nine | 1 0 7 4 . 3 2 f, 3 "J ' -■ - - c a fiife i io 8:.is rV.~ Ei!a Parentheses allow us to specify the precedence among multiple operators. For example, we might list all the places that either have unemployment below 9, or have life expectancies of at least 75.4 for men and Si A for women: . list if unemp < 9 | {mlife >= 75.4 & flife >= 81.4) c 1 ace pCZ inerr.p rr.l i:e flife 3 . :-:ar.::cb= 1137 .5 8 . 5 75 3 :. 3 5 . 1 Saskatchewan 1 j '- 5 . 6 7 75.2 8 1.8 10 . 1 Alo-rta z. : '-z 'z . 4 7 5.5 C ' £ 11 - British ~cluncia 3766 ? . 3 "7 C ^ 31.4 A note of caution regarding missing values: Stata ordinarily shows missing values as a period, but in some operations (notably sort and if, although not in statistical calculations such as means or correlations), these same missing values are treated as if they were large positive numbers. Watch what happens if we sort places from lowest to highest unemployment rate, and then ask to see places with unemployment rates above 15%; sort unemp . list if unemp > 15 10. Prince E dw a:; _ 2 1 a n a 1 3 65 Where missing values exist, we might have to deal with them explicitly as part of the if expression. tabulate vote if age > 65 & age < A less-than inequality such as age < . is a general way to select observations with nonmissing values. Stata permits up to 27 different missing values codes, although we are 22 Statistics with Stata Data Management 23 using only the default " - "here. The other 26 codes are represented internally as numbers even larger than " . ".so < . avoids them all. Type help missing for more details. The in and if qualifiers set observations aside temporarily so that a particular command does not apply to them. These qualifiers have no effect on the data in memory, and the next command will apply to all observations, unless it too has an in or if qualifier. To drop variables from the data in memory, use the drop command. For example, to drop mlife andJlife from memory, type drop mlife flife We can drop observations from memory by using either the in qualifier or the if qualifier. Because we earlier sorted on unemp. the two territories occupy the 12th and 13th positions in the data. Canada itself is 6th. One way to drop these three nonprovinces employs the in qualifier, drop in 12/13 means "drop the 12th through the 13th observations." . list ------+ 1. lae^aeelewa- :Cl:.6 7 2 . 2 ~ m 7 & 4 i 3 . Mar.i-cc.a 113" . 5 8 5 1 4 . Cr.:ar: e 1111 0 . 3 o 3 1 5 . Bneieh Celurroia 37 6? 3 8 Canada ^ -j o Ü ':■ - - i ;■ 6 I 7 _ Z v. a e e c 7334 . 2 15 2 1 8 . New 5rer.5wie> 7 6 0-1 13 & 5 . Neva 3 c " " i f. 23'. : 1 3 9 1 Z< . Fr:r.:s E :i are! Islar.d 13 6.1 1 9 1 1 1 '_ . :i = v:f jur.dia 5^3.4 1 9 1 3 . j :j . I 1 3 . -orei.weee Terrieenes 63.6 . drop in 12/13 [3 erservaeicns eeleized) drop in 6 > 1 co = erv=-icr. Jeleteli The same change could have been accomplished through an if qualifier, with a command that says "drop if place equals Canada or population is less than 100." . drop if place == "Canada" | pop < 100 (3 c beerva -1ens da1eeadl After dropping Canada, the territories, and the variables mlife and flife, we have the following reduced dataset: . list pl:-.c~ pep vnerp I---------------------------------------- 3 . ! ' Alcerts 17 -3" 8 . 4 I 3. vard:cb; 1137.5 3.5 I 3. | Bri:::'" l:l:rc:j 3 "66 9.3 'lew =rur.s ~ 6 o.; 1 3 . 3 l.'cvs SCO" 1 3 93 7.^ 13.9 . ■ Frir.c e Edward Island 13 6.1 15.1 ■ 1 .1 = wf euedian e. "■ -1 c, 1 19.6 + We can also drop selected variables or observations through the Delete button in the Data Editor. Instead of telling Stata which variables or observations to drop, it sometimes is simpler to specify which to keep. The same reduced dataset could have been obtained as follows: keep place pop unemp . keep if place != "Canada" & pop >= 100 i 3 e fc .r e r v a z 1 o n s delete d j Like any other changes to the data in memory, none of these reductions affect disk files until we save the data. At that point, we will have the option of writing over the old dataset ( save, replace ) and thus destroying it, or just saving the newly modified dataset with a new name (by choosing File - Save As , or by typing a command with the form save newname ) so that both versions exist on disk. Generating and Replacing Variables_ The generate and replace commands allow us to create new variables or change the values of existing variables. For example, in Canada, as in most industrial societies, women tend to live longer than men. To analyze regional variations in this gender gap. we might retrieve dataset canadal.dta and generate a new variable equal to female life expectancy {Jlife) minus male life expectancy {mlife). In the main part of a generate or replace statement (unlike if qualifiers) we use a single equals sign. use canadal, clear ( Z a n a a 1 a r. ^tase: 11 generate gap = flife - mlife label variable gap "Female-male gap life expectancy" describe data fr-~ C: .data' ear.adal.cea dc s : 13 Canadian dateset 1 var s : 6 3 Jul 2 3 3 3 10:43 r o = 9 9.9V e f men:rv free) stcrage display value variable r. aer.e e yp e fernae 1 a c e 1 variable lacal p 1 a ;e 5tr21 "I <- - -3 Place n are pep f 1 o a t v5.: g ? e o - 1 a:: en ;r. 1 C 3 3 £ , 11-95 a n e rr. p f 1 -j a t 1 9 . 1 g '■: 13^ cepulatien ensTpleyec, 19 9 3 mlife float h 9 . 3 a '■'ale life e :■:pa e t aney years flife f loa: 3 9 . C a Feirale life excectaney years g a p f 1 c a z * 9 . C g FeTale-irale gap life expeeea 24 Statistics with Stata list place flife mlife gap 1 f 111 e gap --. 1 :i = :: a r- = i a r T = 1 3 r J :1 . 1 8; . 4 7 3 _ 1 74 . : 5.9 o: ;■; i 6 . 2 C 0 ~ 0 5 c n :, ; q c l:. .''an i - : t a 2 . r 7 0 2 0 2 5.902022 C " 1 'J ~ o i i ■ :! ? 5 . C 9 9 9 2 : For the province of Newfoundland, the true value of gap should be 79.8 - 73.9 - 5.9 years, but the output shows this value as 5.900002 instead. Like all computer programs. Stata stores numbers in binary form, and 5.9 has no exact binary representation. The small inaccuracies that arise from approximating decimal fractions in binary are unlikely to affect statistical calculations much because calculations are done in double precision (8 bytes per number). They appear disconcerting in data lists, however. We can change the display format so that Stata shows only a rounded-off version. The following command specifies a fixed display format four numerals wide, with one digit to the right of the decimal: format gap %4.If Even when the display shows 5.9, however, a command such as the following will return no observations: . list if gap ==5.9 This occurs because Stata believes the value does not exactly equal 5.9. (More technically, Stata stores gap values in single precision but does all calculations in double, and the single-and double-precision approximations of 5.9 are not identical.) Display formats, as well as variables names and labels, can also be changed by double-clicking on a column in the Data Editor. Fixed numeric formats such as %4.1f are one of the three most common numeric display format types. These are % w. dg General numeric format, where w specifies the total width or number of columns displayed and d the minimum number of digits that must follow the decimal point. Exponential notation (such as 1.00e-07, meaning 1.00 * 107 or 10 million) and shifts in the decimal-point position will be used automatically as needed, to display values in an optimal (but varying) fashion. %w. df Fixed numeric format, where w specifies the total width or number of columns displayed and a the fixed number of digits that must follow the decimal point, de Exponential numeric format, where w specifes the total width or number of columns displayed and d the fixed number of digits that must follow the decimal point. Data Management 25 For example, as we saw in Table 2.1, the 1995 population of Canada was approximately 29.606,100 people, and the Yukon Territory population was 30,100. Below we see how these two numbers appear under several different display formats: format__Canada_Yukon_ %9.0g 2.96e-07 30100 °09.!f 29606100.0 30100.0 0 o 12.5e 2.96061 e-07 3.01000e-04 Although the displayed values look different, their internal values are identical. Statistical calculations remain unaffected by display formats. Other numerical display formatting options include the use of commas, left- and right-justification, or leading zeroes. There also exist special formats for dates, time series variables, and string variables. Type help format for more information. replace can make the same sorts of calculations as generate . but it changes values of an existing variable instead of creating a new variable. For example, the variable pop in our dataset gives population in thousands. To convert this to simple population, we just multiply (" * " means multiply) all values by 1.000: replace pop = pop * 1000 replace can make such wholesale changes, or itcan be used with in or if qualifiers to selectively edit the data. To illustrate, suppose that we had questionnaire data with variables including age and year born (bom). A command such as the following would correct one or more typos where a subject's age had been incorrectly typed as 229 instead of 29: replace age = 29 if age == 229 Alternatively, the following command could correct an error in the value of age for observation number 1453: . replace age = 29 in 1453 For a more complicated example, . replace age = 2005-i>orn if age >= . | age < 2005-born This replaces values of variable age with 2005 minus the year of birth if age is missing or if the reported age is less than 2005 minus the year of birth. generate and replace provide tools to create categorical variables as well. We noted earlier that our Canadian dataset includes several types of observations: 2 territories. 10 provinces, and 1 country combining them all. Although in and if qualifiers allow us to separate these, and drop can eliminate observations from the data, it might be most convenient to have a categorical variable that indicates the observation's "type." The following example shows one way to create such a variable. We start by generating type as a constant, equal to 1 for each observation. Next, we replace this with the value 2 for the Yukon and Northwest Territories, and with 3 for Canada. The final steps involve labeling new variable type and defining labels for values 1, 2. and 3. use canadal, clear l.anadian dataset 1; generate type = 1 26 Statistics with Stata replace type = 2 if place " "Yukon" | place == "Northwest Territories" replace type = 3 if place =~ "Canada" label variable type "Province, territory or nation" label values type typelbl label define typelbl 1 "Province" 2 "Territory" 3 "Nation" list place flife mlife gap type r.2a.:.e e ~ _ a : type 1 1 1 Canada 3 L 1 -5 ; 6 f j - lie r. 2 . . 2 T 2 9 ■5. ?; o o: 2 -■ ^ Q \ ince i j . ?r -1 - re Eaw= ri Zslar.i f 1 - 2 -1 6 b . ~ P:- r ■ ir.ee 4 ■ , :;;va Sc::: = . 7 J 2 t . 2 0 C 0 C 5 Pre ince 1 5 . 1 3 I . ? 7 4 ' . ' ? iJ ' 3 : Pre- race i " ij; o ~ = C s:. : ~T -I = 2 . 2 9 5 5 ?■ ?rc- i nee - _ 2nta:10 si. 1 5 2 . 2 2< 5 9 9 3 Pr- -ir.ee & . V.ar. i r. c c a 5 0 . S 5 .32:; 0 0 t ? r o' :r.:e | 2 . ( ?2te::a 31 - ^ 2 . 2 Z D C 2 2 : . s 2 0 2 2 2 ? r'.' F : :' ' i r. c 6 1 : i. = r,CiSh ,:c:u-oi5 81 . z Fro- 12 . Q fl 4 " 2 3 2.C93 9 9 2 .tcry ■ 1 3 . 1; c r - 11 v; £ = ~ Tern -or: as 8 z ~ . 3 0 3 0 3 3 Terr. rory 1 As illustrated, labeling the values of a categorical variable requires two commands. The label define command specifies what labels go with what numbers. The label values command specifies to which variable these labels apply. One set of labels (created through one label define command) can apply to any number of variables (that is. be referenced in any number of label values commands). Value labels can have up to 32.000 characters, but work best for most purposes if they are not too long. generate can create new variables, and replace can produce new values, using any mixture of old variables, constants, random values, and expressions. For numeric variables, the following arithmetic operators apply: + add subtract * multiply / divide A raise to power Parentheses will control the order of calculation. Without them, the ordinary rules of precedence apply. Of the arithmetic operators, only addition. *'+ "\ works with string variables, where it connects two string values into one. Although their purposes differ, generate and replace have similar syntax. Hither can use any mathematically or logically feasible combination of Stata operators and in or if qualifiers. These commands can also employ Stata's broad array of special functions, introduced in the following section. Data Management 27 Using Functions This section lists many of the functions available for use with generate or replace . For example, we could create a new variable named loginc, equal to the natural logarithm of income^ by using the natural log function In within a generate command: generate loginc = In(income) In is one of Stata's mathematical functions. These functions are as follows: abs(x) acos (x) asin(x) atan(x) atan2(y,x) atanh(x) ceil(x) cloglog(x) comb (r., k) cos (x) digamma(x) exp(x) floor(x) trunc(x) invcloglog(x) invlogi t(x) ln(x) lnfactorial (a-) lngamma(x) log(x) loglO(x) logit(x) max(xl,x2,..,xn) min[xl,x2, ..,xn) Absolute value of .v. Arc-cosine returning radians. Because 360 degrees = 2ti radians, acos (x) *l80/_pi gives the arc-cosine returning degrees ( _pi denotes the mathematical constant 7i). Arc-sine returning radians. Arc-tangent returning radians. Two-argument arc-tangent returning radians. Arc-hyperbolic tangent returning radians. Integer/? such that n-1 <.v < n Complementary log-log of.v: In(-ln( l-.v)) Combinatorial function (number of possible combinations of n things taken k at a time). Cosine of radians. To find the cosine of v degrees, type generate y = cos(y *_pi/180) d\nT{x) dx Exponential {e to power). Integer n such that n < x < n+i Integer obtained by truncating x towards zero. Inverse of the complementary log-log; 1 - exp(-exp(.i)} Inverse of logit of.v; exp(.v)'( 1 + exp(.v)) Natural (base e) logarithm. For any other base number B, to find the base B logarithm of.v, type generate y ~ In {>;)/In (E) Natural log of factorial. To find .v factorial, type generate y = round(exp(lnfact(x),1) Natural log of T(.v). To find Fix), type generate y = exp(lngamma(x)) Natural logarithm; same as in (x) Base 10 logarithm. Log of odds ratio of.v: ln(.v (1-v)) Maximum of.v/, x2.....xn. Minimum ol.v/,.v2, ....xn 28 Statistics with Stata Data Management 29 mod(x,y) Modulus of-y with respect to v. reldif(x,y) Relative difference: | x-y \ (| v 1 ^ 1) round(x) Round .v to nearest whole number. round(x,y) Round x in units of v. sign(x) -1 if.v<0.0if.r=0,-l if.r>0 sin(x) Sine of radians. sqrt(x) Square root. total(x) Running sum of x (also see help egen ) tan(x) Tangent of radians. tanh(x) Hyperbolic tangent of.x. trigamma (.\) d2\nHx) cix: Many probability functions exit as well, and are listed below. Consult help probfun and the reference manuals for important details, including definitions, constraints on parameters, and the treatment of missing values, betaden (a, b, x) Probability density of the beta distribution. Binomial (n, k tP) Probability of k or more successes in n trials when the probability of a success on a single trial is p. binormal Probability density function for the F distribution with nl numerator and n2 denominator degrees of freedom. Ftail(n2,n2,f) Reverse cumulative (upper-tail, survival) F distribution with nl numerator and n2 denominator degrees of freedom. Ftail(/?7,w2,/) = 1 - V{nhn2j) gammaden(a,b,g,x) Probability density function for the gamma family, where gammaden(a,l,Ojr) = the probability density function for the cumulative gamma distribution gammap(cM). gammap (a, x) Cumulative gamma distribution for a: also known as the incomplete gamma function. ibeta (a,b, x) Cumulative beta distribution for a, b: also known as the incomplete beta function. invbinomial (n, k, p) Inverse binomial. ForP < 0.5, probability p such that the probability of observing k or more successes in n trials is P: for P > 0.5, probability p such that the probability of observing k or fewer successes in n trials is 1 - P. invchi2 (nrP) Inverse of chi2 <). If chi2(/irr) =/?, then invchi2(;/./7) = .y invchi2tail {n ,p) Inverse of chi2tail () If chi2tail(^r) = p, then invchi2tail(iup) = .y invF (nl ,n2 ,p) Inverse cumulative F distribution. If V{nLn2,f) =p, then invF(»7,»2,/?) =/ invFtaii (nl, n2,p) Inverse reverse cumulative F distribution. \fVia\\(nl,n2j)=pi then invFtail(«/,#;2^) =/ invgammap (a,p) Inverse cumulative gamma distribution. If gammap(flry) = /?, then invgammap(t/,/?) - .y invibeta (arb,p) Inverse cumulative beta distribution. If ibeta(#,6ry) =/?, then invibeta(a,6,/?) =.x invnchi2 (n, L,p) Inverse cumulative noncentral chi-squared distribution. If nchi2(»^,^-) =/?, then invnchi2(>iX,/>) =x invnFtaii (nl ,n2,L,p) Inverse reverse cumulative noncentral F distribution. If nFtail(/?7,/?2X ,/) =p, then invnFtail(w7,rt2X,/7) =/ invnibeta (arb,L,p) Inverse cumulative noncentral beta distribution. If nibeta(a,6Xry) = p, then invnibeta(# ,6,Z,/?) = .y invnormai (p) Inverse cumulative standard normal distribution. If normal(r) = />, then invnormaH/?) - z invttail (n,p) Inverse reverse cumulative Student's t distribution. If ttail(>u) - /?, then invttail(n.p) = t nbetaden(a,b,L,x) Noncentral beta density with shape parameters a. b. noncentrality parameter I. nchi2 (n,L,X) Cumulative noncentral chi-squared distribution with n degrees of freedom and noncentrality parameter L. nFden (nl ,n2,L,x) Noncentral F density with nl numerator and n2 denominator degrees of freedom, noncentrality parameter L. nFtaii (nl, n2, l, x) Reverse cumulative (upper-tail, survival) noncentral F distribution with nl numerator and n2 denominator degrees of freedom, noncentrality parameter L. 30 Statistics with Stata nibeta (a ,b, L, x) Cumulative noncentral beta distribution with shape parameters a and b. and noncenrrality parameter L. normal (z) Cumulative standard normal distribution, normalden (z) Standard normal density, mean 0 and standard deviation I. normalden (z,s) Normal density, mean 0 and standard deviations, normalden (x,m, s) Normal density, mean m and standard deviation s. npnchi2 (n,x,p) Noncenrrality parameter/, for the noncentral cumulative chi-squared distribution. If nchi2(/7./.-v) = p, then npnchi2(//Ay?) = L tden (n, t) Probability density function of Student's r distribution with n degrees of freedom. ttail (n, t) Reverse cumulative (upper-tail) Student's/* distribution with/i degrees of freedom. This function returns probability T>t. uniform {) Pseudo-random number generator, returning values from a uniform distributAon theoretically ranging from 0 to nearly 1, written [0,1). Nothing goes inside the parentheses with uniform (). Optionally, we can control the pseudo-random generator's starting seed, and hence the stream of "random" numbers, by first issuing a set seed # command — where # could be any integer from 0 to 2 31 - 1 inclusive. Omitting the set seed command corresponds to set seed 123456789, which will always produce the same stream of numbers. Stata provides more than 40 date functions and date-related time series functions. A listing can be found in Chapter 27 of the User's Guide, or by typing help datefun . Below are some examples of date functions. "Elapsed date" in these functions refers to the number of days since January 1, 1960. date (slf s2 {, y]) Elapsed date corresponding to s{. si is a string variable indicating the date in virtually any format. Months can be spelled out, abbreviated to three characters, or given as numbers; years can include or exclude the century; blanks and punctuation are allowed. s2 is any permutation of m. d and [##]y with their order defining the order that month, day and year occur in s,. gives the century for two-digit years in st: the default is 19y. d (I) A date literal convenience function. For example, typing d(2janl960) is equivalent to typing 1. mdy (m, d,y) Elapsed date corresponding to m, d. andy. day (e) Numeric day of the month corresponding to e, the elapsed date. month (e) Numeric month corresponding to e. the elapsed date. year (e) Numeric year corresponding to e. the elapsed date. dow(e) Numeric day of the week corresponding to e, the elapsed date. doy (e) Numeric day of the year corresponding to e. the elapsed date. week (e) Numeric week of the year corresponding to e, the elapsed date. quarter t e) Numeric quarter of the year corresponding to e. the elapsed date. haifyear (e) Numeric half of the corresponding \oe, the elapsed date. Data Management 31 Some useful special functions include the following: autocode (x, r., x.ri r., x.r.ax) Forms categories fromxby partitioning the interval fromxmin to xmax into n equal-length intervals and returning the upper bound of the interval that contains.v. cond (x, a , b) Returns a if _y evaluates to "true" and b ifx evaluates to "false." generate y = cond(i/icI > inc2, incl, inc2) creates the variable y as the maximum of incl and incl (assuming neither is missing). group { v) Creates a categorical variable that divides the data as presently sorted into x subsamples that are as nearly equal-sized as possible. trunc (.<) Returns the integer obtained by truncating (dropping fractional parts of) x. max < x , v , . . . ,x.) Returns the maximum of.v,, .\\.x „. Missing values are ignored. For example, max (3 + 2 , i) evaluates to 5. rain (x ,;.. , . . . , x.) Returns the minimum of.v, ,.v: ,. . . ,.v„ recode (.\, x , \ , . . . , x_) Returns missing if x is missing, .v, if x < .r,. or .v: if a < x:. and so on. round (\, y) Returns x rounded to the nearest v. sign (x) Returns - I if a" < 0, 0 if y = 0, and -M if a > 0 (missing if .y is missing), total (x) Returns the running sum of.v\ treating missing \alues as zero. String/unctions, not described here, help to manipulate and evaluate string variables. Type help strfun for a complete list of string functions. The reference manuals and User's Guide give examples and details of these and other functions. Multiple functions, operators, and qualifiers can be combined in one command as needed. The functions and algebraic operators just described can also be used in another way that does not create or change any dataset variables. The display command performs a single calculation and shows the results onscreen. For example: . display 2+3 . display loglO(10A83) . display invttail(120,.025) * 34.1/sqrt(975) Thus, display works as an onscreen statistical calculator. Unlike a calculator, display . generate . and replace have direct access to Stata's statistical results. For example, suppose that we summarized the unemployment rates from dataset canadal.dta: summarize unemp Me=: 4 . : 5 C 0 4 & After summarize , Stata temporarily stores the mean as a macro named r (mean) 32 Statistics with Stata Data Management 33 display r(mean) 1 2 . 1 0 ? Z - 1 We could use this result to create variable unempDEV, defined as deviations from the mean: gen unempDEV = unemp - r(mean) summ unemp unempDEV Yar-.aole ' lea Mo-.r. 3td. Cev. V.Lr. Max .-.=,--. 1 -_: 12.::*;9 4. :50c 43 ■ :9.6 Stata also provides another variable-creation command egen ("extensions to generate "), which has its own set of functions to accomplish tasks not easily done by generate . These include such things as creating new variables from the sums, maxima, minima, medians, interquartile ranges, standardized values, or moving averages of existing variables or expressions. For example, the following command creates a new variable named zscore, equal to the standardized (mean 0. variance 1) values of.v: egen zscore ~ std(x) Or, the following command creates new variable mg, equal to the row mean of each observation's values on .v. v. z, and w, ignoring any missing values. egen avg = rowmean{x,y, z, ») To create a new variable named sum, equal to the row sum of each observation's values on .v, v. r, and \\\ treating missing values as zeroes, type egen sum = rowsum(x,y, z ,w) The following command creates new variable xrank. holding ranks corresponding to values of .v: xrank= 1 for the observation with highest .v. xrank = 2 forthe second highest, and so forth. egen xrank = rank(x) Consult help egen for a complete list of egen functions, or the reference manuals for further examples. Converting between Numeric and String Formats_ Dataset canada2.dta contains one string variable, place. It also has a labeled categorical variable, type. Both seem to have nonnumerical values, use Canada2, clear list place type :; = * r une wi e .-:;-in:e 6 . 1 Quebec -' r a v incc- | ]■ 1 n z a r i a I"an- ?r:viri;e 1 9.' 1 r r a v i rj a a I 12. 1 Alberta rrcvir.ee 1 2 1. 1 3ri-i£h Cclur-C'ia = ~rvi r.ce 1 12 . [ ■i - k a r, ' a r r 11 c r y I - Zi . 1 Kcr: r.víea: I-srriten^s n e a r i t c r v 1 Beneath the labels, however, type remains a numeric variable, as we can see if we ask for the nolabel option: list place type, nolabel 1 place t vc e 1 . 1 Canadi 2 . ! JtAr f a und 1 a n a i . Frir.ce Edward Island 4 . o v a Sccti a ; 5 . New Prunswiek 6 . ./lebe: CntariG 1 ' = . Mani - :„ t e 1 ! ? - 2'aa>ata:aawar. 1 ■ :■:. Alberaa 1 1 il. British Columbia 1 i - i 1 Wcr-hwes- Terriaoriaa String and labeled numeric variables look similar when listed, but they behave differently when analyzed. Most statistical operations and algebraic relations are not defined for string variables, so we might want to have both string and labeled-numeric versions of the same information in our data. The encode command generates a labeled-numeric variable from a string variable. The number I is given to the alphabetically first value of the string variable, 2 to the second, and so on. In the following example, we create a labeled numeric variable named placeman from the string variable place: . encode place, gen{placenum) The opposite conversion is possible, too: The decode command generates a string variable using the values of a labeled numeric variable. Here we create string variable typestr from numeric variable type: . decode type, gen(typestr) When listed, the new numeric variable placenum, and the new string variable types tr, look similar to the originals: 34 Statistics with Stata list place placenum type typestr typ« 1 • 1 Panada Car.ada Ma tier. N .tier 2 ■ 1 Newi c ur. dl ar. 3 Newfoundland Pre-. ince ? r c • r:n:e 3 . | ? rir.se "dv=rd Islar.d ?r i -,;e Edward Island Fro\ ince Pre- -ir.ee L- ■ I Neva Sec ii2 '.leva Scoria Fro; ir.ee Pre- 'ir.ee 5 . New 3runs>-::> Mew Brur.swi sk rry, ir.ee F::- -ir.ee c . Ciebec Quebec Pre. ir.ee Prc- • 1 n c e 7 dr.t a r i c Cntari: Prr, ince F r c - -ince 8 . Kär.:t:ta >' ar.it o b a Pr:\ irre Pre* ■ince 5 . S a £ k a t a r. e >; a r. Saskatchewan Pre-. ir.ee F r -tree Alberoa Alberta Free i r c e Fr o- -ir.ee British •• / . British C o 1 e rr b i a ?rc\ ince Pre- 2 . 1 l U k O r. iuken Ter ri tor y Terr .tor y 3. i Pc rthwe s t I^rr:::r:es Kcrt ".vre st Te rr i tc ri es Terri t o r y Terr. tory But with the nolabel option, the differences become visible. Stata views placenum and type basically as numbers. . list place placenum type typestr, nolabel place placen urn type -y pestr 2 2 a r, a d a 3 0 C 8 3 2 0 C 0 0 2 0 C 2 C e - 0 2 p a t i e r. 2 . Kewfcureland z 0 C 2 11 0 C 8' C 2 Ü C 3 C e - C 8' 1 Pre vir ee 1 . Prir.ee E z w a r d I s 1 a r. d 1 0 C 2 2 2 C20C32202e+22 1 Pre vi nee 1 . Nova Sect is. 8 c: c: o 0 81 0 2 0 8 81 0 1 e + 3 c 1 Pre v ince 5 . Kev; Brunswick 8 C 'i 8. 0 c 8 0 c j c 2 0 2 0 e + 3 2 1 Frc vi r.ce f_ . 1 - 18 10 8 3C2C8 0210e+Cl 1 Fro vir.ee T 1 Crtarie 2. C1 8' 0 0 8 0 C 3 8 j 0 8 0 0 e - 0 0 1 Pre v ince s. 1 Hat:ocra 4 8: 2 0 C 8 0 2 0 8 0 0 2 0 C e - 3 0 1 o ~- m vi nee 9 . Saskatchewan 1 2 0 0 8 8 C j 2 2 0 C 1 0 2 e + 3 1 1 Pre v i n c e 0. Alterta 1 8: 0 C 2 C 2 32302302e+0C - Pre vinee Britisr. Ccl.ncia 2 0 0 2 0 C 3 C 2 0 C 2 0 C 3 e * C 2 1 Pre v" - e e 2 . i Yukcr 1 5 C 8' 0 2 3 8 3 0 8 30 2 Oe-Cl 2 Terr it cry 3 • Ncrthwest Territeries 0 C 0 C 2 1 8 3 0 2 2 C 3 C e - 8 3 2 Terr itcry Statistical analyses, such as finding means and standard deviations, work only with basically numeric variables. For calculation purposes, numeric variables" labels do not matter. summarize place placenum type typestr Variable I Cbs 'Aar, Std. Tev. Mm Max plaeeren typestr 3.8 544 6 3 0 4 2 2 Occasionally we encounter a string variable where the values are all or mostly numbers. To convert these string values into their numerical counterparts, use the real function. For example, the variable siblings below is a string variable, although it only has one value, "4 or more " that could not be represented just as easily by a number. Data Management 35 describe siblings - ■ siblir. es ser9 list 9s Muir.be r c f siblings (strin g) •generate sibnum = real(siblings) The new variable sibnum is numeric, with . list a missing value where siblings had "4 or morel sifclincs er ~cre The destring command provides a more flexible method for converting string variables to numeric. In the example above, we could have accomplished the same thing by typing destring siblings, generate (sii>num) force See help destring for information about syntax and options. Creating New Categorical and Ordinal Variables A previous section illustrated how to construct a categorical variable called type to distinguish among territories provinces, and nation in our Canadian dataset. You can create categorical or ordinal variables in many other ways. This section gives a few examples. type has three categories: tabulate type c v ince, Freq. r r c v 1 n c>. Terr:-;:' Katie: 1 5. ■ c . 1 z 92 . 3 1 For some purposes, we might want to re-express a multieategory variable as a set of dichotomies or 'dummy variables," each coded 0 or 1. tabulate will create dummy 36 Statistics with Stata variables automatically if we add the generate option. In the following example, this results in a set of variables called type]. type2, and type3. each representing one of the three categories of Type: . tabulate type, generate(type) Percent Nation TctaL 15.33 10 describe -s data fror. C : da z a \ c an adal uonta a r s ?norv free! C a n a a i a: 3 Jul 2: storage display ■me type f crirat pep unemp rr.l i ie flife gap type -ypel type 2 type 3 f Leafiest float float float byte 6 9 . :■ g *5.Cg 3 9 . 0 g ^9 . Zq % 9 - Og * B . C g % 8 . 0 g ^ 3 . : g v aiue labe 1 typelb ,le Labe! .rne P c p u 1 a z i c n :r, _ t j „ s , "- 15+ p c pu 1 a ~ i c n u n or: Kale li_s expectancy renale life expectant remale-nale gap life P r 0' v ires, terr;tc: y . -ype = = Pr ovince t ype= = Terr itory tvce = =K"aticn p1oyed, year s 'V years expect an; :. r nation Sorted dy aed since _asc s a v e d list place type typel-type3 6 . I 0 . II . 12 . 1 3 . I C a n a a a ! Newfoundland Prince Ed v." a rd Island ( Mcv= Scotia I;ew Brun swi ck I----------------------- Ijuebec I Cntanc y. anitoba i S a s >atchewan Al be r t a , _ .--------------------- British Columbia 1 Nor thwe st Territories Nation Prcv i nee Province ?r c v i n ce Province Province Province Province Province Territory Territory 'pe^ 0 Re-expressing categorical information as a set of dummy variables involves no loss of information; in this exam pie, typel through type3 together tell us exactly as much as type itselt Data Management 37 does. Occasionally, however, analysts choose to re-express a measurement variable in categorical or ordinal form, even though this does result in a substantial loss of information. For example, unemp in canada2.dta gives a measure of the unemployment rate. Excluding Canada itself from the data, we see that unemp ranges from 7% to 19.6%, with a mean of 12.26: summarize unemp if type *= 3 Obs Mean Std. Eev. Mi n Max arlac _ e unemp 4.443' 19.6 Having Canada in the data becomes a nuisance at this point, so we drop it: drop if type -- 3 (1 observation deleted] Two commands create a dummy variable named unempl with values of 0 when unemployment is below average (12.26), 1 when unemployment is equal to or above average, and missing when unemp is missing. Inreadingthe second command, recall that Stata's sorting and relational operators treat missing values as very large numbers. generate unemp2 = 0 if unemp < 12.2 6 ('~ missing values generated) replace unemp2 = 1 if unemp >= 12.2 6 & unemp < (5 real changes nade) We might want to group the values of a measurement variable, thereby creating an ordered-category or ordinal variable. The autocode function (see "Using Functions" earlier in this chapter) provides automatic grouping of measurement variables. To create new ordinal variable unemp3, which groups values of unemp into three equal-width groups over the interval from 5 to 20, type generate unemp3 = autocode(unemp,3,5,20) 12 missing values generated) A list of the data shows how the new dummy (unempl) and ordinal (unempJ) variables correspond to values of the original measurement variable unemp. . list place unemp unemp2 unemp3 place unenp unemp2 unemrj3 1. Newfound!and 19 . 6 : 20 1 2 . Prince Edwa rd island 15.1 i 20 1 3 . Mova 5 cct i a 13.9 i 15 1 4 . Mew 3 runswick 13 . 8 i 1 5 1 5 Quebec 13.2 1 15 1 6 . On t a r i c 9 . 3 o 1 3 1 7 _ Manitoba 8 . 5 1 31 1 8 . Saskatchewan. 7 J 1 J ! 9 . Alberta 8 . 4 r 1C ' British Cclunc i a 9 . B C 1C Yu kon M o r t h w e s t Territories I 38 Statistics with Stata Both strategies just described dealt appropriately with missing values, so that Canadian places with missing values on imemp likewise receive missing values on the variables derived from unemp. Another possible approach works best if our data contain no missing values. To illustrate, we begin by dropping the Yukon and Northwest Territories: drop if unemp >= A greater-than-or-equal-to inequality such as unemp >= . will select any user-specified missing value codes, in addition to the default code "." Type help missing for details. Having dropped observations with missing values, we now can use the group function to create an ordinal variable not with approximately equal-width groupings, as autocode did, but instead with groupings of approximately equal size . We do this in two steps. First, sort the data (assuming no missing values) on the variable of interest. Second, generate a new variable using the group (#) function, where # indicates the number of groups desired. The example below divides our 10 Canadian provinces into 5 groups, sort unemp generate unemp5 = group (5) list place unemp unemp2 unemp3 unemp5 interv Another difference is that autocode assigns values equal to the upper bound of each al, whereas group simply assies 1 to the first group. 2 to the second and so forth. Using Explicit Subscripts with Variables When Stata has data in memory, it also defines certain system variables that describe those data. For example, _N represents the total number of observations. _n represents the observation number: _n = 1 for the first observation, _n = 2 for the second, and so on to the last observation ( _n = _N ). If we issue a command such as the following, it creates a new variable. caselD, equal to the number of each observation as presently sorted: . generate case ID — _n Sorting the data another way will change each observation's value of n , but its caselD value will remain unchanged. Thus, if we do sort the data another way, we can later return to the earlier order by typing .in err. p lj nenpl ;nenp5 ■ir.e.-.pt - : 1C \ Alterta ! Mar.ncca 6 . 4 0 12 1 i i: 2 i ' i 0-t. = rio 9 . 3 - - Bri'-Sh Col-.:~Dia -■ ■ 'd - 12.2 i ; 5 - . i prince Idvard Island 1 9 . ' i lb i 1 : 4 ' - c. i ^ewf cunclar.d - ' " v- Data Management 39 sort caselD Creating and saving unique case identification numbers that store the order of observations at an early stage of dataset development can greatly facilitate later data management. We can use explicit subscripts with variable names, to specify particular observation numbers. For example, the 6th observation in dataset canadal.dta (if we have not dropped or re-sorted anything) is Quebec. Consequently, pop[6] refers to Quebec's population, 7334 thousand. display pop[6] "334.2C02 Similarly, popfl2] is the Yukon's population: display pop[12] Explicit subscripting and the _n system variable have additional relevance when our data form a series. If we had the daily stock market price of a particular stock as a variable named price, for instance, then either price or, equivalently, price[ n] denotes the value of the nth observation or day. price[ _n-1 ] denotes the previous day's price, andprice[ _n+1 ] denotes the next. Thus, we might define a new variable difprice, which is equal to the change in price since the previous day: . generate difprice = price - price[_n-1] Chapter 13, on time series analysis, returns to this topic. Importing Data from Other Programs Previous sections illustrated how to enter and edit data by typing into the Data Editor. If our original data reside in an appropriately formatted spreadsheet, a shortcut can speed up this work: we might be able to copy and paste multi-column blocks of data (not incl uding column labels) directly from the spreadsheet into Stata's Data Editor. This requires some care and perhaps experimentation, because Stata will interpret any column containing non-numeric values as representing a string variable. Single columns (variables) of data could also be pasted into the Data Editor from a text or word processor document. Once data have been successfully pasted into Editor columns, we assign variable names, labels, and so on in the usual manner. These Data Editor methods are quick and easy, but for larger projects it is important to have tools that work directly with computer files created by other programs. Such files fall into two general categories: raw-data ASCII (text) files, which can be read into Stata with the appropriate Stata commands; and system files, which must be translated to Stata format by a special third-party program before Stata can read them. To illustrate ASCII file methods, we return to the Canadian data of Table 2.1. Suppose that, instead of typing these data into Stata's Data Editor, we typed them into our word processor, with at least one space between each value. String values must be in double quotes if they contain internal spaces, as does "Prince Edward Island". For other string values, quotes are optional. Word processors allow the option of saving documents as ASCII (text) files, a simpler and more universal type than the word processor's usual saved-file format. We can thus create an ASCII file named canada.raw that looks something like this: 40 Statistics with Stata Data Management 41 "Canada" 29606.1 10. ■' _> . 1 81.1 "Newfoundland" 575.4 13 . c f 3 . 9 7 9.8 "Prince Edwara Island11 _ 3c . 1 13 . 1 74.3 "Neva Scoria" 937.8 13 . g 7 4 .2 8 Z .4 "New Brunswick" ~6C . 1 1 3 . 3 74.8 80.6 "•Quebec" 7234 .2 13.2 c 81 .2 "Ontario" 11100.3 9. 75 . 81.1 "Manitoba" 1 137 .5 6. c 7 5 g 0 . 8 11 Saskatchewan" 1015 . & c 2 81 .8 "Alberta" 2"4^ 8.4 1 5 . 5 81 . 4 "British ColurrOoia" 3" 6 5 o _ e "5-8 81.4 "Yurion" 3C.1 . "1. 3 SC . "Northwest Territories " 65 . 8 0.2 78 81.3 Note the use of periods, not blanks, to indicate missing values for the Yukon and Northwest Territories. If the dataset should have five variables, then for every observation, exactly five values (including periods for missing values) must exist. inf ile reads into memory an ASCII file, such as canada.raw, in which the values are separated by one or more whitespace characters — blanks, tabs, and newlines (carriage return, line feed, or both) — or by commas. Its basic form is infile variable-list using filename.raw With purely numeric data, the variable list could be omitted, in which case Stata assigns the names var'l, var2, var3, and so forth. On the other hand, we might want to give each variable a distinctive name. We also need to identify string variables individually. For canada.raw, the infile command might be infile str30 place pop unemp mlife flife using canada.raw, clear (13 o'eser v = ticr.s readl The infile variable list specifies variables in the order that they appear in the data file. The clear option drops any current data from memory before reading in the new file. If any string variables exist, their names must each be preceded by a str# statement. str30 , for example, informs Stata that the next-named variable {place) is a string variable with as many as 30 characters. Actually, none of the Canadian place names involve more than 21 characters, but we do not need to know that in advance. It is often easier to overestimate string variable lengths. Then, once data are in memory, use compress to ensure that no variable takesup more space than it needs. The compress command automatically changes all variables to their most memory-efficient storage type. compress ace was ?t r31 describe ■ntair.s Jits cds str21 1 3 9- cf menorv free! or age aispia1 type fern-at lace ■ariable label cat 21 s il if e lif & We can now proceed to label variables and data as described earlier. At any point, the commands save canadaO (or save canadaO, replace ) would save the new dataset in Stata format, as file canadaO.dta. The original raw-data file, canada.raw, remains unchanged on disk. If our variables have non-numeric values (for example, "male" and "female11) that we want to store as labeled numeric variables, then adding the option automatic will accomplish this. For example, we might read in raw survey data through this infile command: infile gender age income vote using survey.raw, automatic Spreadsheet and database programs commonly write ASCII files that have only one observation per line, with values separated by tabs or commas. To read these files into Stata, use insheet. Its general syntax resembles that of inf ile , with options telling Stata whether the data delimited by tabs, commas, or other characters. For example, assuming tab-delimited data, insheet variable-list using filename.raw, tab Or, assuming comma-delimited data with the first row of the file containing variable names (also comma-delimited), insheet variable-list using filename.raw, comma names Writh insheet we do not need to separately identify string variables. If we include no variable list, and do not have variable names in the file's first row, Stata automatically assigns the variable names varl, rar2, var3..... Errors will occur if some values in our ASCII file are not separated by tabs, commas, or some other delimiter as specified in the insheet command. Raw data files created by other statistical packages can be in "fixed-column" format, where the values are not necessarily delimited at all, but do occupy predefined column positions. Both infile and the more specialized command infix permit Stata to read such files. In the command syntax itself, or in a "data dictionary" existing in a separate file or as the first part of the data file, we have to specify exactly how the columns should be read. Here is a simple example. Data exist in an ASCII file named nfresour.raw: 198 6240 876416910 0 0 19 8 7 2 5 2 4 ~ 4 3 0 0 0i 1014 4 1? 332 513 5 63^4 81036 1969253 58964371 140 1990 3615731195 1 9 SI 793C001262 These data concern natural resource production in Newfoundland. The four variables occupy fixed column positions: columns 1-4 are the years (1986... 1991); columns 5-8 measure forestry production in thousands of cubic meters (2408...missing); columns 9-14 measure mine production in thousands of dollars (764.169...793,000); and columns 15-18 are the consumer price index relative to 1986 (1000... 1262). Notice that in fixed-column format, unlike space or tab-delimited files, blanks indicate missing values, and the raw data contain no decimal points. To read nfresour.raw into Stata, we specify each variable's column position: 42 Statistics with Stata . infix year 1-4 wood 5-8 mines 9-14 CPI 15-18 using nfresour.raw, clear (6 cbser';a::cns read) . list ! year ,,t £ j", r.ir.es CPI 1 1. 9 3 6 2 - "i 5 7 6 4 16 9 1 c 0 0 2 . 1 19 3 7 2 52 4 7 4 5 2 0C 5 44 3 . 1 19 2 8 2515 = 53"-! = 1086 4 . 1 9B 9 2 c 3 5 6 9 6 4 5 7 1140 5 . 1 9 9 C 3 615 7 3 11 95 6- 1------- 19 91 79300: 1262 More complicated fixed-column formats might require a data "dictionary."' Data dictionaries can be straightforward, but they offer many possible choices. Typing help infix or help infile2 obtains brief outlines of these commands. For more examples and explanation, consult the User's Guide and reference manuals. Stata also can load, write, or view data from OBDC (Open Database Connectivity) sources; see help obdc . What if we need to export data from Stata to some other, non-OBDC program? The outf ile command writes ASCII files to disk. A command such as the following will create a space-delimited ASCII file named canada6.ra\\\ containing whatever data were in memory: outfile using canada6 The infile , insheet , infix , and outfile commands just described all manipulate raw data in ASCII files. A second, very quick, possibility is to copy your data from Stata's Browser and paste this directly into a spreadsheet such as Excell. Often the best option, however, is to transfer data directly between the specialized system files saved by various spreadsheet, database, or statistical programs. Several third-party programs perform such translations. Star/Transfer, for example, will transfer data across many different formats including dBASE, Excel, FoxPro, Gauss, JMP, Lotus. MATLAB, Minitab, OSIRIS, Paradox, S-PIus, SAS, SPSS, SYSTAT. and Stata. It is available through Stata Corporation (www.stata.com) or from its maker. Circle Systems (www.stattransfer.com). Transfer programs prove indispensable for analysts working in multi-program environments or exchanging data with colleagues. Combining Two or More Stata Files_ We can combine Stata datasets in two general ways: append a second dataset that contains additional observations; or merge w ith other datasets that contain new variables or values. In keeping with this chapter's Canadian theme, we will illustrate these procedures using data on Newfoundland. File newfl.dta records the province's population for 1985 to 1989. I Data Management 43 . use newfl, clear (New: oundland 19 85-3 9; . describe Cental r.s car a _rcm C: ''■oata\r.ewf 1 .dta cbs : 5 Kev;f oundland 198 5-8 5 var s : 2 3 Col 2 CO5 10:49 size: 5 0 (99.9% of rr.errory free) storage display va1 variable r.ame type fcrrrat label variable label year 1 r. t - 31 . l g Ye ar pep f 1 c a z % 9 . C g Population o r t e a . list 19 3 5 19 3 6 19 3 7 19 3 = 193 5 58C-CC I 5 3 C 2 0 C I 5 6 3 2 0 C I 5 63 00C 5 7 0 Z 0 0 File newfl.dta has population and unemployment counts for some later use newf2 years: (Newf cur.dlar.d 19 9 0-95) describe Contains data obs : va r s : size: frort. C : " 64 { d =11a\newf 2 -dta 99.9% cf nenory free) Newfoundland 19 9 C-5 5 3 Jul 2 0 05 10:49 variable r.ame storage t yp e display value forma" label variable label year pep j c b 1 e s s int float 11 c a t * 9 . 0 g % 9 . C g * 9 . C a Year ?opulat i on Nuir.ber of people ur.enplcved Sorted b-: . list p 0 0 j obIess 1 . 1 19 9 0 5- 7 3 J C 0 4 2 0 C 0 i 2 . 1 19 51 3735o: 4 5 0 C 0 1 ■3 1 1992 5 7 5 fj 0 C 4 9 0 c j [ 4 . 1 19 5 3 3 6 4 4 0 C 49CC: 1 1 _994 53240C ocoo: 1 6 . ' Q y C 5 7 5 4 4 9 To combine these datasets, with newfl.dta already in memory, we use the append command: . append using newfl 44 Statistics with Stata list 420 0: : ?. 2 4 5 ř: 2 4 ■:: 13-35 5 6. ■:": 0 Because variable jobless occurs in newjl {1990 to 1995) but not in newfL its 1985 to 1989 values are missing in the combined dataset. We can now put the observations in order from earliest to latest and save these combined data as a new file. ne\\j3.dta: sort year . list yea r pop 19 5 5 1 5 3 ~ 0 1 1 95 z 6 o 3 _■ _ 1 ~i ' i 1 5 9 3 r - 2 4 0 C 4 2 0 0 2 1991 2 ~ 2 5 C 3 4 5. ü 1 :■ " c| "1 "3 - - 4 ~ O 1" 4 9 3 3 3 19-4 1 6 2 -', 0 3 5 2 ■0 save ne*rf3 append might be compared to lengthening a sheet of paper (that is. the dataset in memory) by taping a second sheet with new observations (rows) to its bottom, merge , in its simplest form, corresponds to "widening" our sheet of paper by taping a second sheet to its right side, thereby adding new variables (columns). For example, dataset ne\vf'4.dta contains further Newfoundland time series: the numbers of births and divorces over the years 1980 to 1994. Thus it has some observations in common with our earlier dataset /?evj3.dta, as well as one variable {year) in common, but it also has two new variables not present in newfS.dta. use newf4 (tie-A-: cun 3 5 a:\ e ". 3- 3 '3 - 9 4 i describe r 4 ..ita 3 3ul 1035 10:49 -9.33 c: r.err." rrv free; cntair.s data rrcrri ■Z-.--zis.ta' Data Management 45 Variable r.- hs sorted . list crag* i n t i r.t d _ s p 1 a ■ fernst 4. . g year bd r t h s diver res 1. 1 19 5 0 1 3 3 3 2 552 2 . 1 19 31 11313 3 1 1982 ■4 i 7 " 62 5 4 . 1 1 5 S 3 9 6 3 0 Ť- 11 5 . 1954 3 5 6 0 5 2 3 6 . i 19 a 2. 3 08 2 561 . 1 19 8 6 3 22C 51 0 1 3 . 1 9&7 7 e 5 £ 1 0 2 2 1 9 . 19&8 "39c &8 4 1 1 J . i 15 5 3 7 9 9 6 981 1 1 1 . 1 1990 7 3 5 4 9 7 3 1 2 . 1 1931 6 92 3 912 1 1 3 . ! 19 92 £ c c- 3 8€~- 1 2 4 . 1 993 63 6 0 931 i ' c 1994 62 9 5 933 ■a 1 ue abei "ariabi? 1 a.'o^. NuTT-ber cf births Mu.Tloer of diver;; We want to merge wir/3 with nen/4, matching observations according to year wherever possib e. To accomphsh this, both datasets must be sorted by the index variable (which in this example is year). We earlier issued a sort year command before saving neufi.dtc. so we now do the same w,th „ewf4.dta. Then we merge the two, specifying rra,-as the index variable sort year merge year using nawf3 describe 3c r. _ains -ata fron- r.e- ■'f 4 . e"~ a z-bs : 2 c v s r s : f size: 3 24 3 9 . s tora gs va ri abis e.ane type f c rma --------- ------------- --- ___ year ir.z i 9 . 0 c birte.s 1 n z i- Q divorces int 3 9 . 2q pep float 3 9 . 3g jobless float 9 . C g __me rge cyte ■-. 3 Og Sorted bv -lev.-f our. d_and 1 9 6 3 - 9 4 2 Jul 2305 12:49 va ri able lane 1 V e a r Kuncer of births N urr, ber of d i v trees Population llumber of people onenployed Note list däteset .las changed since 1, 46 Statistics with Stata birzr.s ;iv;r e e a pep jobless merge I - | -«3" - ^ 3 ■/ 2 55 1 1 1 1 1J 2 I 3 9 1 2 . 1934 2 2 6 0 625 7 11 S 9 2 i [ i T . 19 6 6 - n a r> 8 3 2 C 261 2. & 1: 2 '2 2> -3 j 3 . 19 2" ' j 3 'a 1 7 2 0 C 2 3 1 ?~ 2 5 7?4 jj 4 2 2 0 C' 3 i 12. 1 19 9 1 11. 19 2 4 6 9 2 9 6^95 9 3 0 933 5 7 3 5 3 0 5 7 5 6 0 0 5 8 4 4 C 2' 4 9 0 0 0 4 9 0 C 2: 5 i J 0 0 C 3 i 3 16. 1 192-5 5^5449 2 1 In this example, we simply used merge to add new variables to our data, matching observations. By default, whenever the same variables are found in both datasets, those of the "master" data (the file already in memory) are retained and those of the "using" data are ignored. The merge command has several options, however, that override this default. A command of the following form would allow any missing values in the master data to be replaced by corresponding nonmissing values found in the using data (here, newf5.dta): merge year using newx"5, update Or. a command such as the following causes any values from the master data to be replaced by nonmissing values from the using data, if the latter are different: . merge year using newf5, update replace Suppose that the values of an index variable occur more than once in the master data; for example, suppose that the year 1990 occurs twice. Then values from the using data with year = 1990 are matched with each occurrence of year= 1990 in the master data. You can use this capability for many purposes, such as combining background data on individual patients with data on any number of separate doctor visits they made. Although merge makes this and many other data-management tasks straightforward, analysts should look closely at the results to be certain that the command is accomplishing what they intend. As a diagnostic aid, merge automatically creates a new variable called _merge. Unless update was specified, merge codes have the following meanings: 1 Observation from the master dataset only. 2 Observation from the using dataset only. 3 Observation from both master and using data (using values ignored if different). Data Management 47 Tf the update option was specified, jnerge codes convey what happened: 1 Observation from the master dataset only. 2 Observation from the using dataset only. 3 Observation from both, master data agrees with using. 4 Observation from both, master data updated if missing. 5 Observation from both, master data replaced if different. Before performing another merge operation, it will be necessary to discard or rename this variable. For example, drop _merge Or. rename merge _mergel We can merge multiple datasets with a single merge command. For example, if newfS.dta through ne\\[f8.dta are four datasets, each sorted by the variable year, then merging all four with the master dataset could be accomplished as follows. merge year using newf5 new£6 newf7 newf8, update replace Other merge options include checks on whether the merging-variable values are unique, and the ability to specify which variables to keep for the final dataset. Type help merge for details. Transposing, Reshaping, or Collapsing Data_ Long after a dataset has been created, we might discover that for some analytical purposes it has the wrong organization. Fortunately, several commands facilitate drastic restructuring of datasets. We will illustrate these using data (growth 1.dta) on recent population growth in five eastern provinces of Canada. In these data, unlike our previous examples, province names are represented by a numerical variable with eight-character labels. use growthl, clear (Eisterr lar.sda ■zrover.i describe Cc ":ain; .lata fro:: 3: da-a' q rev/ a hi . baa ore: 2 eastern Canada grcv.eh vara: 5 2 Jul 2 2 3 5 1 3:4 8 siae: 1C5 (99.93 of rr.encry free) variable name type format label variable label grew 95 \ eat 48 Statistics with Stata Data Management 49 . list I provinc 2 crcw9 2 -;t w 93 arcw94 g rcw9 z 2. I t:ev, 3rcn 1 0 2.5 2.2 2.4 2 . I Kewfc-r.a 4.5 .2 - 3 "2.3 4. I Ontario 174.9 169.2 .2Z-.9 123.9 ■ 3. I Quebec 82.6 77.4 46.5 4".l | +----------------------------------------------+ In this organization, population growth for each year is stored as a separate variable. We could analyze changes in the mean or variation of population growth from year to year. On the other hand, given this organization, Stata could not readily draw a simple time plot of population growth against year, nor can Stata find the correlation between population growth in New Brunswick and Newfoundland. All the necessary information is here, but such analyses require different organizations of the data. One simple reorganization involves transposing variables and observations. In effect, the dataset rows become its columns, and vice versa. This is accomplished by the xpose command. The option clear is required with this command, because it always clears the present data from memory. Including the varname option creates an additional variable (named varname ) in the transposed dataset, containing original variable names as strings. xpose, clear varname describe Contains data cbs : 5 size: 160 i 9 9 . 91 ef rr.err.cry free) storage d i s p 1 a ariable n a rr e type f o r r-i a - 1 f 2 c a t 2 9 . 0 = 2 float * 9 . 0 g 3 cleat % 9 . 0 g 4 float 9 . 0 g S float % 5 . C g '.- a r n a Hi e a tr 3 ?: 9 5 Sorted by: Note: dataset has changed sir.ee last saved . list -A '.7 C. va r name 4 C provi_oc2 12 1 17-1 9 3 0.6 arc v,- 9 2 5 s : 6 9 1 77.4 grcw93 2 12 0 9 48.5 g r ow 9 4 3 9 163 a 4 7.1 arow95 Value labels are lost along the way, so provinces in the transposed dataset are indicated only by their numbers (1 = New Brunswick, 2 - Newfoundland, and so on). The second through last values in each column are the population gains for that province, in thousands. Thus, variable v / has a province identification number (1, meaning New Brunswick) in its first row, and New Brunswick's population growth values for 1992 to 1995 in its second through fifth rows. We can now find correlations between population growth in different provinces, for instance, by typing a correlate command with in 2/5 (second through fifth observations only) qualifier: . correlate vl-v5 in 2/5 (cos=4) vii i.:o c: :■- I C . 5 053 1 . 0 : 02 v 3 I 0 . 9 ~ 4 2 0.8573 1 . : 0 C 0 v4 ! 0 . 5 C 7u 0.4 3 23 9.620 4 1 . 0 0 0 C v 5 0.6~26 0.93 62 3.804 9 C.6 7 6 5 2.0 C 0 0 The strongest correlation appears between the growth of neighboring maritime provinces New Brunswick (vl) and Nova Scotia (vi): r = .9742. Newfoundland's (v2) growth has a much weaker correlation with that of Ontario (v4): r = .4803. More sophisticated restructuring is possible through the reshape command. This command switches datasets between two basic configurations termed "wide" and "long." Dataset growth 1 .dta is initially in wide format. use growthl, clear ^ = :ern Canada growthl list 1 provir.o2 grcw92 grow55 grow94 qzow95 1 - New 3ron 2 Z 2 . i Newfound 4 . z 4 . Ontario 174.9 5 . Quebec 3 0.6 2.5 2.2 2.4| ■ 8 -3 -5.S 1 7 • 6 3.5 3.9| 369.1 12 0.9 163.5 ! ~ 7 . 4 4 3 . 5 4 7 .1 A reshape command switches this to long format, reshape long grow, i(provinc2) j(year) ote: - = 92 9 3 9 4 9 5) t a wi tie -> long rr.ber cf obs . rr.ber c f vari ables variable (4 values) 5 - > 2 2' 5 -> 3 -> year xi] variables: g r o w? 2 g r o w9 3 ... g r c w 9 5 -> grew Listing the data shows how they were reshaped. A sepby () option with the list command produces a table with horizontal lines visually separating the provinces, instead of every five observations (the default). 50 Statistics with Stata list, sepby(provincí) New Brun :; = w Brun I Kc-I No- 'j n i a r i1 16. label data "Eastern Canadian growth--long" label variable grow "Population growth in 1000s" save growth2 f i le C : '■ daoa\qrco, di: a = avea The reshape command above began by stating that we want to put the dataset in long form. Next, it named the new variable to be created, grow. The i (provinc2) option specified the observation identifier, or the variable whose unique values denote logical observations. In this example, each province forms a logical observation. The j (year) option specifies the sub-observation identifier, or the variable whose unique values (within each logical observation) denote sub-observations. Here, the sub-observations are years within each province. Figure2.1 shows a possible use for the long-format dataset. Withone graph command, we can now produce time plots comparing the population gains in New Brunswick, Newfoundland, and Nova Scotia (observations for which provincl < 4). The graph command on the following page calls for connected-line plots of grow (as v-axis variable) against year (x axis) if province! < 4, with horizontal lines at y = 0 (zero population growth), and separate plots for each value of province. Data Management 51 graph twoway connected grow year if provinc2 < 4, yline(O) ky (provinc2) -C o 1_ c ■B o m Q_ New Brun Nova Sco 92 93 94 Graphs by Eastern Canadian province Newfound Figure 2.1 95 year Declines in their fisheries during the early 1990s contributed to economic hardships in these three provinces. Growth slowed dramatically in New Brunswick and Nova Scotia, while Newfoundland (the most fisheries-dependent province) actually lost population. reshape works equally well in reverse, to switch data from "long" to "wide" format. Dataset growth3.dta serves as an example of long format. . use growth3, clear (eastern Canadian growth--long) . list, sepby(provinc2) 1 provir.c2 grow year 1 . N e w E r u n 1C 9 2 i. . I'isw Brun 2 . 5 93 3 . i New 3rur. 2 .2 94 ^ ■ 1 Mew 3rur. 2 . 4 95 5 . 1 Newfound 4 . 5 9 2 c . 1 Newfound . S 5 3 1 Newfound -3 & . Newfound -5.3 9 5 9. Nova Sec 2 2.1 92 1C . , Neva See 5 . 3 93 11 . 1 Neva Sco 3 . 5 94 12 . 1 Neva 5eo 3 . 9 95 13 . Ontario 174.9 1 4 . Ontario 169.1 93 IB . Ontario 1 2 C . 5 9 4 1 6 . O'tar.c 163.9 92 r 52 Statistics with Stata 17 . 1 Quebec 80 6 92 18 . 1 Quebec 77 4 93 19 . 1 Quebec 48 5 94 20 . \ Quebec 47 1 95 +-------------------------+ To convert this to wide format, we use reshape wide : . reshape wide grow, i(provinc2) j(year) (note: j = 92 93 94 95) Data long > wide Number of obs. 20 > 5 Number of variables 3 > 5 j variable (4 values) year • - > (dropped) xii variables: grow92 grow93 . . grow95 grow > . list + - \ provinc2 grow9 2 grow93 grow94 grow95 1 1. New Brun 10 2 . 5 2 . 2 2.4 2 . 1 Newfound 4 . 5 . 8 -3 -5.8 1 3 . 1 Nova Sco 12.1 5 . 8 3 . 5 3.9 1 4 . 1 Ontario 174.9 169.1 120.9 16 3.9 1 5 . 1 Quebec 80.6 77.4 48.5 47.1 1 + - Notice that we have recreated the organization of dataset growthl .dta. Another important tool for restructuring datasets is the collapse command, which creates an aggregated dataset of statistics (for example, means, medians, or sums). The long growth3 dataset has four observations for each province: use growth3, clear (Eastern Canadian growth--long) . list, sepby(provinc2) +-------------------------+ 1 provinc2 grow year 1. New Brun 10 92 2 . New Brun 2 .5 93 3 . New Brun 2 .2 94 4 . ( New Brun 2 .4 95 5 . \ Newfound 4 .5 92 6 . Newfound .8 93 7 . 1 Newfound -3 94 8 . 1 Newfound -5.8 95 9 . 1 Nova Sco 12.1 92 10 . 1 Nova Sco 5.8 93 11 . 1 Nova Sco 3 .5 94 12 . 1 Nova Sco 3.9 95 13 . 1 Ontario 174.9 92 14 . 1 Ontario 169.1 93 15 . 1 Ontario 120.9 94 16 . 1 Ontario 163.9 95 Data Management 53 17 . 1 Quebec 80 6 92 18 . 1 Quebec 77 4 93 19. 1 Quebec 48 5 94 20 . 1 Quebec 47 1 95 +-------------------------+ We might want to aggregate the different years into a mean growth rate for each province. In the collapsed dataset, each observation will correspond to one value of the by ( ) variable, that is, one province. . collapse (mean) grow, by(provinc2) . list provinc2 grow New Brun 4.275 Newfound -.8150001 Nova Sco 6.325 Ontario 157.2 Quebec 63 . 4 For a slightly more complicated example, suppose we had a dataset similar to growths.dta but also containing the variables births, deaths, and income. We want an aggregate dataset with each province's total numbers of births and deaths over these years, the mean income (to be named meaninc), and the median income (to be named medinc). If we do not specify a new variable name, as with grow in the previous example, or births and deaths, the collapsed variable takes on the same name as the old variable. . collapse (sum) births deaths (mean) meaninc = income (median) medinc = income, by(provinc2) collapse can create variables based on the following summary statistics: mean Means (the default; used if the type of statistic is not specified) sd Standard deviations sum Sums rawsum Sums ignoring optionally specified weight count Number of nonmissing observations max Maximums min Minimums median Medians pi 1st percentiles p2 2nd percentiles (and so forth to p99 ) iqr Interquartile ranges 54 Statistics with Stata Weighting Observations Stata understands four types of weighting: aweight Analytical weights, used in weighted least squares (WLS) regression and similar procedures. f weight Frequency weights, counting the number of duplicated observations. Frequency weights must be integers, iweight Importance weights, however you define "importance." pweight Probability or sampling weights, equal to the inverse of the probability that an observation is included due to sampling strategy. Researchers sometimes speak of "weighted data." This might mean that the original sampling scheme selected observations in a deliberately disproportionate way, as reflected by weights equal to [/(probability of selection). Appropriate use of pweight can compensate for disproportionate sampling in certain analyses. On the other hand, "weighted data" might mean something different — an aggregate dataset, perhaps constructed from a frequency table or cross-tabulation, with one or more variables indicating how many times a particular value or combination of values occurred. In that case, we need f weight. Not all types of weighting have been defined for all types of analyses. We cannot, for example, use pweight with the tabulate command. Using weights in any analysis requires a clear understanding of what we want weighting to accomplish in that particular analysis. The weights themselves can be any variable in the dataset. The following small dataset (nfschool.dta), containing results from a survey of 1,381 rural Newfoundland high school students, illustrates a simple application of frequency weighting. describe Contains data obs : vars : size: from C: 6 3 48 \data\nfschool.dta (99.9% of memory free) Newf.school/univer. (Seyfrit 93) 3 Jul 2005 10:50 variable name storage type di splay fo rma t value label variable label univers year count byte byte int %8 . Og %8 . Og %8 . Og ye s Expect to attend university? What year of school now? observed frequency Sorted by: list, sep(3) 1 univers year count 1. 1 no 10 210 2 . 1 no 11 260 3 . 1 no 12 274 4 . 1 ye s 10 224 5 . 1 yes 11 235 6 . 1 yes 12 178 Data Management 55 At first glance, the dataset seems to contain only 6 observations, and when we cross-tabulate whether students expect to attend a university (univers) by their current year in high school (year), we get a table with one observation per cell, tabulate univers year Expect to I attend | university | What year of school now? 10 11 12 Total no I yes I Total To understand these data, we need to apply frequency weights. The variable count gives frequencies: 210 of these students are tenth graders who said they did not expect to attend a university, 260 are eleventh graders who said no, and so on. Specifying [fweight = count] obtains a cross-tabulation showing responses of all 1,381 students. . tabulate univers year [fweight = count] Expect to I attend | university | What year of school now? •p l 10 11 12 1 Total no l 210 260 274 744 yes 1 224 235 178 637 Total 1 434 495 452 1, 381 Carrying I the analysis further, we might add options a skin ir a table with column percentages ( col), no cell frequencies (nof), and a %2 test of independence (chi2 ). This reveals a statistically significant relationship (P = .001). The percentage of students expecting to go to college declines with each year of high school. . tabulate univers year [fw = count], col nof chi2 Expect to I attend I university | I What year of school now? 10 11 12 Total no yes 48.39 51.61 52 .53 47.47 60.62 I 39.38 I 53 . 87 46 . 13 Total I 100.00 100.00 100.00 | 100.00 Pearson chi2(2) = 13.8967 Pr = 0.001 Survey data often reflect complex sampling designs, based on one or more of the following: disproportionate sampling — for example, oversampling particular subpopulations, in order to get enough cases to draw conclusions about them. clustering — for example, selecting voting precincts at random, and then sampling individuals within the selected precincts. 56 Statistics with Stata stratification — for example, dividing precincts into "urban" and "rural" strata, and then sampling precincts and/or individuals within each stratum. Complex sampling designs require specialized analytical tools, pweights and Stata's ordinary analytical commands do not suffice. Stata's procedures for complex survey data include special tabulation, means, regression, logit, probit, tobit, and Poisson regression commands. Before applying these commands, users must first set up their data by identifying variables that indicate the PSUs (primary sampling units) or clusters, strata, finite population correction, and probability weights. This is accomplished through the svyset command. For example: svyset precinct [pweight=invPsel], strata(urb_rur) fpc(finite) For each observation in this example, the value of variable princinct identifies PSU or cluster. Values of urbrur identify the strata, finite gives the finite population correction, and invPsel gives the probability weight or inverse of the probability of selection. After the data have been svyset and saved, the survey analytical procedures are relatively straightforward. Commands are typically prefixed by svy: , as in svy: mean income or svy: regress income education experience gender The Survey Data Reference Manual contains full details and examples of Stata's extensive survey-analysis capabilities. For online guidance, type help svy and follow the links to particular commands. Creating Random Data and Random Samples_ The pseudo-random number function uniform() lies at the heart of Stata's ability to generate random data or to sample randomly from the data at hand. The Base Reference Manual (Functions) provides a technical description of this 32-bit pseudo-random generator. If we presently have data in memory, then a command such as the following creates a new variable named randnum, having apparently random 16-digit values over the interval [0,1) for each case in the data. . generate randnum = uniform() Alternatively, we might create a random dataset from scratch. Suppose we want to start a new dataset containing 10 random values. We first clear any other data from memory (if they were valuable, save them first). Next, set the number of observations desired for the new dataset. Explicitly setting the seed number makes it possible to later reproduce the same "random" results. Finally, we generate our random variable. clear . set obs 10 obs was 0, now 10 . set seed 12345 . generate randnum = uniform() Data Management 57 list 1 randnum 1. .309106 2 . .6852276 3 . .1277815 4 . .5617244 5 . .3134516 6 . .5047374 7 . . 7232868 8 . .4176817 9 . . 67 68828 10 . .3657581 In combination with Stata's algebraic, statistical, and special functions, uniform () can simulate values sampled from a variety of theoretical distributions. If we want newvar sampled from a uniform distribution over [0,428) instead of the usual [0,1), we type . generate newvar = 428 * uniform() These will still be 16-digit values. Perhaps we want only integers from 1 to 428 (inclusive): . generate newvar = 1 + trunc(428 * uniform()) To simulate 1,000 rolls of a six-sided die, type clear . set obs 1000 obs was 0, now 1000 . generate roll = 1 + trunc(6 * uniform()) . tabulate roll die Freq. Percent Cum. 1 1 171 17 10 17 10 2 1 164 16 40 33 50 3 1 150 15 00 48 50 4 1 170 17 00 65 50 5 t 169 16 90 82 40 6 1 176 17. 60 100 00 Total I 1000 100.00 We might theoretically expect 16.67% ones, 16.67% twos, and so on, but in any one sample like these 1,000 "rolls," the observed percentages will vary randomly around their expected values. To simulate 1,000 rolls of a pair of six-sided dice, type . generate dice = 2 + trunc(6 * uniform()) + trunc(6 * uniform()) tabulate dice Freq. Percent dice | Cum. 2 1 26 2 60 2 60 3 1 62 6 20 8 80 4 1 78 7 80 16 60 5 1 120 12 00 28 60 6 1 153 15 30 43 90 7 1 149 14 90 58 80 8 1 146 14 . 60 73 40 58 Statistics with Stata If 9 96 9 60 83 00 10 1 88 8 80 91 80 11 1 53 5 30 97 10 12 1 29 2 90 100 00 Total [ 1000 100 00 We can use _n to begin an artificial dataset as well. The following commands create a new 5,000-observation dataset with one variable named index, containing values from 1 to 5,000. . set obs 5000 obs was 0, now 5000 generate index = summarize Variable I Obs index 5000 Mean 2500.5 Std. Dev. 1443.52 Min 1 Max 5000 It is possible to generate variables from a normal (Gaussian) distribution using uniform(). The following example creates a dataset with 2,000 observations and 2 variables, z from an N(0,1) population, and* fromN(500,75). clear . set obs 2000 obs was 0, now 2000 . generate z = invnormal(uniform()) generate x = 500 + 75*invnormal(uniform()) The actual sample means and standard deviations differ slightly from their theoretical values: summarize Variable I Obs Std. Dev. Min z I 2000 2 000 . 0375032 503.322 1 . 026784 75.68551 -3.536209 244.3384 4 .038878 743.1377 If z follows a normal distribution, v = e2 follows a lognormal distribution. To form a lognormal variable v based upon a standard normal z, generate v = exp(invnormal(uniform()) To form a lognormal variable w based on an N(100,15) distribution, generate w = exp (100 + 15*invnormal(uniform()) Taking logarithms, of course, normalizes a lognormal variable. To simulate y values drawn randomly from an exponential distribution with mean and standard deviation [x = o = 3, . generate y = -3 * In(uniform()) For other means and standard deviations, substitute other values for 3. XI follows a x2 distribution with one degree of freedom, which is the same as a squared standard normal: . generate XI = (invnormal(uniform())A2 By similar logic, X2 follows a %2 with two degrees of freedom: Data Management 59 . generate X2 = (invnormal(uniform()))A2 + (invnormal(uniform()))A2 Other statistical distributions, including t and F, can be simulated along the same lines. In addition, programs have been written for Stata to generate random samples following distributions such as binomial, Poisson, gamma, and inverse Gaussian. Although invnormal (uniform () ) can be adjusted to yield normal variates with particular correlations, a much easier way to do this is through the drawnorm command. To generate 5,000 observations from N(0,1), type clear drawnorm z, n(5000) . summ Variable I Obs 5000 Mean -.0005951 Std. Dev. Min 1.019788 -4.518918 Max 3.923464 Below we will create three further variables. Variable */ is from an N(0 1) population variable x2 is from N(100,15), and x3 is from W500.75V F,,rth. *( 1,17^1 xl x2 -----■ ".j x3 xl 1. 0 0.4 -0.8 x2 0.4 1 . 0 0.0 x3 -0 . 8 0.0 1.0 Ismfcmte ^ ^ ^ COrrdation matrix C> then using c in tue drawnorm command: . mat C = (1, •4, -. 8 \ -4, 1, 0 \ -.8, 0, 1) drawnorm xl ■ summarize xl x2 x3, -x3 means (0,100,500) sds(l,15,75) corr(C) Variable | Obs Mean Std. Dev. x2 1 x3 1 5000 5000 5000 .0024364 100.1826 500.7747 1 .01648 14.91325 76.93925 -3.478467 46.13897 211.5596 3 . 5 98 916 150.7634 769.6074 . correlate xl 26. We could also select random samples of a particular size. To discard all but 90 randomly-selected observations from the dataset in memory, type sample 90, count The sections in Chapter 14 on bootstrapping and Monte Carlo simulations provide further examples of random sampling and random variable generation. Writing Programs for Data Management_ Data management on larger projects often involves repetitive or error-prone tasks that are best handled by writing specialized Stata programs. Advanced programming can become very technical, but we can also begin by writing simple programs that consist of nothing more than a sequence of Stata commands, typed and saved as an ASCII file. ASCII files can be created using your favorite word processor or text editor, which should offer "ASCII text file" among its options under File - Save As. An even easier way to create such text files is through Stata's Do-file Editor, which is brought up by clicking Window - Do-file Editor or the icon <^. Alternatively, bring up the Do-file Editor by typing the command doedit, or doedit filename iffilename exists. For example, using the Do-file Editor we might create a file named canada.do (which contains the commands to read in a raw data file named canada.raw), then label the dataset and its variables, compress it, and save it in Stata format. The commands in this file are identical to those seen earlier when we went through the example step by step. infile str30 place pop unemp mlife flife using canada.raw label data "Canadian dataset 1" label variable pop "Population in 1000s, 1995" label variable unemp "% 15+ population unemployed, 1995" label variable mlife "Male life expectancy years" label variable flife "Female life expectancy years" compress save canadal, replace Once this canada.do file has been written and saved, simply typing the following command causes Stata to read the file and run each command in turn: . do canada Data Management 61 Such batch-mode programs, termed "do-files," are usually saved with a .do extension. More elaborate programs (defined by do-files or "automatic do" files) can be stored in memory, and can call other programs in turn — creating new Stata commands and opening worlds of possibility for adventurous analysts. The Do-file Editor has several other features that you might find useful. Chapter 3 describes a simple way to use do-files in building graphs. For further information, see the Getting Started manual on Using the Do-file Editor. Stata ordinarily interprets the end of a command line as the end of that command. This is reasonable onscreen, where the line can be arbitrarily long, but does not work as well when we are typing commands in a text file. One way to avoid line-length problems is through the # de 1 imi t command, which can set some other character as the end-of-command delimiter. In the following example, we make a semicolon the delimiter; then type two long commands that do not end until a semicolon appears; and then finally reset the delimiter to its usual value, a carriage return ( cr ): #delimit ; infile str30 place pop unemp mlife flife births deaths marriage medinc mededuc using newcan.raw; order place pop births deaths marriage medinc mededuc unemp mlife flife; ♦delimit cr Stata normally pauses each time the Results window becomes full of information, and waits to proceed until we press any key (or ^j). Instead of pausing, we can ask Stata to continue scrolling until the output is complete. Typed in the Command window or as part of a program, the command set more off calls for continuous scrolling. This is convenient if our program produces much screen output that we don't want to see, or if it is writing to a log file that we will examine later. Typing . set more on returns to the usual mode of waiting for keyboard input before scrolling. Managing Memory When we use or File - Open a dataset, Stata reads the disk file and loads it into memory. Loading the data into memory permits rapid analysis, but it is only possible if the dataset can fit within the amount of memory currently allocated to Stata. If we try to open a dataset that is too large, we get an elaborate error message saving "no room to add more observations," and advising what to do next. use C : \data\gi>ank2 . dta (Scientific surveys off S. Newfoundland) no room to add more observations An attempt was made to increase the number of observations beyond what is currently possible. You have the following alternatives: 1. Store your variables more efficiently; see help compress. (Think of Stata's data area as the area of a rectangle; Stata can trade off width and length.) 2. Drop some variables or observations; see help drop. 62 Statistics with Stata Data Management 63 the set 3. Increase the amount of memory allocated to the data area using memory command; see help memory. r ( 9 0 1) ; Small Stata allocates a fixed amount of memory to data, and this limit cannot be changed. Intercooled Stata and Stata/SE versions are flexible, however. Default allocations equal 1 megabyte for Intercooled, and 10 megabytes for Stata/SE. If we have Intercooled or Stata/SE, running on a computer with enough physical memory, we can set Stata's memory allocation higher with the set memory command. To allocate 20 megabytes to data, type . set memory 20m Current memory allocation current settable value description memory usage (1M = 1024k) set maxvar set memory set matsize 5000 max. variables allowed 20M max. data space 400 max. RHS vars in models 1.733M 20.000M 1. 254M 22.987M If there are data already in memory, first type the command clear to remove them. To reset the memory allocation "permanently," so it will be the same next time we start up, type set memory 20m, permanently In the example given earlier, gbank2.dta is a 11.3-megabyte dataset that would not fit into the default allocation. Asking for a 20-megabyte allocation has now given us more than enough room for these data. Contains data from C:\data\gbank2.dta obs: 74,078 vars : size: 44 Spring scientific surveys NAFO 3KLNOPQ, 1971-93 2 Mar 2000 21:28 LI 333,934 (46.0% of memory free) va storage display riable name type format value label id rec_type vessel trip set rank assembla year month day set_type stratum division unit_are light w ind_di r wind_for sea bottom time_mid duration tow dist float byte byte int int int str7 byte byte byte byte int str2 str3 int byte byte byte byte int byte int %4 %4 %7 %4 %4 %4 Og Og Og Og Og Og Og Og Og Og Og variable label original case number %2s %3s %4 %4 .0g .0g .0g .0g .0g .0g .0g .0g Vessel Trip number Set number Year Month Day set_type Set type Stratum or line fished NAFO division Nfld. area grid map square Light conditions Wind direction Wind force Type of bottom Time (midpoint) Duration of set Distance towed gear op byte %4 Og Operation of gear depthcat byte %4 Og Category of depth min dept int %8 Og Depth (minumum) max dept int %8 Og Depth (maximum) bot dept int %8 Og Depth (bottom if MWT) temp sur int %8 Og Temperature (surface) tempcat byte %8 Og Category of temperature temp fs int %8 Og Temperature (fishing depth) lat float %9 Og Latitude (decimal) long float %9 Og Longitude (decima1) pos meth byte %4 Og gear int %8 Og Gear total byte %9 Og species int %8 Og Species number long %9 Og Number of individual fish weight double %9 Og Catch weight in kilograms latin str31 %31s Species -- Latin name common str27 %27s Species -- common name surtemp float %9 Og Surface temperature degrees C f i shtemp float %9 Og Fishing depth temperature C depth int %9 Og Mean trawl depth in meters ispecies byte %9 Og Indicator species Sorted by: id Dataset gbank2.dta contains 74,078 observations from scientific surveys offish populations on Newfoundland's Grand Banks, conducted over the years 1971 to 1993. When we describe the data (above), Stata reports "46.09% of memory free," meaningnot 46% of the computer's total resources, but 46% of the 20 megabytes we allocated for Stata data. It is usually advisable to ask for more memory than our data actually require. Many statistical and data-management operations consume additional memory, in part because they temporarily create new variables as they work. It is possible to set memory to values higher than the computer's available physical memory. In that case, Stata uses "virtual memory," which is really disk storage. Although virtual memory allows bypassing hardware limitations, it can be terribly slow. If you regularly work with datasets that push the limits of your computer, you might soon conclude that it is time to buy more memory. Type help limits to see a list of limitations in Stata, not only on dataset size but also other dimensions including matrix size, command lengths, lengths of names, and numbers of variables in commands. Some of these limitations can be adjusted by the user. Graphs Graphs appear in every chapter of this book — one indication of their value and integration with other analyses in Stata, Indeed, graphics have always been one of Stata's strong suits, and reason enough for many users to choose Stata over other packages. The graph command evolved incrementally from Stata versions 1 through 7. Stata version 8 marked a major step forward, however, graph underwent a fundamental redesign, expanding its capabilities for sophisticated, publication-quality analytical graphics. Output appearance and choices were much improved as well. With the new graph command syntax and defaults, or alternatively through the new menus, attractive (and publishable) basic graphs are quite easy to draw. Graphically ambitious users who visualize non-basic graphs will find their efforts supported by a truly impressive array of tools and options, described in the 500-page Graphics Reference Manual. In the much shorter space of this chapter, the spectrum from elementary to creative graphing will be covered taking an example- rather than syntax-oriented approach (see the Graphics Reference Manual ox help graph for thorough coverage of syntax). We begin by illustrating seven basic types of graphs. histograms two-variable scatterplots, line plots, and many others histogram graph twoway graph matrix graph box graph pie graph bar graph dot scatterplot matrices box plots pie charts bar charts dot plots For each of these basic types, there exist many options. That is especially true for the versatile twoway type. More specialized graphs such as symmetry plots, quantile plots, and quantile-normal plots exist as well, for examining details of variable distributions. A few examples of these, and also of graphs for industrial quality control, appear in this chapter. Type help graph^other for more details. Final ly, the chapter concludes with techniques particularly useful in building data-rich, self-contained graphics for publication. Such techniques include adding text to graphs, overlaying multiple twoway plots, retrieving and reformatting saved graphs, and combining multiple graphs into one. As our graphing commands grow more complicated, simple batch programs Graphs 65 (do-files) can help to write and re-use them. The full range of graphical choices goes far beyond what this book can cover, but the concluding examples point out a few of the possibilities. Later chapters supply further examples. The Graphics menu provides point-and-click access to most of these graphing procedures. A note to long-time Stata users: The graphical capabilities of Stata 8 and 9 outshine those of earlier versions. For analysts comfortable with old Stata, there is much new material to learn. Menus allow a quick entry, and the new graphics commands, like the old ones, follow a consistent logic that becomes clear with practice. Fortunately, the changeover need not be sudden. Version 7-style graphics remain available if needed. They have been moved to the command graph7 . For example, an old-version scatterplot would formerly have been drawn by the command graph income education which does not work in the newer Stata. Instead, the command . graph7 income education will reproduce the familiar old type of graph. The options of graph7 are similar to those of the old-style graph . To see an updated version of this same scatterplot, type the new graphics command graph twoway scatter income education Further examples of new commands appear in the next section, which should give a sense of what has changed (and what is familiar) with the redesigned graphical capabilities. Example Commands histogram y, frequency Draws histogram of variable}', showing frequencies on the vertical axis. histogram y, start(0) width(lO) norm fraction Draws histogram of y with bins 10 units wide, starting at 0. Adds a normal curve based on the sample mean and standard deviation, and shows fraction of the data on the vertical axis, histogram y, by(xr total) fraction In one figure, draws separate histograms ofy for each value of x, and also a "total" histogram for the sample as a whole. kdensity x, generate(xpoints xdensity) width(20) biweight Produces and graphs kernel density estimate of the distribution of .v. Two new variables are created: xpoints containing the* values at which the density is estimated, and xdensity with the density estimates themselves, width (20) specifies the halfwidth of the kernel, in units of the variable x. (If width () is not specified, the default follows a simple formula for "optimal.") The biweight option in this example calls for a biweight kernel, instead of the default epanechnikov . graph twoway scatter y x Displays a basic two-variable scatterplot of>' against x. Statistics with Stata Graphs 67 graph twoway lfit y x \\ scatter y x Visualizes the linear regression of y on x by overlaying two twoway graphs: the regression (linear fit or lfit ) line, and the y vs. x scatterplot To include a 95% confidence band for the regression line, replace lfit with If itci . graph twoway scatter y x, xlabel(0(10)100) ylabel(-3 (1)6, horizontal) Constructs scatterplot ofy vs. x, withx axis labeled at 0, 10,100. y axis is labeled at -3, -2, 6, with labels written horizontally instead of vertically (the default). graph twoway scatter y x, mlabel(country) Constructs scatterplot ofy vs. with data points (markers) labeled by the values of variable country. graph twoway scatter y xl, by(x2) In one figure, draws separate y vs. xl scatterplots for each value ofx2. graph twoway scatter y xl [fweight = population], msymbol(Oh) Draws a scatterplot of y vs. xl. Marker symbols are hollow circles (Oh), with their size (area) proportional to frequency-weight variable population. graph twoway connected y time A basic time plot ofy against time. Data points are shown connected by line segments. To include line segments but no data-point markers, use line instead of connected: . graph twoway line y time graph twoway line yl y2 time Draws a time plot (in this example, a line plot) with twoy variables that both have the same scale, and are graphed against an jc variable named time. graph twoway line yl time, yaxis(l) || line y2 time, yaxis (2) Draws a time plot with two y variables that have different scales, by overlaying two individual line plots. The left-handy axis, yaxis (1), gives the scale for W, while the right-hand v axis, yaxis (2), gives the scale for v2. graph matrix xl x2 x3 x4 y Constructs a scatterplot matrix, showing all possible scatterplot pairs among the variables listed. graph box yl y2 y3 Constructs box plots of variables y J, y2, andyi. graph box y, over(x) yline(.22) Constructs box plots ofy for each value of *, and draws a horizontal line aty = .22. graph pie a b c, pie Draws one pie chart with slices indicating the relative amounts of variables a, b, and c. The variables must have similar units. graph bar (sum) a b c Shows the sums of variables a, b, and c as side-by-side bars in a bar chart. To obtain means instead of sums, type graph bar (mean) a jb c . Other options include bars representing medians, percentiles, or counts of each variable. graph bar (mean) a, over(x) Draws a bar chart showing the mean of variable a at each value of variable x. . graph bar (asis) a b c, over(x) stack Draws a bar chart in which the values ("as is") of variables a, b, and c are stacked on top of one another, at each value of variable x. . graph dot (median) y, over(x) Draws a dot plot, in which dots along a horizontal scale mark the median value of v at each level ofx. Other options include means, percentiles, or counts of each variable. . qnorm y Draws a quantile-normal plot (normal probability plot) showing quantiles of v versus corresponding quantiles of a normal distribution. . rchart xl x2 x3 x4 x5, connect(l) Constructs a quality-control R chart graphing the range of values represented by variables xl-x5. Graph options, such as those controlling titles, labels, and tick marks on the axes are common across graph types wherever this makes sense. Moreover, the underlying logic of Stata's graph commands is consistent from one type to the next. These common elements are the key to gaining graph-building fluency, as the basics begin to fall into place. Histograms Histograms, displaying the distribution of measurement variables, are most easily produced with their own command histogram. For examples, we turn to states.dta, which contains selected environment and education measures on the 50 U.S. states plus the District of Columbia (data from the League of Conservation Voters 1991; National Center for Education Statistics 1992, 1993; World Resources Institute 1993). use states (U.S. states data 1990-91) describe Cont a ins dat a from c : \data\states dt a obs : 51 U.S. states data 1990-91 vars : 21 4 Jul 2005 12:07 size: 4,080 (99 . 9% of memory free) s torage display value variable name type forma t label variable label state s tr20 % 2 0s State region byte %9 .0g region Geographical region pop float %9 .0g 1990 population area float % 9 . Og Land area, square miles density float %7 . 2f People per square mile met ro float %5 . If Metropolitan area population, % waste float %5 . 2f Per capita solid waste, tons energy int %8 .0g Per capita energy consumed, Btu miles int %8 .0g Per capita miles/year, 1,000 toxic float lb . 2 f Per capita toxics released, lbs green float %5 . 2f Per capita greenhouse gas, tons house byte %8 Og House '91 environ, voting, ro senate byte %8 .0g Senate '91 environ, voting, % cs a t int %9 Og Mean composite SAT score vsa t int %8 Og Mean verbal SAT score 68 Statistics with Stata ms a t pe rcent expen se income high college int byte int long float float ■stí . Dg %9 . Og %9 . Og %10.Og % 9 . 0 g % 9 . 0 g Mean math SAT score % HS graduates taking SAT Per pupil expenditures prim&sec Median household income, 51,000 t adults HS diploma % adults college degree Sorted by Figure 3.1 shows a simple histogram of college, the percentage of a state's over-25 population with a bachelor's degree or higher. It was produced by the following command: histogram college, frequency title("Figure 3.1") Figure 3.1 Figure 3.1 20 25 % adults college degree Under the Prefs - Graph Preferences menus, we have the choice of several pre-designed "schemes" for the default colors and shading of our graphs. Custom schemes can be defined as well. The examples in this book employ the s2 mono (monochrome) scheme, which among other things calls for shaded margins around each graph. The s1 mono scheme does not have such margins. Experimenting with the different monochrome and color schemes helps to determine which works best for a particular purpose. A graph drawn and saved under one scheme can subsequently be retrieved and re-saved under a different one, as described later in this chapter. Options can be listed in any order following the comma in a graph command. Figure 3.1 illustrates two options: frequency (instead of density, the default) is shown on the vertical axis; and the title "Figure 3.1" appears over the graph. Once a graph is onscreen, menu choices provide the easiest way to print it, save it to disk, or cut and paste it into another program such as a word processor. Figure 3.1 reveals the positive skew of this distribution, with a mode above 15 and an outlier around 35. It is hard to describe the graph more specifically because the bars do not line up withx-axis tick marks. Figure 3.2 contains a version with several improvements (based on some quick experiments to find the right values): Graphs 69 1. The.v axis is labeled from 12 to 34, in increments of 2. 2. The y axis is labeled from 0 to 12, in increments of 2. 3. Tick marks are drawn on they axis from 1 to 13, in increments of 2. 4. The histogram's first bar (bin) starts at 12. 5. The width of each bar (bin) is 2. . histogram college, frequency title ("Figure 3.2") xlabel(12(2)34) ylabel(0(2)12) ytick (1 (2)13) start(12) width(2) Figure 3.2 Figure 3.2 o c 12 14 16 18 20 22 24 26 28 % adults college degree 30 32 34 Figure 3.2 helps us to describe the distribution more specifically. For example, we now see that in 13 states, the percent with college degrees is between approximately 16 and 18. Other useful histogram options include: bin(#) Draw a histogram with # bins (bars). We can specify either bin(#) or, as inFigure3.2, start (#) and width (#)—but not both. percent Show percentages on the vertical axis, ylabel and ytick then refer to percentage values. Another possibility, frequency, is illustrated in Figure 3.2. We could also ask for fraction of the data. The default histogram shows density, meaning that bars are scaled so that the sum of their areas equals 1. gap (#) Leave a gap between bars. # is relative, 0 < # < 100; experiment to find a suitable value. addlabels Label the heights of histogram bars. A separate option, addlabopts , controls the how the labels look. discrete Specify discrete data, requiring one bar for each value of*. 70 Statistics with Stata norm kdensity Overlay a normal curve on the histogram, based on sample mean and standard deviation. Overlay a kcmal-dcnsity estimate on the histogram. The option kdenopts controls density computation; see help kdensity for details. With histograms or most other graphs, we can also override the defaults and specify our own titles for the horizontal and vertical axes. The option ytitle controls y-axis titles, and xtitle controls .v-axis titles. Figure 3.3 illustrates such titles, together with some other histogram options. Note the incremental buildup from basic (Figure 3.1) to more elaborate (Figure 3.3) graphs. This is the usual pattern of graph construction in Stata: we start simply, then experimentally add options to earlier commands retrieved from the Review window, as we work toward an image that most clearly presents our findings. Figure 3.3 actually is over-elaborate, but drawn here to show off multiple options. . histogram college, frequency title("Figure 3.3") ylabel (0 (2)12) ytick (1 (2)13) xlabel (12 (2)34) start(l2) width(2) addlabel norm gap(15) Figure 3.3 Figure 3.3 >,CO H (J 6 \ 4\ 4 12 14 16 18 20 22 24 26 28 30 32 34 % adults college degree Suppose we w ant to see how the distribution of college varies by region. The by option obtains a separate histogram for each value of region. Other options work as they do for single histograms. Figure 3.4 shows an example in which wc ask for percentages on the vertical axis, and the data grouped into 8 bins. Graphs 71 histogram college, by(region) percent bin (8) West South 10 15 Graphs by Geographical region N. East Figure 3.4 mm ■ 111 i Midwest 20 25 30 10 15 Vo adults college degree 20 25 30 Figure 3.5, below, contains a similar set of four regional graphs, but includes a fifth that shows the distribution for all regions combined. . histogram college, percent bin(8) by(region, total) West N. East r □ South Figure 3.5 Midwest Total sis -Pis P 10 15 20 25 30 10 15 20 25 30 % adults college degree Graphs by Geographical region I I 10 15 20 25 30 Axis labeling, tick marks, titles, and the by (varname) or by (varname, total) options work in a similar fashion with other Stata graphing commands, as seen in the following sections. fo 72 Statistics with Stata Graphs 73 Scatterplots Basic scatterplots are obtained through commands of the general form graph twoway scatter y x wherey is the vertical ory-axis variable, and.v the horizontal orx-axis one. For example, again using the states.dto dataset. we could plot waste (per capita solid wastes) against metro (percent population in metropolitan areas), with the result shown in Figure 3.6. Each point in Figure 3.6 represents one of the 50 U.S. states (or Washington DC). graph twoway scatter waste metro -I Figure 3.6 £ "to 1§ TO O. fl] O 1_ x[/~ flat, then vertical vertical, then flat Figure 3.15 (on the following page) repeats this stairstep plot ot TAC, but with some enhancements of axis labels and titles. The option xtitle("»> requests no x;axis title (because "year" is obvious). We added tick marks at two-year intervals to the * axis, abe ed they axis at intervals of 100, and printed y-axis labels horizontally instead of vertically (the default). Graphs 81 graph twoway line TAC year, connect (stairstep) xtitle( xtick(1960(2)2000) ytitle("Thousands of tons") ylabel (0 (100) 800, angle(horizontal)) xtitle("") clpattern (dash) 800- 700- 600- c o 500- o 400- o 300- 200- 100- 0- Figure 3.15 1960 1970 1980 1990 2000 Instead of letting Stata determine the line patterns (solid, dashed, etc.) in Figure 3.15, we used the clpattern (dash) option to call for a dashed line. Possible line pattern choices are listed in the table below (also see help linepatternstyle ). clpattern( ) Description solid dash dot dash__dot shortdash s hortdash_do t longdash longdash_dot blank formula solid line dashed line dotted line dash then dot short dash short dash followed by dot long dash long dash followed by dot invisible line for example, clpattern) or clpattern (— .) 82 Statistics with Stata Before move on to other examples and types, Figure 3.16 unites the three variables discussed in this section to create a single graphic showing the tragedy of the Northern Cod. Noichou the connect ( ). clpattern ( ), and legend ( ) options work in this three-variable context. h twoway line cod canada TAC year, connect(line line stairstep) clpattern(solid longdash dash) xtitle(»") xtick(1960(2)2000) ytitlet"Thousands of tons") ylabel(0 (100)800, angle (horizontal) ) xtitleC") legend(label (1 "all nations") label (2 "Canada") label (3 "TAC") position (2) ring (0) rows(3) ) graph twoway c. y all nations Canada TAC Figure 3.16 1960 1970 1980 1990 2000 Graphs 83 Connected-Line Plots In the line plots of the previous section, data points are invisible and we see only the connecting lines. The graph twoway connected command creates connected-line plots in which the data points are marked by scatterplot symbols. The marker-symbol options described earlier for graph twoway scatter, and also the line-connecting options described for graph twoway line, both apply to graph twoway connected as well. Figure 3.17 shows a default example, a connected-line time plot of the cod biomass variable (bio) from cod.dta. graph twoway connected bio year Figure 3.17 o (/) o w o in E o TO O E ^ 1960 1970 1980 Year 2000 The dataset contains only biomass values for 1978 through 1997, resulting in much empty space in Figure 3.17 . if qualifiers allow us to restrict the range of years. Figure 3.18, on the following page, does this. It also dresses up the image to show control of marker symbols, line patterns, axes, and legends. With cod landings and biomass both in the same image, we see that the biomass began its crash in the late 1980s, several years before a crisis was officially recognized. 84 Statistics with Stata T graph twoway connected bio cod year if year > 1977 t year < 1999, -symbol(T Oh) clpattern(dash solid) xlabel(1978(2)1996) xtick(1979 (2) 1997) ytitle("Thousands of tons") xtitlel ) ylabel (0 (500) 2500, angle(horizontal)) legenddabeKl "Estimated biomass") label(2 "Total landings ) position (2) rows (2) ring (0)) Figure 3.18 2500 2000 c o £ 1500 CO ■o c CD o 1000J 500 J t .a.-- Estimated biomass ___Total landings A / X 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 Other Twoway Plot Types In addition to basic line plots and scatterplots, the graph twoway command encompasses a wide variety of other types. The following table lists the possibilities. graph twowav Description scatter scatterplot line line plot connected connected-line plot scatteri scatter with immediate arguments (data given in the command line) area line plot with shading bar twoway bar plot (different from graph bar ) spike twoway spike plot dropline dropline plot (spikes dropped vertically or horizontally to given value) dot twoway dot plot (different from graph dot) rarea range plot, shading the area between high and low values Graphs 85 rbar range plot with bars between high and low values rspike range plot with spikes between high and low values reap range plot with capped spikes rcapsym range plot with spikes capped with symbols rscatter range plot with scatterplot marker symbols rline range plot with lines rconnected range plot with lines and markers pespike paired-coordinate plot with spikes pecapsym paired-coordinate plot with spikes capped with symbols pcarrow paired-coordinate plot with arrows pebarrow paired-coordinate plot with arrows having two heads pescatter paired-coordinate plot with markers pci pespike with immediate arguments pcarrowi pcarrow with immediate arguments tsline time-series plot tsrline time-series range plot mband straight line segments connect the (x, y) cross-medians within bands mspline cubic spline curve connects the (x, y) cross-medians within bands lowess LOWESS (locally weighted scatterplot smoothing) curve lfit linear regression line qfit quadratic regression curve fpfit fractional polynomial plot lfitci linear regression line with confidence band qfitci quadratic regression curve with confidence band fpfitci fractional polynomial plot with confidence band function line plot of function histogram histogram plot kdensity kernel density plot The usual options to control line patterns, marker symbols, and so forth work where appropriate with all twoway commands. For more information about a particular command, type help twoway_mband, help twoway_f unction, etc. (using any of the names above). Note that graph twoway bar is a different command from graph bar . Similarly, graph twoway dot differs from graph dot . The twoway versions 86 Statistics with Stata Graphs 87 provide various methods for plotting a measurement y variable against a measurement x variable, analogous to a scatterplot or a line plot. The non-twoway versions, on the other hand, provide ways to plot summary statistics (such as means or medians) of one or more measurement v variables against categories of one or more .v variables. The twoway versions thus are comparatively specialized, although (as with all twoway plots) they can be overlaid with other twoway plots for more complex graphical effects. Many of these plot types are most useful in composite figures, constructed by overlaying two or more simple plots as described later in this chapter. Others produce nice stand-alone graphs. For example, Figure 3.19 shows an area plot of the Newfoundland cod landings. . graph twoway area cod canada year, ytitle("") Figure 3.19 1960 1970 1980 Year 1990 2000 Total landings, 1000t Canadian landings, 1000t The shading in area graphs and other types with shaded regions can be controlled through the option bcolor . Type help colorstyle for a list of the available colors, which include grayscales. The darkest gray, gsO, is actually black. The lightest gray, gsl6, is white. Other values are in between. For example, Figure 3.20 shows a light-gray version of this graph. I graph twoway area cod canada year, ytitle("") bcolor(gsl2 gsl4) Figure 3.20 1960 1970 1980 Year 1990 2000 Total landings, 1000t Canadian landings, 1000t Unusually cold atmosphere/ocean conditions played a secondary role in Newfoundland's fisheries disaster, which involved not only the Northern Cod but also other species and populations. For example, key fish species in the neighboring Gulf of St. Lawrence declined during this period as well (Hamilton, Haedrich and Duncan 2003). Dataset gulf.dta describes environment and Northern Gulf cod catches (raw data from DFO 2003). Contains data from C: \data\gulf.dta obs : 56 Gulf of St. Lawrence environment and cod fishery var s : 7 10 Jul 2005 11:51 size: 1,344 (99 . 9% of memory free) s to rage display value va riable name type format label variable label winter int %8 .0g Winter minarea float ^9 .Og Minimum ice area, 1000 km"2 maxa rea float % 9 . Og Maximum ice area, 1000 km~2 mindays byte % S .0g Minimum ice days maxdays byte %8 . Og Maximum ice days cil float %9 .0g Cold Intermediate Laye r cod temperature minimum, c float %9 .0g N. Gulf cod catch, 1000 tons Sorted by : winter The maximum annual ice cover averaged 173,017 km2 during these years, summarize maxarea Variable maxarea Ob 5 38 Mean 173 . 0172 Std. Dev. 37 .18623 Min 4 7 .890 1 Max 2 2 0.1905 88 Statistics with Stata Graphs 89 Figure 3.2] uses this mean (173 thousand) as the base for a spike plot, in which spikes above and below the line show above and below-average ice cover, respectively. The yline (173) option draws a horizontal line at 173. graph twoway spike maxarea winter if winter > 1963 , base (173) yline (173) ylabel(40 (20)220, angle(horizontal)) xlabel(1965(5)2000) Figure 3.21 220- 200- CM < E 180- o o 160- o 140- a> CO 0) 120- o E D 100- E X CO 80- 60- 40- 1965 1970 1975 1980 1985 1990 1995 2000 Winter The base() format of Figure 3.21 emphasizes the succession of unusually harsh winters (above-average maximum ice cover) during the late 1980s and early 1990s, around the time of Newfoundland's fisheries crisis. We also see an earlier spell of mild winters in the early 1980s, and hints of a recent warming trend. A different view of the same data, in Figure 3.22, employs lowess regression to smooth the time series. The bandwidth option, bwidth (. 4), specifies a curve based on smoothed data points that arc calculated from weighted regressions within a moving band containing 40% of the sample. Lower bandwidths such as bwidth (. 2), or 20% of the data, would give us a morejagged, less smoothed curve that more closely resembles the raw data. Higher bandwidths such as bwidth (. 8), the default, will smooth more radically. Regardless of the bandwidth chosen, smoothed points towards either extreme of the x values must be calculated from increasingly narrow bands, and therefore will show less smoothing. Chapter 8 contains more about lowess smoothing. graph twoway lowess maxarea winter if winter > 1963, bwidth( 4) ylme<173) ylabel (40 (20) 220 , angle (horizontal) ) xlabel(1965(5)2000) Figure 3.22 1965 1970 1975 1980 1985 Winter 1990 1995 2000 Range plots connect high and lowy values at each level of x, using bars, spikes, or shaded areas. Daily stock market prices are often graphed in this way. Figure 3.23 shows a capped-spike range plot using the minimum and maximum ice cover variables from gulf.dta. . graph twoway reap minarea maxarea winter if winter > 1963, ylabel<0<20)220, angle(horizontal)) ytitle("Ice area, 1000km*2»> xlabel (1965 (5) 2000) 220 200 180 £, 16°-J 140- 9 120- 10,000")) legend(rows(4) order(4 3 2 1) position(ll) ring(0) label (1 "Non-native") label (2 "Aleut") label(3 "Indian") label (4 "Eskimo")) stack ytitle(Population) ylabel(0(100000)300000) ytick(50000(100000)350000) Figure 3.32 Eskimo Indian Aleut Non-native c o o o ro o =3 CS d o a o o o Villages <1,000 Towns 1,000-10,000 Cities >10,000 Figure 3.32 plots the same variables as the pie chart in Figure 3.27, but displays them quite differently. Whereas the pie charts show relative sizes (percentages) of ethnic groups within each community type, this bar chart shows their absolute sizes. Consequently, Figure 3.32 tells us something that Figure 3.27 could not: the majority of Alaska's Eskimo (Yupik and Inupiat) population lives in villages. Dot Plots_ Dot plots serve much the same purpose as bar charts: visually comparing statistical summaries of one or more measurement variables. The organization and Stata options for the two types of plot are broadly similar, including the choices of statistical summaries. To see a dot plot comparing the medians of variables x, y, z, and w, type . graph dot (median) x y z w 100 Statistics with Stata For a dot plot comparing the mean ofy across categories of x, type . graph dot (mean) y, over(x) Figure 3.33 shows a dot plot of male and female smoking rates by region, from statehealth.dta. The over option includes a suboption, sort (smokeM ), which calls for the regions to be sorted in order of their mean values of smokeM— that is, from lowest to highest smoking rates. We also specify a solid triangle as the marker symbol for smokeM, and hollow circle for smokeF. , graph dot (mean) smokeM smokeF, over(region, sort(smokeM)) marker(1, msymbol(T)) marker(2, msymbol(Oh)) Figure 3.33 West N. East Midwest South 10 .*. mean of smokeM 20 mean of smokeF 30 Although Figure 3.33 displays only eight means, it does so in a way that facilitates several comparisons. We see that smoking rates are generally higher for males; that among both sexes they are higher in the South and Midwest; and that regional variations are substantially greater for the male smoking rates. Bar charts could convey the same information, but one advantage of dot plots is their compactness. Dot plots (particularly when rows are sorted by the statistic of interest, as in Figure 3.33) remain easily readable even with a dozen or more rows. Symmetry and Quantile Plots Box plots, bar charts, and dot plots summarize measurement variable distributions, hiding individual data points to clarify overall patterns. Symmetry and quantile plots, on the other hand, include points for every observation in a distribution. They are harder to read than summary graphs, but convey more detailed information. Graphs 101 A histogram of per-capita energy consumption in the 50 U.S. states (from states did) appears in Figure 3.34. The distribution includes a handful of very high^onsumption states wh.ch happen to be oil producers. A superimposed normal (Gaussian) curve indicates that energy has a hghter-than-normal left tail, and a heavier-than-normal right tail - the definition of positive skew, . histogram energy, start(lOO) width(lOO) xlabel ( 0 (100) 1000) frequency norm Figure 3.34 100 200 300 400 500 600 700 Per capita energy consumed, Btu 800 900 1000 102 Statistics with Stata Figure 3.35 depicts this distribution as a symmetry plot. It plots the distance of the /th observation above the median (vertical) against the distance of the /th observation below the median. All points would lie on the diagonal line if this distribution were symmetrical. Instead, we see that distances above the median grow steadily larger than corresponding distances below the median, a symptom of positive skew. Unlike Figure 3.34, Figure 3.35 also reveals that the energy-consumption distribution is approximately symmetrical near its center. symplot energy Per capita energy consumed, Btu Figure 3.35 CD 11 Q S 50 100 Distance below median 150 Graphs 103 Quantile plots automatically calculate what fraction of the observations lie below each data value, and display the results graphically as in Figure 3.36. Quantile plots provide a graphic reference for someone who does not have the original data at hand. From well-labeled quantile plots, we can estimate order statistics such as median (.5 quantile) or quartiles (.25 and .75 quantiles). The IQR equals the rise between .25 and .75 quantiles. We could also read a quantile plot to estimate the fraction of observations falling below a given value. quantile em '9Y Figure 3.36 CD 0) o CD • ••••••• .25 Fraction of the data .75 Quantiles are values below which a certain fraction of the data lie. For example, a .3 quantile is that value higher than 30% of the data. If we sort/? observations in ascending order, the /th value forms the (/- S)/n quantile. The following commands would calculate quantiles of variable energy: drop if energy >= sort energy . generate quant = (_n - .5)/_N As mentioned in Chapter 2, _n and _N are Stata system variables, always unobtrusively present when there are data in memory. _n represents the current observation number, and _N the total number of observations. 104 Statistics with Stata Quantile-normal plots, also called normal probability plots, compare quantiles of a variable's distribution with quantiles of a theoretical normal distribution having the same mean and standard deviation. They allow visual inspection for departures from normality in every part of a distribution, which can help guide decisions regarding normality assumptions and efforts to find a normalizing transformation. Figure 3.37, a quantile-normal plot of energy, confirms the severe positive skew that we had already observed. The grid option calls for a set of lines marking the .05, .10, .25 (first quartilc), .50 (median), .75 (third quartile), .90, and .95 quantiles of both distributions. The .05. .50, and .95 quantile values are printed along the top and right-hand axes. qnorm energy, grid 111.872 354.5 597.129 Figure 3.37 E W O £ ° oco §2 Ü o i_ o 0) CM Q. 200 400 Inverse Normal 600 800 Grid lines are 5. 10, 25, 50. 75, 90, and 95 percentiles Quantile-quantile plots resemble quantile-normal plots, but they compare quantiles (ordered data points) of two empirical distributions instead of comparing one empirical distribution with a theoretical normal distribution. On the following page, Figure 3.38 shows a quantile-quantile plot of the mean math SAT score versus the mean verbal SAT score in 50 states and the District of Columbia. If the two distributions were identical, we would see points along the diagonal line. Instead, data points form a straight line roughly parallel to the diagonal, indicating that the two variables have different means but similar shapes and standard deviations. Graphs 105 qqplot msat vsat Quantile-Quantile Plot Figure 3.38 a* 1* 400 450 500 Mean verbal SAT score 550 Regression with Graphics (Hamilton 1992a) includes an introduction to reading quantile-based plots. Chambers et al. (1983) provide more details. Related Stata commands include pnorm (standard normal probability plot), pchi (chi-squared probability plot), and qchi (quantile-chi-squared plot). Quality Control Graphs Quality control charts help to monitor output from a repetitive process such as industrial production. Stata offers four basic types: c chart, p chart. R chart, and x chart. A fifth type, called Shewhart after the inventor of these methods, consists of vertically-aligned x and R charts. Iman (1994) provides a brief introduction to R and x charts, including the tables used in calculating their control limits. The Base Reference Manual gives the command details and formulas used by Stata. Basic outlines of these commands are as follows: cchart defects unit Constructs a c chart with the number of nonconformities or defects (defects) graphed against the unit number (unit). Upper and lower control limits, based on the assumption that number of nonconformities per unit follows a Poisson distribution, appear as horizontal lines in the chart. Observations with values outside these limits are said to be "out of control." . pchart rejects unit ssize Constructs a p chart with the proportion of items rejected (rejects I ssize) graphed against the unit number (unit). Upper and lower control limit lines derive from a normal approximation, taking sample size (ssize) into account. If ssize varies across units, the control limits will vary too, unless we add the option stabilize . 106 Statistics with Stata Graphs 107 rchart xl x2 x3 x4 x5, connect (1) Constructs an R (range) chart using the replicated measurements in variables xl through x5 — that is, in this example, five replications per sample. Graph the range within each sample against the sample number, and (optionally) connect successive ranges with line segments. Horizontal lines indicate the mean range and control limits. Control limits are estimated from the sample size if the process standard deviation is unknown. When o is known we can include this information in the command. For example, assuming o = 10, . rchart xl x2 x3 x4 x5, connect(l) std(10) xchart xl x2 x3 x4 x5, connect(1) Constructs an x (mean) chart using the replicated measurements in variables xl through x5. Graphs the mean within each sample against the sample number and connect successive means with line segments. The mean range is estimated from the mean of sample means and control limits from sample size, unless we override these defaults. For example, if we know that the process actually has u. = 50 and o = 10, xchart xl x2 x3 x4 x5, connect(1) mean(50) std(10) Alternatively, we could specify particular upper and lower control limits: xchart xl x2 x3 x4 x5, connect(1) mean(50) lower(40) upper(60) shewhart xl x2 x3 x4 x5, mean(50) std(10) In one figure, vertically aligns an x chart with an R chart. To illustrate a p chart, we turn to the quality inspection data in qualityl.dta. Contains data from C: \data\qualityl.dta obs : 16 Quality control example 1 va r s : 3 4 Jul 2005 12:0 7 size: 112 (99.9i of memory free) storage di splay value variable na me type f 0 rmat label variable label day byte :9.0g Day sampled ssize byte •9.0g Number of units s ampled rejects byte °:9.0g Number of units re j ected Sorted by: list in 1/5 +----- --------+ day ssize re j e c t s | 1 . 58 53 t 10 [ ? . 7 53 1 P 1 3 . 26 52 12 í 4 . 21 52 10 ! 5. 6 51 10 1 Note that sample size varies from unit to unit, and that the units (days) are not in order, pchar t handles these complications automatically, creating the graph with changing control limits seen in Figure 3.39. (For constant control limits despite changing sample sizes, add the stabilize option.) pchart rejects day ssize Figure 3.39 0 20 2 units are out of control Day sampled 40 60 Dataset quahty2.dta, borrowed from Iman (1994:662), serves to illustrate rchart and xchart . Variables xl through x4 represent repeated measurements from an industrial production process; 25 units with four replications each form the dataset Contains data obs : vars: size: from C: 25 4 500 \data\quality2.dta (99.9% of memory free) Quality control (Iman 1994:662} 4 Jul 2005 12:07 variable name s to rage type display f o rmat va lue label variable label x2 x3 x4 float f 1 o a t float float %9 . Og %9 . Og %9 . Og % 9 . 0 g Sorted by: list in 1/5 xl x2 x3 x4 4 . 6 2 4 3 . 6 6 . 7 3.8 5 . 1 4 . 7 4 . 6 4 . 3 4 . 5 3 . 9 4 . 9 6 4 . 8 5 . 7 7 . 6 6 . 9 2 . 5 4 . 7 108 Statistics with Stata Figure 3.40, an R chart, graphs variation in the process range forms us that one unit's range is "out of control." over the 25 units, r char t in :hart xl x2 x3 x4 , connect(1) Figure 3.40 0 5 1 unit is out of control 10 15 Sample 20 25 Figure 3.41, an x chart, shows variation in outside the control limits. xchart xl x2 x3 x4, connect(l) the process mean. None of these 25 means falls Figure 3.41 0 5 0 units are out of control 10 15 Sample 20 25 \ Graphs 109 Adding Text to Graphs Titles, captions, and notes can be added to make graphs more self-explanatory. The default versions of titles and subtitles appear above the plot space; notes (which might document the data source, for instance) and captions appear below. These defaults can be overridden, of course. Type help title_options for more information about placement of titles, or help textbox_options for details concerning their content. Figure 3.42 demonstrates the default versions of these four options in a scatterplot of the prevalence of smoking and college graduates among U.S. states, using statehealth.dta. Figure 3.42 also includes titles for both the left and right y axes, yaxis (1 2), and top and bottom x axes, xaxis (1 2). Subsequent ytitle and xtitle options refer to the second axes specifically, by including the axis (2) suboption. y axis 2 is not necessarily on the right, and x axis 2 is not necessarily on the left, as we will see later; but these are their default positions. . graph twoway scatter smokeT college, yaxis(1 2) xaxis(1 2) title("This is the TITLE") subtitle( caption("This is the CAPTION") note( ytitle("Percent adults smoking") ytitle("This is Y AXIS 2", axis (2) ) xtitle("Percent adults with Bachelor xtitle ("This is X AXIS 2", axis (2)) This is the TITLE This is the SUBTITLE This is X AXIS 2 This is the SUBTITLE1') This is the NOTE") s degrees or higher") Figure 3.42 10 15 20 25 30 35 o^ E tu o CO CN •• % fc' 10 15 20 25 30 35 Percent adults with Bachelor's degrees or higher This is the NOTE This is the CAPTION Titles add text boxes outside of the plot space. We can also add text boxes at specified coordinates within the plot space. Several outliers stand out in this scatterplot. Upon investigation, they turn out to be Washington DC (highest college value, at far right), Utah (lowest smokeT value, at bottom center), and Nevada (highest smokeT value, at upper left). Text boxes provide a way for us to identify these observations within our graph, as demonstrated in Figure 3.43. The option text (15. 5 22.5 "Utah") places the word "Utah" at position v = 15.5, x = 22.5 in the scatterplot, directly above Utah's data point. 110 Statistics with Stata Graphs 11 Similarly, we place the word "Nevada" at>' = 33.5, jc = 15, and draw a box (with small margins; see help marginstyle ) around that state's name. Three lines of left-justified text are placed next to Washington DC (each line specified in its own set of quotation marks). Any text box or title can have multiple lines in this fashion; we specify each line individually in its own set of quotations, then specify justification or other suboptions. The "Nevada" box uses a default shaded background, whereas for the "Washington DC" box we chose a white background color (see help textbox_options and help colorstyle). graph twoway scatter stnokeT college, yaxis (1 2) xaxis (1 2) title("This is the TITLE") subtitle("This is the SUBTITLE") caption("This is the CAPTION") note("This is the NOTE") ytitle("Percent adults smoking") ytitle("This is Y AXIS 2", axis(2)) xtitle("Percent adults with Bachelor 1s degrees or higher") xtitle("This is X AXIS 2", axis(2)) text(15.5 22.5 "Utah") text(33.5 15 "Nevada", box margin(small)) text(23.5 32 "Washington DC" "is not actually" "a state", box justification (left) margin(small) bfcolor(white)) 10 15 This is the TITLE This is the SUBTITLE This is X AXIS 2 20 25 Figure 3.43 30 35 CO c ^ o o co E £to ~o Co C o o ^N 1949, xtitle("") xlabel (1950(10)2000) xtick(1950 (2)2002) legend(position(11) ring(0) rows(2) order(2 1) label (1 "Max ice area") label (2 "Min CIL temp")) note ("Source: Hamilton, Haedrich and Duncan (2003); DFO (2003)") yaxis(1) yscale(range(-1,3) axis (1)) ", axis(l)) axis(1) angle(horizontal) nogrid) "fisheries decline" "and collapse" axis (2)) data from Figure 3.47 1950 1960 1970 1980 1990 2000 Source: Hamilton, Haedrich and Duncan (2003); data fromDFO (2003) The text box on the right in Figure 3.47 marks the late-1980s and early-1990s period when key fisheries including the Northern Gulf cod declined or collapsed. As the graph shows, the fisheries declines coincided with the most sustained cold and ice conditions on record. 114 Statistics with Stata To place cod catches in the same graph with temperature and ice, we need three independent vertical scales. Figure 3.48 involves three overlaid plots, with all y axes at left (default). The basic forms of the three component plots are as follows: connected maxarea winter A connected-line plot of maxarea vs. winter, using y axis 3 (which will be leftmost in our final graph). They axis scale ranges from-300 to +220, with no grid of horizontal lines. Its title is "Ice area, 1000 kmA2." This title is placed in the "northwest" position, placement (nw). line cil winter A line plot of cil vs. winter, using y axis 2. y scale ranges from -4 to +3, with default labels, connected cod winter A connected-line plot of cod vs. winter, using y axis l. The title placement is "southwest,1' placement (sw). Bringing these three component plots together, the full command for Figure 3.48 appears on the next page, y ranges for each of the overlaid plots were chosen by experimenting to find the "right" amount of vertical separation among the three series. Options applied to the whole graph restrict the analysis to years since 1959, specify legend and .r axis labeling, and request vertical gnd lines. w Graph* 115 graph twoway connected maxarea winter, yaxis (3) yscale(range (-300,220) axis(3)) ylabel(50 (50) 200 , nogrid axis(3)) ytitle("Ice area, 1000 kmA2", axis(3) placement(nw)) clpattern(dash) || line cil winter, yaxis(2) yscale(range(-4,3) axis (2)) ylabel(, nogrid axis(2)) ytitle("CIL temperature, degrees C", axis(2)) clpattern(solid) || connected cod winter, yaxis(l) yscale(range(0,200) axis(1)) ylabel ( , nogrid axis (1)) ytitle("Cod catch, 10 00 tons", axis (1) placement(sw)) || if winter > 1959, legend(ring(0) position(7) label (1 "Max ice area") label(2 "Min CIL temp") label(3 "Cod catch") rows(3)) xtitle("") xlabel (1960 (5)2000, grid) CM < O §1 nig iZ T— ro tu o a m Figure 3.48 o w a> O) a> "c o o So O <0 O o Ü o Max ice area Min CiL temp Cod catch 1960 1965 1970 1975 1980 1985 1990 1995 2000 Graphing with Do-Files Complicated graphics like Figure 3.48 require graph commands that are many physical lines long (although Stata views the whole command as one logical line). Do-files, introduced in Chapter 2, help in writing such multi-line commands. They also make it easy to save the command for future re-use, in case we later want to modify the graph or draw it again. The following commands, typed into Stata's Do-file Editor and saved with the file name ,fig03_48.do, become a new do-file for drawing Figure 3.48. Typing . do fig03_48 then causes the do-file to execute, redrawing the graph and saving it in two formats. y 116 Statistics with Stata Graphs 11 j ^delimit ; use c:\data\gulf.dta, clear ; graph twoway connected max are a winter, yaxis (3) yscale(range(-300 , 2 2 0) axis (3) ) y1abe1 (50 (50) 200 , nogrid axis (3) ) ytitle ("Ice area, 1000 kmA 2 " , axis (3) placement (nw) } clpattern (dash) || line cil winter, yaxis (2) yscale(range (-4,3) axis (2) ) y1abe1(, nogrid axis(2)) ytitle<"CIL temperature, degrees C", axis(2)) clpattern(solid) || connected cod winter, yaxis(l) yscale (range (0 , 200) axis (1) ) y1abe1 ( , nogrid axis (1) ) ytitle ("Cod catch, 1000 tons", axis(l) placement (sw) ) || if winter > 19 5 9, legend (rinq (0) position (7) label (1 "Max ice area") label(2 "Min CIL temp") label(3 "Cod catch") rows(3)) xtitle ("") xlabel (1960(5)2000, grid) saving(c:\data\fig03_48.gph, replace) ; graph export c:\data\fig03 48.eps, replace ; #delimit c r The first line of this do-file sets the semicolon (;) as end-of-line delimiter. Thereafter, Stata does not consider a line finished until it encounters a semicolon. The second line simply retrieves the dataset (gulf.dta) needed to draw Figure 3.48; note the semicolon that finishes this line. The long graph twoway command occupies the next 15 lines on this page, but Stata treats this all as one logical line that ends with the semicolon after the saving () option. This option saves the graph in Stata's .gph format. Next, the graph export command creates a second version of the same graph in Encapsulated Postscript format, as indicated by the .eps suffix in the filename fig04_48.eps. (Type help graph_export to learn more about this command, which is particularly useful for writing programs or do-files that will create graphs repeatedly.) Thedo-file's final #delimit cr command re-sets a carriage return as the end-of-line delimiter, going back to Stata's usual mode. Although it is not visible on paper, the line # delimit cr must itself end with a carriage return (hit the Enter key), creating one last blank line at the end of the do-file. Retrieving and Combining Graphs_ Any graph saved in Stata's "live" .gph format can subsequently be retrieved into memory by the graph use command. For example, we could retrieve Figure 3.48 by typing graph use fig03_48 Once the graph is in memory, it is displayed onscreen and can be printed or saved again with a different name or format. From a graph saved earlier in .gph format, we could subsequently save versions in other formats such as Postscript (.ps), Portable Network Graphics (.png), or Enhanced Windows metafile (.emf). We also could change the color scheme, either through menus or directly in the graph use command. fig03_48.gph was saved in the s2 monochrome scheme, but we could see how it looks in the si color scheme by typing . graph use fig03 47, scheme(slcolor) Graphs saved on disk can also be combined by the graph combine command. This provides a way to bring multiple plots into the same image. For illustration, we return to the Gulf of St. Lawrence data shown earlier in Figure 3.48. The following commands draw three simple time plots (not shown), savingthem with the namesfig03_49a.gph, fig03_49b.gph, and fig03_49c.gph. The margin (medium) suboptions specify the margin width for title boxes within each plot. graph twoway line maxarea winter if winter > 1964 , xtitle("") xlabel(1965(5)2000, grid) ylabel(50(50)200, nogrid) title("Maximum winter ice area", position(4) ring(0) box margin(medium)) ytitle("1000 JcmA2") saving (fig03_49a) graph twoway line cil winter if winter > 1964, xtitle("") xlabel(1965(5)2000, grid) ylabel(-1{.5)1.5, nogrid) title("Minimum CIL temperature", position(l) ring(O) box margin(medium)) ytitle("Degrees C") saving(fig03_49b) . graph twoway line cod winter if winter > 1964, xtitle("") xlabel (1965 (5) 2000, grid) ylabel(0 (20) 100, nogrid) title("Northern Gulf cod catch", position(l) ring(O) box margin(medium)) ytitle("1000 tons") saving(fig03_49c) To combine these plots, we type the following command. Because the three plots have identical x scales, it makes sense to align the graphs vertically, in three rows. The imargin option specifies "very small" margins around the individual plots of Figure 3.49. graph combine fig03_4Pa.gph fig03_49b.gph fig03_49c.gph, imargin(vsmall) rows(3) 1965 1965 Figure 3.49 1970 1975 Maximum winter ice area 1990 1980 1985 1995 2000 1970 1975 1980 1985 1990 1995 2000 118 Statistics with Stata Type help graph_combine for more information on this command. Options control details including the number of rows or columns, the size of text and markers (which otherwise become smaller as the number of plots increases), and the margins between individual plots. They can also specify whether x or v axes of twoway plots have common scales, or assign all components a common color scheme. Titles can be added to the combined graph, which can be printed, saved, retrieved, or for that matter combined again in the usual ways. Our final example illustrates several of these graph combine options, and a way to build graphs with unequal-sized components. Suppose we want a scatterplot similar to the smoking vs. college grads plot seen earlier in Figure 3.42, but with box plots of the v and x variables drawn beside their respective axes. Using stcitehealth.dtct, we might first try to do this by drawing a vertical box plot of smokeT, a scatterplot of smokeT vs. college, and a horizontal box plot of college, and then combining the three plots into one image (not shown) with the following commands. graph box smokeT, saving(wrongl) graph twoway scatter smokeT college, saving(wrong2) . graph hbox college, saving(wrong2) . graph combine wrongl.gph wrong2.gph wrong3.gph The combined graph produced by the commands above would look wrong, however. We would end up with two fat box plots, each the size of the whole scatterplot, and none of the axes aligned. For a more satisfactory version, we need to start by creating a thin vertical box plot of smokeT. The fxsize(20) option in the following command fixes theplot1s.v(horizontal) size at 20% of normal, resulting in a normal height but only 20% width plot. Two empty caption lines are included for spacing reasons that will be apparent in the final graph. graph box smokeT, fxsize(20) caption ( "" ytitle( " " ) ylabel(none) ytick(15(5)35 grid) saving(fig03_50a) For the second component, we create a straightforward scatterplot oismokeTvs. college. graph twoway scatter smokeT college, ytitle("Percent adults smoking") xtitle ("Percent adults with Bachelor's degrees or higher") ylabel ( , grid) xlabel(, grid) saving(fig03_50b) The third component is a thin horizontal box plot of college. This plot should have normal width, but a v (vertical) size fixed at 20% of normal. For spacing reasons, two empty left titles are included. Graphs lig These three components come together in Figure 3.50. The graph combine command's cols (2) option arranges the plots in two columns, like a 2-by-2 table with one empty cell. The holes (3) option specifies that the empty cell should be the third one so our three component graphs fill positions l, 2, and 4. iscale (1.05) enlarges marker symbols and text by about 5%, for readability. The empty captions or titles we built into the original box plots compensate for the two lines of text (title and label) on each axis of the scatterplot, so the box plots align (although not quite perfectly) with the scatterplot axes. ■ graph combine fig03_50a.gph fig03_50b.gph fig03 50c.gph cols (2) holes(3) iscale (1.05) ~~ to CO O) c « CO CO a to MB OJ CO ... O a> 0- lO a Figure 3.50 t 1S 15 20 25 30 35 Percent adults with Bachelor's degrees or higher graph hbox college, fysize(20) lltitle("") l2title(M") ylabel(none) ytick (10 (5)35, grid) ytitle("") saving(fig03_50c) Summary Statistics and Tables The summarize command finds simple descriptive statistics such as medians, means, and standard deviations of measurement variables. More flexible arrangements of summary statistics arc available through the command tabstat. For categorical or ordinal variables, tabulate obtains frequency distribution tables, cross-tabulations, assorted tests, and measures of association, tabulate can also construct one-or two-way tables of means and standard deviations across categories of other variables. A general table-making command, table , produces as many as seven-way tables in which the cells contain statistics such as frequencies, sums, means, or medians. Finally, we review further one-variable procedures including normality tests, transformations, and displays for exploratory data analysis (EDA). Most of the analyses covered in this chapter can be accomplished either through the commands shown or through menu selections under Statistics - Summaries, tables & tests. In addition to such general-purpose analyses, Stata provides many tables of particular interest to epidemiologistsl. These are not described in this chapter, but can be viewed by typing help epitab . Selvin (1996) introduces the topic. Example Commands summarize yl y2 y3 Calculates simple summary statistics (means, standard deviations, minimum and maximum values, and numbers of observations) for the variables listed. summarize yl y2 y3, detail Obtains detailed summary statistics including percentiles, median, mean, standard deviation, variance, skewness, and kurtosis. summarize yl if xl > 3 & x2 < Finds summary statistics for yl using only those observations for which variable xl is greater than 3, and x2 is not missing, summarize yl [fweight = w], detail Calculates detailed summary statistics for yl using the frequency weights in variable w. tabstat ylr stats(mean sd skewness kurtosis n) Calculates only the specified summary statistics for variable tabstat yl, stats(min p5 p25 p50 p75 p95 max) by(xl) Calculates the specified summary statistics (minimum, 5th percentile, 25th percentile, etc.) for measurement variable yl, within categories ofxl. i 1 Summary Statistics and fables 121 tabulate xl Displays a frequency distribution table for all nonmissing values of variable*/, tabulate xl, sort miss Displays a frequency distribution of .vi, including the missing values. Rows (values) are sorted from most to least frequent. tabl xl x2 x3 x4 Displays a series of frequency distribution tables, one for each of the variables listed, tabulate xl x2 Displays a two-variable cross-tabulation with xl as the row variable, and.v2 as the columns, tabulate xl x2, chi2 nof column Produces a cross-tabulation and Pearson %2 test of independence. Does not show cell frequencies, but instead gives the column percentages in each cell. tabulate xl x2, missing row all Produces a cross-tabulation that includes missing values in the table and in the calculation of percentages. Calculates "all" available statistics (Pearson and likelihood x:„ Cramer's V, Goodman and Kruskafs gamma, and Kendall's t h), tab2 xl x2 x3 x4 Performs all possible two-way cross-tabulations of the listed variables, tabulate xl, summ(y) Produces a one-way table showing the mean, standard deviation, and frequency of v values within each category of xl. tabulate xl x2, summ(y) means Produces a two-way table showing the mean ofy at each combination ofxl andx2 values, by x3, sort: tabulate xl x2, exact Creates a three-way cross-tabulation, with subtables foxxl (row) by.v2 (column) at each value of x3. Calculates Fisher's exact test for each subtable. by varname, sort: works as a prefix for almost any Stata command where it makes sense. The sort option is unnecessary if the data already are sorted on varname. table y x2 x3, by(x4 x5) contents(freq) Creates a five-way cross-tabulation, ofy (row) by.v2 (column) byx? (supercolumn), byx4 (superrow l) by x5 (superrow 2). Cells contain frequencies, table xl x2, contents(mean yl median y2) Creates a two-way table ofxl (row) by x2 (column). Cells contain the mean of yl and the median of y2. Summary Statistics for Measurement Variables_ Dataset VTtown.dta contains information from residents of a town in Vermont. A survey was conducted soon after routine state testing had detected trace amounts of toxic chemicals in the town's water supply. Higher concentrations were found in several private wells and near the public schools. Worried citizens held meetings to discuss possible solutions to this problem. 122 Statistics with Stata Co ntains data f r om C: \data \7Tto w r.. d t a obs : 1 5 3 v a r s : / 1, 6 S 3 (99 . 9 \ o ŕ merne ty tree) storage di s play v a 1 u e va r i able name t vpe fo rma t label gender ryte ,8 . Og sex ibl 1 i ved byte ''-,8. Og kids b y t e ;e. Og kidlbl edec byte l" 8 . Og meet i ngs byte -8 . Og kidlbl c o n t a r\ byte c 8 Og con t amib school byte t 8 Og c lose Sorted by: VT town survey (Hamilton 1985) 11 Jul 2005 18:05 ariable label Respondent's gender Years lived in town Have children <19 in town? Highest year school completed Attended meetings on pollution Be lieve own property/water contaminated School closing opinion To find the mean and standard deviation of the variable lived (years the respondent had lived in town), type summarize lived Min Max r i able 11 ved Obs 153 Me an 2 67 97 Std. Dev 16.95466 81 This table also gives the number of nonmissing observations and the variable's minimum and maximumvalues. Ifwe had simply typed summarize with no variable list, we would obtain means and standard deviations for every numerical variable in the dataset. To see more detailed summary statistics, type summarize lived, detail Yea r s 1i ved in town 1% C. ~ ■J c 1 0 1 2 5": 5 0 r 75> 9 0 \ J „I '-. q Q-* Percentile. 1 ,ma liest 1 1 15 29 42 6 8 1 Obs 153 1 Sum of Wgt . 153 Mean 19.26797 est Std. Dev. 16.95-3 66 65 6 5 Variance 287 .4 60 6 6 8 SkewnesE 1 . 2088 0 4 81 K u r t o s i s 4.025642 This summarize, detail output includes basic statistics plus the following: Percentiles: Notably the first quartile (25th percentile), median (50th percentile), and third quartile (75th percentile). Because many samples do not divide evenly into quarters or other standard fractions, these percentiles are approximations. Four smallest and four largest values, where outliers might show up. Summary Statistics and Tables 123 Sum of weights: Stata understands four types of weights: analytical weights ( aweight), frequency weights ( fweight ), importance weights ( iweight ), and sampling weights (pweight). Different procedures allow, and make sense with, different kinds of weights, summarize, detail , for example, permits aweight or fweight. For explanations see help weights. Variance: Standard deviation squared (more properly, standard deviation equals the square root of variance). Skewness: The direction and degree of asymmetry. A perfectly symmetrical distribution has skewness = 0. Positive skew (heavier right tail) results in skewness > 0; negative skew (heavier left tail) results in skewness < 0. Kurtosis: Tail weight. A normal (Gaussian) distribution is symmetrical and has kurtosis = 3. If a symmetrical distribution has heavier-than-normal tails (that is, is sharply peaked), it will have kurtosis > 3. Kurtosis < 3 indicates lighter-than-normal tails. The tabstat command provides a more flexible alternative to summarize. We can specify just which summary statistics we want to see. For example, tabstat lived, stats(mean range skewness) variable I mean ---------+----------- lived I 19.26797 range skewness 80 1.208804 With a by (varname) option, tabstat constructs a table containing summary statistics for each value of varname. The following example contains means, standard deviations, medians, interquartile ranges, and number of nonmissing observations of lived, for each category of gender. The means and medians both indicate that, on average, the women in this sample had lived in town for fewer years than the men. Note that the median column is labeled "p50", meaning 50th percentile. tabstat lived, stats(mean sd median iqr n) by{gender) Summary for variables: lived by categories of: gender (Respondent's gender) gender | me an s d pSO iqr N male 1 23 4 8 3 3 3 1 9 69125 19.5 28 60 female 1 16 54 83 9 14 3 94 63 1 3 19 93 Total 1 19 267 97 1 6 954 66 1 5 24 153 Statistics available for the stats () option of tabstat include: mean Mean count Count of nonmissing observations n sum max Same as count Sum Maximum 124 Statistics with Stata min range sd var cv semean Minimum Range = max - min Standard deviation Variance Coefficient of variation = sd / mean Standard error of mean = sd / sqrt(n) skewness Skewness kurtosis Kurtosis median Median (same as p50 ) pi Ist percentile (similarly, p5 , pio , p25 , p50 , p75 , p95 , or p99 ) iqr Interquartile range = p75 - p25 q Quartiles; equivalent to specifying p25 p50 p75 Further tabstat options give control over the table layout and labeling. Type help tabstat to see a complete list The statistics produced by summarize or tabstat describe the sample at hand. We might also want to draw inferences about the population, for example, by constructing a 99% confidence interval for the mean of lived: . ci lived, level (99) Variable i Cbs Mean Std. Err. [99% Conf. Interval] 1ived | 153 19.2 6 797 1.370703 15.69241 22.84354 Based on this sample, we could be 99% confident that the population mean lies somewhere in the interval from 15.69 to 22.84 years. Here we used a level ( ) option to specify a 99%o confidence interval. If we omit this option, ci defaults to a 95%o confidence interval. Other options allow ci to calculate exact confidence intervals for variables that follow binomial or Poisson distributions. A related command, cii , calculates normal, binomial, or Poisson confidence intervals directly from summary statistics, such as we might encounter in a published article. It does not require the raw data. Type help ci for details about both commands. Exploratory Data Analysis_ Statistician John Tukey invented a toolkit of methods for exploratory data analysis (EDA), which involves analyzing data in an exploratory and skeptical way without making unneeded assumptions (see Tukey 1977; also Hoaglin, Mosteller, and Tukey 1983, 1985). Box plots, introduced in Chapter 3, are one of Tukey's best-known innovations. Another is the stem-and-leaf display, a graphical arrangement of ordered data values in which initial digits form the "stems1' and following digits for each observation make up the "leaves.11 Summary Statistics and Tables 125 stem lived Stem-and-leaf plot for lived (Years lived in town) 0* 0 . 1 * 1 . 2* 2 . 3* 3 . 4 * 4 . 5* 5 . 6* 6 . 7 * 7 . 8* ( 1111111222223333333344444444 1 55555555555566666666777 88 99 99 I 0000001122223333334 I 55555567788899 I 000000111112224444 ] 5 67 78 8 99 ] 00000124 t 5555666789 1 0012 I 5 9 I 00134 I 556 I 5 5 5 8 [ 1 stem automatically chose a double-stem version here, in which 1 * denotes first digits of 1 and second digits of 0-4 (that is, respondents who had lived in town 10-14 years). 1. denotes first digits of 1 and second digits of 5 to 9 (15-19 years). We can control the number of lines per initial digit with the lines ( ) option. For example, a five-stem version in which the 1* stem hold leaves of 0-1, It leaves of 2-3, If leaves of 4-5, Is leaves of 6-7, and 1. leaves of 8-9 could be obtained by typing stem lived, lines (5) Type help stem for information about other options. Letter-value displays (lv) use order statistics to dissect a distribution. lv lived 153 77 39 20 10 . 5 . 3 2 1 . 1 Years lived in town 15 17 21 27 75 33 34 .5 37.75 41 30 inner fence | outer fence | -31 -67 29 39 52 60 .5 65 68 74.5 81 65 101 spread 24 36 50 59.5 64 67 73.5 80 # below pseudosigma 1 7 . 973 1 15.86391 62351 26523 15955 5 97 62 14113 16 1 6 15 14 15 , 15.32737 # above 5 0 M denotes the median, and F the "fourths" (quartiles, using a different approximation than the quartile approximation used by summarize, detail and tabsum). E , D , C , . . . denote cutoff points such that roughly 1/8, 1/16, 1/32, ... of the distribution remains outside in the tails. The second column of numbers gives the "depth,11 or distance from nearest extreme, for each letter value. Within the center box, the middle column gives "midsummaries," which are averages of the two letter values. If midsummaries drift away from the median, as they do for lived* this tells us that the distribution 126 Statistics with Stata skewed as wc move farther out into the tails. The "spreads11 are differences between pairs of letter values. For instance, the spread between F 's equals the approximate interquartile range. Final ly, "pseudosigmas" in the right-hand column estimate what the standard deviation should be if these letter values described a Gaussian population. The F pseudosigma, sometimes called a "pseudo standard deviation" (PSD), provides a simple and outlier-resistant check for approximate normality in symmetrical distributions: 1. Comparing mean with median diagnoses overall skew: mean > median positive skew mean = median symmetry mean < median negative skew 2. If the mean and median are similar, indicating symmetry, then a comparison between standard deviation and PSD helps to evaluate tail normality: standard deviation > PSD heavier-than-normal tails standard deviation = PSD normal tails standard deviation F3 + 3IQR lv gives these cutoffs and the number of outliers of each type. Severe outliers, values beyond the outer fences, occur sparsely (about two per million) in normal populations. Monte Carlo simulations suggest that the presence of any severe outliers in samples of n = 15 to about 20,000 should be sufficient evidence to reject a normality hypothesis at a = .05 (Hamilton 1992b). Severe outliers create problems for many statistical techniques. summarize , stem , and lv all confirm that lived has a positively skewed sample distribution, not at all resembling a theoretical normal curve. The next section introduces more formal normality tests, and transformations that can reduce a variable's skew. Normality Tests and Transformations_ Many statistical procedures work best when applied to variables that follow normal distributions. The preceding section described exploratory methods to check for approximate normality, extending the graphical tools (histograms, box plots, symmetry plots, and quantile-normal plots) presented in Chapter 3. A skewness-kurtosis test, making use of the skewness and kurtosis statistics shown by summarize, detail , can more formally evaluate the null hypothesis that the sample at hand came from a normally-distributed population. Summary Statistics and Tables 127 'ktest lived Skewness/Ku rtosi s tests for Noma] it Variable I Pr (Skewness) prn-lirt . , joint + —----------J__Pr(Kurtofais) adj ch-2(2) Prob> lived [ :hi 2 0 .000 0 . 028 2 4.79 0.0000 sktest here rejects normality: lived appears significantly nonnormal in skewness (P = .000), kurtosis (P= .028), and in both statistics considered jointly (P= .0000). Stata rounds off displayed probabilities to three or four decimals; "0.0000" really means P < .00005. Other normality or log-normality tests include Shapiro-Wilk W { swilk ) and Shapiro-Francia W1 ( sfrancia ) methods. Type help sktest to see the options. Nonlinear transformations such as square roots and logarithms are often employed to change distributions' shapes, with the aim of making skewed distributions more symmetrical and perhaps more nearly normal. Transformations might also help linearize relationships between variables (Chapter 8). Table 4.1 shows a progression called the "ladder of powers" (Tukey 1977) that provides guidance for choosing transformations to change distributional shape. The variable lived exhibits mild positive skew, so its square root might be more symmetrical. We could create a new variable equal to the square root of lived by typing generate srlived = lived A.5 Instead of lived A . 5, we could equally well have written sqrt (lived) . Logarithms are another transformation that can reduce positive skew. To generate a new variable equal to the natural (base e) logarithm of lived, type . generate loglived ~ In(lived) In the ladder of powers and related transformation schemes such as Box-Cox, logarithms take the place of a "0" power. Their effect on distribution shape is intermediate between .5 (square root) and-.5 (reciprocal root) transformations. Transformation cube square raw square root l°g e (or log](>) negative reciprocal root negative reciprocal negative reciprocal square negative reciprocal cube Table 4.1: Ladder ofP Formula owers new = old A3 new = old A2 old new = old A . 5 new = Infold) new = loglO(old) new = -(oid A- . 5 ) new = -(old A-1) new = -(old A-2) new = -(old A-3) Effect reduce severe negative skew reduce mild negative skew no change (raw data) reduce mild positive skew reduce positive skew reduce sev ere positive ske- reduce ve ry severe positive skew 128 Statistics with Stata Summary Statistics and Tables 129 and so forth. When old itself contains negative or zero values, it is necessary to add a constant before transformation. For example, if arrests measures the number of times a person has been arrested (0 for many people), then a suitable log transformation could be generate larrests = In{arrests + 1) The ladder command combines the ladder of powers with sktest tests for normality. It tries each power on the ladder, and reports whether the result is significantly nonnonnal. This can be illustrated using the severely skewed variable energy, per capita energy consumption, from states.dta. ladder energy Trans f o rrr.a t i cn cube square r 'i w square-root log tree i p r c~ =11 root reciprocal reciprocal square reciprocal c uhe It appears that the reciprocal transformation, [/energy (or energy 1 ), most closely resembles a normal distribution. Most of the other transformations (including the raw data) are significantly nonnormal. Figure 4.1 (produced by the gladder command) visuallysupports this conclusion by comparing histograms of each transformation to normal curves, gladder energy fo rmu1a chi2 (2) P (chi2) energy": 53 . 74 0.000 energyA2 45 , 53 0.000 energy 3 3 . 25 0.000 s q r t (energy) 2 5 . 0 3 0.000 log (energy) 1 5 . 88 0.000 1/sqrt (energy) 7 3 6 0.025 1/energy 1 . 32 0.517 1/ (energyA2) 4 . 13 0.127 1 / (e ne rgy '\3 ) 1 i . 56 0.003 02.009* D8a3t06f&-D6E?k 08e+09 sqrt II 0 20000400DCSD00CHI000O00001 log .005 - 004-.003 -.002 - 001 1/square DDOQ2H3 002)00 tffiOBOOe^CSee-21 identity 200 400 60D 800 1D00 1/sqrt IP -07 - 06 -.05 - 04 -.03 1/cubic Figure 4.1 -1.50e-D71 O0e-D75 OOe- Per capita energy consumed, Btu Histograms by Iranstormation Figure 4.2 shows a corresponding set of quantile-normal plots for these ladder of powers transformations, obtained by the "quantile ladder" command qladder . To make the tiny plots more readable in this example we scale the labels and marker symbols up by 25% with the scale (1.25) option. The axis labels (which would be unreadable and crowded) are suppressed by the options ylabel (none) xlabel (none). . qladder energy, scale (1.25) ylabel(none) xlabel (none) cubic square identity Figure 4.2 sqrt log 1/sqrt 1/square 1/cubic Per capita energy consumed, Btu Guantile-Normal plols by transformation An alternative technique called Box-Cox transformation offers finer gradations between transformations and automates the choice among them (easier for the analyst, but not always a good thing). The command bcskewO finds a value of A. (lambda) for the Box-Cox transformations or = (v;- - 1} Ik = \n(y) k > 0 or k < 0 such that y (X) has approximately 0 skewness. Applying this to energy, we obtain the transformed variable benergy-: bcskewO benergy = energy, level(95) Transform | L [95% Conf. Interval] (energy A L-1 ) /L | -1.24 605. (1 missing value generated) 3 6163383 S kewness .00 0281 That is, benergy = (energy '246 - 1)/(-!.246) is the transformation that comes closest to symmetry (as defined by the skewness statistic). The Box-Cox parameter k ~ -1.246 is not far from our ladder-of-powers choice, the -1 power. The confidence interval for k, -2.0525 < k < -.6163 allows us to reject some other possibilities including logarithms (k = 0) or square roots (k = .5). Chapter 8 describes a Box-Cox approach to regression modeling. 130 Statistics with Stata Frequency Tables and Two-Way Cross-Tabulations _ The methods described above apply to measurement variables. Categorical variables require other approaches, such as tabulation. Returning to the survey data in VTtown.dta, we could find the percentage of respondents who attended meetings concerning the pollution problem by tabulating the categorical variable meetings'. tabulate meetings Attended i T.ee t - nqs on j pollution! b req . Peccant C^m. no I 10-. 69.28 f-.-3.28 yes ( '-17 30.12 100.00 Total1 12 3 130.00 tabulate can produce frequency distributions for variables that have thousands of values. To construct a manageable frequency distribution table for a variable with many values, however, you might first want to group those values by app]ying generate with its recode or autocode options (see Chapter 2 or help generate ). tabulate followed by two variable names creates a two-way cross-tabulation. For example, here is a cross-tabulation of meetings by kids (whether respondent has children under 19 living in town); tabulate meetings kids Attende ci ) meetings [ Have ch LIdz en < "I 9 i n polluticij I no ve s i Total n o ! 5 2 5 <1 I 10 6 yes I "! 1 36 I 4 7 Total i 6 3 9 0 \ 1 53 The first-named variable forms the rows, and the second forms columns in the resulting table. We see that only 11 of these 153 people were non-parents who attended the meetings, tabulate has a number of options that are useful with frequency tables: all Equivalent to the options chi2 lrchi2 gamma taub V. Not all of these options will be equally appropriate for a given table, gamma and taub assume that both variables have ordered categories, whereas chi2 , lrchi2 , and V do not. cchi2 Displays the contribution to Pearson y; (chi-squared) in each cell of a two-way table. cell Shows total percentages for each cell. chi2 Pearson x2 test of hypothesis that row and column variables are independent. clrchi2 Displays the contribution to likelihood-ratio %2 in each cell of a two-way table. Summary Statistics and Tables 1Z1 column Shows column percentages for each cell. exact Fisher's exact test of the independence hypothesis. Superior to chi2 if the table contains thin cells with low expected frequencies. Often too slow to be practical in large tables, however. expected Displays the expected frequency under the assumption of independence in each cell of a two-way table. gamma Goodman and Kruskafs y (gamma), with its asymptotic standard error (ASE). Measures association between ordinal variables, based on the number of concordant and discordant pairs (ignoring ties). -I contarn = no Attended I meet i ngs | Have ch iIdre n <19 in on I town ? polluti on I no yes 1 Total no I 91 . 30 68.75 1 78.18 yes I 8.70 31.25 ! 21 . 82 -----------+----- --------+ -- Total I 1 00.00 100.00 1 100 . 00 Pea r s on chi2(l) = 7.98 14 Pr = 0 134 Statistics with Stata > contan = yes Attended I meetings I Have children town ? < 1 9 in p o ; I u t i o n no yes ( Total no I 5 8.82 3 8.46 | 4 6.51 y-s I 4 ; .18 61.54 I 5 3.19 Total I 10 0.00 10 0.00 I 100. 00 Pe arson chi2 (1) - 1 . 7131 Pr = 0 Parents were more likely to attend meetings, among both the contaminated and uncontaminated groups. Only among the larger uncontaminated group is this "parenthood effect" statistically significant, however. As multi-way tables separate the data into smaller subsamples, the size of these subsamples has noticeable effects on significance-test outcomes. This approach can be extended to tabulations of greater complexity. For example, to get a four-way cross-tabulation of gender by contain by meetings by kids, with y; tests for each meetings by kids subtable (results not shown), type the command by gender con tarn, sort: tabulate meetings kids, column chi2 A better way to produce multi-way tables, if we do not need percentages or statistical tests, is through Stata's general table-making command, table. This versatile command has many options, only a few of which are illustrated here. To construct a simple frequency table of meetings, type table meetings, contents(freq) Attended nee t ings on poll;ticn For a two-way frequency table or cross-tabulation, type table meetings kids, contents(freq) Have Attended 1 children meeti ng s I •-19 in on 1 town? pollution I no yes n o 5 2 5 4 ye s 1 11 3 6 Summary Statistics and Tables 135 If we specify a third categorical variable, it forms the "supercolumns" of a three-way tabli . table meetings kids contam, contents(freq) P el iev e o w n I P- oper z ter At tended | vontdrr inat e d and Hav e meetings | children <1 g in town on | ---no --- ---yes pollution ] :.o yes n o yes no | 42 44 10 10 yes ] 4 20 7 1 6 More complicated tables require the by ( ) option, which allows up to four "supperrow" variables, table thus can produce up to seven-way tables: one row, one column, one supercolumn, and up to four superrows. Here is a four-way example: . table meetings kids contam, contents(freq) by(gender) Pesponden | t ' s [ gender j Believe own and | proper ty 'water Attended j contaminated and Have mee t i n gs | children <19 in town o n | --- no--- ---yes pollution | no yes no ye s ma le | n c | 1 8 18 5 J ye s | 2 7 3 6 female | no | 24 26 7 7 ye s I 2 1 3 4 10 The contents ( ) option of tabl contents(freq) contents(mean varname) contents(sd varname) contents(sum varname) contents(rawsum varname) contents(count varname) contents(n varname) contents(max varname) contents(min varname) contents(median varname) contents(iqr varname) specifies what statistics the table's cells contain: Frequency Mean of varname Standard deviation of varname Sum of varname Sums ignoring optionally specified weight Count of nonmissing observations of varname Same as count Maximum of varname Minimum of varname Median of varname Interquartile range (IQR) of varname 136 Statistics with Stata Summary Statistics and Tables 137 contents (pi vamame) 1st percentile of varname contents (p2 vamame) 2nd percentile of varname (so forth to p9 9 ) The next section illustrates several more of these options. Tables of Means, Medians, and Other Summary Statistics___ tabulate readily produces tables of means and standard deviations within categories of the tabulated variable. For example, to form a one-way table with means of lived within each category of meetings, type tabulate meetings, summ(lived) Meetings attenders appear to be relative newcomers, averaging 14.2 years in town, compared with 21.5 years for those who did not attend. We can also use tabulate to form a two-way table of means by typing tabulate meetings kids, sum(lived) means Means of Years lived in town Attended I meetings I Have children <19 on i in town ? pollution1, no ye. Total no [ 28.J07692 14.962963 | 21.5094 34 yes I 23.363636 11.416667 | 14.212766 Total I 27.444444 13.5 4 4444 I 19.267974 Both parents and nonparents among the meeting attenders tend to have lived fewer years in town, so the newcomer/oldtimer division noticed in the previous table is not a spurious reflection of the fact that parents with young children were more likely to attend. The means option used above called for a table containing only means. Otherwise we get a bulkier table with means, standard deviations, and frequencies in each cell. Chapter 5 describes statistical tests for hypotheses about subgroup means. Although it performs no tests, table nicely builds up to seven-way tables containing means, standard deviations, sums, medians, or other statistics (see the option list in previous section). Here is a one-way table showing means of lived within categories of meetings: Attended | Attended meetings on | Summary of Years lived in town 1 pollution 1 Me a n Std. Dev. Freq. meet irigs IHave children <19 on 1 in town? 1 no yes no 1 2 1.509434 14 . 2 127 66 17.7438 09 13 . 911109 106 47 pollution yes | n o [ 23 . 3077 14.963 1 23.3 63 6 11.4167 Total | 19.2 67974 1 6 . 954 663 153 yes table meetings, contents(mean lived) At t e nded 1 meet ings c n poll c t ion mean (1ived) no 2 1 . 5094 yes 14.2128 A two-way tabic of means is a straightforward extension: . table meetings kids, contents(mean lived) Table cells can contain more than one statistic. Suppose we want a two-way table with both means and medians of the variable lived: . table meetings kids, contents(mean lived median lived) At t ended 1 meetings 1 Ha v e ch i ldren < 1 9 on 1 i n town? pollution J no yes no : 28 .307 7 1 4 . 963 1 27.5 1 2 . 5 yes 1 1 23 .3 636 11 . 4 1 6 7 1 21 6 The medians in the table above confirm our earlier conclusion based on means: the meeting attenders, both parents and nonparents, tended to have lived feweryears in town than their non-attending counterparts. Medians within each cell are less than the means, reflecting the positive skew (means pulled up by a few long-time residents) of the variable lived. The cell contents shown by table could be means, medians, sums, or other summary statistics for two or more different variables. 138 Statistics with Stata Using Frequency Weights summarize , tabulate , table , and related commands can be used with frequency weights that indicate the number of replicated observations. For example, file sextab2.dta contains results from a British survey of sexual behavior (Johnson et al. 1992). It apparently has 48 observations: Contains obs : vars: data f roro C: 4 8 4 \data\sextab2 dta British sex survey (Johnson 92) 11 Jul 2005 18:05 size: 432 (99.9°; of memory free) variable name sto c age type display f o rmat value label va r i able 1abe1 age qender lifepart count byte byte byte i n t % 8 . 0 g q-, 8.0g 18 . Og \ 8 . 0 g age ge nde r partners Age Gende r # heterosex partners lifetime Number of individuals Sorted by: age lifepart gender One variable, count, indicates the number of individuals with each combination of characteristics, so this small dataset actually contains information from over 18,000 respondents. For example, 405 respondents were male, ages 16 to 24, and reported having no heterosexual partners so far in their lives. . list in 1/5 1 age gende r 1 i fepart count 1 1. 1 16-24 2. 1 16-24 3. | 16-24 4. 1 16-24 5. I 16-24 male female male female ma le none none one one two 4 05 1 4 65 1 60 6 1 194 1 We use count as a frequency weight to create a cross tabulate lifepart gender [fw = count] # 1 heterosex 1 partners 1 1i fetime | Gender male female 1 Total none \ one 1 two 1 3-4 \ 5-9 1 10+ 1 544 1734 887 1542 1630 20-38 58 6 1 4146 1 1 777 1 1908 1 13 6 4 1 708 1 11 30 58SÜ 2 664 34 50 2 994 2 7 56 Total 1 8385 10489 1 1 8874 The usual tabulate options work as expected with frequency weights. Here is the same table showing column percentages instead of frequencies: Summary Statistics and Tables 139 tabulate lifepart gender [fweight = count], column nof heterosex | par t ner s | Gende r 1i fe t ime 1 ma 1 e female 1 Total none | 6 4 9 5 59 1 5 . 99 one 1 20 68 39 5 3 1 31 . 15 two 1 10 58 1 6 94 1 14 . 11 3-4 1 18 39 18 1 9 1 18.28 5-9 1 1 9 44 1 3 00 1 15.86 10+ 1 2 4 42 6 75 1 14.60 Total 1 100 00 100 0 0 1 100.00 Other types of weights such as probability or analytical weights do not work as well with tabulate because their meanings are unclear regarding the command's principal options. A different application of frequency weights can be demonstrated with summarize. File college!.dta contains information on a random sample consisting of II U.S. colleges, drawn from Barron's Compact Guide to Colleges (1992). Con tains data from C: \data\collegel.dta obs : 11 Colleges sample 1 (Barron's 92) vars : 5 11 Jul 2005 18:05 size: 429 (99.9% of memory free) s torage display value va r i able name type format label variable label school str2 8 %28s College or univers ity enroll int %8 . Og Full-time students 19 91 pctmale byte 18 . Og Percent male 1991 msa t int % 8 . 0 g Average math SAT vsat int %8 . Og Average verbal SAT Sorted by: The variables include msat, the mean math Scholastic Aptitude Test score at each of the 11 schools. list school enroll msat 11 school enroll msat Brown University 5550 680 U. Scranton 382 1 554 U. North Carolina/Asheville 20 3 5 540 Claiemont College 84 9 660 DePaul University 6197 547 Thomas Aquinas College 201 570 Davidson Col lege 1543 640 U. Michigan/Dearborn 3 54 1 435 Mass . College of Art 961 482 Oberlin College 2765 640 American University 5228 587 msat va lue among these 11 schools by typing 140 Statistics with Stata We can easily find the mean summarize msat Va r i a blc ' Ob s n s a t I 11 This summary table gives each school's mean math SAT score the same weight. DePaul University, however, has 30 times as many students as Thomas Aquinas College. To take the different enrollments into account we could weight by enroll Mean 5 8 0.4545 Std. Dev 6 7.63189 Min 482 Max 680 ;umma Variable rize msat [fweight = enroll] Obs Mean Std. Dev msat 32 69 1 583 . 064 63.10665 4 82 680 Typing summarize m sat [freq = enroll] would accomplish the same thing. The enrollment-weighted mean, unlike the unweighted mean, is equivalent to the mean for the 32,691 students at these colleges (assuming they all took the SAT). Note, however, that we could not say the same thing about the standard deviation, minimum, or maximum. Apart from the mean, most individual-level statistics cannot be calculated simply by weighting data that already are aggregated. Thus, we need to use weights with caution. They might make sense in the context of one particular analysis, but seldom do for the dataset as a whole, when many different kinds of analyses are needed. ANOVA and Other Comparison Methods Analysis of variance (ANOVA) encompasses a set of methods for testing hypotheses abo>ut differences between means. Its applications range from simple analyses where we compare uV»e means of v across categories ofx, to more complicated situations with multiple categorical and measurement .y variables, /tests for hypotheses regarding a single mean (one-sample) or a pair of means (two-sample) correspond to elementary forms of ANOVA. Rank-based "nonparametric" tests, including sign, Mann-Whitney, and Kruskal-Wallls, take a different approach to comparing distributions. These tests make weaker assumptions about measurement, distribution shape, and spread. Consequently, they remain valid under a wider range of conditions than ANOVA and its "parametric" relatives. Careful analysts sometimes use parametric and nonparametric tests together, checking to see whether both point toward similar conclusions. Further troubleshooting is called for when parametric and nonparametric results disagree. anova is the first of Stata's model-fitting commands to be introduced in this book. Lik:e the others, it has considerable flexibility encompassing a wide variety of models, anova can fit one-way and iV-way ANOVA or analysis of covariance (ANCOVA) for balanced armd unbalanced designs, including designs with missing cells. It can also fit factorial, nested, mixed, or repeated-measures designs. One follow-up command, predict , calculates predicted values, several types of residuals, and assorted standard errors and diagnostic statistics after anova . Another followup command, test, obtains tests of user-specifie=d null hypotheses. Both predict and test work similarly with other Stata model-fittimg commands, such as regress (Chapter 6). The following menu choices give access to most operations described in this chapter: Statistics - Summaries, tables, & tests - Classical tests of hypotheses Statistics - Summaries, tables, & tests - Nonparametric tests of hypotheses Statistics - ANOVA/MANOVA Statistics - General post-estimation - Obtain predictions, residuals, etc., after estimation Graphics - Overlaid twoway graphs Hi w 142 Statistics with Stata Example Commands____ anova y xl x2 Performs two-way ANOVA, testing for differences amongthe means of y across categories of xl and x2, anova y xl x2 xl*x2 Performs a two-way factorial ANOVA, including both the main and interaction (xl*x2) effects of categorical variables xl and x2. . anova y xl x2 x3 xl*x2 xl*x3 x2*x3 xl*x2*x3 Performs a three-way factorial ANOVA, including the three-way interaction xl*x2*x3, as well as all two-way interactions and main effects. . anova reading curriculum / teacher\curriculum Fits a nested model to test the effects of three types of curriculum on students' reading ability (reading), teacher is nested within curriculum ( teacher\curriculum ) because several different teachers were assigned to each curriculum. The Base Reference Manual provides other nested ANOVA examples, including a split-plot design, anova headache subject medication, repeated(medication) Fits a repeated-measures ANOVA model to test the effects of three types of headache medication (medication) on the severity of subjects' headaches (headache). The sample consists of 20 subjects who report suffering from frequent headaches. Each subject tried each of the three medications at separate times during the study. . anova y xl x2 x3 x4 x2*x3f continuous(x3 x4) regress Performs analysis of covariance (ANCOVA) with four independent variables, two of them {xl and x2) categorical and two of them (x3 and x4) measurements. Includes the x2*x3 interaction, and shows results in the form of a regression table instead of the default ANOVA table. . kwallis y, by(x) Performs a Kruskal-Wallistest ofthe null hypothesis that>> has identical rank distributions across the k categories of x (k > 2). oneway y x Performs a one-way analysis of variance (ANOVA), testing for differences among the means of v across categories of x. The same analysis, with a different output table, is produced by anova y x . oneway y x, tabulate scheffe Performs one-way ANOVA, including a table of sample means and Scheffe multiple-comparison tests in the output. . ranksum y, by(x) Performs a Wilcoxon rank-sum test (also known as a Mann-Whitney U test) ofthe null hypothesis that y has identical rank distributions for both categories of dichotomous variable*. If we assume that both rank distributions possess the same shape, this amounts to a test for whether the two medians of v are equal. ANOVA and Other Comparison Methods 143 serrbar ymean se xr scale(2) Constructs a standard-error-bar plot from a dataset of means. Variable ymean holds the group means of y\ se the standard errors; and x the values of categorical variable x. scale (2) asks for bars extending to ±2 standard errors around each mean (default is ±l standard error). signrank yl = y2 Performs a Wilcoxon matched-pairs signed-rank test for the equality of the rank distributions of yl and v2. We could test whether the median ofy7 differs from a constant such as 23.4 by typing the command signrank yl = 23.4. signtest yl = y2 Tests the equality of the medians of yl and y2 (assuming matched data; that is, both variables measured on the same sample of observations). Typing signtest yl = 5 would perform a sign test of the null hypothesis that the median of yl equals 5. . ttest y = 5 Performs a one-sample t test of the null hypothesis that the population mean ofy equals 5. . ttest yl = y2 Performs a one-sample (paired difference) t test of the null hypothesis that the population mean of yl equals that of y2. The default form of this command assumes that the data are paired. With unpaired data (yl mdy2 are measured from two independent samples), add the option unpaired. ttest y, by(x) unequal Performs a two-sample t test of the null hypothesis that the population mean of v is the same for both categories of variable .v. Does not assume that the populations have equal variances. (Without the unequal option, ttest does assume equal variances.) One-Sample Tests____ One-sample t tests have two seemingly different applications: 1. Testing whether a sample mean v differs significantly from an hypothesized value |in. 2. Testing whether the means of v, and y2 , two variables measured over the same set of observations, differ significantly from each other. This is equivalent to testing whether the mean of a "difference score" variable created by subtracting^, fromj, equals zero. We use essentially the same formulas for either application, although the second starts with information on two variables instead of one. The data in writing.dta were collected to evaluate a college writing course based on word processing (Nash and Schwartz 1987). Measures such as the number of sentences completed in timed writing were collected both before and after students took the course. The researchers wanted to know whether the post-course measures showed improvement. 144 Statistics with Stata ANOVA and Other Comparison Methods 145 . describe Contains data ob s va r s size from C:\data\writinq.ata 24 Nash and Schwartz 12 Jul 2005 10:16 ( 1987 } 312 (99 o f memo ry free) s t orage d i s p 1 a y value variable name type f 0 rmat labe 1 id byte V, 8 Qg s lbl p r e S byte 5 8 Og preP byte %B Og preC byte %8 Og preE byte l,.8 Og posts byte y;8 Og postP byte 5 8 Og po s t C byte 5 8 Og p 0 s t E byte t S Og Sorted by: variable label Student ID # of sentences (p re-test] # of paragraphs (pre-test) Coherence scale 0-2 (pre-test) Evidence scale 0-6 (pre-test) # of sentences (post-test) # of paragraphs (post-test) Coherence scale 0-2 (post-test) Evidence scale 0-6 (post-test) Suppose that we knew that students in previous years were able to complete an average of 10 sentences. Before examining whether the students in witing.dta improved during the course, we might want to learn whether at the start of the course they were essentially like earlier students —- in other words, whether their pre-test (preS) mean differs significantly from the mean of previous students (10). To see a one-sample / test of Hyiji = 10, type 10 . ttest preS = O n e - s a rnp 1 e t test Variable 1 Obs Mean Std. Err Std Dev . [95% Conf. Interval] preS 1 24 10 . 79167 .9402034 4 . 6 C 3 6 0 3 7 8.84 67 09 12 . 736 63 Degrees of freedom: 2 3 Ho : mean(preS = 10 Ha: mean < 10 t - 0.8420 p < t = 0 . 7 958 P Ha: mean != t = 0 > 1 11 = o 10 8420 408 4 Ha: me an t = C P > t - > 10 ) . 8420 ) .2042 The notation P > t means "the probability of a greater value oft"—that is, the one-tail test probability. The two-tail probability of a greater absolute t appears as P > 11 I = .4 084 . Because this probability is high, we have no reason to reject H0:|i = 10. Note that ttest automatically provides a 95% confidence interval for the mean. We could get a different confidence interval, such as 90%, by adding a level (90) option to this command. A nonparametric counterpart, the sign test, employs the binomial distribution to test hypotheses about single medians. For example, we could test whether the median of preS equals 10. signtest gives us no reason to reject that null hypothesis either. . signtest preS = 10 Sign test sign I observed expected positive I 12 11 negative 1 10 11 zero I 2 2 all I 24 24 One-sided tests: Ho: median of preS - 10 - 0 vs. Ha : mecii an of preS - 10 > 0 Pr(#pos itive >= 12) = Binomial (n 22, x >= 12, p = 0.5) = 0.4159 Ho: median of preS - 10 = 0 vs. Ha: median of preS - 10 < 0 Pr(#negative >= 10) - Binomial (n = 22, x >= 10, p - 0.5) = 0.7383 Two-sided test: Ho: median of preS - 10 = 0 vs. Ha: median of preS - 10 != 0 Pr(#po5itive >= 12 or # negat i ve >= 12) = mind, 2*Binomial(n = 22, x >= 12, p - 0.5)) = 0.8318 Like ttest, signtest includes right-tail, left-tail, and two-tail probabilities. Unlike the symmetrical t distributions used by ttest, however, the binomial distributions used by signtest have different left- and right-tail probabilities. In this example, only the two-tail probability matters because we were testing whether the writing.dta students "differ" from their predecessors. Next, we can test for improvement during the course by testing the null hypothesis that the mean number of sentences completed before and after the course (that is, the means of preS and postS) are equal. The ttest command accomplishes this as well, finding a significant improvement. ttest posts = preS Paired t test Variable I Obs Mean Std. Err. Std. Dev. [95* Conf. Interval] ---------+--------------------------------------------------------------------- postS! 24 26.375 1.693779 8.2 9 7787 22.87115 29.87885 preS I 24 10.79167 .9402034 4.606037 8.846708 12.73663 diff I 24 15.58333 1.383019 6.775382 12.72234 18.44433 Ho: mean (posts - preS) = mean(diff) = 0 Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0 t = 11.2676 t = 11.2676 t= 11.2676 P ^ t = 1.0000 P > |t) = 0.00 00 F > t = 0.0000 Because we expect "improvement,'1 not just "difference" between the preS and postS means, a one-tail test is appropriate. The displayed one-tail probability rounds off four decimal 146 Statistics with Stata places to zero ("0.0000" really means P < .00005). Students' mean sentence completion does significantly improve. Based on this sample, we are 95% confident that it improves by between 12.7 and 18.4 sentences. t tests assume that variables follow a normal distribution. This assumption usually is not critical because the tests are moderately robust. When nonnormality involves severe outliers, however, or occurs in small samples, we might be safer turning to medians instead of means and employing a nonparametric test that does not assume normality. The Wilcoxon signed-rank test, for example, assumes only that the distributions are symmetrical and continuous. Applying a signed-rank test to these data yields essentially the same conclusion as ttest , that students' sentence completion significantly improved. Because both tests agree on this conclusion, we can assert it with more assurance. signrank posts = preS Wilcoxon signed-rank test s i an I ob 5 sum ranks expected positive | 24 300 150 negative | 0 0 150 z e ro I 0 0 o all I 24 30 0 30 0 unadjusted variance 1 2 2 5. 00 adjustment for ties -1 . 63 adjustment, for zeros 0 . 00 adjusted vanance 1 2 2 3. 3 8 Ho: posts ^ p r e S z = Prob > I 7 I = 4 . 289 0.0000 Two-Sample Tests The remainder of this chapter draws examples from a survey of college undergraduates by Ward and Ault (1990) {student!.dto). describe Contains data from C: \da t a \student2 . dta obs : 24 3 Student survey (Ward & Ault 1990) vari : 19 12 Jul 2005 10:16 size: 6,561 (99.9 •.. o ŕ memo ry free) s t o rage dis play value variable nane type format label va r i ab1e label id i n t 0 8 . Og Student ID year byte Og year Year in col1ege age byte Og Age at last birthday gender byte % 9 . Og s Gender (male) major d y t e '■■3 . Og Student major -elig byte ^.8 . 0 g v 4 Religious preference drm k byte -9 . Og 3 3-po i nt drinking sea ie gpa float ,;- 9 . Og Grade Point Average grades byte ■-■s. Og g rade s Guessed grades this semester ANOVA and Other Comparison Methods 147 belong byte 0 S 0 q belong live byte ' 8 Og v: 0 miles byte -■S Og st u d y byte 0 8 Cg athlete byte '-. S 3g yes employed byte '-■8 0g ye s a 11 n i g h t byte 3 0g a 11 n :. g h t ditch byte 0 8 Og times hsdrink byte 9 Og aggress byte \ 9 Og Sorted by: id Belong to fraternity/sorol ity Where d c v o u 1i vo? How many miles from campus? Avg. hours/week studying Are you a varsity athlete? Are you employed? How often study all night? How many class/month ditched? High school drinking scale Aggressive behavior scale About 19% of these students belong to a fraternity or sorority: tabulate belong Belong to fraternity/ sorority 1 * reo. Percent Cum. member | 4 7 19. 34 19.34 nonmember 196 80.66 100.00 Total 1 24 3 10 0.00 Another variable, drink, measures how often and heavily a student drinks alcohol, on a 33-point scale. Campus rumors might lead one to suspect that fraternity/sorority members tend to differ from other students in their drinking behavior. Box plots comparing the median drink values of members and nonmembers, and a bar chart comparing their means, both appear consistent with these rumors. Figure 5.1 combines these two separate plot types in one image. . graph box drink, over(belong) ylabel (0(5)35) saving(fig05_01a) . graph bar (mean) drink, over(belong) ylabel(0(5)35) saving(fig05_01b) . graph combine fig05_Qla.gph fig05_01b.gph, col(2) iscale(1.05) O ^ (/) CT) C O C LT) O £° CO ,-CO Figure 5.1 member nonmember member nonmember 148 Statistics with Stata ANOVA and Other Comparison Methods 14g The ttest command, used earlier for one-sample and paired-difference tests, can perform two-sample tests as well. In this application its general syntax is ttest measurement, by (categorical). For example, . ttest drink, by(belong) Two-sample t test with equal variances Group I Obs Mean Std. Err Std. Dev. [95% C 0 n f. Int e rva1] member I 4 7 2 4.7234 . 7 124 518 4 . S8 4 323 2 3.28931 2 6 . 1 575 n or. it, em be I ] 96 1 7 .7 602 .457501? 6 . 4050 1 8 16.85792 18.662 49 combined j 243 19.107 .431224 6.722117 1 8 . 257 5 6 19.95643 d i f f . 6 . 9632 . 9978 608 4.997558 8 . 92 8842 D e qr e e 3 of f reedom : 241 Ho : mea n (membe r) - mean(nonmembe) = diff - 0 Ha : di ff < 0 Ha: diff != -■ 0 Ha: diff > 0 t - 6.9781 t = 6 978 1 t = 6 .978 1 P < t - 1.0000 P > 1 to i = 0 G G 0 0 P > t - 0 .0000 As the output notes, this t test rests on an equal-variances assumption. But the fraternity and sorority members' sample standard deviation appears somewhat lower — they are more alike than nonmembers in their reported drinking behavior. To perform a similar test without assuming equal variances, add the option unequal : ttest drink, by(belong) unequal Two-samp] e t test with unequal variances Group 1 Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] membe r nonmembe 1 1 47 196 24.7234 1 7 . 7 602 .7124518 .4575013 4 . 88 4323 6.405018 23 . 2 8 931 16.85792 2 6 . 1 575 1 8 . 66249 combined 1 243 19.107 . 4 3122 4 6.722117 18.25756 19.95643 diff 1 6.9632 .84 66965 5.280627 8.645773 Satterthwaite's degrees of freedom: 88.22 Ho: mean(member) - mean(nonmembe) = diff = 0 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 t - B.2240 t = 8.2240 t = 8.2240 P < t = 1.0000 P > ltl = 0.00 GO P > t = G.OOOO Adjusting for unequal variances does not alter our basic conclusion that members and nonmembers are significantly different. We can further check this conclusion by trying a nonparametric Mann-Whitney Utest, also known as a Wilcoxon rank-sum test. Assumingthat the rank distributions have similar shape, the rank-sum test here indicates that we can reject the null hypothesis of equal population medians. ranksum drink, by(belong) I '.vc - s arnp 1 e Wi 1 roxen rank-sun (Mann-Whitney) test belong I obs rank sum expected member , 47 8535 5^34 nonmember 1 196 21111 23912 combined i 24 3 29646 29646 unadjusted variance 187 310.67 adjustment for ties -472.30 adjusted variance 1 3 6 & 3 8 . 3 6 Ho: drink (bei ong = = mernbe r) = dr i n k (be 1 ong = = nonmember) z = 6.480 Prob > IzI = 0.00G0 One-Way Analysis of Variance (ANOVA)___ Analysis of variance (ANOVA) provides another way, more general than t tests, to test for differences among means. The simplest case, one-way ANOVA, tests whether the means of ■y differ across categories of.v. One-way ANOVA can be performed by a oneway command with the general form oneway measurement categorical. For example, oneway drink belong, tabulate Belong to | fraternity/ | Summary of 3 3-point drinking scale sorority I Mean Std. Dev. Freq. member I 24.723404 4.8843233 47 nonmember | 17.760204 6.4050179 196 Total/ 19.106996 6.7221166 243 Analysis of Variance Source SS df MS F Prob ^> F Between groups 1838.0S426 1 1838.08426 48.69 0.0000 Within groups 9097.13385 241 37.74 74433 Total 109 3 5.2181 242 45.1868 517 Bartlett•s test for equal variances: chi2 (1) = 4.8 378 Prob>chi2 = 0.028 The tabulate option produces a table of means and standard deviations in addition to the analysis of variance table itself. One-way ANOVA with a dichotomous.v variable is equivalent to a two-sample t test, and its F statistic equals the corresponding t statistic squared, oneway offers more options and processes faster, but it lacks ttest's unequal option for abandoning the equal-variances assumption. oneway formally tests the equal-variances assumption, using Bartlett's %2. A low Bartlett's probability implies that ANOVA's equal-variance assumption is implausible, in 150 Statistics with Stata which case we should not trust the ANOVA F test results. In the oneway drink belong example above, Bartlett's P = .028 casts doubt on the ANOVA's validity. ANOVA's real value lies not in two-sample comparisons, but in more complicated comparisons of three or more means. For example, we could test whether mean drinking behavior varies by year in college: oneway drink year, tabulate scheffe Year in college Summary of 33-point drinking scale Mean Std. Dev. Freq. Freshman Sophomore Juni or Senior 18.975 6.92260 33 21.169231 6.54 44 853 1 9 . 4 53333 6.286608 1 16 . 650^94 6 . 6409257 Total I 19.106996 6.7221166 24 3 Analysis of Variance SS df MS Between groups Within qtoups 666.200518 10269.0176 3 239 222 .0668 39 42.96 66008 F 5 . 17 Prob > F 0.0018 Total 10935.2181 242 45.1868 517 Bartlett's test for equal variances: chi2(3) = 0.5103 Prob>chi2 = Comparison of 33-point drinking scale by Year in college (Scheffe) .917 Row Mean-1 Col Mean 1 Freshman S ophomor Junior Sophomor I 2.19423 0.429 Junior 1 .478333 0.987 -1.7159 0.498 Senior | 1 -2.32421 0.382 -4.51844 0.002 -2 . 80254 0.103 We can reject the hypothesis of equal means (P ~ .0018), but not the hypothesis of equal variances (P = .917). The latter is ''good news" regarding the ANOVA's validity. The box plots in Figure 5.2 (next page) support this conclusion, showing similar variation within each category. This figure, which combines separate box plots and dot plots, shows that differences among medians and among means follow similar patterns. . graph hbox drink, over(year) ylabel (0 (5)35) saving(fig05^02a) . graph dot (mean) drink, over(year) ylabel(0(5)35, grid) marker(1, msymbol(S)) saving(fig05_Q2b) . graph combine fig05_02a . gph fig05__02h. gph, row (2) iscale(1.05) Freshman Sophomore Junior Senior ANOVA and Other Comparison Methods 151 Figure 5.2 10 .15 20 25 33-point drinking scale 30 35 Freshman Sophomore Junior Senior 0 10 15 20 , 25 30 35 mean or drink The scheffe option (Scheffe multiple-comparison test) produces a table showing the differences between each pair of means. The freshman mean equals 18.975 and the sophomore mean equals 21.16923, so the sophomore-freshman difference is 21.16923- 18.975 = 2.19423, not statistically distinguishable from zero (P = .429). Of the six contrasts in this table, only the senior-sophomore difference, 16.6508 - 21.1692 = -4.5184, is significant (P = .002). Thus, our overall conclusion that these four groups' means are not the same arises mainly from the contrast between seniors (the lightest drinkers) and sophomores (the heaviest). oneway offers three multiple-comparison options: scheffe , bonferroni , and sidak (see Base Reference Manual for definitions) . The Scheffe test remains valid under a wider variety of conditions, although it is sometimes less sensitive. The Kruskal-Wallis test (kwallis )> a ^-sample generalization of the two-sample rank-sum test, provides a nonparametric alternative to one-way ANOVA. It tests the null hypothesis of equal population medians. . kwallis drinJc, by(year) Test: Equality of populations 1 yea r Obs 1 Rank Sum 1 Fieshtnan 40 f 4 914.00 1 Sophomore 65 1 934 1.50 1 Junio r , 7 5 1 9300.50 1 Senior | 63 1 6090.00 chi-squa red = 14 453 with 3 probability - 0 0023 chi-squared with ties = 14.4 probability = 0 . 002 3 14.4 90 with 3 d.f 152 Statistics with Stata ANOVA and Other Comparison Methods 153 Here, the kwallis results (P = .0023) agree with our oneway findings of significant differences in drink by year in college. Kruskal-Wallis is generally safer than ANOVA if we have reason to doubt ANOVA's equal-variances or normality assumptions, or if we suspect problems caused by outliers, kwallis , like ranksum, makes the weaker assumption of similar-shaped distributions within each group. In principle, ranksum and kwallis should produce similar results when applied to two-sample comparisons, but in practice this is true only if the data contain no ties, ranksum incorporates an exact method for dealing with tics, which makes it preferable for two-sample problems. Two- and A/-Way Analysis of Variance One-way ANOVA examines how the means of measurement variable v vary across categories of one other variable .v. yV-way ANOVA generalizes this approach to deal with two or more categorical x variables. For example, we might consider how drinking behavior varies not only by fraternity or sorority membership, but also by gender. We start by examining a two-way table of means: table belong gender, contents(mean drink) row col Belong to f r a t e r n 11 y/scrorit y Gender F e m a 1 e (ma1e) Ma 1 e Total membe r nonmember 22.44444 I 6 . 5 1 7 2 4 26 . 1 9 1 3 7 9 3 .5625 24.7234 1 7 . 7 602 Total 17.31343 21 . 31193 19.107 It appears that in this sample, males drink more than females and members drink more than nonmembers. The member-nonmember difference appears similar among males and females. Stata's iV-way ANOVA command, anova , can test for significant differences among these means attributable to belonging to a fraternity or sorority, gender, or the interaction of belonging and gender (written belong*gender). anova drink belong gender belong*gender Number o± obs Root MSE = 5. 243 R-squared 96592 Adj R-squared = 0 . 0 . 2221 2123 Source I Part ial SS df MS F Pr ob > F Mode 1 | 2428.67237 3 80 9 . 5574 56 22.75 0 . 0000 be1ong gende r ue 1 onq * 'jen de r | 14 0 6.2366 | 408 . 5200 97 I 2 .7 8016612 1 1 1 140 6 . 2366 408 . 520097 3.78016612 39 . 51 11.48 0 .11 0 . 0 . 0 . 0 0 00 0008 74 48 Posidual 1 8 5 0 6.54574 23 9 35.5 9 22416 Tota 1 1 10935.2181 242 4 5.1868517 In this example of "two-way factorial ANOVA," the output shows significant main effects for belong (P = .0000) and gender (P = .0008), but their interaction contributes little to the model (P = .7448). This interaction cannot be distinguished from zero, so we might prefer to fit a simpler model without the interaction term (results not shown): anova drink belong gender To include any interaction term with anova , specify the variable names joined by *. Unless the number of observations with each combination of x values is the same (a condition called "balanced data"), it can be hard to interpret the main effects in a model that also includes interactions. This does not mean that the main effects in such models are unimportant, however. Regression analysis might help to make sense of complicated ANOVA results, as illustrated in the following section. Analysis of Covariance (ANCOVA)_ Analysis of Covariance (ANCOVA) extends F Model | 2927.0 3 087 3 975 . 67 6958 3 0.14 0 . CO 00 belong / 148 9.31999 1 1489.31999 4 6.01 0 . CO 00 gender | 4 05 . 1378 4 3 1 4 0 5.137843 12 . 52 0 . 0 0 05 gpa I I 4 0 7.0089 1 4 0 7.0089 12.57 0 . 0 0 05 Res idua1 1 6926.99206 2 14 3 2 . 3 6 9 1 2 18 Total 1 9854.022 9 4 217 4 5.4102439 From this analysis we know that a significant relationship exists between drink and gpa when we control for belong and gender. Beyond their F tests for statistical significance, however, ANOVA or ANCOVA ordinarily do not provide much descriptive information about how variables are related. Regression, with its explicit model and parameter estimates, does a better descriptive job. Because ANOVA and ANCOVA amount to special cases of regression, we could restate these analyses in regression form. Stata does so automatically if we add the regress option to anova . For instance, we might want to see regression output in order to understand results from the following ANCOVA. 154 Statistics with Stata anova drink belong gender belong*gender gpa, continuous(gpa) regress '] o v. rce i SS df MS Numbe r of obs 218 F( 4, 213) 22.57 Model | 2933.4 5 823 4 7 33 . 3 64 558 Prob > F = 0.0000 He ^ id jal 1 6920.56 4 7 213 32. 4 909141 R- squared = 0.2977 Ad j R-squared = 0.2845 Total I 9854.022 94 217 45. 4 102 4 3 9 Root MSE = 5.7001 dr i nk C oe f . s t d. Err. t P> 1 t I [95a Conf. Interval] cons 27.4767 6 2 .4 39962 11 .26 0.000 2 2.6672 32 . 28 633 belong 1 6 . 9 2 5384 1 . 28 6774 5 . 38 0.000 4.388 94 2 9 . 4 61 826 2 (dropped) gender 1 -7. . 62 90 57 8 91715 2 -2 . 95 0.004 -4.38 67 74 - .87134 07 2 (dropped} gpa -3.054633 85 934 98 -3 . 55 0.000 -4 . 7 4 8552 -1 . 3 60713 belong * gende r 1 1 - . 8 6561 58 1 . 94 62 1 1 -0.44 0 . 657 -4 . 701916 2.970685 1 2 (dr opped) 2 1 (d ropped) ? 2 (dropped) With the regress option, we get the anova output formatted as a regression table. The top part gives the same overall/7 test andtf2 as a standard ANOVA table. The bottom part describes the following regression: We construct a separate dummy variable {0,1} representing each category of each x variable, except for the highest categories, which are dropped. Interaction terms (if specified in the variable list) are constructed from the products of every possible combination of these dummy variables. Regress on all these dummy variables and interactions, and also on any continuous variables specified in the command line. The previous example therefore corresponds to a regression of drink on four x variables: 1. a dummy coded 1 = fraternity/sorority member, 0 otherwise (highest category of belong, nonmember, gets dropped); 2. a dummy coded 1 = female, 0 otherwise (highest category of gender, male, gets dropped); 3. the continuous variable gpa; 4. an interaction term coded 1 = sorority female, 0 otherwise. Interpret the individual dummy variables' regression coefficients as effects on predicted or mean v. For example, the coefficient on the first category of gender (female) equals -2.629057. This informs us that the mean drinking scale levels for females are about 2.63 points lower than those of males with the same grade point average and membership status. And we know that among students of the same gender and membership status, mean drinking scale values decline by 3.054633 with each one-point increase in grades. Note also that we have confidence intervals and individual / tests for each coefficient; there is much more information in the anova, regress output than in the ANOVA table alone. ANOVA and Other Comparison Methods 155 Predicted Values and Error-Bar Charts____ After anova , the followup command predict calculates predicted values, residuals, or standard errors and diagnostic statistics. One application for such statistics is in drawing graphical representations of the model's predictions, in the form of error-bar charts. For a simple illustration, we return to the one-way ANOVA of drink by year: anova drink year Number of obs - source Model year Res idual 24 3 R-squared - 0 . 0 609 Root MSE = 6.55489 Adj R-squared = 0.0491 Partial SS df MS F Prob > F 666.200518 3 222.066839 5.17 0.0018 666.200518 3 222.066839 5.17 0.0018 10269.0176 239 42.9666008 Total I 10935.2181 242 45.1868517 To calculate predicted means from the recent anova, type predict followed by anew variable name: predict drinkmean (option xb assumed; fitted values) label variable drinkmean "Mean drinking scale" With the stdp option, predict calculates standard errors of the predicted means: . predict SEdrink, $tdp Using these new variables, we apply the serrbar command to create an error-bar chart. The scale (2) option tells serrbar to draw error bars of plus and minus two standard errors, from drinkmean - l^SEdrink to drinkmean + l^SEdrink. In a serrbar command, the first-listed variable should be the means ox y variable; the second-listed, the standard error or standard deviation (depending on which you want to show); and the third-listed variable defines the x axis. The plot ( ) option for serrbar can specify a second plot to overlay on the standard-error bars. In Figure 5.3, we overlay a line plot that connects the drinkmean values with solid line segments. 156 Statistics with Stata s6rrbar ^n*...- S.-rin* solid) , legend F Number of obs Root MSE -ed 0.2503 0.22 80 df Model gender 166.482503 94 .3505 972 year I 19.0404045 aender*year I 24.1029759 7 23.7832147 1 94.3505972 3 6.34680149 3 8.03432529 11.21 44.47 2 . 99 3 .79 0 . 0 000 0 .0000 0 .0317 0 .0111 Residual , 4 98.538073 J_ 35 2 - 1214 3861 '"""T;;;rr_665.Ö2Ö577 242 2.74801891 ANOVA and Other Comparison Methods 157 predict aggmean option xb assumed; fitted values) label variable aggmean "Mean aggressive behavior scale predict SEagg, stdp gen agghigh = aggmean + 2 * SEagg gen agglow = aggmean - 2 * SEagg graph twoway connected aggmean year I I reap agghigh agglow year II , byigender, legend(off) note("")) ytitle("Mean aggressive behavior scale") Ü {/> o > Female Male Figure 5.4 3 4 1 Year in college Figure 5.4 built error-bar charts by overlaying two pairs of plots. The first pair are female and male connected-line plots, connecting the group means of aggress (which we calculated using predict, and saved as the variable aggmean). The second pair are female and male capped-spike range plots (twoway reap) in which the vertical spikes connecting variables agghigh (group means of aggress plus two standard errors) and agglow (group means of aggress minus two standard errors). The by (gender) option produced sub-plots for females and males. Notice that to suppress legends and notes in a graph that uses a by ( ) option, legend (off) and note("") must appear as suboptions within by ( ). The resulting error-bar chart (Figure 5.4) shows female means on the aggressive-behavior scale fluctuating at comparatively low levels during the four years of college. Male means are higher throughout, with a sophomore-year peak that resembles the pattern seen earlier for drinking (Figures 5.2 and 5.3). Thus, the relationship between aggress and year is different for males and females. This graph helps us to understand and explain the significant interaction effect. predict works the same way with regression analysis ( regress ) as it does with anova because the two share a common mathematical framework. A list of some other 158 Statistics with Stata arc in Chanter 6 and further examples using these options are given predict options appears in Chapter 6 and_ t P assumptions regarding in Chapter 7. The options include residuals that c^e ^ed ;0 J Cook^s D, and error distributions, and also a suite of ^°^.^JZLn, on mod I results. The models. Linear Regression Analysis Stata offers an exceptionally broad range of regression procedures. A partial list of the possibilities can be seen by typing help regress . This chapter introduces regress and related commands that perform simple and multiple ordinary least squares (OLS) regression. One followup command, predict, calculates predicted values, residuals, and diagnostic statistics such as leverage or Cook's D. Another followup command, test , performs tests of user-specified hypotheses, regress can accomplish other analyses including weighted least squares and two-stage least squares. Regression with dummy variables, interaction effects, polynomial terms, and stepwise variable selection are covered briefly in this chapter, along with a first look at residual analysis. The following menus access most of the operations discussed: Statistics - Linear regression and related - Linear regression Statistics - Linear regression and related - Regression diagnostics Statistics - General post-estimation - Obtain predictions, residuals, etc., after estimation Graphics - Overlaid twoway graphs Statistics - Cross-sectional time series Example Commands . regress y x Performs ordinary least squares (OLS) regression of variable y on one predictor, x. . regress y x if ethnic == 3 £ income > 50 Regresses y on x using only that subset of the data for which variable ethnic equals 3 and income is greater than 50. . predict yhat Generates a new variable (here arbitrarily named yhat) equal to the predicted values from the most recent regression. . predict e, resid Generates a new variable (here arbitrarily named e) equal to the residuals from the most recent regression. . graph twoway lfit y x \\ scatter y x Draws the simple regression line ( lfit or linear fit) with a scatterplot ofy vs. x. 160 Statistics with Stata graph twoway mspline yhat x \\ scatter y x Draws a simple regression line with a scatterplot of y vs. x by connecting (with a smooth cubic spline curve) the regression's predicted values (in this example named yhat). Note: There are many alternative ways to draw regression lines or curves in Stata. These alternatives include the twoway graph types mspline (illustrated above), mband. line. If it, If itci , qfit,and qfitci , each of which has its own advantages and options. Usually we combine (overlay) the regression line or curve with a scatterplot. Tf the scatterplot comes second in our graph twoway command, as in the example above, then scatterplot points will print on top of the regression line. Placing the scatterplot first in the command causes the line to print on top of the scatter. Examples throughout this and the following chapters illustrate some of these different possibilities. rvfplot Draws a residual versus fitted (predicted values) plot, automatically based on the most recent regression, graph twoway scatter e yhat, yline(O) Draws a residual versus predicted values plot using the variables e and yhat. regress y xl x2 x3 Performs multiple regression ofy on three predictor variables, xl, x2, andxJ. regress y xl x2 x3, robust Calculates robust (Huber/White) estimates of standard errors. See the User's Guide for details. The robust option works with many other model fitting commands as well. regress y xl x2 x3, beta Performs multiple regression and includes standardized regression coefficients ("beta weights") in the output table. correlate xl x2 x3 y Displays a matrix of Pearson correlations, using only observations with no missing values on all of the variables specified. Adding the option covariance produces a variance-covariance matrix instead of correlations. pwcorr xl x2 x3 y, sig Displays a matrix of Pearson correlations, using pairwise deletion of missing values and showing probabilities from t tests of H0:p = 0 on each correlation graph matrix xl x2 x3 y, half Draws a scatterplot matrix. Because their variable lists are the same, this example yields a scatterplot matrix having the same organization as the correlation matrix produced by the preceding pwcorr command. Listing the dependent (y) variable last creates a matrix in which the bottom row forms a series of y-versus-x plots. test xl x2 Performs an Ftest of the null hypothesis that coefficients onxl and x2 both equal zero in the most recent regression model. . xi : regress y xl x2 i.catvar*x2 Performs "expanded interaction" regression of y on predictors xl, x2, a set of dummy variables created automatically to represent categories ofcatvar, and a set of interaction terms equal to those dummy variables times measurement variable x2. help xi gives more details. Linear Regression Analysis 161 sw regress y xl x2 x3, pr(.05) Performs stepwise regression using backward elimination until all remaining predictors are significant at the .05 level. All listed predictors are entered on the first iteration. Thereafter, each iteration drops one predictor with the highest P value, until all predictors remaining have probabilities below the "probability to retain," pr (. 05). Options permit forward or hierarchical selection. Stepwise variants exist for many other model-fitting commands as well; type help sw for a list. regress y xl x2 x3 [aweight = w] Performs weighted least squares (WLS) regression ofy onx/, x2, andxi. Variable w holds the analytical weights, which work as if we had multiplied each variable and the constant by the square root of w, and then performed an ordinary regression. Analytical weights are often employed to correct for heteroskedasticity when they and x variables are means, rates, or proportions, and w is the number of individuals making up each aggregate observation (e.g., city or school) in the data. If they andx variables are individual-level, and the weights indicate numbers of replicated observations, then use frequency weights [fweight = wj instead. See help svy if the weights reflect design factors such as disproportionate sampling. regress yl y2 x (x z) regress y2 yl z (x z) Estimates the reciprocal effects of yl and y2, using instrumental variables x and z. The first parts of these commands specify the structural equations: yl = oc(l + a, v2 + a:x + e, y2 = p0 + ply2 + p2vv + e2 The parentheses in the commands enclose variables that are exogenous to all of the structural equations, regress accomplishes two-stage least squares (2SLS) in this example. svy: regress y xl x2 x3 Regressesyonpredictorsx/,x2, andx J, with appropriate adjustments for a complex survey sampling design. We assume that a svyset command has previously been used to set up the data, by specifying the strata, clusters, and sampling probabilities, help svy lists the many procedures available for working with complex survey data, help regress outlines the syntax of this particular command; follow references to the User's Guide and the Survey Data Reference Manual for details, xtreg y xl x2 x3 x4, re Fits a panel (cross-sectional time series) model with random effects by generalized least squares (GLS). An observation in panel data consists of information about unit / at time /, and there are multiple observations (times) for each unit. Before using xtreg , the variable identifying the units was specified by an iis Ci is") command, and the variable identifying time by tis ("/ is"). Once the data have been saved, these definitions are retained for future analysis by xtreg and other xt procedures, help xt lists available panel estimation procedures, help xtreg gives the syntax of this command and references to the printed documentation. If your data include many observations for each unit, a time-series approach could be more appropriate. Stata's time series procedures (introduced in Chapter 13) provide further tools for analyzing panel data. Consult the Longitudinal/Panel Data Reference Manual for a full description. 162 Statistics with Stata Linear Regression Analysis 163 xtmixed population year || city: year Assume that we have yearly data on population, for a number of different cities. The xtmixed population year partspecifiesa"fixed-effecf'model,similartoordinary regression, which describes the average trend in population. The | | city: year part specifies a "random-effects1' model, allowing unique intercepts and slopes (different starting points and growth rates) for each city. . xtmixed SAT grades prepcourse \ \ district: pctcollege f [ region: Fits a hierarchical (nested or multi-level) linear model predicting students's SAT scores as a function of the individual students1 grades and whether they took a preparation course; the percent college graduates among their school district's adults; and region of the country (region affecting v-intercept only). Seethe Longitudinal/Panel Data Reference Manualfor much more about the xtmixed command, which is new with Stata 9. The Regression Table_ File states.dta contains educational data on the U.S. states and District of Columbia: describe state csat expense percent income high college region storage display value variable name type format label variable label state s t r 2 0 -t 2 0 s State csat i nr. ';■ 9 . 0g Mean composite SAT score expense int. i 9 . Og Per pupil expenditures pcinusec percent byte :9.0g % HS graduates taking SAT income long ?10 . Og Median household income high float ?9.0g % adults HS diploma college float t 9 . u g % adults college degree region byte - 9.0g region Geographical region Political leaders occasionally use mean Scholastic Aptitude Test (SAT) scores to make pointed comparisons between the educational systems of different U.S. states. For example, some have raised the question of whether SAT scores are higher in states that spend more money on education. We might try to address this question by regressing mean composite SAT scores (csat) on per-pupil expenditures (expense). The appropriate Stata command has the form regress y x , where y is the predicted or dependent variable, and x the predictor or independent variable. regress csat expense Sou rce I SS df MS Number of obs = 51 -------------+------------------------------ F ( lf 49) - 13.61 Model | 4 8708.300 1 1 48708.3001 Prob > JT = 0.0006 Residual I 175306.21 4 9 3577.67775 R-squared = 0.2114 --------------+------------------------------- Adj R-squared = 0.2015 Total I 224 014.51 50 4480.29 0 2 Root MSE - 59.814 1 - 1 raRs Conf, Interval] This regression tells an unexpected story: the more money a state spends on education, the lower its students' mean SAT scores. Any causal interpretation is premature at this point, but the regression table does convey information about the linear statistical relationship between csat and expense. At upper right it gives an overall F test, based on the sums of squares at the upper left. This Ftest evaluates the null hypothesis that coefficients on all x variables in the model (here there is only one a* variable, expense) equal zero. The F statistic, 13.61 with 1 and 49 degrees of freedom, leads easily to rejection of this null hypothesis (P = .0006). Prob > F means "the probability of a greater/7" statistic if we drew samples randomly from a population in which the null hypothesis is true. At upper right, we also see the coefficient of determination, R 2 = .2174. Per-pupil expenditures explain about 22% of the variance in states' mean composite SAT scores. Adjusted R2 , R2U = .2015. takes into account the complexity of the model relative to the complexity of the data. This adjusted statistic is often more informative for research. The lower half of the regression table gives the fitted model itself. We find coefficients (slope and y-intercept) in the first column, here yielding the prediction equation predicted csat = 1060.732 - .0222756expense The second column lists estimated standard errors of the coefficients. These are used to calculate / tests (columns 3-4) and confidence intervals (columns 5-6) for each regression coefficient. The / statistics (coefficients divided by their standard errors) test null hypotheses that the corresponding population coefficients equal zero. At the a = .05 significance level, we could reject this null hypothesis regarding both the coefficient on expense |P = .001) and the y-intercept (".000", really meaning P < .0005). Stata's modeling commands print 95% confidence intervals routinely, but we can request other levels by specifying the level ( ) option, as shown in the following: . regress csat expense, level(99) Because these data do not represent a random sample from some larger population of U.S. states, hypothesis tests and confidence intervals lack their usual meanings. They are discussed in this chapter anyway for purposes of illustration. The term cons stands for the regression constant, usually set at one. Stata automatically includes a constant unless we tell it not to. The nocons option causes Stata to suppress the constant, performing regression through the origin. For example, regress y x, nocons or regress y xl x2 x3, nocons In certain advanced applications, you might need to specify your own constant. If the "independent variables" include a user-supplied constant (named c, for example), employ the hascons option instead of nocons : regress y c x, hascons Using nocons in this situation would result in a misleading ftest and/?2. Consult the Base Reference Manual or help regress for more about hascons . 164 Statistics with Stata Multiple Regression Multiple regression allows us to estimate how expense predicts csat, while adjusting for a number of other possible predictor variables. We can incorporate other predictors of csat simply by listing these variables in the command regress csat expense percent income high college ss df MS Model Residual 184 663.309 3935 1.2012 3 6932 . 6617 874 . 4 7 1 1 37 Total I 224014 . 51 44 80.2 902 Number of obs ^ 51 F( 5, 45) - 42.23 Prob > F = 0.0000 R-squared = 0.8243 Adj R-squared = 0.8048 Root MSE - 29.571 csat 1 Coe f . Std. Err. t P> 1 t 1 [95% Conf. Inte rva1] expense 1 . 003352 8 . 004 4709 0 75 0.457 -.005652 . 0123576 percent | -2 . 618177 .25384 91 -10 31 0.000 -3 . 12 94 55 -2 . 1068 98 income 1 . 000 1056 .0011661 0 09 0.928 - . 00224 3 . 0024542 high 1 1.63084 1 . 9 9224 7 1 64 0.107 - . 3 6764 7 3 . 62 9329 college | 2 . 0308 94 1.660118 1 22 0.228 -1 . 312756 5 . 374 544 cons 1 85 1 . 564 9 59.2 9228 14 36 0.000 732 .14 4 1 970 . 9857 This yields the multiple regression equation predicted csat = 851.56 + .00335 expense - 2.6\8percent + .000]income + l.63high + 2.03college Controlling for four other variables weakens the coefficient on expense from -.0223 to .00335, which is no longer statistically distinguishable fromzero. The unexpected negative relationship between expense and csat found in our earlier simple regression evidently can be explained by other predictors. Only the coefficient on percent (percentage of high school graduates taking the SAT) attains significance at the .05 level. We could interpret this "fourth-order partial regression coefficient'' (so called because its calculation adjusts for four other predictors) as follows. b2 = -2.618: Predicted mean SAT scores decline by 2.618 points, with each one-point increase in the percentage of high school graduates taking the SAT — if expense, income, high, and college do not change. Taken together, the five x variables in this model explain about 80% of the variance in states' mean composite SAT scores (R 2a = .8048). In contrast, our earlier simple regression with expense as the only predictor explained only 20% of the variance in csat. To obtain standardized regression coefficients ("beta weights") with any regression, add the beta option. Standardized coefficients are what we would see in a regression where all the variables had been transformed into standard scores (means 0, standard deviations 1). Linear Regression Analysis 165 regress csat expense percent income high college, beta Source | SS df MS Number of obs = 51 -------------¥------------------------------ F( 5/ 45) = 42.23 Model I 184663.309 5 36932.6617 Prob > F = 0.0000 Residual | 39351.2012 45 874.471137 R-squared = 0.8243 ------------------------------------------- Adj R-squared = 0.8048 Total I 224014.51 50 4480.2902 Root MSE = 29.571 csat I Coef. Std. Err. t P>/tI Beta expense | .0 033528 .0 04470 9 0.75 0.457 .070185 percent I -2.6181^7 .2538491 -10.31 0.000 -1.024538 i n come | .0 001056 .0011 6 61 0.09 0.928 .0101321 high I 1.630841 .992247 1.64 0.107 .1361672 college I 2.030894 1.660118 1.22 0.228 .1263952 _cons t 851.5649 59.29228 14.36 0.000 The standardized regression equation is predicted csat* = .07expense* - 1.0245percent* + .OMncome* + A36high* + A26coItege* where csat*, expense*, etc. denote these variables in standard-score form. We might interpret the standardized coefficient on percent, for example, as follows: b 2* = -1.0245: Predicted mean SAT scores decline by 1.0245 standard deviations, with each one-standard-deviation increase in the percentage of high school graduates taking the SAT — if expense, income, high, and college do not change. The F and t tests, R2, and other aspects of the regression remain the same. Predicted Values and Residuals After any regression, the predict command can obtain predicted values, residuals, and other case statistics. Suppose we have just done a regression of composite SAT scores on their strongest single predictor: regress csat percent Now, to create a new variable calledyhat containing predicted y values from this regression, type . predict yhat label variable yhat "Predicted mean SAT score" Through the resid option, we can also create another new variable containing the residuals, here named e: . predict e, resid label variable e "Residual" We might instead have obtained the same predicted y and residuals through two generate commands: . generate yhatO = _b[_cons] + _b[percent]^percent 166 Statistics with Stata Linear Regression Analysis 167 generate eO = csat - yhatO Stata temporarily remembers coefficients and other details from the recent regression. Thus Jb[varname ] refers to the coefficient on independent variable varname. _b[_cons] refers to the coefficient on _cons (usually, the y-intercept). These stored values are useful in programming and some advanced applications, but for most purposes, predict saves us the trouble of generating yhatO and eO "by hand" in this fashion. Residuals contain information about where the model fits poorly, and so are important for diagnostic or troubleshooting analysis. Such analysis might begin just by sorting and examining the residuals. Negative residuals occur when our model overpredicts the observed values. That is, in these states the mean SAT scores are lower than we would expect, based on what percentage of students took the test. To list the states with the five lowest residuals, type sort e list state percent csat yhat in 1/5 I state percent csat yhat e 1 . 1 South Carol i n a 58 832 894 3 3 33 -62 .3 3 3 3 2 . 1 West Virginia 17 926 986 0 953 -60. 0952 6 3 . 1 North Carolina 5 7 844 896 5714 -52 .5714 4 . 1 Texas 44 87 4 925 6666 -51 . 6 6 6 6 6 5 . Nevada 25 919 968 1905 -49. 1 904 9 The four lowest residuals belong to southern states, suggesting that we might be able to improve our model, or better understand variation in mean SAT scores, by somehow taking region into account. Positive residuals occur when actual y values are higher than predicted. Because the data already have been sorted by e, to list the five highest residuals we add the qualifier in -5/1 "■-5" in this qualifier means the 5th-from-last observation, and the letter "el" (note that this is not the number "I") stands for the last observations. The qualifiers in 47/1 or in 47/51 could accomplish the same thing. list state percent csat yhat e in -5/1 1 state percent csat y ha t e 47 . 1 Ma ssachusetts 7 9 896 84 7 . 3333 48 6667 3 4 8 . [ Connecticut 81 897 842 . 857 1 54 1 4292 49 . i North Dakota 6 1073 1010.714 62 28567 5 0 . 1 New Hamps hire 75 921 856 .2 856 64 71434 51 . 1 Iowa 5 1093 1012.952 80 04758 predict also derives other statistics from the most recently-fitted model. Below are some predict options that can be used after anova or regress. Predicted values of v. predict new, xb means the same thing (referring to Xb, the vector of predicted v values). Cook's D influence measures. COVRATIO influence measures; effect of each observation on the variance-covariance matrix of estimates. DFBETAs measuringeach observation's influence on the coefficient of predictor xl. DFITS influence measures. Diagonal elements of hat matrix (leverage). Residuals. Standardized residuals. Studentizcd (jackknifed) residuals. Standard errors of predicted individual v, sometimes called the standard errors of forecast or the standard errors of prediction. Standard errors of predicted mean/>>. Standard errors of residuals. Welscrfs distance influence measures. Furtheroptionsobtainpredictedprobabilitiesandexpectedvalues;type help regress for a list. All predict options create case statistics, which are new variables (like predicted values and residuals) that have a value for each observation in the sample. When using predict, substitute a new variable name of your choosing for new in the commands shown above. For example, to obtain Cook's D influence measures, type predict D, cooksd Or you can find hat matrix diagonals by typing . predict h, hat The names of variables created by predict (such as yhat, e, D, h) are arbitrary and are invented by the user. As with other elements of Stata commands, we could abbreviate the options to the minimum number of letters it takes to identify them uniquely. For example, . predict e, resid could be shortened to pre e, re . predict new . predict new, cooksd . predict new, covratio . predict DFxl, dfbeta(xl . predict new, dfits predict new, hat . predict new, resid predict new, rstandard . predict new, rstudent . predict new, stdf . predict new, stdp predict new, stdr predict new, welsch 168 Statistics with S tata Linear Regression Analysis 169 Basic Graphs for Regression This section introduces some elementary graphs you can use to represent a regression model or examine its fit. Chapter 7 describes more specialized graphs that aid post-regression diagnostic work. In simple regression, predicted values lie on the line defined by the regression equation. By plotting and connectingpredicted values, we can make that line visible. The If it (linear fit) command automatically draws a simple regression line. . graph twoway lfit csat percent Ordinarily, it is more interesting to overlay a scatterplot on the regression line, as done in Figure 6.1. graph twoway lfit csat percent | | scatter csat percent /| , ytitle("Mean composite SAT score") legend(off) <-» Figure 6.1 03 O U) o h- o <*- CO B (Ä o Q. E 3§ £ °> <0 20 40 60 % HS graduates taking SAT 80 We could draw the same Figure 6.1 graph k'by hand" using the predicted values (yhat) generated after the regression, and a command of the form graph twoway mspline yhat percent, bands (50) || scatter csat percent || , legend(off) ytitle("Mean composite SAT score") The second approach is more work, but offers greater flexibility for advanced applications such as conditional effect plots or nonlinear regression. Working directly with the predicted values also keeps the analyst closer to the data, and to what a regression model is doing, graph twoway mspline (cubic spline curve fit to 50 cross-medians) simply draws a straight line when applied to linear predicted values, but will equally well draw a smooth curve in the case of nonlinear predicted values. Residual-versus-predicted-values plots provide useful diagnostic tools (Figure 6 2) After any regression analysis (also after some other models, such as ANOVA) we can automatical^ draw a residual-versus-fitted (predicted values) plot just by typing . rvfplot, yline(0) Figure 6.2 •I 850 900 950 Fitted values 1000 1050 The "by-hand" alternative for drawing Figure 6.2 would be graph twoway scatter e yhat, yline (0) Figure 6.2 reveals that our present model overlooks an obvious pattern in the data. The residuals or prediction errors appear to be mostly positive at first (due to too-high predictions), then mostly negative, followed by mostly positive residuals again. Later sections will seek a model that better fits these data. predict can generate two kinds of standard errors for the predicted y values, which have two different applications. These applications are sometimes distinguished by the names "confidence intervals" and "prediction intervals": A "confidence interval" in this context expresses our uncertainty in estimating the conditional mean ofy at a given x value (or a given combination of .v values, in multiple regression). Standard errors for this purpose are obtained through . predict SE, stdp Select an appropriate rvalue. With 49 degrees of freedom, for 95% confidence we should use t - 2.01, found by looking up the / distribution or simply by asking Stata: . display invttail (49, .05/2) 2.0095752 Then the lower confidence limit is approximately . generate lowl = yhat - 2.01*SE and the upper confidence limit is 170 Statistics with Stata Linear Regression Analysis 17 . generate highl = yhat + 2.01*SE Confidence bands in simple regression have an hourglass shape, narrowest at the mean of x. We could graph these using an overlaid twoway command such as the following. . graph twoway mspline lowl percent, clpattern(dash) bands(50) || mspline highl percent, clpattern(dash) bands(50) || mspline yhat percent, clpattern (solid) bands (50) || scatter csat percent || , legend(off) ytitle("Mean composite SAT score") Shaded-area range plots (see help twoway_rarea ) offer a different way to draw such graphs, shading the range between lowl and highl. Alternatively, If itci can do this automatically, and take care of the confidence-band calculations, as illustrated in Figure 6.3. Note the stdp option, calling for a conditional-mean confidence band (actually, the default). . graph twoway lfitci csat percent, stdp || scatter csat percent, msymbol(O) I| , ytitle("Mean composite SAT score") legend(off) title("Confidence bands for conditional means (stdp)") L_ o o _ CO O , o = I college >= save nirtvstate sxpense >= income > I high > In addition to Pearson correlations, Stata can also calculate several rank-based correlations. These can be employed to measure associations between ordinal variables, or as an outlier-resistant alternative to Pearson correlation for measurement variables. To obtain the Spearman rank correlation between csat and expense, equivalent to the Pearson correlation if these variables were transformed into ranks, type spearman csat expense Number of obs -Spearman's rho = 51 -0.4282 TesL of Ho: csat and expense are independent Prob > ItI = 0.0017 Kendall's tu (tau-a) and xb (tau-b) rank correlations can be found easily for these data, although with larger datasets their calculation becomes slow: Xtau csat expense Number of obs ~ Kendall's tau-a = Kendall's tau-b = Kendall's score = SE of score - 51 -0.2925 -0.2932 -3 7 3 123.095 (corrected for ties) Tc*t of Ho: csat and expense are independent Prob > iz| = 0.0025 (continuity corrected) For comparison, here is the Pearson correlation with its (unadjusted) P-value: Linear Regression Analysis 175 pwcorr csat expense, sig I csat expense csat I 1.0000 expense I -0.4663 1.0000 [ 0.0006 In this example, both spearman (-.4282)and pwcorr (-.4663) yield higher correlations than ktau (-.2925 or-.2932). All three agree that null hypotheses of no association can be rejected. Hypothesis Tests Two types of hypothesis tests appear in regress output tables. As with other common hypothesis tests, they begin from the assumption that observations in the sample at hand were drawn randomly and independently from an infinitely large population. 1. Overall F test: The F statistic at the upper right in the regression table evaluates the null hypothesis that in the population, coefficients on all the model's x variables equal zero. 2. Individual t tests: The third and fourth columns of the regression table contain t tests for each individual regression coefficient. These evaluate the null hypotheses that in the population, the coefficient on each particular* variable equals zero. The t test probabilities are two-sided. For one-sided tests, divide these P-values in half. In addition to these standard F and t tests, Stata can perform F tests of user-specified hypotheses. The test command refers back to the most recent model-fitting command such as anova or regress . For example, individual t tests from the following regression report that neither the percent of adults with at least high school diplomas (high) nor the percent with college degrees (college) has a statistically significant individual effect on composite SAT scores. regress csat expense percent income high college Conceptually, however, both predictors reflect the level of education attained by a state's population, and for some purposes we might want to test the null hypothesis that both have zero effect. To do this, we begin by repeating the multiple regression quietly , because we do not need to see its full output again. Then use the test command: quietly regress csat expense percent income high college test high college ( 1) high - 0.0 ( 2) college =0.0 F( 2, 45) = Prob > F - 3 . 32 0.045 1 176 Statistics with Stata Unlike the individual null hypotheses, the joint hypothesis that coefficients on high and college both equal zero can reasonably be rejected (P = .0451). Such tests on subsets of coefficients are useful when we have several conceptually related predictors or when individual coefficient estimates appear unreliable due to multicollinearity (Chapter 7). test could duplicate the overall F test: . test expense percent income high college test could also duplicate the individual-coefficient tests: 3. Finally, test understands some algebraic expressions. We could request something like the following, which would test H() :p3 = (p4 + p5) / 100: test income = (high + college)/100 Consult help test for more information and examples. Dummy Variables_ Categorical variables can become predictors in a regression when they are expressed as one or more {0,1} dichotomies called "dummy variables." For example, we have reason to suspect that regional differences exist in states' mean SAT scores. The tabulate command will generate one dummy variable for each category of the tabulated variable if we add a gen (generate) option. Below, we create four dummy variables from the four-category variable region. The dummies are named regl, reg2, reg3 and reg4. regl equals 1 for Western states and 0 for others; reg2 equals 1 for Northeastern states and 0 for others; and so forth. . tabulate region, gen(reg) Ge og raph i ca I 1 region | Freq . Percent Cum . West | 13 26.00 26.00 N. East | 9 18.00 44.00 South | 16 32.00 76 . 00 Midwest | ] 2 24.00 100.00 Total | 50 100.00 . test expense region==Wes J test percent t 1 Freq . Percen t Cum . . test income 0 [ 37 74.00 74.00 and so forth. Applications of test more useful in advanced work include 1 i 13 26.00 100.00 1. Test whether a coefficient equals a specified constant. For example, to test the null Total l 50 10 0.00 hypothesis that the coefficient on income equals 1 (H0 :p 3 = 1), instead of the usual null ■ tabulate reg2 hypothesis that it equals 0 (H0 :p 3 = 0), type region==N. | test income = 1 East 1 Freq . Percent Cum . 2. Test whether two coefficients are equal. For example, the following command evaluates 0 t 4 1 8 2.00 82 . 00 100.00 the null hypothesis H0 :p4 = p 5. 1 1 9 18.00 test high = college Total 1 50 100.00 Linear Regression Analysis 177 describe regl-reg4 variable name to rage type display format value label regl reg2 re93 byte reg4 byte tabulate regl byte ;;8.0g byte %8.0g -8.0g 8.0g variable label region==West region=-N. East regicn=-Scuth region==Midwest twoIXlenesTof wheT ^ Variab'e' 18 e"uivalent to Planning a two sample:t test of whether mean csat is the same across categories of reS2 That is is the mean csat the same in the Northeast as in other U.S. states? 'nat,s,,sthe • regress csat reg2 Source | SS df MS Number of obs 50 Model 1 Residual 1 35191 .4017 177769.978 1 48 35191 3703 . .4017 54121 F( 1, 48) Prob > f R-squa red 9 . 50 0.0034 0.1652 Total I 212961.38 4 9 4 34 6. 1 5061 Adj R-squared Root MSE = 0 . 1479 60 . 857 Std . Err. t cons | 958 . 6098 2 2 . 4 01 67 9.504224 -3.08 100.86 0.003 0.000 -1 14 . 0 958 -24 .01262 The dummy variable coefficient's t statistic (t = —3 08 p- rvm • a- , • b -69.0542) among Nonhoaem stales. We get exactly the same test!, (, - mP . mi, 178 Statistics with Stata ttest csat, by(reg2) Two-sample t test wit - equal var iance s [ 9 5 co C o n f . interval] 0 ' -ii 9 5 8.6098 13.9 5 623 878 . 8278 9 0 0.2833 22.40167 Degrees of freedom: 48 Ho: mean( 0 J - mean(1) = diff = Ü Ha: diff < 0 t = 3 . 0 8 2 5 P < t = 0.9383 P Ha: diff l- = t = 3. > Itl = o. 0 082 5 0034 Ha: diff t = P > t = > 0 3 .082 5 o.oon taking the test. We do so with a multiple regression of csat on both regz p regress csat reg2 percent df MS 2 87332.4916 Source I -° ■ 1 l 1 7 4 664. 983 M°de; ' ;B096"3069 4 7 814.8 1 6955 Resi dual i 3829b. ^______________ + 49 4346.15061 Number of obs - jU 47) = 107.18 = 0.0000 = 0.8202 F( 2, Prob > F R-squared Adj R-squared = 0 ' 8 J ^ Root MSE 28 . 54 Total 2 12 9 61.38 Interval] reg2 I percent I 57 . 524 3^ -2.7 9 3009 1 4 . 2832 6 .2134796 -13 .08 142.19 0 . 0 . 000 000 -3 .2224 7 5 1019.123 -2 .363544 1 04 8 . 374 The Northeastern region variable reg2 now has a statistically significantpav/V/ve coefficient (b = 57.52437, P < .0005). The earlier negative relationship was misleading. Although mean SAT scores among Northeastern states really are lower, they are lower because higher percentages of students take this test in the Northeast. A smaller, more "elite" group of students, often less than 20% of high school seniors, take the SAT in many of the non-Northeast states. In ail Northeastern states, however, large majorities (64% to 81%) do so. Once we adjust for differences in the percentages taking the test, SAT scores actually tend to be higher in the Northeast. To understand dummy variable regression results, it can help to write out the regression equation, substituting zeroes and ones. For Northeastern states, the equation is approximately predicted csat ~ 1033.7 + 51.5reg2 ~ 2.Spercent = 1033.7+57.5 x l-2Apercent = 1091.2 -2.%percent Linear Regression Analysis 779 For other states, the predicted csat is 57.5 points lower at any given level of percent: predicted csat = 1033.7 + 57.5 * 0 - l.Spercent = 1033.7 ~-2.%percent Dummy variables in models such as this are termed "intercept dummy variables,"because they describe a shift in the v-intercept or constant. From a categorical variable with k categories we can define k dummy variables, but one of these will be redundant. Once we know a state's values on the West, Northeast, and Midwest dummy variables, for example, we can already guess its value on the South variable. For this reason, no more than k ~ I of the dummy variables — three, in the case of region — can be included in a regression. If we try to include all the possible dummies, Stata will automatically drop one because multicollinearity otherwise makes the calculation impossible, regress csat regl reg2 reg3 reg4 percent Source I S3 df MS Number of obs = 50 -------------+------------------------------ F ( 4 , 45 ) = 64 . 61 Model | 181378.099 A 45344.5247 Prob > F = 0.0000 Residual | 31583.2811 45 701.850691 R-squared = 0.8517 -------------+------------------------------ Adj R-squared = 0.8385 Total) 212961.38 4 9 4346.15061 Root MSE - 26.492 csat | Coef. Std. Err. t P>It| {9 51 Conf. Interval] regl | -23.77315 11.12578 -2.14 0.038 -46.18162 -1.364676 reg2 | 25.79985 16.96365 1.52 0.135 -8.366693 59.96639 reg3 1 -33.29951 10.85443 -3.07 0.004 -55.16146 -11.43757 reg4 J (d ropped) percent I -2 .54 6058 .2140196 -1 1.90 0.000 -2.977 1 1 6 -2.1 1 5001 _ccns I 1047.638 8.27 36 2 5 12 6.62 0.0 00 10 30.97 4 1064.3 02 The model's fit — including/?2, Ftests, predictions, and residuals — remains essentially the same regardless of which dummy variable we (or Stata) choose to omit. Interpretation of the coefficients, however, occurs with reference to that omitted category. In this example, the Midwest dummy variable (reg4) was omitted. The regression coefficients on regl, regl> and reg3 tell us that, at any given level of percent, the predicted mean SAT scores are approximately as follows: 23.8 points lower in the West (regl = 1) than in the Midwest; 25.8 points higher in the Northeast {regl = 1) than in the Midwest; and 33.3 points lower in the South (reg3 = 1) than in the Midwest. The West and South both differ significantly from the Midwest in this respect, but the Northeast does not. An alternative command, areg , fits the same model without going through dummy variable creation. Instead, it "absorbs" the effect of a ^-category variable such as region. The model's fit, F test on the absorbed variable, and other key aspects of the results are the same as those we could obtain through explicit dummy variables. Note that areg does not provide estimates of the coefficients on individual dummy variables, however. 180 Statistics with Stata areg csat percent, absorb(region) Number of obs = 50 F( 1, 45) = 141.52 Prob > F = 0.0000 R-squared - 0 . 8517 Adj R-squared - 0.8385 Root MSE = 26.492 csat I Coef. Std.Err. t P>It| [95% Conf. Interval] percent I -2.546058 .2140196 -11.90 0.000 -2.977116 -2.115001 _cons | 1035.445 8.38689 123.46 0.000 1018.55 3 1052.3 37 region | F (3, 45) - 9.4 65 0.000 (4 categories) Although its output is less informative than regression with explicit dummy variables, areg does have two advantages. It speeds up exploratory work, providing quick feedback about whether a dummy variable approach is worthwhile. Secondly, when the variable of interest has many values, creating dummies for each of them could lead to too many variables or too large a model for our particular Stata configuration, areg thus works around the usual limitations on dataset and matrix size. Explicit dummy variables have other advantages, however, including ways to model interaction effects. Interaction terms called ""slope dummy variables" can be formed by multiplying a dummy times a measurement variable. For example, to model an interaction between Northeast/other region and percent, we create a slope dummy variable called reg2perc. . generate reg2perc = reg2 * percent (1 missing value generated) The new variable, reg2perc7 equals percent for Northeastern states and zero forall other states. We can include this interaction term among the regression predictors: regress csat reg2 percent reg2perc Source I SS df MS Number of obs - 50 -------------+------------------------------ F ( 3 , 46) = 82.27 Model ( 179506.19 3 59835.3968 Prob > F - 0.0000 Residual | 33455.1897 46 727.286733 R-squared - 0.8429 -------------+------------------------------ Adj R-squared - 0.8327 Total | 212961.38 49 4346.15061 Root MSE = 26.968 csat I Coef. Std. Err. t P>It | [95% Conf. Interval] reg2 I -2 41.3574 116. 62 78 -2. 07 0.044 -476.117 -6.597821 percent I -2.858829 .2032947 -14.06 0.000 -3.26804 -2.449618 reg2perc | 4.179666 1.620009 2.58 0.013 .9187559 7.440576 cons | 1035.519 6.902898 150.01 0.000 1021.624 1049.414 Linear Regression Analysis 181 predicted csat = 1035.5 - 24\Areg2 - 2.9percent + 4.2reg2perc = 1035.5 -241.4 x 1 - 2.9percent + 4.2 x 1 xpercent - 794.1 + \3percent For other states it is predicted csat = 1035.5 - 241.4 x 0 - 2.9percent + 4.2 x o x percent = 1035.5 - 2. ^percent An interaction implies that the effect of one variable changes, depending on the values of some other variable. From this regression, it appears that percent has a relatively weak and positive effect among Northeastern states, whereas its effect is stronger and negative among the rest. To visualize the results from a slope-and-intercept dummy variable regression, we have several graphing possibilities. Without even fitting the model, we could ask If it to do the work as follows, with the results seen in Figure 6.6. . label define reg2 0 "other regions" 1 "Northeast" label values reg2 reg2 graph twoway lfit csat percent I| scatter csat percent (I , by(reg2, legend(off) note("")) ytitle("Mean composite SAT score") other regions Northeast Figure 6.6 0 20 40 % HS graduates taking SAT 60 80 The interaction is statistically significant (t = 2.58, P = .013). Because this analysis includes both intercept (reg2) and slope (reg2perc) dummy variables, it is worthwhile to write out the equations. The regression equation for Northeastern states is approximately 182 Statistics with Stata Alternatively, we could fit the regression model, calculate predicted values, and use those to make the a more refined plot such as Figure 6.7. The bands (50) options with both mspline commands specify median splines based on 50 vertical bands, which is more than enough to cover the range of the data. quietly regress csat reg2 percent reg2perc . predict yhatl ■ graph twoway scatter csat percent if reg2 =~ 0 || mspline yhatl percent if reg2 == 0, clpattern'-axis title and the legend; in their default form, both would be crowded and unclear. Figures 6.6 and 6.7 both show the striking difference, captured by our interaction effect, between Northeastern and other states. This raises the question of what other regional differences exist. Figure 6.8 explores this question by drawing a csat-percent scatterplot with different symbols for each of the four regions. In this plot, the Midwestern states, with one exception (Indiana), seem to have their own steeply negative regional pattern at the left side of the graph. Southern states are the most heterogeneous group. Linear Regression Analysis 183 graph twoway scatter csat percent if regl «2 M scatter csat percent if reg2 ™X, msymboMsh) '■at percent if reg3 == i, msymbol (T) scatter cs; scatter c< t_ o o o h- O 0) '« O Q. £ to 0) 5 West South Northeast Midwest Figure 6.8 2° 40 eb % HS graduates taking SAT 80 ^^^^ The xi (expand interactions) command simplifies the jobs of expanding multiple-category variables into sets of dummy and interaction variables, and including these as predictors in regression or other models. For example, in &z\z$Qtstudent2.dta (introduced inChapter 5) there is a four-category variable_yea/% representing a student's year in college (freshman, sophomore, etc.). We could automatically create a set of three dummy variables by typing xi, prefix(ind) i.year The three new dummy variables will be named indyearJ2, indyear^ and indyearjt. The prefix (} option specified the prefix used in naming the new dummy variables. If we typed simply xi i.year giving no prefix () option, the names JyearJ>, Jyear_3, md_Iyear_4 would be assigned (and any previously calculated variables with those names would be overwritten by the new variables). Typing drop _I* employs the wildcard notation to drop all variables that have names beginning with I 184 Statistics with Stata By default, xi omits the lowest value of the categorical variable when creating dummies, but this can be controlled. Typing the command char dta[omit] prevalent will cause subsequent xi commands to automatically omit the most prevalent category (note the use of square brackets), char _dta [ ] preferences are saved with the data; to restore the default, type char _dta[omit] Typing char yeai-[omit] 3 would omit year 3. To restore the default, type char year[omit] xi can also create interaction terms involving two categorical variables, or one categorical and one measurement variable. For example, we could create a set of interaction terms for year and gender by typing xi i . year*i. . gender From the four categories of year and the two categories ofgender, this xi command creates seven new variables — four dummy variables and three interactions. Because their names all begin with /, we can use the wildcard notation __/* to describe these variables: describe I* storage display Iyear_2 Iyear_3 Iyear_4 Igender_l I y eaXgen__2__l IyeaXgen_3__l IyeaXgen_4_l f o rma t byte %8 . Og byte %8 . Og byte 18 . 0 g byt e %8 . Og byte ■i 8 . Og byte ■i 8 . 0 g byte %8 . Og va lue label ariable label year = = 2 yea r = = 3 yea r==4 gender==l year==2 & year==3 & year==4 & gender==l gender==l gende r= = l To create interaction terms for categorical variable year and measurement variable drink (33-point drinking behavior scale), type xi i.year*drink Six new variables result: three dummy variables for year, and three interaction terms representing each of the year dummies times drink. For example, for a sophomore student Jyearl = I and JyeaXdrink_l = \*drink = drink. For a junior student, Jyearl = 0 and JyearXdrink_2 = Qxdrink =0; also Jyear3 = 1 and JyeaXdrinkJ = 1 *drink = drink, and so forth. describe _Iyea* storage display -ariable name type format value label variable label I y e a r _ 2 Iyear_3 I year__4 byte " 8 . 0 g byte % 8 . 0 g byte 08.Og yea r==2 year==3 year==4 Linear Regression Analysis 185 ^IyeaXdr i nk_2 float -^9 . Og (year = = 2) "drink _IyeaXdr in K_3 float r, 9 . Og (year = = 3) *drink _IyeaXdrink_4 float 3 9.0g (year==4)*drink The real convenience of xi comes from its ability to generate dummy variables and interactions automatically within a regression or other model-fitting command. For example, to regress variable gpa (student's college grade point average) on drink and a set of dummy variables for year, simply type xi: regress gpa drink i.year This command automatically creates the necessary dummy variables, following the same rules described above. Si mi larly, to regress gpa on drink, year, and the interaction of drink and year, type xi: regress gpa drink i.year*drink i.year _Iyear_l-4 (naturally coded; _Iyear__l omitted) i.ye ar*drink _I yeaXd r i n k__ # (coded as above) Source | SS df MS Number of obs = 218 -------------+------------------------------ F( lf 210) = 3.75 Model 1 5.08865901 7 .726951288 Prob > F = 0.0007 Residual | 40.6630801 210 .193633715 R-squared - 0.1112 __-----------+------------------------------- Adj R-squared = 0.0816 Total | 45.7517391 217 .21083750 7 Root MSE - .4 4004 gpa | Coef. Std. Err. t P> I t | [9 5\ Conf. Interva1] drink \ -.0 285369 .0140402 -2.03 0.0 43 -.0562146 -.0008591 __Iyear_ 2 [ -.5 88 92 68 .314782 -1.86 0.065 -1.2044 6 4 .0 3661 07 _Iyear^ 3 I -.285 94 24 .3044178 -0.94 0.349 -.886 0 487 .31416 39 __Iyear_4 ) -.2203783 .29395 9 5 -0.75 0.4 54 -.799868 .3591114 drink 1 {d r. o p pe d) IyeaXdrin-2 I .0199977 .0164436 1.22 0.225 -.0124179 .0524133 _IyeaXdrin-3 I .0108977 .016348 0.67 0.506 -.0213297 .043125 IyeaXdrin-4 | .0104239 .0163b9 0.64 0.525 -.0218446 .0426925 _cons I 3.432132 .2523984 13.60 0.000 2.934572 3.929691 The xx: command can be applied in the same way before many other model-fitting procedures such as logistic (Chapter 10). In general, it allows us to include predictor (nght-hand-stde) variables such as the following, without first creating the actual dummy variable or interaction terms. nmy ■ a t var i . ca t vari*i.ca t var2 Creates y-1 dummy variables representing the j categories of catvar. Creates y-\ dummy variables representing the j categories of catvar/; k-\ dummy variables from the k categories of catvar2\ and (y-l)(/c-l) interaction variables (dummy * dummy), i . catvar*measvar Createsy'-l dummy variables representing they categories of catvar, and j-\ variables representing interactions with the measurement variable (dummy x measvar). After any xi command, the new variables remain in the dataset. 186 Statistics with Stata Stepwise Regression With the regional dummy variable terms we added earlier to the state-level data in states.dta, we have many possible predictors of csat. This results in an overly complicated model, with several coefficients statistically indistinguishable from zero. . regress csat expense percent income col lege high regl reg2 reg2perc reg3 Source | Model I 195420.517 Residual | 17540.863 Total I 212961.38 df 9 21713.3908 0 438.521576 49 4346.15061 Number of obs F( 9, 40) Prob > F R-squared 50 4 9.51 0 .0000 0.9176 Adj R-squared - 0.8991 Root MSE 20.941 csat 1 Coef . Std. Err. t P> 111 [95% Conf. Interval] expense I - . 0022508 .0041333 -0 54 0 . 589 - . 0 1 0604 5 .006103 percent I -2 . 93786 .2302596 -12 76 0 . 000 -3.403232 -2 . 4 72488 income | -.0004919 . 001 02 55 -0 48 0 . 634 - . 002 5645 .0015806 college | 3.900087 1 . 7 1 940 9 2 27 0 . 02 9 . 42503 1 8 7 . 37 5142 h .1 gh I 2.175542 1 . 171767 1 86 0 . 071 - . 1 92 688 4 . 543771 regl | -33 . 7 845 6 9.302983 -3 63 0 . 001 -52 .58659 -14.98253 reg2 | -143.5149 101 . 1244 -1 42 0 . 164 -347 .8 94 9 60 . 86509 reg2perc I 2 . 50 661 6 1.404483 1 78 0 . 082 - . 331 9506 5.345183 reg3 | -8 . 7 99205 1 2 .54658 -0 70 0 . 487 -34 . 1 567 9 1 6 . 55838 cons 1 839 .22 0 9 7 6 . 35942 10 99 0 . 000 684 .8 92 7 993.549 We might now try to simplify this model, dropping first that predictor with the highest t probability {income, P = .634), then refitting the model and deciding whether to drop something further. Through this process of backward elimination, we seek a more parsimonious model; one that is simpler but fits almost equally well. Ideally, this strategy is pursued with attention both to the statistical results and to the substantive or theoretical implications of keeping or discarding certain variables. For analysts in a hurry, stepwise methods provide ways to automate the process of model selection. They work either by subtracting predictors from a complicated model, or by adding predictors to a simpler one according to some pre-set statistical criteria. Stepwise methods cannot consider the substantive or theoretical implications of their choices, nor can they do much troubleshooting to evaluate possible weaknesses in the models produced at each step. Despite their drawbacks, stepwise methods meet certain practical needs and have been widely used. For automatic backward elimination, we issue a sw regress command that includes all of our possible predictor variables, and a maximum P value required to retain them. Setting the P-to-retain criteria as pr (. 05) ensures that only predictors having coefficients that are significantly different from zero at the .05 level will be kept in the model. Linear Regression Analysis sw regress csat expense percent income college high regl reg2 reg2perc reg3, pr(.05) begin with full model p = 0.6341 >= 0.0500 removing income p = 0.5273 >= 0.0500 removing reg3 p - 0.4215 >= 0.0500 removing expense p = 0.2107 >= 0.0500 removing reg2 Source | SS df MS Number of obs = 50 -------------f------------------------------ F ( 5/ 44) = 91 .01 Model [ 194185.761 5 38837.1521 Prob > F = 0.0000 Residual | 18775.6194 44 426.718624 R-squared = 0.9118 -------------+--------.---------------------- Adj R-Squared = 0.9018 Total] 212 9 61.38 49 4346.15061 Root MSE = 2 0.6 57 csat | Coef. Std.Err. t P>ItI [95% Conf. Interval] regl | -30.59218 8.479395 -3.61 0.001 -47.68128 -13.50309 percent I -3.119155 .1804553 -17.28 0.000 -3.482839 -2.755471 reg2perc | .5833272 .1545969 3.77 0.000 .2717577 .8948967 college I 3.995495 1.359331 2.94 0.005 1.255944 6.735046 high \ 2.231294 .8178968 2.73 0.009 .5829313 3.879657 _cons I 806.672 49.98744 16.14 0.000 705.9289 907.4151 sw regress dropped first income, then reg3, expense, and finally regl before settling on the final model. Although it has four fewer coefficients, this final model has almost the same R2 (.9118 versus .9176) and a higher R \ (.9018 versus .8991) compared with the earlier version. If, instead of a P-to-retain, pr (. 05), we specify aP-to-enter value such as pe (. 05), then sw regress performs forward inclusion (starting with an "empty" or constant-only model) instead of backward elimination. Other stepwise options include hierarchical selection and locking certain predictors into the model. For example, the following command specifies that the first term (xl) should be locked into the model and not subject to possible removal: sw regress y xl x2 x3, pr(.05) lockterml The following command calls for forward inclusion of any predictors found significant at the . 10 level, but with variables x4, x5, and x6 treated as one unit — either entered or left out together: sw regress y xl x2 x3 (x4 x5 x6) , pe(.10) The following command invokes hierarchical backward elimination with a P = .20 criterion: sw regress y xl x2 x3 (x4 x5 x6) x7, pr(.20) hier The hier option specifies that the terms are ordered: consider dropping the last term (x7) first, and stop if it is not dropped. If x7 is dropped, next consider the second-to-last term (x4 x5 x6), and so forth. Many other Stata commands besides regress also have stepwise variants that work in a similar manner. Available stepwise procedures include the following: sw clogit Conditional (fixed-effects) logistic regression sw cloglog Maximum likelihood complementary log-log estimation 187 188 Statistics with Stata sw cnreg Censored normal regression sw glm Generalized linear models sw logistic Logistic regression (odds) sw logit Logistic regression (coefficients) sw nbreg Negative binomial regression sw ologit Ordered logistic regression sw oprobit Ordered probit regression sw poisson Poisson regression sw probit Probit regression sw qreg Quantile regression sw regress OLS regression sw stcox Cox proportional hazard model regression sw streg Parametric survival-time model regression s w tobit Tobit regression Type help sw for details about the stepwise options and logic Polynomial Regression_ Earlier in this chapter, Figures 6.1 and 6.2 revealed an apparently curvilinear relationship between mean composite SAT scores (csat) and the percentage of high school seniors taking the test (percent). Figure 6.6 illustrated one way to model the upturn in SAT scores at high percent values: as a phenomenon peculiar to the Northeastern states. That interaction model fit reasonably well (R2a = .8327). But Figure 6.9 (next page), a residuals versus predicted values plot for the interaction model, still exhibits signs of trouble. Residuals appear to trend upwards at both high and low predicted values. quietly regress csat reg2 percent reg2perc rvfplot, yline(O) Linear Regression Analysis 189 Figure 6.9 0 B * • # * t * i 850 900 950 Fitted values 1000 1050 Chapter 8 presents a variety of techniques for curvilinear and nonlinear regression. "Curvilinear regression*" here refers to intrinsically linear OLS regressions (for example, regress ) that include nonlinear transformations of the originály or x variables. Although curvilinear regression fits a curved model with respect to the original data, this model remains linear in the transformed variables. (Nonlinear regression, also discussed in Chapter 8, applies non-OLS methods to fit models that cannot be linearized through transformation.) One simple type of curvilinear regression, called polynomial regression, often succeeds in fitting U or inverted-U shaped curves. It includes as predictors both an independent variable and its square (andpossibly higher powers if necessary). Because the csctt-percent relationship appears somewhat U-shaped, we generate a new variable equal to percent squared, then include percent and percent2 as predictors of csat. Figure 6.10 graphs the resulting curve. . generate percent2 = percentA2 . regress csat percent percent2 df MS Model I 19 3 721.829 Residual | 302 92 . 68 06 2 96860.9146 48 631.097513 Total 224Ü 14.51 50 4480.2902 Number of obs ~ F( 2, 48) = Prob > F R-squared Adj R-squared = Root MSE 51 153.48 0.0000 0.8643 0.8591 25.122 csat ( Coe f . S td. Err. t P> 1 t 1 [95-i Conf. percent ( percen t2 1 _cons ) -6-11199 3 .0495819 1065.92 1 . 67 1 540 6 .0084179 9.285379 -9 . 10 5 . 89 114.80 0.000 0 .000 0 . 0 o o -7 . 4 6221 6 .0 326566 104 7.252 -4.76177 . 0 665072 1084.591 Linear Regression Analysis 191 190 Statistics with S tata predict yhat2 (option xb ass umed; fitted values) r>£>rcent, bands(50) ■raph twoway mspline yhat2 percent, I, scatter c«t parcMt------ ™osite SAT score") legend(off) ytitle("Mean compo. Figure 6.10 §4 o o o CO 0) m o Q. E 88 S05 chi2 7 0 5 . 54 0 . oooc tcrime 1 Coef. Std. Err. Z P> 1 2 1 [95;, C o n f . In terva1] t", * ' .,' unemp ( .1645266 .0381813 pop I .0558997 .0073437 cons [ -.7264381 .30152? 4 7 -2 31 0.000 6 1 0.000 4 1 0.016 .0896925 . 0 4 1 50 62 -1.31741 . 2 393 607 . 0702 931 -.1354659 sigma u 1 .34 4 584 37 sigma e I .42064667 rho | .40157462 (fraction of variance due to u i) The xtreg output table contains regression coefficients, standard errors, / tests, and confidence intervals that resemble those of an OLS regression. In this example we see that the coefficient on unemp (. ] 645) is positive and statistically significant. The predicted number of crimes increases by .1645 for each additional person unemployed, if population is held constant. Holding unemployment constant, predicted crimes increase by 5.59 with each 100-person increase in population. Echoing the individual-coefficient z tests, the Wald chi-square test at upper right (x2 = 705.54, df=2,P< .00005) allows us to reject the joint null hypothesis that the coefficients on unemp and pop are both zero. This output table gives further information related to the two error terms. At lower left in the table we find standard deviation of the common residuals u, standard deviation of the unique residuals F - Q.GQ00 Residual I 16 7 89.4069 47 357.221424 R-squared = 0.92 51 -------------+-------------------------.----- Adj R-squared = 0.9203 Total ! 224014.51 50 4480.2902 Root MSE - 18.90 csat | Coef. Std, Eir. t P>1t i [ 9 5% Co.nf . Interval] percent I -6.520312 .5095805 -12.80 0.000 -7.545455 -5.495168 percent2 | .0536555 .0063678 8.43 0.000 .0408452 .0664658 high | 2.986509 .4857502 6.15 0.000 2-009305 3.963712 _cons I 844.8207 36.63387 23.06 0.000 771.1228 918.5185 The regression equation is predicted csat = 844.82 - 6.52percent + .Q5percent2 + 2.99higk Regression Diagnostics 199 The scatterplot matrix in Figure 7.1 depicts interrelations among these four variables. As noted in Chapter 6, the squared term percent2 allows our regression model to fit the visibly curvilinear relationship between csat and percent. . graph matrix percent percent^ high csat, half msymbol(+) %HS graduates taking SAT 6000 J 4000 j 2000 -) 0 90 80 70 Figure 7.1 1000-j 900 800 percent2 10(D 2000 4000 600060 70 80 90 % adults HS diploma Mean composite SAT score Several post-regression hypothesis tests perform checks on the model specification. The omitted-variables test ovtest essentially regressesy on the x variables, and also the second, third, and fourth powers of predictedy (after standardizing y to have mean 0 and variance I), It then performs an F test of the null hypothesis that all three coefficients on those powers of y equal zero. If we reject this null hypothesis, further polynomial terms would improve the model. With the csat regression, we need not reject the null hypothesis, ovtest Ramsey RESET test using powers of the fitted value Ho: model has no omitted variables F(3 s of csat 4 4 ) Prob > F 1.48 0.2319 A heteroskedasticny test, bettest, tests the assumption of constant error variance by SÄ"?1 «andaniized res.duals are linearly related to % (see Cook and We.sberg 1994 for d,scuss,on and example). Results from the csat regression suggest that in this mstance we should reject the null hypothesis of constant variance ^ . hettest Cook-Weisberg test for heteroskedasticity using fitted valu Ho: Constant variance es of csat chi2(l) Prob > chi2 4 .86 0 . 0274 200 Statistics with Stata SKÄÄt-Ä--- DiagnosticJMots Chapter 6 demonstrated how predict can create new variables holding residual and predicted values after a regress command. To obtain these values from our regression of csat on percent, percent!, and high, we type the two commands: predict yhat3 . predict e3, resid The new variables named e3 (residues) andyhat3 (predicted values) could be displayed in a residual-versus-predicted graph by typing graph twoway scatter e3 yhat, yline(O) . The rvfplot (residual-versus-fitted) command obtains such graphs in a single step. The version in Figure 7.2 includes a horizontal line at 0 (the residual mean), which helps in reading such plots. rvfplot, yline(0) Figure 7.2 • o _ • • • • luals y o -cr: ——-■ * # 850 900 950 1000 Fitted values 1050 Figure 1.2 shows residuals symmetrically distributed around 0 (symmetry is consistent with the normal-errors assumption), and with no evidence of outliers or curvilinearity. The dispersion of the residuals appears somewhat greater for above-average predicted values of v, however, which is why hettest earlier reeded the constant-variance hypothesis. Residual-versus-fitted plots provide a one-graph overview of the regression residuals. For more detailed study, we can plot residuals against each predictor variable separately through a series of "residual-versus-predictor" commands. To graph the residuals against predictor/z/gA (not shown), type Regression Diagnostics 201 rvpplot high The one-variable graphs described in Chapter 3 can also be employed for residual analysis. For example, we could use box plots to check the residuals for outliers or skew, or quantile-normal plots to evaluate the assumption of normal errors. Added-variable plots are valuable diagnostic tools, known by different names including partial-regression leverage plots, adjusted partial residual plots, oradjusted variable plots. They depict the relationship between y and one x variable, adjusting for the effects of other x variables. If we regressed yon x2 andxJ, and likewise regressed*/ onx2 andxJ, then took the residuals from each regression and graphed these residuals in a scatterplot, we would obtain an added-variable plot for the relationship betweenj; andx/, adjusted forx2 andx3. An avplot command performs the necessary calculations automatically. We can draw the adjusted-variable plot for predictor high, for example, just by typing , avplot high Speeding the process further, we could type avplots to obtain a complete set of tiny added-variable plots with each of the predictor variables in the preceding regression. Figure 7.3 shows the results from the regression of csat on percent, percent2, and high. The lines drawn in added-variable plots have slopes equal to the corresponding partial regression coefficients. For example, the slope of the line at lower left in Figure 7.3 equals 2.99, which is the coefficient on high. avplots ra O -10 -5 0 5 e( percent | X ) coef = -6 5203116, se = .50958046, ( = -12.I Figure 7.3 10 -500 0 500 e( percent I X ) coef - .05365555, se = .00836777. t = 8.43 1000 -10 -5 0 5 e{ high I X ) coef = 2.9865088, se = .48575023, I = 6.15 10 Added-variable plots help to uncover observations exerting a disproportionate influence on the regression model. In simple regression with onex variable, ordinary scatterplots suffice for this purpose. In multiple regression, however, the signs of influence become more subtle. An observation with an unusual combination of values on several x variables might have high leverage, or potential to influence the regression, even though none of its individual x values 202 Statistics with Stata is unusual by itself. High-leverage observations show up in added-variable plots as points horizontally distant from the rest of the data. We see no such problems in Figure 7.3, however. If outliers appear, we might identify which observations these are by including observation labels for the markers in an added-variable plot. This is done using the mlabel ( ) option, just as with scatterplots. Figure 7.4 illustrates using state names (values of the string variable state) as labels. Although such labels tend to overprint each other where the data are dense, individual outliers remain more readable. avplot high, mlabel(state) « Iowa Figure 7.4 • North DtkOtegon New Hampshire . IlliAdfrryland * Mont; ^s°t? Alaska Tennessee • Rhode Island * • FlbS Kentucky LouisianlixaS Arkansas Alabama • NonrS@§r@itha • Wisconsin, • Delaware * Nebraska V,P83ifor™a J^^om ^^W«^^nfe^cCUtUSeUS • Idaho • Ohio " 'Arizona Indiana * Oklahoma • MicNgBQda Wyoming • Misai^jßffc Carolina • West Virginia -10 ■* District of Columbia 10 e( high | X ) coef = 2.9865088. se * .48575023.1 = 6.15 Component-plus-residual plots, produced by commands of the form cprplot xl, take a different approach to graphing multiple regression. The component-plus residual plot for variable xl graphs each observation's residual plus its component predicted from xl, e, against values of xl. Such plots might help diagnose nonlinearities and suggest alternative functional forms. An augmented component-plus-residual plot (Mallows 1986) works somewhat better, although both types often seem inconclusive. Figure 7.5 shows an augmented component-plus-residual plot from the regression of csat on percent, percent!, and high. acprplot high, lowess Regression Diagnostics 203 Figure 7.5 3 oo T3 iTi -to F = 0.0000 R-squared = 0.92 97 Adj R-squared - 0.9251 Root MSE - 18 . 186 csat 1 Coef . Std. Err. t P> 1 t 1 [95*. Conf . Interva1] percent i - 6 . 77870 6 . 504 42 1 7 -13 44 0.000 -7.79405 4 -5 . 763357 percent2 | . 05 63562 . 0 0 6250 9 9 02 0.000 .0437738 . 0 68 9387 high 1 3.281765 . 4 8 658 54 6 74 0.000 2 . 302319 4 .26121 cons 1 8 2 7.1159 36 .17138 22 87 0.000 754.30 67 899.9252 In the n = 50 (instead of n = 51) regression, all three coefficients strengthened a bit because we deleted an ill-fit observation. The general conclusions remain unchanged, however. Chambers et al. (1983) and Cook and Weisberg (1994) provide more detailed examples and explanations of diagnostic plots and other graphical methods for data analysis. Diagnostic Case Statistics Afterusing regress or anova, we can obtain a variety of diagnostic statistics through the predict command (see Chapter 6 or type help regress ). The variables created by predict are case statistics, meaning that they have values for each observation in the data. Diagnostic work usually begins by calculating the predicted values and residuals. There is some overlap in purpose among other predict statistics. Many attempt to measure how much each observation influences regression results. "Influencing regression results," however, could refer to several different things — effects on the _y-intercept, on a particular slope coefficient, on all the slope coefficients, or on the estimated standard errors, for example. Consequently, we have a variety of alternative case statistics designed to measure influence. Standardized and studentized residuals (rstandard and rstudent) help to identify outliers among the residuals -— observations that particularly contradict the regression model. Studentized residuals have the most straightforward interpretation. They correspond to the t statistic we would obtain by including in the regression a dummy predictor coded 1 for that observation and 0 for all others. Thus, they test whether a particular observation significantly shifts the>'-intercept. Hat matrix diagonals ( hat ) measure leverage, meaning the potential to influence regression coefficients. Observations possess high leverage when their jc values (or their combination of x values) are unusual. Several other statistics measure actual influence on coefficients. DFBETAs indicate by how many standard errors the coefficient onxl would change if observation / were dropped from the regression. These can be obtained fora single predictor, x7, in either of two ways: through the predict option dfbeta(xl) or through the command dfbeta . 206 Statistics with Stata Regression Diagnostics 207 Cook's D ( cooksd ), Welsch's distance ( welsch ), and DFITS ( df its ), unlike DFBETA, all summarize how much observation / influences the regression model as a whole — or equivalently, how much observation / influences the set of predicted values. COVRATIO measures the influence of the ith observation on the estimated standard errors. Below we generate a full set of diagnostic statistics including DFBETAs for all three predictors. Note that predict supplies variable labels automatically for the variables it creates, but dfbeta does not. We begin by repeating our original regression to ensure that these post-regression diagnostics refer to the proper (n = 51) model. . quietly regress csat percent percent2 high . predict standard, rstandard . predict student, rstudent . predict hf hat . predict D, cooksd . predict DFITS, dfits . predict W, welsch . predict COVRATIO, covratio . dfbeta DFpercent DFpercent2 DFhigh DFbeta(percent) DFbeta(percent2) DFbeta(high) describe standard ~ DFhigh variable name standard student h D DFITS W COVRATIO DFpercent DFpercent2 DFhigh storage display type format float 19. Og float %9 . Og float %9 Og float 'o9 og float %9 Og float %9 Og float %9 Og float %9 • Og float %9 • Og float %9 .og value label variable label Standardized residual. Student!zed residuals Leve rage Cook's D Dfits Welsch distance Covratio summarize sti andard - DFhigh Mean Std. Dev. Min Max standard I student | h 1 D 1 51 51 51 51 - .0031359 -.00162 . 0784314 . 021 994 1 - .010734 8 1 .032723 .0373011 .0364003 .3064762 -2.182423 .0336437 .0000135 - . 8 96658 2 .3 36 977 .2151227 .1860992 . 7444486 W I COVRATIO 1 DFpercent I DFpercent2 I DFhigh | 51 51 51 51 51 - .089723 1 . 092452 . 000938 - . 00 10659 - .0012204 . 1316834 .1498813 . 1370372 .1747835 .7607449 - .5067295 - . 44077 1 - . 6 316988 1 . 360136 .5269799 . 4253 958 . 34 14 85 1 summarize shows us the minimum and maximum values of each statistic, so we can quickly check whether any are large enough to cause concern. For example, special tables could be used to determine whether the observation with the largest absolute studentized residual (student) constitutes a significant outlier. Alternatively, we could apply the Bonferroni inequality and t distribution table: max| student \ is significant at level a if (t \ is significant at a/n. In this example, we have max| student \ = 2.337 (Iowa) and n = 51. For Iowa to be a significant outlier (cause a significant shift in intercept) at a = .05, t = 2.337 must be significant at .05/51: display 0009803 9 05/51 Stata's ttail ( ) function can approximate the probability of | /1 > 2.337, given df=n-K-l -51-3-1 = 47; . display 2*ttail(47, 2.337) .02375133 The obtained P-value (P = .0238) is not below a/n = .00098, so Iowa is not a significant outlier at a = .05. Studentized residuals measure the ith observation's influence on the y-intercept. Cook's Z), DFITS, and Welsch's distance all measure the ith observation's influence on all coefficients in the model (or, equivalently, on all n predicted v values). To list the 5 most influential observations as measured by Cook's D, type sort D . list state yhat3 D DFITS W in -5/1 state North Dakota Wyoming Tennessee Iowa Utah yhat3 1036 .696 1017.005 974.6981 10 5 2.78 1067.712 D 0705921 0789454 .1117 18 1265392 1860992 DFITS 54 93086 5820746 6 9 9 2 3 4 3 744 4486 .896658 W 4.020527 4 . 2704 65 5 . 1 62398 5 . 52468 6.854601 The in -5/1 qualifier tells Stata to list only the fifth-from-last (-5) through last (lowercase letter'T') observations. Figure 7.7 shows one way to display influence graphically: symbols in a residual-versus-predicted plot are given sizes proportional to values of Cook's £), through the "analytical weight" option [aweight = £>]. Five influential observations stand out, with large positive or negative residuals and high predicted csat values. 208 Statistics with Stata graph twoway scatter e3 yhat3 [aweight = D], msymbol(oh) yline(O) Figure 7.7 850 900 950 Fitted values 1000 1050 7.8 shows their similarity in the example at hand. graph matrix D W DFITS, half Figure 7.8 Cooks D ( Welsch distance Dfits .2 -5 DFBETAs indicate how much each observation Typing df beta after a regression automatically generates DFBETAs tor each preo Regression Diagnostics 209 this example, they received the names DFpercent (DFB ETA for predictorpercent), DFpercent2, and DFhigh. Figure 7.9 graphs their distributions as box plots. . graph box DFpercent DFpercent2 DFhigh, legend(cols(3)) Figure 7.9 llliNHBB DFpercent DFpercent2 DFhigh From left to right, Figure 7.9 shows the distributions of DFBETAs for percent, percent2, and high, (We could more easily distinguish them in color.) The extreme values in each plot belong to Iowa and Utah, which also have the two highest Cook's D values. For example, Utah's DFhigh = -.63. This tells us that Utah causes the coefficient on high to be .63 standard errors lower than it would be if Utah were set aside. Similarly, DFpercent = .53 indicates that with Utah present, the coefficient on percent is .53 standard errors higher (because the percent regression coefficient is negative, "higher" means closer to 0) than it otherwise would be. Thus, Utah weakens the apparent effects of both high and percent. The most direct way to learn how particular observations affect a regression is to repeat the regression with those observations set aside. For example, we could set aside all states that move any coefficient by half a standard error (that is, have absolute DFBETAs of .5 or more): regress csat percent percent2 high if abs(DFpercent) < .5 & abs(DFpercent2) < .5 & abs(DFhigh) < .5 Source | SS df MS Number of obs = 48 __-----------+------------------------------ F( 3/ 44) = 215.47 Model | 175366.782 3 58455.5939 Prob > F = 0.0000 Residual I 11937.1351 44 271.298525 R-squared = 0.9363 -------------+------------------------------ Adj R-Squared = 0.9319 Total I 187303.917 47 3985.18972 Root MSE - 16.471 csat I Coef. Std.Err. t P>It | [95% Conf. Interval] percent I -6.510868 .4700719 -13.85 0.000 -7.458235 -5.5635 percent2 | .0538131 .005779 9.31 0.000 .0421664 .0654599 high | 3.35664 .4577103 7.33 0.000 2.434186 4.279095 cons I 815.0279 33.93199 24.02 0.000 746.6424 883.4133 210 Statistics with Stata Careful inspection will reveal the details in which this regression table (based on n = 48) differs from its « = 51 or n = 50 counterparts seen earlier. Our central conclusion — that mean state SAT scores are well predicted by the percent of adults with high school diplomas and, curvilinearly, by the percent of students taking the test — remains unchanged, however. Although diagnostic statistics draw attention to influential observations, they do not answer the question of whether we should set those observations aside. That requires a substantive decision based on careful evaluation of the data and research context. In this example, we have no substantive reason to discard any states, and even the most influential of them do not fundamentally change our conclusions. Using any fixed definition of what constitutes an "outlier," we are liable to see more of them in larger samples. For this reason, sample-size-adjusted cutoffs are sometimes recommended for identifying unusual observations. After fitting a regression model with K coefficients (including the constant) based on observations, we might look more closely at those observations for which any of the following are true: leverage h > 2K/n Cook's D> 4/n DFITS> lyfKJh Welsch's W> 3vT DFBETA > 21 v"n~ \COVRATIO- ) \ > 3K/n The reasoning behind these cutoffs, and the diagnostic statistics more generally, can be found in Cook and Weisberg(1982, 1994); Belsley, Kuh, and Welsch (1980); or Fox (1991). Multicollinearity If perfect multicollinearity (linear relationship) exists among the predictors, regression equations become unsolvable. Stata handles this by warning the user and then automatically dropping one of the offending predictors. High but not perfect multicollinearity causes more subtle problems. When we add a new x variable that is strongly related to x variables already in the model, symptoms of possible trouble include the following: 1. Substantially higher standard errors, with correspondingly lower t statistics. 2. Unexpected changes in coefficient magnitudes or signs. 3. Nonsignificant coefficients despite a high R2. Multiple regression attempts to estimate the independent effects of each x variable. There is little information for doing so, however, if one or more of the x variables does not have much independent variation. The symptoms listed above warn that coefficient estimates have become unreliable, and might shift drastically with small changes in the sample or model. Further troubleshooting is needed to determine whether multicollinearity really is at fault and, if so, what should be done about it. Multicollinearity cannot necessarily be detected, or ruled out, by examining a matrix of correlations between variables. A better assessment comes from regressing each x on all of the other x variables. Then we calculate 1 -R2 from this regression to see what fraction of the first Regression Diagnostics 211 x variable's variance is independent of the other jc variables. For example, about 97% of high's variance is independent of percent and percent2: . quietly regress high percent percent2 . display 1 - e(r2) .96942331 After regression, e(r2) holds the value of R 2 . Similar commands reveal that only 4% of percent's variance is independent of the other two predictor variables: . quietly regress percent high percent2 A* - e (r2) display 1 04010307 This finding about percent and percent2 is not surprising. In polynomial regression or regression with interaction terms, some x variables are calculated directly from other x variables. Although strictly speaking their relationship is nonlinear, it often is close enough to linear to raise problems of multicollinearity. The post-regression command vif , for variance inflation factor, performs similar calculations automatically. This gives a quick and straightforward check for multicollinearity. quietly regress csat percent percent2 high vif Variable | percent | percent2 [ high I Mean VIF J VI F 1/VIF 24.94 24.78 1.03 0 . 04 01 03 0 .04 0354 0 . 969423 16 . 92 The 1/VIF column at right in a vif table gives values equal to 1 ~R2 from the regression of each x on the other x variables, as can be seen by comparing the values for high (.969423) orpem?/7/(.040103)withourearlier display calculations. That is, 1/VIF (or 1 ~R2) tells us what proportion of an.* variable's variance is independent of all theotherx variables. A low proportion, such as the .04 (4% independent variation) of percent and percent2, indicates potential trouble. Some analysts seta minimum level, called tolerance, for the 1/VIF value, and automatically exclude predictors that fall below their tolerance criterion. The VIF column at center in a vif table reflects the degree to which other coefficients' variances (and standard errors) are increased due to the inclusion of that predictor. We see that high has virtually no impact on other variances, but percent and percent! affect the variances substantially. VIF values provide guidance but not direct measurements of the increase in coefficient variances. The following commands show the impact directly by displaying standard error estimates for the coefficient on percent, whenpercent2 is and is not included in the model. quietly regress csat percent percent2 high display __se [percent] 50958046 quietly regress csat percent high display _se[percent] 1616219 3 r r ft,.:,!' is 212 Statistics with Stata With percent! included in the model, the standard error for percent is three times higher: .50958046 /.16162193 - 3.1529166 This corresponds to a tenfold increase in the coefficient's variance. How much variance inflation is too much? Chatterjee, Hadi, and Price (2000) suggest the following as guidelines for the presence of multicollinearity: 1. The largest VIF is greater than 10; or 2. the mean VIF is larger than 1. With our largest VIFs close to 25, and the mean almost 17, the csat regression clearly meets both criteria. How troublesome the problem is, and what, if anything, should be done about it, are the next questions to consider. Because percent and percent! are closely related, we cannot estimate their separate effects with nearly as much precision as we could the effect of either predictor alone. That is why the standard error for the coefficient on percent increases threefold when we compare the regression of csat on percent and high to a polynomial regression of csat on per cent, per cent!, and high. Despite this loss of precision, however, we can still distinguish all the coefficients from zero. Moreover, the polynomial regression obtains a better prediction model. For these reasons, the multicollinearity in this regression does not necessarily pose a great problem, or require a solution. We could simply live with it as one feature of an otherwise acceptable mode). When solutions are needed, a simple trick called "centering" often succeeds in reducing multicol linearity in polynomial or interaction-effect models. Centering involves subtracting the mean from x variable values before generating polynomial or product terms. Subtracting the mean creates a new variable centered on zero and much less correlated with its own squared values. The resulting regression fits the same as an uncentered version. By reducing multicollinearity, centering often (but not always) yields more precise coefficient estimates with lower standard errors. The commands below generate a centered version of percent named Cpercent, and then obtain squared values of Cpercent named Cpercent2. summarize percent Variable | Obs Mean Std. Dev. Min Max percent 51 35.76471 26.19281 generate Cpercent = percent - r(mean) . generate Cpercent2 = Cpercent A2 correlate Cpercent Cpercent2 percent percent2 high csat f R-squared Adj R-squared = Root MSE 51 193.37 0.0000 0.9251 0.9203 18 . 90 Coef Std. Err Cpercent | Cpercent2 | high I cons [ P> I t I [9 Conf. Interval] • H1 9085 • 0063678 0. 000 604.1646 ~2. 682362 •0536555 2.986509 -23.97 8 .43 0.000 0 . 000 2.9074 93 • 04 08452 2.457231 .0664 659 3.963712 756.3458 (■^5J^^^93^^ -efficient on Cpercent is actually lower is correspondingly larger. Thus, it am^^—2 " '"^m°deh The statistic 214 Statistics with Stata prec1S1o, The VIF tabic now glves iess cause for concern ^of the ^e predict, has more than 80% independent variation, compared With 4 /. for percent p uncentered regression. vif Variable I "IF 1 /VIF Coercent I percent?. I high I 1 20 0 831528 1 1 18 .03 0 0 846991 9 6 9423 Mean VIr I i .14 afterrigress , anova , or other model-fitting procedures by typing correlate, _coef | Cpercent Cpercc~2 high cons Cpercent I Cpercent2 1 high 1 cons | 1.0000 -0.3893 -0.1100 0.2105 1 . 0 000 0.1040 1.0000 -0.21.51 -0.9912 1.0000 By adding the option covariance , we can see the coetncie matrix, from which standard errors are derived. prelate, _coef covariance I Coerccnt Cperce-2 high _cons Cper cent Cpe rcent2 high cons .012 524 - . 000277 .0 00041 - 009239 .000322 ". 891 1 26 - .0 51 817 .235953 18.2105 1430.6 Fitting Curves Basic regression and correlation methods assume linear relationships. Linear models provide reasonable and simple approximations for many real phenomena, over a limited range of values. But analysts also encounter phenomena where linear approximations are too simple; these call for nonlinear alternatives. This chapter describes three broad approaches to modeling nonlinear or curvilinear relationships: 1. Nonparametric methods, including band regression and lowess smoothing. 2. Linear regression with transformed variables ("curvilinear regression"), including Box-Cox methods. 3. Nonlinear regression. Nonparametric regression serves as an exploratory tool because it can summarize data patterns visually without requiring the analyst to specify a particular model in advance. Transformed variables extend the usefulness of linear parametric methods, such as OLS regression (regress), to encompass curvilinear relationships as well. Nonlinear regression, on the other hand, requires a different class of methods that can estimate parameters of intrinsically nonlinear models. The following menu groups cover many of the operations discussed in this chapter. The final topic, nonlinear regression, requires a command-based approach. Graphics - Twoway Statistics - Nonparametric analysis - Lowess smoothing Data - Create or change variables - Create new variable Statistics - Linear regression and related Example Commands boxcox y xl x2 x3, model(lhs) Finds maximum-likelihood estimates of the parameter X (lambda) for a Box-Cox transformation ofy, assuming that is a linear function of xf x2, andxJ plus Gaussian constant-variance errors. The model (lhs) option restricts transformation to the left-hand-side variable y. Other options could transform right-hand-side (x) variables by the same or different parameters, and control further details of the model. Type help boxcox for the syntax and a complete list of options. The Base Reference Manual gives technical details. 21 Fitting Curves 217 graph twoway mband y x, bands(10) || scatter y x Produces a v versus.y scatterplot with line segments connectingthe cross-medians (median x, median y points) within 10 equal-width vertical bands. This is one form of "band regression." Typing mspline in place of mband in this command would result in the cross-medians being connected by a smooth cubic spline curve instead of by line segments. graph twoway lowess y x, bwidth(.4) || scatter y x Draws a lowess-smoothed curve with a scatterplot ofy versus x. Lowess calculations use a bandwidth of .4 (40% of the data). In order to calculate and keep the smoothed values as a new variable, use the related command lowess . lowess y x, bwidth(.3) gen(newvar) Draws a lowess-smoothed curve on a scatterplot ofy versus x, using a bandwidth of .3 (30% of the data). Predicted values for this curve are saved as a variable named newvar. The lowess command offers more options than graph twoway lowess, including fitting methods and the ability to save predicted values. See help lowess fordetails. nl exp2 y x Uses iterative nonlinear least squares to fit a 2-parameter exponential growth model, predicted v = h, h2 K The term exp2 refers to a separate program that specifies the model itself. You can write a program to define your own model, or use one of the common models (including exponential, logistic, and Gompertz) supplied with Stata. After nl , use predict to generate predicted values or residuals. nl log4 y x, init (B0 = 5, Bl = 25, B2=.1 , B3 = 50) Fits a 4-parameter logistic growth model (log4 ) of the form predicted y ~ b{) + h, /(] + exp(-Z> 2 (x - b,))) Sets initial parameter values for the iterative estimation process at b{) = 5, b, = 25, b2 = .1, and by = 50. regress lny xl sqrtx2 invx3 Performs curvilinear regression using the variables lny, xl, sqrtx2, and invx3. These variables were previously generated by nonlinear transformations of the raw variables >>, x2, and x3 through commands such as the following: generate lny = In(y) generate sqrtx2 = sqrt(x2) generate invx3 = l/x3 When, as in this example, they variable was transformed, the predicted values generated by predict yhat, or residuals generated by predict e, resid, will be also in transformed units. For graphing or other purposes, we might want to return predicted values or residuals to raw-data units, using inverse transformations such as replace yhat = exp(yhat) Band Regression Nonparametric regression methods generally do not yield an explicit regression equation. They are primarily graphic tools for displaying the relationship, possibly nonlinear, between y and x. Stata can draw a simple kind of nonparametric regression, band regression, onto any scatterplot or scatterplot matrix. For illustration, consider these sobering Cold War data (missile.dta) from Mackenzie (1990). The observations are 48 types of long-range nuclear missiles, deployed by the U.S. and Soviet Union during their arms race, 1958 to 1990: Contai ns obs : data from C: 4 S \data\mis sile.dta v a r s : size: 6 1,392 (99.9* of memory f ree) Missiles (MacKenzie 1990) 16 Jul 2005 14:57 variable name type format value label country year type range CEP byte int byte int float B 8 . 0 g 1 8 . 0 g i 8 . 0 g % 8 . 0 g %9 . Og soviet type Missile US or Soviet missile? Year of first deployment ICBM or submarine-launched^ Range in nautical miles should land. Yea" ar scS 50% °f the file's -"heads oy year, scientists on both sides worked to improve accuracy (Figure 8 1) . graph twowav mhar,H ____ , y V g CO,i'- graph twoway mband CEP year, bands (8) I I scatter CEP year II , ytitle("Circular Error Probable miles") legend(off) Figure 8.1 o - 1960 1970 Year of first deployment 1980 1990 218 Statistics with Stata Fitting Curves 219 Figure 8.1 shows CEP declining (accuracy increasing) over time. Theoption bands (8) instructs graph twoway mband to divide the scatterplot into 8 equal-width vertical bands and draw line segments connecting the points (medians, median y) within each band. This curve traces how the median of CEP changes with y^ar. Nonparametric regression does not require the analyst to specify a relationship's functional form in advance. Instead, it allows us to explore the data with an "open mind." This process often uncovers interesting results, such as when we view trends in U.S. and Soviet missile accuracy separately (Figure 8.2). The by {country) option in the following command produces separate plots for each country, each with overlaid band-regression curve and scatterplot. Within the by ( ) option are suboptions controlling the legend and note. graph twoway mband CEP year, bands(8) || scatter CEP year l| , ytitle("Circular Error Probable by(country, legend(off) note ("")) miles") U.S. U.S.S.R. Figure 8.2 o in O 1960 1970 1980 1990 1960 1970 Year of first deployment 1980 1990 The shapes of the two curves in Figure 8.2 differ substantially. U.S. missiles became much more accurate in the 1960s, permitting a shift to smaller warheads. Three or more small warheads would fit on the same size missile that formerly carried one large warhead. The accuracy of Soviet missiles improved more slowly, apparently stalling during the late 1960s to early 1970s, and remained a decade or so behind their American counterparts. To makeup for this accuracy disadvantage, Soviet strategy emphasized larger rockets carrying high-yield warheads. Nonparametric regression can assist with a qualitative description of this sort or serve as a preliminary to fitting parametric models such as those described later. We can add band regression curves to any scatterplot by overlaying an mband (or mspline ) plot. Band regression's simplicity makes it a convenient exploratory tool, but it possesses one notable disadvantage — the bands have the same width across the range of x values, although some of these bands contain few or no observations. With normally distributed variables, for example, data density decreases toward the extremes. Consequently, the left and right endpoints of the band regression curve (which tend to dominate its appearance) often reflect just a few data points. The next section describes a more sophisticated, computation-intensive approach. Lowess Smoothing The lowess and graph twoway lowess commands accomplish a form of nonparametric regression called lowess smoothing (for locally weighted scatterplot smoothing). In general the lowess command is more specialized and more powerful, with options that control details of the fitting process, graph twoway lowess has advantages of simplicity, and follows the familiar syntax of the graph twoway family. The following example uses graph twoway lowess to plot CEP against year for U.S. missiles only (country =- 0). graph twoway lowess CEP year if country == 0, bwidth(.4) || scatter CEP year || , legend(off) ytitle ("Circular Error Probable, miles") Figure 8.3 CM 1960 1970 1980 Year of first deployment 1990 A graph very similar to Figure 8.2 would result if we had typed instead lowess CEP year if country == 0, bwidth(.4) Like Figure 8.2, Figure 8.3 (next page) shows U.S. missile accuracy improving rapidly during the 1960s and progressing at a more gradual rate in the 1970s and 1980s. Lowess-smoothed values of CEP are generated here with the name IsCEP. The bwidth (.4) option specifies the lowess bandwidth; the fraction of the sample used in smoothing each point. The default is bwidth (. 8). The closer bandwidth is to 1, the greater the degree of smoothing. Lowess predicted (smoothed) y values for n observations result from n weighted regressions. Let k represent the half-bandwidth, truncated to an integer. For each y, , a 220 Statistics with Stata smoothed value v/ is obtained by weighted regression involving only those observations within the interval from / = max(l, / - k) through / = min(/ + k, n). The /th observation within this interval receives weight vvy according to a tricube function: where u} = (Xj-X;) I A A stands for the distance between x, and its furthest neighbor within the interval. Weights equal 1 for*. = xn but fall off to zero at the interval's boundaries. See Chambers et al. (1983) or Cleveland (1993) for more discussion and examples of lowess methods. lowess options include the following, mean For running-mean smoothing. The default is running-line least squares smoothing. noweight Unweighted smoothing. The default is Cleveland's tricube weighting function. bwidth ( ) Specifies the bandwidth. Centered subsets of approximately bwidth x n observations are used for smoothing, except towards the endpoints where smaller, unccntered bands are used. The default is bwidth (.8). logit Transforms smoothed values to logits. adjust Adjusts the mean of smoothed values to equal the mean of the original y variable; like logit, adjust is useful with dichotomous y. gen (newvar) Creates newvar containing smoothed values ofy. nograph Suppresses displaying the graph. plot ( ) Provides a way to add other plots to the generated graph; see help plot_option. rlopts () Affects the rendition of the reference line; see help cline_options . Because it requires n weighted regressions, lowess smoothing proceeds slowly with large samples. In addition to smoothing scatterplots, lowess can be used for exploratory time series smoothing. The file ice.dta contains results from the Greenland Ice Sheet 2 (GISP2) project described in Mayewski, Holdsworth, and colleagues (1993) and Mayewski, Meeker, and colleagues (1993). Researchers extracted and chemically analyzed an ice core representing more than 100,000 years of climate history, ice.dta includes a small fraction of this information: measured non-sea salt sulfate concentration and an index of "Polar Circulation Intensity" since AD 1500. Fitting Curves 221 Contains data from C:\data\ice.dc 271 3 5,962 (99.9% of obs var s size memory free) eenland ice (Mayewski 1995) Jul 2005 14:57 variable name storage display type format value label variable label year sulfate PCT int %ty double %10.0g % 6 . 0 g doubi Year S04 ion concentration, ppb Polar Circulation Intensity Sorted by: year To retain more detail from this 271-point time series, we smooth with a relatively narrow bandwidth, only 5% of the sample. Figure 8.4 graphs the results. The smoothed curve has been drawn with "thick" width, to visually distinguish it from the raw data. (Type help linewidthstyle for other choices of line width.) . graph twoway lowess sulfate year, bwidth(.05) clwidth (thick) || line sulfate year, clpattern(solid) | I , ytitie("S04 ion concentration, ppb") legend(label(1 "lowess smoothed") label(2 "raw data")) Figure 8.4 1600 1700 v 1800 Year 1900 2000 lowess smoothed raw data Non-sea salt sulfate (SO 4 ) reached the Greenland ice after being injected into the atmosphere, chiefly by volcanoes or the burning of fossil fuels such as coal and oil. Both the smoothed and raw curves in Figure 8.4 convey information. The smoothed curve shows oscillations around a slightly rising mean from 1500 through the early 1800s. After 1900, fossil fuels drive the smoothed curve upward, with temporary setbacks after 1929 (the Great Depression) and the early 1970s (combined effects of the U.S. Clean Air Act, 1970; the Arab oil embargo, 1973; and subsequent oil price hikes). Most of the sharp peaks of the raw data 222 Statistics with Stata have been identified with known volcanic eruptions such as Iceland's Hekla (1970) or Alaska's Katrnai(1912). After smoothing time scries data, it is often useful to study the smooth and rough (residual) series separately. The following commands create two new variables: lowess-smoothed values of sulfate {smooth) and the residuals or rough values (rough) calculated by subtracting the smoothed values from the raw data. lowess sulfate year, bwidth(.05) gen(smooth) label variable smooth ?rS04 ion concentration (smoothed) " gen rough = sulfate - smooth label variable rough "S04 ion concentration (rough)" Figure 8.5 compares the smooth and rough time series in a pair of graphs annotated using the text ( ) option, then combined. . graph twoway line smooth year, ylabel< 0(50)150) xtitle("") ytitle("Smoothed") text(2 0 154 0 "Renaissance") text(20 1900 "Industrialization") text<90 1860 "Great Depression 1929") text(150 1935 "Oil Embargo 1973") saving(fig08_05a, replace) . graph twoway line rough year, ylabel (0 (50)150) xtitle{"") ytitle("Rough") text(75 1630 "Awu 1640", orientation(vertical)) text(120 1770 "Laki 1783", orientation (vertical)) text(90 180 5 "Tambora 1815", orientation(vertical)) text(65 1902 "Katmai 1912", orientation(vertical)) text(80 1960 "Hekla 1970", orientation (vertical)) yline (0) saving(fig08_05b, replace) . graph combine fxg08_05a.gph fig08_05b.gph, rows(2) Figure 8.5 Oil Embargo 1973 Great Depression 1929 Industrialization 1600 1700 1800 Fitting Curves 223 Regression with Transformed Variables — 1 By subjecting one or more variables to nonlinear transformation, and then including the transformed variable(s) in a linear regression, we implicitly fit a curvilinear model to the underlying data. Chapters 6 and 7 gave one example of this approach, polynomial regression, which incorporates second (and perhaps higher) powers of at least one.v variable among the predictors. Logarithms also are used routinely in many fields. Other common transformations include those of the ladder of powers and Box-Cox transformations, introduced in Chapter 4. Dataset tornado.dta provides a simple illustration involving U.S. tornados from 1916 to 1986 (from the Council on Environmental Quality, 1988). Contains data fron C: \data\tornado.dta obs : 71 U.S. tornados 1916-1986 (Council on Env. Quality 1988) v a r s : 4 16 Jul 2005 14:57 size: 994 (99.9J- of memory free) storage display value variable name type format 1abe1 variable label year int % 8 . 0 g Year tornado int %Q . Og Number of tornados lives int ^8 . Og Number of lives lost a vlo st f 1 oat % 9 . Og Average lives lost/tornado Sorted by: year The number of fatalities decreased over this period, while the number of recognized tornados increased, because of improvements in warnings and our ability to detect more tornados, even those that do little damage. Consequently, the average lives lost per tornado (avlost) declined with time, but a linear regression (Figure 8.6, following page) does not well describe this trend. The scatter descends more steeply than the regression line at first, then levels off in the mid-1950s. The regression line actually predicts negative numbers of deaths in later years. Furthermore, average tornado deaths exhibit more variation in early years than later — evidence of heteroskedasticity. Fitting Curves 225 224 Statistics with Stata . graph twoway scatter avlost year [| lfit avlost year, clpattern(solid) || , ytitle("Average number of lives lost") xlabel(1920(10)1990) xtitle<"") legend(off) ylabel(0(1)7) yline(O) Figure 8.6 ^920^9^0 1940 1950 1960 1990 The relationship becomes linear, and heteroskedasticity vanishes if we work instead with logarithms of the average number of lives lost (Figure 8.7): . generate loglast = In{avlost) label variable loglost "ln(avlost)" regress loglost year Source I SS df MS Number of obs = 71 --------------+------------------------------ F( if 69} = 162.24 Model I 115.895325 1 115.895325 Prob > F = 0.0000 Residual | 43.8807356 69 .63595269 R-squared - 0.7254 -------------+------------------------------- Ad3 R-squared - 0.7214 Total I 159.77606 70 2.26251515 Root MSE = .79747 loglost | Coef. Std. Err. t P> f t j [95% Qonf. Intervall year | -.0623418 .004618 -13.50 0.000 -.0715545 -.053129 _cons | 120.5645 9.010312 13.38 0.000 102.5894 138.5395 . predict yhat2 (option xb assumed; fitted values) . label variable yhat2 "ln(avlost) = 120.56 - .06year" label variable loglost "ln(avlost)" . graph twoway scatter loglost year || mspline yhat2 year, clpattern(solid) bands(50) || , ytitle("Natural log(average lives lost)") xlabel(192 0(10)1990) xtitle("") legend(off) ylabel(-4(1)2) yline (0) Figure 8.7 1920 1930 1940 1950 1960 1970 1980 1990 The regression model is approximately predicted miavlosi) = 120.56 - .QGyear Because we regressed logarithms of lives lost on year, the model's predicted values are also measured in logarithmic units. Return these predicted values to their natural units (lives lost) by inverse transformation, in this case exponentiating (e to power) yhat2: replace yhat2 = exp(ynat2) (71 real changes made) Graphing these inverse-transformed predicted values reveals the curvilinear regression model, which we obtained by linear regression with a transformed y variable (Figure 8.8). Contrast Figures 8.7 and 8.8 with Figure 8.6 to see how transformation made the analysis both simpler and more realistic. 226 Statistics with Stata graph twowi I I I I xt. ,oway scatter avlost year bands(50) mSpline yhat2 year '^^^^3 lo0t«, xiabel (1920 (10) 1 4-n ^ ("Average number or xiveo i'tl^r"? lig-nd'off) yl-b-K0(l)7, ylxn-(0, 990) Figure 8.8 ES ^20 1930 1940 1950 ^60 1970 1980 1990 The boxcox command employs maximum-likelihood methods to fit curvilinear models involving Box-Cox transformations (introduced in Chapter 4). Fitting a model with Box-Cox transformation of the dependent variable ( model (lhs) specifies left-hand side) to the tornado data, we obtain results quite similar to the model of Figures 8.7 and 8.8. The nolog option in the following command does not affect the model, but suppresses display of log likelihood after each iteration of the fitting process. boxcox avlost year, model (lhs) nolog Number of obs = 71 LR chi2(1) - 92.28 Log livelihood - -7.7185533 Prob > chi2 = 0.000 P> I z I [95% Conf. Interval] .0646726 " A -.1828519 .07066 avlost 1 Coef. Std. Err._____________________------- ------------;;::;;:'^".:;^,6 -0.8-7 o.38 6 -.mssis /theta 1 -.056095? Estimates of sea j Coc f le-variant parameter: Not rans year | cons I .0661891 127.971 3 /s igma 8 3 0117 7 Fitting Curves 227 Test Res trieted LR statistic P-Value HO : log likelihood chi2 Prob > chi2 theta = -1 -84 . 9237 91 154.42 0 . 0 0 Q theta - 0 -8.0941678 0 . 75 0.386 theta = 1 -101.50385 187.57 0.000 The boxcox output shows theta = -.056 as the optimal Box-Cox parameter for transforming avlost, in order to linearize its relationship with year. Therefore, the left-hand-side transformation is alvlost(X)5b) = (alvlost-{m - l)/-.056 Box-Cox transformation by a parameter close to zero, such as -.056, produces results similar to the natural-logarithm transformation we applied earlier to this variable "by hand." It is therefore not surprising that the boxcox regression model predicted alvlost{im 1 = 127.97 - Xflyear resembles the earlier model (predicted \n(avlost) = 120.56 - .06year) drawn in Figures 8.7 and 8.8. The boxcox procedure assumes normal, independent, and identically distributed errors. It does not select transformations with the aim of normalizing residuals, however. boxcox can fit several types of models, including multiple regressions in which some or all of the right-hand-side variables are transformed by a parameter different from they-variable transformation. It cannot apply different transformations to each separate right-hand-side predictor. To do that, we return to a "by hand" curvilinear-regression approach, as illustrated in the next section. Regression with Transformed Variables — 2_ For a multiple-regression example, we will use data on living conditions in 109 countries found in dataset nations.dta (from World Bank 1987; World Resources Institute 1993). Contains data from C: \da ta\nations dta obs : 109 Data on 109 nations, ca. 1985 vars : 15 16 Jul 2005 14:57 size: 4,033 (99.9% of memory free) storage display va 1 ue varlable name type f ormat label variable label country st r 8 %9s Country pop float % 9 , 0 g 19 8 5 population in millions birth byte % 8 . 0 g Crude birth rate/1000 people death byte %8 . Og Crude death rate/1000 people ch1dmo r t byte %8 . Og Child (1-4 yr) mortality 1985 in fmor t in t ^ 8 . Og Infant (<1 yr) mortality 1985 life byte %8 . Og Life expectancy at birth 1985 food in t %8 . Og Per capita daily calories 1985 energy in t %8 . Og Per cap energy consumed, kg oil gnpc ap in t %8 . Og Per capita GNP 1985 gnpgro float %9 . Og Annual GNP growth % 65-85 u rban byte %8 . Og % population urban 1985 228 Statistics with Stata Fitting Curves 229 schooll schools s ch oo13 int %S.0g byte £ 8 . Q g byte %8.Og Primary enrollment % age-group Secondary enroll % age-group Higher ed. enroll % age-group Relationships amongbirth rate, per capita gross national product (GNP) and child mortality are not linear, as can be seen clearly inthe scatterplot matrix of Figure 8.9. The skewed gnpcap and chldmort distributions also present potential leverage and influence problems. fraph matrix gnpcap chldmort birth, half Figure 8.9 Per capita GNP 1985 40 . 20 f 40 €s- 10000 ■V. f 20000 Child (1-4 mortality 1985 * • • r* *• 'i Crude birth rate/1000 people 20 40 Experimenting with ladder-of-powers transformations reveals that the log of gnpcap and the square root of chldmort have distributions more symmetrical, with fewer outliers or potential leverage points, than the raw variables. More importantly, these largely eliminate the nonlineanties: compare the raw-data scatterplots m Figure 8.9 with their transformed-variables counterparts in Figure 8.10, on the following page, generate loggnp = logl0(gnpcap) label variable loggnp »Log-10 of per cap GNP" generate srmort = sqrt(chldmort) label variable srmort "Square root child mortality graph matrix loggnp srmort birth, half Log-10 orDer cap GNP Figure 8.10 2 68 40 20 0 4 "V/ 1 « «... ■ im * • Square root child mortality 3 4 0 Crude birth rate/1000 peop/e We can now apply linear regression using the transformed variables: . regress birth loggnp srmort Source I SS df MS Number of obs = 109 -------------+------------------------------ F( 2, 106) - 198.06 Model I 15837.9603 2 7918.98016 Prob > F - 0.0000 Residual I 4236.18646 106 39.9828911 R-squared - 0.7889 -------------+------------------------------ Adj R-sqUared = 0.7849 Total I 20076.1468 108 185.890248 Root MSE = 6.3232 bi rth I Coef . Std. Er r . t P> \ t 1 19 51: Conf . Interval] loggnp [ -2.353739 1.686255 -1.40 0.166 -5.696903 .9894259 srmort I 5.577359 .533567 10.45 0.000 4.51951 6.635207 _cons I 26.19488 6.362687 4.12 0.000 13.58024 38.80953 Unlike the raw-data regression (not shown), this transformed-variables version finds that per capita gross national product does not significantly affect birth rate once we control for child mortality. The transformed-variables regression fits slightly better: R2^ = .7849 instead of .6715. (We can compare R :a across models here only because both have the same untransformed v variable.) Leverage plots would confirm that transformations have much reduced the curvilinearity of the raw-data regression. 230 Statistics with Stata Conditional Effect Plots Conditional effect plots trace the predicted values of>' as a function of one x variable, with other x variables held constant at arbitrary values such as their means, medians, quartiles, or extremes. Such plots help with interpreting results from transformed-variables regression. Continuing with the previous example, we can calculate predicted birth rates as a function of loggnp, with srmort held at its mean (2.49): generate yhatl = _b[_cons] + _b[loggnp]*loggnp + _b[srmort]*2.49 label variable yhatl "birth = f(gnpcap | srmort = 2.49) The _b[varname] terms refer to the regression coefficient on varname from this session's most recent regression. _b[_cons] is the ^-intercept or constant. For a conditional effect plot, graph yhatl (after inverse transformation if needed, although it is not needed here) against the untransformedx variable (Figure 8.11). Because conditional effect plots do not show the scatter of data, it can be useful to add reference lines such as the x variable's 10th and 90th percentiles, as shown in Figure 8.11. . graph twoway line yhatl gnpcap, sort xlabel(,grid) xline(230 10890) Figure 8.11 5000 10000 Per capita GNP 1985 15000 20000 Similarly, Figure 8.12 depicts predicted birth rates as a function of srmort, with loggnp held at its mean (3.09): . generate yhat2 - _b[__cons] + _b [ loggnp] *3 . 09 + _b [ srmort] * srmort . label variable yhat2 "birth = f{chldmort | loggnp = 3.09)" graph twoway line yhat2 chldmort, sort xlabel(,grid) xline(0 27) Fitting Curves 231 Figure 8.12 20 30 Child (1-4 yr) mortality 1985 40 How can we compare the strength of different* variables' effects? Standardized regression coefficients (beta weights) are sometimes used for this purpose, but they imply a specialized definition of "strength" and can easily be misleading. A more substantively meaningful comparison might come from looking at conditional effect plots drawn with identicaly scales. This can be accomplished easily by using graph combine, and specifying common v-axis scales, as done in Figure 8.13. The vertical distances traveled by the predicted values curve, particularly over the middle 80% of the x values (between 10th and 90th percentile lines), provide a visual comparison of effect magnitude. . graph combine fig08^11.gph £igQ8_12.gph, ycommon cols (2) scale(1.25) Figure 8.13 Ol o ii t o oi o —o t ^ o E TJ -C o 5000 10000 15000 20000 Per capita GNP 1985 Child (1-4 yr2)°morta^ity 1985° pfc 232 Sfaf/sf/cs vw'tfi State Combining several conditional effects plots into one image with common vertical scales, as done in Figure 8.13, allows quick visual comparison of the strength of different effects. Figure 8,13 makes obvious how much stronger is the effect of child mortality on birth rates — as separate plots (Figures 8.11 and 8.12) did not. Nonlinear Regression — 1 Variable transformations allow fitting some curvilinear relationships using the familiar techniques of intrinsically linear models. Intrinsically nonlinear models, on the other hand, require a different class of fitting techniques. The nl command performs nonlinear regression by iterative least squares. This section introduces it using a dataset of simple examples, nonlin.dta: Contains data from C: \data\nonlin.dta obs : 100 Nonlinear model examples (artificial data) vans; 5 16 Jul 2005 14:51 size: 2, 100 (99.9% of memory free) s torage display value var iable name type format label variab1e label X byte 19 . Og Independent variable yi float %9 . Og yl = 10 * 1 . 03~x + e Y2 float %9 . Og y2 = 10 * (1 - -95~x) + e y3 float %9 . Og y3 = 5 + 25/ (1 + exp (-. lMx-50) ) ) + e y4 float %9 . Og y4 = 5 + 25*exp<-exp(-.1 * (x-50) ) ) + e Sorted by: x The nonlin.dta data are manufactured, with y variables defined as various nonlinear functions of x, plus random Gaussian errors, yl, for example, represents the exponential growth process>7 = 10 * 1.03v. Estimating these parameters from the data, nl obtains>7 = 11.20 x 1.03r, which is reasonably close to the true model. nl exp2 yl x (obs = 100) Iteration 0 Iteration 1 11 erat ion 2 Iter at ion 3 Source residual SS res idua1 SS residual SS residual SS SS 27 625.96 26547.42 26138 . 3 26138.29 df Model 1 Residual I 667018.255 261 38 .2 933 2 333509.128 98 266.717278 Total I 693156.549 100 6931.56549 Number of obs F( 2, 98) Prob > F R-squa red Adj R-squared Root MSE Res. de v. 100 12 50 . 42 0 . 0000 0 . 9623 0.9615 16.33148 840.3864 Fitting Curves 233 ^-param. exp. g rowth curve, yl=bl*b2"x V1 I Coef. std. Err. bl 1 11-20416 1.146682 9 77 n 000 h_l 1 1-02 8838 .00 1 24 04 82 9.41 O.'oQO t p>ltJ [95% Conf Interval] 8.928602 1.026376 13.47971 1.031299 (SE's, P values, Cl's, and correlati ons are asymptotic approximations) The predict command obtains predicted values and residuals for a nonlinear model estimated by nl. Figure 8.14 graphs predicted values from the previous example, showing the close fit [R2 = .96) between model and data. . predict yhatl (option yhat assumed; fitted values) . graph twoway scatter yl x || line yhatl x, sort II , legend(off) ytitle("yl = 10 * 1.03Ax + e") xtitle(l,x") , Figure 8.14 n oo T- o IT) 20 40 60 80 100 The exp2 part of our nl exp2 yl x command specified a particular exponential growth function by calling a brief program named nlexpl.ado. Stata includes several such programs, defining the following functions: exp3 3-parameter exponential: y = b 0 + b , b 21 exp2 2-parameter exponential: y = b ( £ 2 T exp2a 2-parameter negative exponential: y-b^l - b2r) log4 4-parameter logistic; b0 starting level and (60+ b, ) asymptotic upper limit: y~b, + bj(\+zxp(-b2(x-by))) iog3 3-parameter logistic; 0 starting level and b x asymptotic upper limit: y = />,/(! +Q\p{-b2{x -by))) Fitting Curves 235 <2 234 Statistics with Stata gom4 4-parameter Gompertz; 6^ starting level and (b0 + b,) asymptotic upper limit: gom3 3-parameter Gompertz; 0 starting level and &, asymptotic upper limit: y = Z?! exp(~exp(- b2 (x - 63))) nonlin.dta contains examples corresponding to exp2 (y/), exp2a (y2), log4 (yj), and gom4 (y4) functions. Figure 8.15 shows curves fit by nl to y2,yi, and y4. Figure 8.15 TT 20 40 60 80 100 x 20 40 60 80 100 X .m. exp. g * .' version 1.1.3 12 j un19 program define nlexp2 vers ion 6 if "'1'"-"?" ( global S_2 "2~pars global S_l "bl b2" Approximate initial values by regress local exp "['e{wtype? ' ' e (wexp) ' ]" tempvar Y quietly ( 1 - ' '"(depvar) ' ) if e (sample) rowth curve, $S_E_dePv=bl*b2~ ion of log on X. feg -y. -2r-exP' if e(sample) = log( e •2' 'exp lobal bl = exp(_bl_consl) lobal b2 = exp(_b[ 2 1) g exit } replace ' 1 $bl* ($b2) A'2 This program finds some approximate initial values of the parameters to be estimated, storing these as "global macros" named bl and b2 . It then calculates an initial set of predicted values, as a "local macro'1 named 1 , employing the initial parameter estimates and the model equation: replace '1' = $bl * ($b2)A12' Subsequent iterations of nl will return to this line, calculating new predicted values (replacing the contents of macro 1 ) as they refine the parameter estimates bl and b2 . In Stata programs, the notation $bl means "the contents of global macro bl ." Similarly, the notation 11 ' means "the contents of local macro 1 ." Before attempting to write your own nonlinear function, examine nllog4.ado , nlgom4 . ado , and others as examples, and consult the manual or help nl for explanations. Chapter 14 contains further discussion of macros and other aspects of Stata programming. Nonlinear Regression — 2__ Our second example involves real data, and illustrates some steps that can help in research. Dataset lichen.dta concerns measurements of lichen growth observed on the Norwegian arctic island of Svalbard (from Werner 1990). These slow-growing symbionts are often used to date rock monuments and other deposits, so their growth rates interest scientists in several fields. Contains data from C: \data\lichen.dta obs : 11 Lichen growth (Werner 1 990) vars : 8 14 Jul 2005 14:57 size: 572 (99.9% of memory free) storage di splay value variable name type format label variable labe 1 locale str3l %31s Locality and feature point strl %9s Cont rol po i n t date int % 8 . 0 g Date age int % 8 . 0 g Age in years r short float %9 . Og Rhizocarpon short axis mm r long float % 9 . 0 g Rhizocarpon long axis mm pshort int %8 . Og P.minuscula short axis mm p long int % 8 . 0 g P.minuscula long axis mm Sorted by: Lichens characteristically exhibit aperiod of relatively fast early growth, gradually slowing, as suggested by the lowess-smoothed curve in Figure 8.16. 236 Statistics with Stata Fitting Curves 237 Figure 8.16 40 30 20 10 100 200 Age in years 300 400 Lichenometricians seek to summarize and compare such patterns by drawing growth curves. Their growth curves might not employ an explicit mathematical model, but we can fit one here to illustrate the process of nonlinear regression. Gompertz curves are asymmetrical S-curves, which have been widely used to model biological growth: y = b{ exp(- exp(- b2(x-by))) They might provide a reasonable model for lichen growth. If we intend to graph a nonlinear model, the data should contain a good range of closely spaced x values. Actual ages of the 11 lichen samples in lichen.dta range from 28 to 346 years. We can create 89 additional artificial observations, with "ages" from 0 to 352 in 4-year increments, by the following commands: . range newage 0 396 100 obs was 11, now 10 0 replace age = newage[_n-ll] if age >— (89 real changes made) The first command created a new variable, newage, with 100 values ranging from 0 to 396 in 4-year increments. In so doing, we also created 89 new artificial observations, with missing values on all variables except newage. The replace command substitutes the missing artificial-case age values with newage values, starting at 0. The first 15 observations in our data now look like this: list rlong age newage in 1/15 1 rlong age newage 1. 1 1 28 0 2 . 1 5 56 4 3 . 1 12 79 8 4 . 1 14 so 12 5 • 1 13 80 16 1 6 • 1 8 80 20 1 7 . J 7 89 24 1 8 . I 10 89 28 1 9 - 1 34 34 6 32 i 10 1 34 346 36 1 11 1 25.5 131 4 0 1 1 2 1 0 44 1 13 1 4 48 1 14 1 8 52 1 1 5 1 12 5 6 1 summarize rlong age newage Variable 1 Obs Mean r lo ng 1 11 14 . 86364 age I 1 00 17fj . 68 newage 1 100 198 Min 11.31391 104.7042 116 . 046 34 352 396 Wenowcould drop newage. Only the original 11 observations have nonmissing/Vo/zg values, so only they will enter into model estimation. Stata calculates predicted values for any observation with nonmissing x values, however. We can therefore obtain such predictions for both the 11 real observations and the 89 artificial ones, which will allow us to graph the regression curve accurately. Lichen growth starts with a size close to zero, so we chose the gom3 Gompertz function ratherthan gom4 (which incorporates a nonzero takeoff level, the parameter/^). Figure8.16 suggests an asymptotic upper limit somewhere near 34, suggesting that 34 should be a good guess or starting value of the parameter b j. Estimation of this model is accomplished by nl gom3 rlong age, init(Bl=34) nolog (obs = 11) Source | ss df MS Number of obs = 11 Model 1 Residual I 3 63 3 . 161 12 77 . 088881 5 3 12 8" 9 . 11 .05371 63611018 F( 3, 8} = Prob > F R-squared - 125 . 68 0.0000 0. 97 92 Total 1 3710.25 11 33 7 .2 954 55 Adj R-squared = Root MSE 0.9714 3.104208 3-parameter Gompertz function, rlong = b 1 * exp(- exp (- b2* Res. dev. (age-b3))) 52.63435 rlong ) Coef . Std. Err t P> 1 t I [95% Conf. Interval! bl 1 b2 1, b3 1 34.36637 .0217685 88 . 7 9701 2 . 2671 8 6 . 00 60806 5 . 63254 5 15.16 3 . 53 15.76 0 . 0 . 0 . 000 007 000 2 9.13823 . 007 74 65 7 5 . 80834 39.59451 .0357904 101.7857 (SE's, P values, CJ's, and correlations are asymptotic approximations) A nolog option suppresses displaying a log of iterations with the output. All three parameter estimates differ significantly from 1. Sr? B ■ft l 238 Statistics with Stata We obtain predicted values using predict, and graph these to see the regression curve. The yline option is used to display the lower and estimated upper limits (0 and 34.366) of this curve in Figure 8.17. . predict yhat (option yhat assumed; fitted values) , graph twoway scatter rlong age line yhat «ge, clpattern(solid) bands(50) ine(0 34. ytitleC'Rhizocarpon long axis, mm") I I rnsp. I , , iegend VU-(° 34.366)^^ xlabel(0 (100)400 , grid) Figure 8.17 200 Age in years 400 Especially when working with sparse data or a relatively complex model, nonlinear regression programs can be quite sensitive to their initial parameter estimates. The init option with nl permits researchers to suggest their own initial values if the default values supplied by an nlfunction program do not seem to work. Previous experience with similar data, or publications by other researchers, could help supply suitable initial values. Alternatively, we could estimate through trial and error by employing generate to calculate predicted values based on arbitrarily-chosen sets of parameter values and graph to compare the resulting predictions with the data. Robust Regression Stata's basic regress and anova commands perform ordinary least squares (OLS) regression. The popularity of OLS derives in part from its theoretical advantages given "ideal" data. If errors are normally, independently, and identically distributed (normal i.i.d.), then OLS is more efficient than any other unbiased estimator. The flip side of this statement often gets overlooked: if errors are not normal, or not i.i.d., then other unbiased estimators might outperform OLS. In fact, the efficiency of OLS degrades quickly in the face of heavy-tailed (outlier-prone) error distributions. Yet such distributions are common in many fields. OLS tends to track outliers, fitting them at the expense of the rest of the sample. Over the long run, this leads to greater sample-to-sample variation or inefficiency when samples often contain outliers. Robust regression methods aim to achieve almost the efficiency of OLS with ideal data and substantially better-than-OLS efficiency in non-ideal (for example, nonnormal errors) situations. "Robust regression" encompasses a variety of differenttechniques, each with advantages and drawbacks for dealing with problematic data. This chapter introduces two varieties of robust regression, rreg and qreg, and briefly compares their results with those of OLS ( regress ). rreg and qreg resist the pull of outliers, giving them better-than-OLS efficiency in the face of nonnormal, heavy-tailed error distributions. They share the OLS assumption that errors are independent and identically distributed, however. As a result, their standard errors, tests, and confidence intervals are not trustworthy in the presence of heteroskedasticity or correlated errors. To relax the assumption of independent, identically distributed errors when using regress or certain other modeling commands (although not rreg or qreg), Stata offers options that estimate robust standard errors. For clarity, this chapter focuses mostly on two-variable examples, but robust multiple regression or iV-way ANOVA are straightforward using the same commands. Chapter 14 returns to the topic of robustness, showing how we can use Monte Carlo experiments to evaluate competing statistical techniques. Several of the techniques described in this chapter are available through menu selections: Statistics - Nonparametric analysis - Quantile regression Statistics - Linear regression and related - Linear regression - Robust SE 240 Statistics with Stata Example Commands rreg y xl x2 x3 Performs robust regression of y on three predictors, using iteratively reweighted least squares with Huber and biweight functions tuned for 95% Gaussian efficiency. Given appropriately configured data, rreg can also obtain robust means, confidence intervals, difference of means tests, and ANOVA or ANCOVA. rreg y xl x2 x3, nolog tune (6) genwt(rweight) iterate(lO) Performs robust regression ofy on three predictors. The options shown above tell Stata not to print the iteration log, to use a tuning constant of 6 (which downweights outliers more steeply than the default 7), to generate a new variable (arbitrarily named rweight) holding the final-iteration robust weights for each observation, and to limit the maximum number of iterations to 10. qreg y xl x2 x3 Performs quantile regression, also known as least absolute value (LAV) or minimum Ll-norm regression, ofy on three predictors. By default, qreg models the conditional .5 quantile (approximate median) ofy as a linear function of the predictor variables, and thus provides "median regression." qreg y xl x2 x3, quantile(.25) Performs quantile regression modeling the conditional .25 quantile (first quartile) of y as a linear function of xl, x2, andxJ. bsqreg y xl x2 x3, rep (100) Performs quantile regression, with standard errors estimated by bootstrap data resampling with 100 repetitions (default is rep (20)). predict e, resid Calculates residual values (arbitrarily named e) after any regress, rreg, qreg, or bsqreg command. Similarly, predict yhat calculates the predicted values of y. Other predict options apply, with some restrictions. regress y xl x2 x3, robust Performs OLS regression of y on three predictors. Coefficient variances, and hence standard errors, are estimated by a robust method (Huber/White or sandwich) that does not assume identically distributed errors. With the cluster () option, one source of correlation among the errors can be accommodated as well. The User's Guide describes the reasoning behind these methods. Regression with Ideal Data To clarify the issue of robustness, we will explore the small (n = 20) contrived dataset robustl.dta: Contains data from C:\data\robustl.dta obs: 2 0 vars : size: 10 Robu st regression examples 1 (artificial data) 17 Jul 2005 09:35 (99.9% of memory free) f Robust Regression 241 variable n, storage display type format value label variable label el yi e2 y2 x3 e3 y3 e4 y4 float float float float float float float float float float %9 %9 Og Og Og Og Og Og og Og Og Normal X Normal errors yl - 10 + 2*x Normal y2 10 errors with 1 outlier + e2 Normal X with 1 leverage obs Normal errors with 1 extreme y3 - 10 + 2*x3 + e3 Skewed errors Y4 = 10 + 2*x + e4 Sorted by: The variables x and el each contain 20 random values from independent standard normal distributions, yl contains 20 values produced by the regression model: yl = 10 + 2x + e7 The commands that manufactured these first three variables are clear set obs 20 generate x = invnorm(uniform()) generate el = invnorm(uniform()) generate yl * 10 + 2*x + el With real data, coding mistakes and measurement errors sometimes create wildly incorrect values. To simulate this, we might shift the second observation's error from-0.89 to 19.89: . generate e2 = el . replace e2 - 19.89 in 2 . generate y2 = 10 + 2*x + e2 Similar manipulations produce the other variables in robustl.dta. yl and x present an ideal regression problem: the expected value of yl really is a linear function of x, and errors come from normal, independent, and identical distributions — because we defined them that way. OLS does a good job of estimating the true intercept (10) and slope (2), obtaining the line shown in Figure 9.1. . regress yl x Source df MS Model I 134.059351 Residual | 22 . 291 57 1 134.059351 !8 1.23842055 Total [ 156.350921 19 8.22899586 Number of obs F< 1, 18) Prob > F R-squared Adj R-squared Root MSE 20 108 . 25 0 . 0000 0 . 8574 0.84 95 1 . 1128 yl Coef Std. E. P> I t I [95% Conf. Interval] x I cons I 2 . 04 8057 9.963161 -1968465 ■ 24 99861 10.40 0.000 39.85 0.000 1.634498 9.43796 2.461616 10.48836 predict yhatlo 242 Statistics with Stata Robust Regression 243 e < to jraph twoway scatter yl x || line yhatlc x, clpattern(solid) sort ytitleC'yl = 10 + 2*x + el") legend (order (2 label(2 "OLS line") position(11) ring(0) cols(1)) Figure 9.1 An iteratively reweighted least squares (IRLS) procedure, rreg, obtains robust regression estimates. The first rreg iteration begins with OLS. Any observations so influential as to have Cook's D values greater than I are automatically set aside after this first step. Next, weights are calculated for each observation using a Huber function, which downweights observations that have larger residuals, and weighted least squares is performed. After several WLS iterations, the weight function shifts to a Tukey biweight (as suggested by Li 1985), tuned for 95% Gaussian efficiency (see Hamilton 1992a for details), rreg estimates standard errors and tests hypotheses using a pseudovalues method (Street, Carrol I and Ruppert 1988) that does not assume normality. . rreg yl x Huber iteration Huber iteration Biweight iteration Biweight iteration Biweight iteration maximum difference in weights maximum difference in weights - urn difference in weights - difference in weights = difference in weights - maxim maximum ma ximum Ro bust regression estimates . 3577 4 4 .021815 .144213 .013202 .002654 Number F( 1, Prob > 07 78 71 7 6 08 of obs 18) F 20 79 . 96 0.0000 Err . t X 1 2 047813 .22 9 004 9 8259 8 34 94 . 17 0.000 9 . 3251 61 10.54717 This 4 ideal data" example includes no serious outliers, so here rreg is unneeded. The rreg intercept and slope estimates resemble those obtained by regress (and are not far from the true values 10 and 2), but they have slightly larger estimated standard errors. Given normal i.i.d. errors, as in this example, rreg theoretically possesses about 95% of the efficiency of OLS. rreg and regress both belong to the family of Af-estimators (for maximum-likelihood). An alternative order-statistic strategy calledL-estimation fits quantiles ofy, rather than its expectation or mean. For example, we could model how the median (.5 quantile) ofy changes with x. qreg , an /./-type estimator, accomplishes such quantile regression and provides another method with good resistance to outliers: . qreg yl x Iteration 1: WLS sum of weighted deviations = 17.7 11531 Iteration 1: Iteration 2: sum of abs. sum of abs. weighted deviations -weighted deviations = 1 7 . 1 30001 1 6 . 858602 Median regres Raw sum of Min sum of s i on dev i at i ons de v i at i. ons 46.84 (about 1 6.858 6 10.4) Numbe r Pseu do of obs = R2 20 0 .6401 yi 1 Coef. Std. Err. t P>1t 1 [ 95% Conf. Interval] X cons 1 2.139896 1 9.65342 . 2590447 .3564108 8 . 26 27.09 0.000 L 0 . 000 8 .595664 .904 628 2.684 129 10.40221 Although qreg obtains reasonable parameter estimates, its standard errors here exceed those of regress (OLS) and rreg. Given ideal data, qreg is the least efficient of these three estimators. The following sections view their performance with less ideal data. /Outliers The variable v2 is identical to v7, but with one outlier caused by the "wild" error of observation #2. OLS has little resistance to outliers, so this shift in observation #2 (at upper left in Figure 9.2) substantially changes the regress results: . regress y2 x Source | SS df MS Number of obs = 20 -------------+------------------------------ F ( 1 , 18 ) = 0 . 97 Model I 18.764271 1 18.764271 Prob > F = 0.3378 Residual I 348.233471 18 19.3463039 R-squared = 0.0511 -------------+------------------------------ Adj R-squared - -0.0016 Total I 366.997742 19 19.3156706 Root MSE = 4.3984 y2 | Coef. Std. Err. t P>ItI [95% Conf. Interval] x | .7662304 .7780232 0.98 0.338 -.8683356 2.400796 cons | 11.1579 .9880542 11.29 0.000 9.082078 13.23373 e to ft' \ »4 244 Statistics with Stata . predict yhat2o (option xb assumed; fitted values) . label variable yhat2o "OLS line (regress)" The outlier raises the OLS intercept (from 9.936 to 1 1.1579) and lessens the slope (from 2.048 to 0.766). R: has dropped from .8574 to .0511. Standard errors quadrupled, and the OLS slope (solid line in Figure 9.2) no longer significantly differs from zero. The outlier has little impact on rreg, however, as shown by the dashed line in Figure 9.2. The robust coefficients barely change, and remain close to the true parameters 10 and 2; nor do the robust standard errors increase much. rreg y2 x, nolog genwt{rwexght2) Robust regression estimates Number of obs -F( 1, I7) = Prob > F 19 63 . 01 0 .0000 x 1 1.979015 .2493146 32 . 59 0. 000 9 .360966 10 . 65 695 . predict yhat2r (option xb assumed; fitted values) . label variable yhat2r "robust regression (rreg)" . graph twoway scatter y2 x || line yhat2o x, clpattern(solid) sort I 1 line yhat2r x, clpattern(longdash) sort 1) , ytitle("y2 = 10 + 2*x + e2") legend(order(2 3) position(l) ring(O) cols(l) margin(sides)) ___Figure 9.2 OLS regression (regress) robust regression (rreg) Robust Regression 245 The nolog option above caused Stata not to print the iteration log. The genwt {rweight2) option saved robust weights as a variable named rweight2. . predict res±d2rr resid list y2 x resid2r rweight2 1 v2 X resid2r rwe igh 12 1. ] 5 37 -1 97 - . 740307 1 . 9464 4 4 65 2 . 1 26 19 -1 85 19.84221 3 . 1 5 93 -1 74 -.6354806 . 96037 073 4 . 1 8 58 -1 36 1.262494 .8493384 5 . 1 6 16 -1 07 -1 . 731 42 1 . 7257 631 6 . 1 9 80 -0 69 1.]56554 .87273631 7 . 1 8 12 -0 55 -.8005085 . 937583 91 8 . 1 10 40 -0 4 9 1.36075 . 82606386 9 . I 9 35 -0 42 . 17222 . 99712 388 10 . 1 11 16 0 33 .4979582 .97581674 11 . 1 11 40 0 44 . 5202664 . 97360863 12 . 1 13 26 0 69 1.885513 .68048066 13 . 1 10 88 0 78 - . 6725982 . 95572 833 14 . t 9 58 0 79 -1 . 99238 9 . 64 64491 8 15 . 1 12 41 1 26 -,0 925257 .99913568 16 . 1 14 14 1 27 1 . 61 7685 .7588707 3 1 7 . 1 12 66 1 47 - .258 1 1 8 9 . 99338589 18 . 1 12 74 1 61 -.4551811 .97957817 19 . 1 12 10 1 81 -.89098 3 9 . 9230704 1 20 . 1 14 19 2 12 -.0144787 .99997651 Residuals near zero produce weights near one; farther-out residuals get progressively lower weights. Observation #2 has been automatically set aside as too influential because of Cook's D > 1. rreg assigns its rweight2 as "missing," so this observation has no effect on the final estimates. The same final estimates, although not the correct standard errors or tests, could be obtained using regress with analytical weights (results not shown): regress y2 x [aweight = rweight2] Applied to the regression of y2 on x, qreg also resists the outlier's influence and performs better than regress , but not as well as rreg. qreg appears less efficient than rreg , and in this sample its coefficient estimates are slightly farther from the true values of 10 and 2. qreg y2 x, nolog Median regression Number of obs = 20 Raw surn of dev ia t i ons 56.68 (about 10 88) Min sum of devi a t lons 3 6 20036 Pseudo R2 = 0 . 3613 1 Coef. Std. Err. t P> 1 t 1 [95% Conf. Interval] X 1 1.821428 . 4 1 0594 4 4 44 0.000 .9588014 2 . 684 055 c ons 1 10.115 . 508852 6 19 88 0.000 9 . 04 594 1 1 1 . 1 8406 ES < 246 Statistics with Stata Monte Carlo researchers have also noticed that the standard errors calculated by qreg sometimes underestimate the true sample-to-sample variation, particularly with smaller samples. As an alternative, Stata provides the command bsqreg, which performs the same median or quantile regression as qreg , but employs bootstrapping (data resampling) to estimate the standard errors. The option rep ( ) controls the number of repetitions. Its default is rep (20), which is enough for exploratory work. Before reaching "final" conclusions, we might take the time to draw 200 or more bootstrap samples. Both qreg and bsqreg fit identical models. In the example below, bsqreg also obtains similar standard errors. Chapter 14 returns to the topic of bootstrapping. . bsqreg y2 x, rep (50) (fitting base model) (bootstrapping ..................................................) Median regression, bootstrap(50) SEs Number of obs = 20 Raw sum of deviations 56.68 (about 10.88) Min sum of deviations 36.20036 Pseudo R2 - 0.3613 y2 | Coef. Std.Err. t P>ItI [95%Conf. Interval] x | 1.821428 .4084728 4.46 0.000 .9632587 2.679598 cons | 10.115 .4774718 21.18 0.000 9.111869 11.11813 X Outliers (Leverage)_ rreg, qreg, and bsqreg deal comfortably withy-outliers, unless the observations with unusual y values have unusual x values (leverage) too. The y3 and x3 variables in robust.dta present an extreme example of leverage. Apart from the leverage observation (#2), these variables equal yl andjc. The high leverage of observation #2, combined with its exceptional y3 value, make it influential: regress and qreg both track this outlier, reporting that the "best-fitting" line has a negative slope (Figure 9.3). regress y3 x3 Source | SS df MS Number of obs = 20 ___----------+------------------------------ F ( lf 18) = 11 . 01 Model | 139.306724 1 139.306724 Prob > F - 0.0038 Residual | 227.691018 18 12.649501 R-squared = 0.3796 -------------+------------------------------ Adj R-squared = 0.3451 Total | 366.9 97 742 19 19.3156706 Root MSE - 3.5566 y3 I Coef. Std. Err. t P>It I [95% Conf. Interval] x3 | -.6212248 .1871973 -3.32 0.004 -1.014512 -.227938 _cons I 10.80931 .8063436 13.41 0.000 9.115244 12.50337 predict yhat3o label variable yhat3o "OLS regression (regress)" Robust Regression 247 qreg y3 x3, nolog Median regression Raw sum of deviations 56.68 (about 10 Min sum of deviations 56.19466 Number of obs Pseudo R2 20 0 . 0086 y3 Coef Std. Err. x3 I cons I .6222217 11.36533 .347103 1 - 419214 -1.79 P> I t I 0.090 [95% Conf. int erval] - 1 . 35 1 458 1070 1 4 f. _8;_°1 °-00° 8.383 676 i4 3^99 predict yhat3q label variable ynat3q "median regression (qreg) * rreg y3 x3, nolog Robust regression estimates Number of obs = F( 1, 17) = Prob > F 19 63 . 01 0.0000 t _cons [ 10.00897 .3071265 7 . 94 32 . 59 0.000 1 . 453007 2 .505023 regression (rreg) predict yhat3r label variable yhat3r "robust graph twoway scatter y3 x3 II line yhat3o x3, clpattern(solid) sort || line yhat3r x3, clpattern(longdash) sort || line yhat3q x3, clpattern(shortdash) sort , ytitle("y3 = 10 + 2*x + e3") legend(order(4 3 2) position(5) ring(0) cols(l) margin(sides)) ylabel(-30(10)30) CO o + X * CM + o o II o Figure 9.3 ......... median regression (qreg) ---robust regression (rreg) - OLS regression (regress) ■20 "15 -10 -5 Normal X with 1 leverage obs. 248 Statistics with Stata Figure 9.3 illustrates that regress and qreg are not robust against leverage (jc-outliers). The rreg program, however, not only downweights large-residual observations (which by itself gives little protection against leverage), but also automatically sets aside observations with Cook's D (influence) statistics greater than 1. This happened when we regressed y3 on x3; rreg ignored the one influential observation and produced a more reasonable regression line with a positive slope, based on the remaining 19 observations. Setting aside high-influence observations, as done by rreg , provides a simple but not foolproof way to deal with leverage. More comprehensive methods, termed bounded-influence regression, also exist and could be implemented in a Stata program. The examples in Figures 9.2 and 9.3 involve single outliers, but robust procedures can handle more. Too many severe outliers, or a cluster of similar outliers, might cause them to break down. But in such situations, which are often noticeable in diagnostic plots, the analyst must question whether fitting a linear model makes sense. It might be worthwhile to seek an explicit model for what is causing the outliers to be different. Monte Carlo experiments (illustrated in Chapter 14) confirm that estimators like rreg and qreg generally remain unbiased, with better-than-OLS efficiency, when applied to heavy-tailed (outlier-prone) but symmetrical error distributions. The next section illustrates what can happen when errors have asymmetrical distributions. Asymmetrical Error Distributions__ The variable e4 in robustl.dta has a skewed and outlier-filled distribution: e4 equals el (a standard normal variable) raised to the fourth power, and then adjusted to have 0 mean. These skewed errors, plus the linear relationship with x, define the variable y4 ~ 10 + 2x + e4. Regardless of an error distribution's shape, OLS remains an unbiased estimator. Over the long run, its estimates should center on the true parameter values. regress y4 x Source | SS df MS Number of obs = 20 -------------+------------------------------ F( if 18j = 6i97 Model I 155.870383 1 155.870383 Prob > F = 0.0166 Residual 1 402.341909 18 32.3523283 R-squared = 0.2792 -------------4------------------------------- Acjj R-squared = 0.2392 Total ! 558.212291 19 29-379594 3 Root MSE = 4.7278 y4 I Coef. Std. Err. t P> I t | [95;; Conf. Interval] x i 2.208388 .8362862 2.64 0.017 .4514157 3.96536 __cons \ 9.9 75681 1.0620 46 9.39 0.000 7.744406 12.20 6 96 The same is not true for most robust estimators. Unless errors are symmetrical, the median line fit by qreg, or the biweight line fit by rreg , does not theoretically coincide with the expected->> line estimated by regress . So long as the errors' skew reflects only a small fraction of their distribution, rreg might exhibit little bias. But when the entire distribution is skewed, as with e4< rreg will downweight mostly one side, resulting in noticeably biased v-intercept estimates. Robust Regression 249 ■ rreg y4 x, nolog Robust regression est mates Number of ob; F( 1, Prob > f 18) = i319 29 0 - 0000 y4 | Coef Std. E. x cons 1.952073 7■4766 6 9 [95r* Conf. Interval] • 0 5 314 3 5 •0 682518 36-32 0.000 1.839163 109"55 °-000 7.333278 2.0 64 984 7•620061 Although the rreg y-intercept in Figure 9.4 is too low, the slope remains parallel to the OLS line and the true model. In fact, being less affected by outliers, the rreg slope (1.95) is closer to the true slope (2) and has a much smaller standard error than that of regress . This illustrates the tradeoff of using rreg or similar estimators with skewed errors: we risk getting biased estimates of the v-intercept, but can still expect unbiased and relatively precise estimates of other regression coefficients. In many applications, such coefficients are substantively more interesting than the;y-intercept, making the tradeoff worthwhile. Moreover, the robust t and F tests, unlike those of OLS, do not assume normal errors. true model OLS regression (regress) robust regression (rreg) Figure 9.4 0 Normal X 5£bustAnalwisjof Varian ce Contains dat, obs : vars size a ±rom C:\data\faculty.dta 226 Y 6 2,938 f Q Q qa r -o of memory free) College faculty salaries I7 Jul 2005 09:32 250 Statistics with Stata torage display rank gender fema1e a s soc full pay Sorted by: format byte %8 . Og byte %8 . Og byte %8 . Og byte % 8 . 0 g byte %8 - Og float % 9 . 0 g val ue label variable label rank sex Academic rank Gender (dummy variable) Gender (effect coded) Assoc Professor (effect coded) Full Professor (effect coded) Annual salary Faculty salaries increase with rank. In this sample, men have higher average salaries: table gender rank, contents(mean pay) Gender I (dummy I variable) I Academic rank Assist Assoc Ful i Male ( 29280 38622.22 52084.9 Female I 2871 1 .04 3801 9.05 4 7 1 90 An ordinary (OLS) analysis of variance indicates that both rank and gender significantly affect salary. Their interaction is not significant, anova pay rank gender rank*gender Number of obs Root MSE 226 R-squa = 5108.21 Adj R- red squa red = 0 . - 0 . 7 305 7244 Source | Partial SS df MS F Prob > F Mode 1 | I.5560e+10 5 3 . 1120e + Q9 119. 26 0 . 0000 r an k gende r rank*gender 1 1 7.6124e+Q9 | 127361829 | 87997720.1 2 1 2 3 . 8062e+09 127361829 43998860.1 145. 4 . 1 . 87 88 69 0 . 0 . 0 . 0000 0282 1876 Res idua1 1 | 5.7406e+09 220 260 93824 .5 Total | 2 . I300e + 10 225 94668810.3 But salary is not normally distributed, and the senior-rank averages reflect the influence of a few highly paid outliers. Suppose we want to check these results by performing a robust analysis of variance. We need effect-coded versions of the rank and gender variables, which this dataset also contains. tabulate gender female Gender I (dummy I Gender (effect coded) variable) I -1 1 ' ___________+______________________+ Male I 149 0 1 Female I 0 77 | Total Total 1 149 77 I 149 77 226 Robust Regression 251 tabulate rank assoc Academic | rank | Assoc Professor -1 (effec 0 t coded) 1 J Total Assist | Assoc [ Full | 64 0 0 0 0 57 0 1 105 ] 0 1 64 105 57 Total [ 64 57 105 | 226 . tab rank full Academic | rank ) Full Professor -1 effect 0 coded) 1 1 Total Assist | &SSOC J Full ! 64 0 0 0 105 0 0 1 0 1 57 | 64 105 57 Total | 64 105 57 | 226 If'faculty, dta did not already have these effect-coded variables (female, assoc, and full), we could create them from gender and rank using a series of generate and replace statements. We also need two interaction terms representing female associate professors and female full professors: . generate femassoc = female*assoc generate femfull = female*full Males and assistant professors are "omitted categories" in this example. Now we can duplicate the previous ANOVA using regression: regress pay assoc full female femassoc femfull Source SS df Model Residual 1.5560e+10 5.7406e+09 5 3.1120e+09 20 2 6093824.5 Total f 2.1300e+10 225 94 10.3 Number of obs - F( 5, 220) = Prob > F R-squared Adj R-squared = Root MSE 226 119.26 0.0000 0.7305 0.7244 5108 .2 pay Coef Std. Err t p> i t [95% Conf. Interval] assoc full female femas s oc femfu 11 cons ■663.8995 10652.92 ■1011.174 709.5864 1436.277 38984.53 543.8499 783 . 9227 457 . 6938 543.84 99 783.9227 457.6938 -1 . 22 13 . 59 -2 . 21 1 . 30 -1 . 83 85 . 18 0.223 0 . 000 0 . 028 0.193 0.068 0 . 000 -17 35.722 9107 . 957 -1913.199 -362 .2359 -2981 .236 4 07 . 9229 12197.88 109 . 1483 1781.409 108.6819 38082.51 39886.56 test assoc full < D ( 2) assoc = 0.0 full -0.0 F ( 2, 220) -Prob > F = 145.87 o.0000 E to ■ft! § 252 Statistics with Stata test female ( 1) female = f ( 1 , 2 2 0 ) - 4.88 Prob' > f = 0.0282 test femassoc femfull ( 1) feraassoc ^ 0.0 ( 2 ) femful1 = 0.0 220) Prob > f - 1.6? 0.1876 regress followed by the appropriate test commands obtains exactly the same R 2 and Ftest results that we found earlier using anova . Predicted values from this regression equal the mean salaries. . predict predpayl (option xb assumed; fitted values) . label variable predpayl "QLS predicted salary" table gender- rank, contents(mean predpayl) Gender 1 (dummy I Academic rank Full variable) I Assist Assoc Male 1 2928 0 38 622 . 22 52084.9 Female | 28711 . 04 38019.05 4 7 190 Predicted values (means), R2, and F tests would also be the same regardless of which categories we chose to omit from the regression. Our "omitted categories,1' males and assistant professors, are not really absent. Their information is implied by the included categories: if a faculty member is not female, he must be male, and so forth. To perform a robust analysis of variance, apply rreg to this model: rreg pay assoc full female femassoc femful1, nolog Robust regression estimates Number of obs f( 5, 220) Prob > F 226 138.25 0 . 0000 pay assoc full female fema s soc femfull cons Coef . ■31 5 . 64 6 3 97 65.296 ■74 9. 494 9 197.78 33 -913 .348 38 33 1.87 Std. Err, 4 58 . 1588 660 . 4048 38 5 . 577 8 458.1588 660 . 4048 335 . 577 8 -0 . 1 4 . -1 0 - 1 99 69 79 94 43 38 41 P> I t I 0.492 0 .000 0.053 0 . 666 0.168 0.000 95% Conf. interval] 1218.588 84 63 .767 15 0 9.394 705 . 1587 2214 . 87 8 37571.97 587.2 956 I 1066 . 83 10 .4 03 95 II 00. 725 388 .1815 39091.77 1 Robust Regression 253 test assoc full ( 1) as s oc - 0.0 ( 2 ) f ull =■ 0 - 0 f ( 2 , 220) Prob > f test female ( 1) female = 0.0 F ( 1 , 2 2 0) Prob > f 182 . 67 0.0000 3.78 0 . 0 53: - test femassoc femful1 ( 1) femassoc = 0.0 ( 2) femful1 = 0.0 F ( 2, 220) = Prob > F - 1.16 0 . 3144 rreg downweights several outliers, mainly highly-paid male full professors. T° see tne robust means, again use predicted values: . predict predpay2 (option xb assumed; fitted values) label variable predpay2 "Robust predicted salary" table gender rank, contents(mean predpay2) Gende r 1 (dummy 1 Academic rank variable) 1 Assist As s oc Full Ma le I 28916.15 38 567 . 9 3 497 60 . 01 Ferna 1e 1 28846.29 374 64 . 51 464 34 . 32 The male-female salary gap among assistant and full professors appears smaller if we 05 e robust means. It does not entirely vanish, however, and the gender gap among aSs0Ciate professors slightly widens. With effect coding and suitable interaction terms, regress can duplicate A^^^A exactly. rreg can do parallel analyses, testing for differences amongrobust means instea^ of ordinary means (as regress and anova do). Used in similar fashio^ qrecj °Pens the third possibility of testing for differences among medians. For comparison, here 1S a quantile regression version of the faculty pay analysis; i I LH 3 254 Stat/stfcs iv/f/7 Stata full female femassoc femfull, nolog qreg pay assoc Median regression Raw sum of deviations 1738010 (about 37360) Min sum of deviations 798870 Number of obs Pseudo R2 226 0.5404 pay Coef. Std. Err P> I t I [95% Conf. Interval] a s soc full female femassoc femfu11 cons -7 6 0 1 033 5 623.3333 ■1 56 . 6667 ■691 . 6667 4 40 . 1693 615 . 7735 365 . 1262 4 40 . 1693 61 5 . 7735 38300 365 . 1262 -1.73 16.78 -1 . 71 -0 . 36 -1 . 12 104 . 90 . 086 .000 .089 .722 .263 . 000 1627.488 9121.43 1342.926 ■1024 . 1 5 5 -1 905 .2 3 6 37580 . 4 1 107 .4 881 1154 8 . 57 96 . 25 94 710 . 8214 521 .9031 39019.59 test assoc full ( l) assoi 0 . 0 ( 2) full = 0-0 Ft 2, 220) Prob > F test female { l) female = 0.0 F( 1, 220) Prob > F 208.94 0.0000 2 .91 0 .0892 test femassoc femfull ( l) femassoc = 0.0 ( 2) femfull = 0.0 F( 2, 220) Prob > F 1 .60 0 .2039 predict predpay3 (option xb assumed; fitted values) . label variable Predpay3 "Median predicted salary table gender ran*, contents(mean predpay3) Gender (dummy variable) 1 1 Academic rank | Assist ASSOC Full Male Fernale 1 28500 \ 23950 38320 36760 4 9950 47320 Predated values from this quantile regression closely resemble the median salaries in each subgroup, as we can verify directly: Robust Regression 255 table gender rank, contents(median pay) Gender (dummy Acad emi c rank variable) Assist Ass OC Full Male 2 8 5 0 0 38 3 20 4 9 950 Ferna 1e 28 950 365 90 4 6530 qreg thus allows us to fit models analogous to /Y-way ANOVA or ANCOVA, but involving .5 quantiles or approximate medians instead of the usual means. In theory, .5 quantiles and medians are the same. In practice, quantiles are approximated from actual sample data values, whereas the median is calculated by averaging the two central values, if a subgroup contains an even number of observations. The sample median and .5 quantile approximations then can be different, but in a way that does not much affect model interpretation. Further rreg and greg Applications_ Diagnostic statistics and plots (Chapter 7) and nonlinear transformations (Chapter 8) extend the usefulness of robust procedures as they do in ordinary regression. With transformed variables, rreg or qreg fit curvilinear regression models, rreg can also robustly perform simpler types of analysis. To obtain a 90% confidence interval for the mean of a single variable,^, we could type either the usual confidence-interval command ci : . ci y, level(90) Or, we could get exactly the same mean and interval through a regression with no x variables: regress y, level(90) Similarly, we can obtain robust mean with 90% confidence interval by typing rreg y, level(90) qreg could be used in the same way, but keep in mind the previous section's note about how a .5 quantile found by qreg might differ from a sample median. In any of these commands, the level ( ) option specifies the desired degree of confidence. Ifweomitthis option, Stata automatically displays a 95% confidence interval. To compare two means, analysts typically employ a two-sample t test (t te s t) or one-way analysis of variance (oneway or anova). As seen earlier, we can perform equivalent tests (yielding identical t and F statistics) with regression, for example, by regressing the measurement variable on a dummy variable (here called group) representing the two categories: regress y group A robust version of this test results from typing the following command: rreg y group qreg performs median regression by default, but it is actually a more general tool. It can fit linear models for any quantile of y, not just the median (.5 quantile). For example, fei; £ ! < 50 ■ § ?: c 256 Statistics with Stata commands such as the following analyze how the first quartile (.25 quantile) ofy changes with x. . qreg y x, quant(.25) Assuming constant error variance, the slopes of the .25 and .75 quantile lines should be roughly the same, qreg thus could perform a check for heteroskedasticity or subtle kinds of nonlineari ty. Robust Estimates of Variance — 1_ Both rreg and qreg tend to perform better than OLS ( regress or anova ) in the presence of outlier-prone, nonnormal errors. All of these procedures share the common assumption that errors follow independent and identical distributions, however. If the distributions of errors vary across x values or observations, then the standard errors calculated by anova , regress , rreg , or qreg probably will understate the true sample-to-sample variation, and yield unrealistically narrow confidence intervals. regress and some other model lifting commands (although not rreg or qreg)have an option that estimates standard errors without relying on the strong and sometimes implausible assumptions of independent, identically distributed errors. This option uses an approach derived independently by Huber, White, and others that is sometimes referred to as a sandwich estimator of variance. The artificial dataset (robust2.dto) provides a first example. Conta i ns data f r o m C : \dat a \ r o b i is t2 . dta obs : 50 0 Robust regression examp1e s 2 (artificial data) v a r s : 12 17 Jul 2005 09:03 size: 2 4,500 (99 . 9 of memory free) storage di s p 1 a y value variable n ame type format label var iable label X float ■-.9 .0g Standard normal x e5 float '..9 . Og Standard normal errors y -' float %9 .Og y5 = 10 + 2*x + e5 (normal i.i.d. errors) e6 float '• 9 .Og Contaminated normal errors : 95% N ( 0, 1 ) , 5 % (N ( 0, 1 0 ) y6 float 19 . Og y6 = 10 + 2*x + e6 (Contaminated normal errors) e7 t.oat -, 9 . Og Centered chi-square(l) errors y7 float --■ g .0g y7 = 10 + 2 * x + e7 (skewed errors) e8 f 1 oat '■ 9 .Og Normal errors, variance increases with x y8 float l 9 . Og y8 - 10 + 2*x + e8 (heteroskedasticity) group byte •;9 . Og e9 float -= 9 . Og Normal errors, variance increases with x, mean & variance increase with cluster y9 float 9 . Og y9 = 10 + 2*x + e9 (heteroskedasticity & correlated errors) Sorted by: i Robust Regression 257 x. Because errore do not apie^hP L t n ^ ' re8resslon line leases with errors, confidence into Is" d t stp^ " *" ^ °f *' qreg wou,d face the same problem " ^ ^ msMy- "eg or ■ regress y8 x Source I qq ' ^S df M Q +-------------------------J __ Number of obs - 50o Model I 1607.35658 1 1607~35gsr f( lf 498) = 133"96 Residual | 5 975.191 62 4 98 1 1 998?7fi" Pr°b > F = 0.0000 +-----------------_____'_____ D ' R-squared = 0.2120 Total / 7582.5482 499 15 19'tJ [95% Conf. Interval] x I 1 - 81 9032 . 1571612 11^7 n ---------------- _cons I 10.0 6642 .154919 " •°0° 1-510251 2.12 7813 -----------------------------_____64-98 9.752047 10.370s Figure 9.5 o (/) CM 0) U 0) O in ii • • • % Standard normal x 258 Statistics with Stata More credible standard errors and confidence intervals for this OLS regression can be obtained by using the robust option: regress y8 x, robust Regression with robust standard errors Number of obs F ( 1, 4 98 ) Prob > F R-sguared Root MSE 500 83 . 80 0.0000 0.2120 3.4 639 1 y8 i Coef . Rebus t Std. Err. t P> 1 t 1 [95% Conf. I nt e rva1] x I cons [ 1.819032 10.06642 .1987122 .156184 6 9 . 15 64 . 45 0 . 000 0.000 1.428614 9 . 7595 61 2 .209449 10 . 37328 Although the fitted model remains unchanged, the robust standard error for the slope is 27% larger (.199 vs. .157)thanitsnonrobustcounterpart. Withthe robust option, the regression output does not show the usual ANOVA sums of squares because these no longer have their customary interpretation. The rationale underlying these robust standard-error estimates is explained in the User's Guide. Briefly, we give up on the classical goal of estimating true population parameters (p's) for a model such as Instead, we pursue the less ambitious goal of simply estimating the sample-to-sample variation that our b coefficients might have, if we drew many random samples and applied OLS repeatedly to calculate b values for a model such as y, b0 + b j xt + e We do not assume that these b estimates will converge on some "true'1 population parameter. Confidence intervals formed using the robust standard errors therefore lack the classical interpretation of having a certain likelihood (across repeated sampling) of containing the true value of p. Rather, the robust confidence intervals have a certain likelihood (across repeated sampling) of containing Z>, defined as the value upon which sample b estimates converge. Thus, we pay for relaxing the identically-distributed-errors assumption by settling for a less impressive conclusion. Robust Estimates of Variance — 2_ Another robust-variance option, cluster, allows us to relax the independent-errors assumption in a limited way, when errors are correlated within subgroups or clusters of the data. The data in attract.dta describe an undergraduate social experiment that can be used for illustration. In this experiment, 51 college students were asked to individually rate the attractiveness, on a scale from 1 to 10, of photographs of unknown men and women. The rating exercise was repeated by each participant, given the same photos shuffled in random order, on four occasions during evening social events. Variable ratemale is the mean rating each participant gave to all the male photos in one sitting, and rate/em is the mean rating given Robust Regression 259 to female photos, gender records the participant's (rater's) own gender, and bac his or her blood alcohol content at the time, measured by Breathalyzer. Contains data from C:\data\attract.dta obs var s size 5,508 (99.9% of memory free Perceived attractiveness and drinking (D. C. Hamilton 2003) 18 Jul 2005 17:27 variable name id gender bac genbac relstat drinkfrq ratefem ratemale value type forma t label byte %9 . Og byte %9 . 0g sex float %9 . Og float % 9 . 0 g byte % 9 . 0 g rel f 1 oat %9 . Og float %9 . Og float %9 . 0g variable labe1 Pa rti cipant numbe r Participant gender (female) Blood a 1choho 1 content gende r *bac interaction Relationship status (single) Days drinking in previous week Rated attractiveness of females Rated attractiveness of males Sorted by: id Although the data contain 204 observations, these represent only 51 individual participants. It seems reasonable to assume that disturbances (unmeasured influences on the ratings) were correlated across the repetitions by each individual. Viewing each participant's four rating sessions as a cluster should yield more realistic standard error estimates. Adding the option cluster (id) to a regression command, as seen below, obtains robust standard errors across clusters defined by id (individual participant). . regress ratefem bac gender genbac, cluster(id) Regression with robust standard errors Number of obs = 2 04 F( 3, 50) = 7.75 Prob > F = 0.0002 R-squared = 0.12 64 Number of clusters (id) ^ 51 Root MSE = 1.1219 I Robust ratef em | Coef. Std. Err. t. P> I t j [95% Conf. Interval] bac I 2.89 6741 .8543378 3.39 0.001 1.18075 3 4.612729 gender I -.7299888 .3383096 -2.16 0.036 -1.409504 -.0504141 genbac I .2080538 1.708146 0.12 0.904 -3.222859 3.633967 _cons I 6.486767 .229689 28.24 0.000 6.025423 6.94811 Blood alcohol content (bac) has a significant positive effect: as bac goes up, predicted attractiveness rating of female photos increases as well. Gender (female) has a negative effect: female participants tended to rate female photos as somewhat (ess attractive (about .73 lower) than male participants did. The interaction of gender and bac is weak (.21). The intercept- and slope-dummy variable regression model, approximately predicted ratefem = 6.49 + 2.90bac - .13gender + .IXgenbac 260 Statistics with Stata can be reduced for male participants (gender = 0) to predicted ratefem = 6.49 + 2.90bac - (.73 x 0) + (.21 x 0 x bac) = 6.49 + 2.90bac and for female participants (gender = 1) to predicted ratefem = 6.49 + 2.90bac - (.73 x l) + (.21 x 1 x bac) = 6.49 + 2.90bac - .73 + .2\bac = 5J6 + 3.\\bac The slight difference between the effects of alcohol on males (2.90) and females (3.11) equals the interaction coefficient, .21. Attractiveness ratings for photographs of males were likewise positively affected by blood alcohol content. Gender has a stronger effect on the ratings of male photos: female participants tended to give male photos much higher ratings than male participants did. For male-photo ratings, the gender * bac interaction is substantial (-4.36), although it falls short of the .05 significance level. regress ratemal bac gender genbac, cluster(id) Regression with robust standard errors Nuraber of clusters (id) 51 Numberofobs= 201 F ( 3, 50) - 10.96 Prob > F = 0.0000 R-squared = 0.3516 Root MSE = 1.3931 1 ratemale [ Coef . Robus t Std. Err. t P> 1 t 1 [95% Conf. Inte rva1] bac | 4 246042 2 .2 617 92 1 88 0 . 066 -.2969004 8.788985 gender | 2 443216 . 452 9047 5 39 0 . 000 1.53353 3 . 352 902 genbac | -4 364301 3 . 573689 -1 22 0 . 228 -11.54227 2 . 8 1 3663 cons I 3 628043 .2504253 14 49 0 . 000 3 . 12504 9 4 . 131037 Robust Regression 261 £ « 0) 00 4— O) to -| Female raters Male raters Figure 9.6 _a> E OD 'S O) to Female raters Male raters .3 .4 0 .1 Blood alcohol content OLS regression with robust standard errors, estimated by regress with the roh„«i-opfcon, should not be confused with the robust regress.on'est.nfa" by rreg Desp^ simHar-soundrng names, the two procedures are unrelated, and solve d.fferent problems The regression equation for ratings of male photos by male participants is approximately predicted ratemale = 3.63 + 4.25bac + (2.44 x 0) - (4.36 x 0 x bac) = 3.63 + 4.25/>ac and for rating of male photos by female participants, predicted ratemale = 3.63 + 4.25bac + (2.44 xj)-(4.36 x 1 x bac) = 6.01 -OMbac The difference between the substantial alcohol effect on male participants (4.25) and the near-zero alcohol effect on females (-0.11) equals the interaction coefficient, -4.36. In this sample, males' ratings of male photos increase steeply, and females' ratings of male photos remain virtually steady, as the rater's bac increases. Figure 9.6 visualizes these results in a graph. We see positive rating-bac relationships across all subplots except for females rating males. The graphs also show other gender differences, including higher bac values among male participants. Logistic Regression 263 Logistic Regression The regression and ANOVA methods described in Chapters 5 through 9 require measured dependent or y variables. Stata also offers a full range of techniques for modeling categorical, ordinal, and censored dependent variables. A list of some relevant commands follows. For more details on any of these, type help command. binreg Binomial regression (generalized linear models), biogit Logit estimation with grouped (blocked) data, bprobit Probit estimation with grouped (blocked) data, clogit Conditional fixed-effects logistic regression, cloglog Complementary log-log estimation. enreg Ceusorcd-normal regression, assuming thatj follows a Gaussian distribution but is censored at a point that might vary from observation to observation, constraint Defines, lists, and drops linear constraints. dprobit Probit regression giving changes in probabilities instead of coefficients. glm Generalized linear models. Includes option to model logistic, probit, or complementary log-log links. Allows response variable to be binary or proportional for grouped data, glogit Logit regression for grouped data, gprobit Probit regression for grouped data, heckprob Probit estimation with selection, hetprob Hcteroskcdastic probit estimation. intreg interval regression, wherey is either point data, interval data, left-censored data, or right-censored data, logistic Logistic regression, giving odds ratios. logit Logistic regression — similar to logistic , but giving coefficients instead of odds ratios. miogit Multinomial logistic regression, with polytomousy variable. niogit Nested logit estimation. ologit Logistic regression with ordinal y variable. oprobit Probit regression with ordinal y variable. probit Probit regression, with dichotomousy variable. roiogit Rank-ordered logit model for rankings (also known as the Plackett-Luce model, exploded logit model, or choice-based conjoint analysis), scobit Skewed probit estimation. svy: logit Logistic regression with complex survey data. Survey ( svy ) versions of many other categorical-variables modeling commands also exist, tobit Tobit regression, assuming y follows a Gaussian distribution but is censored at a known, fixed point (see enreg for a more general version), xtciogiog Random-effects and population-averaged cloglog models. Panel (xt) versions of logit, probit, and population-averaged generalized linear models (see help xtgee ) also exist. After most model-fitting commands, predict can calculate predicted values or probabilities. predict also obtains appropriate diagnostic statistics, such as those described for logistic regression in Hosmer and Lemeshow (2000). Specific predict options depend on the type of model just fitted. A different post-fitting command, predictnl , obtains nonlinear predictions and their confidence intervals (see help predictnl). Examples of several of these commands appear in the next section. Most of the methods for modeling categorical dependent variables can be found under the following menus: Statistics - Binary outcomes Statistics - Ordinal outcomes Statistics - Categorical outcomes Statistics - Generalized linear models (GLM) Statistics - Cross-sectional time series Statistics - Linear regression and related - Censored regression After the Example Commands section below, the remainder of this chapter concentrates on an important family of methods called logit or logistic regression. We review basic logit methods for dichotomous, ordinal, and polytomous dependent variables. Example Commands logistic y xl x2 x3 Performs logistic regression of {0,1} variable y on predictors jcV, x2, and x3. Predictor variable effects are reported as odds ratios. A closely related command, logit y xl x2 x3 performs essentially the same analysis, but reports effects as logit regression coefficients. The underlying models fit by logistic and logit are the same, so subsequent predictions or diagnostic tests will be identical. I 264 Statistics with Stata lfit Presents a Pearson chi-squared goodness-of-fit test for the fitted logistic model: observed versus expected frequencies of y = 1, using cells defined by the covariate (jc-variable) patterns. When a large number of x patterns exist, we might want to group them according to estimated probabilities. lfit, group(lO) would perform the test with 10 approximately equal-size groups, lstat Presents classification statistics and classification table, lstat, lroc , and lsens (see below) are particularly useful when the point of analysis is classification. These commands all refer to the previously-fit logistic model. lroc Graphs the receiver operating characteristic (ROC) curve, and calculates area under the curve, lsens Graphs both sensitivity and specificity versus the probability cutoff, predict phat Generates a new variable (here arbitrarily named phat) equal to predicted probabilities that y = 1 based on the most recent logistic model. predict dX2, dx2 Generates a new variable (arbitrarily named dX2), the diagnostic statistic measuring change in Pearson chi-squared, from the most recent logistic analysis. mlogit y xl x2 x3, base (3) rrr nolog Performs multinomial logistic regression of multiple-category variable y on three x variables. Option base (3) specifies y = 3 as the base category for comparison; rrr calls for relative risk ratios instead of regression coefficients; and nolog suppresses display of the log likelihood on each iteration. predict P2, outcome(2) Generates a new variable (arbitrarily namedP2) representing the predicted probability that y = 2, based on the most recent mlogit analysis. glm success xl x2 x3, family(binomial trials) eform Performs a logistic regression via generalized linear modeling using tabulated rather than individual-observation data. The variable success gives the number of times that the outcome of interest occurred, and trials gives the number of times it could have occurred for each combination of the predictors xl, x2, and x3. That is, success / trials would equal the proportion of times that an outcome such as "patient recovers" occurred. The eform option asks for results in the form of odds ratios ("exponentiated form") rather than logit coefficients. cnreg y xl x2 x3, censored(cen) Performs censored-normal regression of measurement variable^' on three predictors*/*, x2, and x3. If an observation's truey value is unknown due to left or right censoring, it is replaced for this regression by the nearest y value at which censoring occurs. The censoring variable cen is a {-1,0,1J indicator of whether each observation's value ofv has been left censored, not censored, or right censored. Logistic Regression 265 Space Shuttle Data Our main example for this chapter, shuttle.dta, involves data covering the first 25 flights of the U.S. space shuttle. These data contain evidence that, if properly analyzed, might have persuaded NASA officials not to launch Challenger on its last, fatal flight in 1985 (that was 25th shuttle flight, designated STS 51-L). The data are drawn from the Report of the Presidential Commission on the Space Shuttle Challenger Accident (1986) and from Tufte (1997). Tufte's book contains an excellent discussion about data and analytical issues. His comments regarding specific shuttle flights are included as a string variable in these data. Contains data from C:\data\shuttl 25 obs va r s size dt, 1,675 (99.9% of memory free) First 25 space shuttle flights 20 Jul 2005 10:40 variable name flight month day yea r distress temp damage comments value type format label byte %8 . Og byte %8 . 0g int %8 . Og byte %8 . Og dlM byte %8 . Og byte %9 . 0g s tr55 %55s variable label Flight Month of launch Day of launch Year of launch Thermal distress incidents Joint temperature, degrees F Damage severity index (Tufte 19 97) Comments (Tufte 1997) Sorted by; list flight-temp, sepby(year) flight month day year date distress t emp STS-1 STS-2 1981 1981 7772 798 6 none 1 or 2 66 70 STS-3 STS-4 STS-5 22 1982 27 1982 11 1982 8116 8213 8350 69 80 none STS- 6 STS-7 STS-8 STS- 9 4 4 1983 6 18 1983 8 30 1983 11 28 1983 84 94 8569 8 642 8732 1 or 2 none none none 67 72 73 70 I 0 II . 12 . 13 . 1 4 . 15 . 16 . 1 7 . 18 . 1 9 . 20 . 21 . 22 . STS_41-B STS_41-C STS_41-D STS_41-G STS 51-A 3 1984 6 1984 30 1984 5 1984 8 1984 8 7 9 9 88 62 9008 9044 9078 1 or 2 3 plus 3 plus none none 57 63 70 78 67 STS_51-C STS_51-D STS_51-B S TS_51 -G STS_51-F STS_51-I STS_51-J STS 61-A 1 4 4 6 7 8 10 10 24 12 29 17 29 27 3 30 1985 1985 1985 1985 1985 1985 1985 1985 9155 9233 9250 92 99 934 1 9370 94 07 9434 3 plus 3 plus 3 plus 3 plus 1 or 2 1 or 2 none 3 plus 53 67 75 70 81 76 79 75 to 266 Statistics with Stata 23. I STS_61-B 1-L 26 -l:jo-j 2 1985 9 4 61 1 or 2 2 4. I STS_61-C 25 . I STS 51-L 12 28 1986 1986 9508 9 52 4 3 plus 58 I 31 ] - - - + This chapter examines three of the shuttle.dta variables: distress The number of "thermal distress incidents/' in which hot gas blow-through or charring damaged joint seals of a flight's booster rockets. Burn-through of a booster joint seal precipitated the Challenger disaster. Many previous flights had experienced less severe damage, so the joint seals were known to be a source of possible danger. temp The calculated joint temperature at launch time, in degrees Fahrenheit. Temperature depends largely on weather. Rubber O-rings sealing the booster rocket joints become less flexible when cold. date Date, measured in days elapsed since January 1, 1960 (an arbitrary starting point). date is generated from the month, day, and year of launch using the mdy (month-day-year to elapsed time; see help dates ) function: generate date = mdy(month, day, year) . label variable date "Date (days since 1/1/60) " Launch date matters because several changes over the course of the shuttle program might have made it riskier. Booster rocket walls were thinned to save weight and increase payloads, and joint seals were subjected to higher-pressure testing. Furthermore, the reusable shuttle hardware was aging. So we might ask, did the probability of booster joint damage (one or more distress incidents) increase with launch date? distress is a labeled numeric variable: tabulate distress Therma1 distress incidents F req Percent Cum none I 1 or 2 I 3 plus I 39.13 2 6.09 34.78 39 . 13 65 . 22 100 . 00 Total I 23 100.00 Ordinarily, tabulate displays the labels, but the nolabel option reveals that the underlying numerical codes are 0 - "none", 1 - "1 or 2", and 2 = "3 plus." . tabulate distress, nolabel Thermal distress inc idents ! 1 I Freq . Pe rcen t Cum. 0 1 2 1 ! 1 9 6 8 39. 13 26-09 34.78 39 . 13 65 .22 100 . 00 Total I 23 100.00 Logistic Regression 267 We can use these codes to create a new dummy variable, any, coded 0 for no distress and 1 for one or more distress incidents: generate any = distress (2 missing values generated) replace any = 1 if distress == 2 (8 real changes made) label variable any "Any thermal distress" To see what this accomplished, tabulate distress any I Any thermal distress I ! 0 ll Total Thermal distress in c i dent none | 1 or 2 I 3 plus I Total 0 6 8 1 4 23 Logistic regression models how a {0,1} dichotomy such as any depends on one or more x variables. The syntax of logit resembles that of regress and most other model-fitting commands, with the dependent variable listed first. logit any date, coef Iteration 0: log likelihood = -15 . 394543 Iteration 1: log 1i ke1ihood = -13 .01 923 Iteration 2: log 1i kelihood = -12 . 99114 6 Iteration 3: log likelihood = -12 . 991096 Logit estimates Number LR chi of obs 2(1) 23 4 . 81 Log like!iheod = -12 . 991096 Prob > Ps eudo chi2 R2 0.0283 0 . 1 561 any [ Coef. Std. Err z P> 1 z 1 [95% Conf. Inte rva1] date | _cons | . 0020907 .0010703 -1 8.1 31 1 6 9 . 5 172 17 1 . -1 . 95 91 0.051 0.057 -6.93e-06 -3 6 . 78456 .0041884 .5222 3 96 The logit iterative estimation procedure maximizes the logarithm of the likelihood function, shown at the output's top. At iteration 0, the log likelihood describes the fit of a model including only the constant. The last log likelihood describes the fit of the final model, I =-18.131 16+ .0020907^ [10.1] where L represents the predicted logit, or log odds, of any distress incidents: I = \n[P(any = 1) / P(any = 0)] [10.2] An overall y; test at the upper right evaluates the null hypothesis that all coefficients in the model, except the constant, equal zero, /- --2(ln^,-[n^f) [10.3] where In , is the initial or iteration 0 (model with constant only) log likelihood, and In f is the final iteration's log likelihood. Here, 268 Statistics with Stata f =-2[-l5.394543 -(-12.991096)] -4.81 The probability of a greater ^\ with 1 degree of freedom (the difference in complexity between initial and final models), is low enough (.0283) to reject the null hypothesis in this example. Consequently, date does have a significant effect. Less accurate, though convenient, tests are provided by the asymptotic r (standard normal) statistics displayed with logit results. With one predictor variable, that predictor's z statistic and the overall %: statistic test equivalent hypotheses, analogous to the usual / and F statistics in simple OLS regression. Unlike their OLS counterparts, the logit z approximation and yj tests sometimes disagree (they do here). The ^2 test has more general validity. Like Stata's other maximum-likelihood estimation procedures, logit displays a pseudo R ~ with its output: pseudo/?: = 1 -ln_i\/ln ^ [10.4] For this example. pseudo/?2 - 1 -(-12.991096)/ (-15.394543) = .1561 Although they provide a quick way to describe or compare the fit of different models for the same dependent variable, pseudo R 2 statistics lack the straightforward explained-variance interpretation of true R 2 in OLS regression. After logit,the predict command(withnooptions)obtainspredictedprobabilities, Phat^ 1 /(l +e L) [10.5] Graphed against date, these probabilities follow an S-shaped logistic curve as seen in Figure 10.1. I Logistic Regression 269 predict Phat label variable Phat "Predicted P(distress >= 1) graph twoway connected Phat date, sort Figure 10.1 7500 8000 8500 9000 Date {days since 1/1/60) 9500 The coefficient given by logit (.0020907) describes date's effect on the logit or log odds that any thermal distress incidents occur. Each additional day increases the predicted log odds of thermal distress by .0020907. Equivalently, we could say that each additional day multiplies predicted odds of thermal distress by e 0020907 - 1.0020929; each 100 days therefore multiplies the odds by (e 002mi ) m = 1.23. (e = 2.71828, the base number for natural logarithms.) Stata can make these calculations utilizing the Jo[varname] coefficients stored after any estimation: . display exp(_b[date]) 1.0020929 . display exp(_b[date])Al00 1 . 2325359 Or, we could simply include an or (odds ratio) option on the logit command line. An alternative way to obtain odds ratios employs the logistic command described in the next section, logistic fits exactly the same model as logit, but its default output table displays odds ratios rather than coefficients. 270 Statistics with Stata Using Logistic Regression Here is the same regression seen logistic any date Logit estimates Log likelihood - -12.991096 earlier, but using logistic instead of logit Number of obs LR chi2(1) Frob > chi2 Fseudo R2 23 4 . 81 0.0283 0.15 61 Std. Err. z .00 1 072 5 1 . 95 Note the identical log likelihoods and f statistics. Instead of coefficients (b), logistic displays odds ratios (e"). The numbers in the "Odds Ratio" column of the logistic output are amounts by which the odds favoring j = 1 are multiplied, with each 1 -unit increase in that x variable (if other x variables' values stay the same). After fitting a model, we can obtain a classification table and related statistics by typing . Istat Logistic model for any Classified I 16 7 Total I 14 23 Classified + it p True d defined as any + if predicted Pr(d) >- -5 Sensitivity Specificity Positive predictive value Negative predictive value Pr( +[ d] Pr { - I ~d: Pr ( d I +' Pr(~D| - 85.71' 55.56' 7 5.00' 71.43 False + rate for true -D Pr + ^. False - rate for true d Fr u False t rate for classified + ?r(~D\ + False _ rate for classified - Pr ( D| - 44 . 44 '■ 14.29: 2 5.00 28 . 57 Correctly classified 73 . 91 By default, lstat employs a probability of .5 as its cutoff (although we can change this by adding a cutoff ( ) option). Symbols in the classification table have the following meanings: D The event of interest did occur (that is, y = 1) for that observation. In this example, D indicates that thermal distress occurred. ~D The event of interest did not occur (that is, y = 0) for that observation. In this example, corresponds to flights having no thermal distress. Logistic Regression 271 + The model's predicted probability is greater than or equal to the cutoff point Since we used the default cutoff, + here indicates that the model predicts a .5 or higher probability of thermal distress. The predicted probability is less than the cutoff. Here, - means a predicted probability of thermal distress below .5. Thus for 12 flights, classifications are accurate in the sense that the model estimated at least a .5 probability of thermal distress, and distress did in fact occur. For 5 other flights, the model predicted less than a .5 probability, and distress did not occur. The overall "correctly classified" rate is therefore 12 + 5= 17 out of 23, or 73.91%. The table also gives conditional probabilities such as "sensitivity" or the percentage of observations with P > .5 given that thermal distress occurred (12 out of 14 or 85.71%). After logistic or logit, the followup command predict calculates various prediction and diagnostic statistics. Discussion of the diagnostic statistics can be found in Hosmer and Lemeshow (2000). Predicted probability that y = 1 Linear prediction (predicted log odds that v = 1) Standard error of the linear prediction AB influence statistic, analogous to Cook's D Deviance residual for yth x pattern, dj Change in Pearson %2, written as A%2 or Ax2P Change in deviance %29 written as AD or A%2 D Leverage of the yth x pattern, h. Assigns numbers to x patterns7y = 1,2,3 ... J Pearson residual for yth ,r pattern, r} Standardized Pearson residual predict newvar predict newvar, xb predict newvar, stdp predict newvar, dbeta predict newvar, deviance predict newvar, dx2 predict newvar, ddeviance predict newvar, hat predict newvar, number predict newvar, resid predict newvar, rstandard Statistics obtained by the dbeta , dx2 , ddeviance , and hat options do not measure the influence of individual observations, as their counterparts in ordinary regression do. Rather, these statistics measure the influence of "covariate patterns"; that is, the consequences of dropping all observations with that particular combination of x values. See Hosmer and Lemeshow (2000) for details. A later section of this chapter shows these statistics in use. Does booster joint temperature also affect the probability of any distress incidents? We could investigate by including temp as a second predictor variable . E I r; 272 Statistics with Stata logistic any date temp Logit estimates Log likelihood = -11.350748 Number of obs LR chi2(2) Prob > chi2 Pseudo R2 23 8 . 09 0.0175 0 . 2 627 ny I Odds Ratio Std. Err 1>lzl [95% Conf. Interval date I temp I 1.00297 8408309 0013675 0987887 2.17 0.030 ■1.48 0.140 1 . 000293 . 6678848 1 .005653 1 .058561 Tne «ation table indicates that including temperature as a pred.ctor proved our correct classification rate to 78.26%. lstat Logistic model for any T r ue Classified I 12 2 D I 3 1 Total 15 Total Classified + if 14 23 dieted Pr (D) >= True D defined as any Sensitivity Specificity Positive predictive value Negative predictive value Pr ( + I Pr ( - I Pr { D I Pr(-D I 7 1% 67% 00% ,00% False + rate for true ~D rate for true D rate for classified rate for classified False Fal se False Pr( +I~D) Pr( -I D) Pr(~Dl +) Pr ( DI -) 33 14 20 25 33% 2 9% Correctly classified 00% 2 6% According to the fitted model, each 1-degree increase in joint temperature multiplies the odds of booster joint damage by .84 (in other words, each 1-degree warming reduces the odds of damage by about 16%). Although this effect seems strong enough to cause concern, the asymptotic z test says that it is not statistically significant (z = -1.476, P = .140). A more definitive test, however, employs the likelihood-ratio x2- The lrtest command compares nested models estimated by maximum likelihood. First, estimate a "full" model containing all variables of interest, as done above with the logistic any date temp command. Next, type an estimates store command, giving a name (such as full) to identify this first model: . estimates store full Now estimate a reduced model, including only a subset of the x variables from the full model. (Such reduced models are said to be "nested.'1) Finally, a command such as lrtest Logistic Regression 273 full requests a test of the nested model against the previously storedfull model. For example (using the quietly prefix, because we already saw this output once), . quietly logistic any date . lrtest full likelihood-ratio test (As sumption: . nested in full) LR chi2 (1) Prob > chi2 = 0.0701 This lrtest command tests the recent (presumably nested) model against the model previously saved by estimates store . It employs a general test statistic for nested maximum-likelihood models, X2 =-2(1113!, - lna0) [10.6] where In cd 0 is the log likelihood for the first model (with all x variables), and In ^ 1 is the log likelihood for the second model (with a subset of those x variables). Compare the resulting test statistic to a %2 distribution with degrees of freedom equal to the difference in complexity (number of x variables dropped) between models 0 and 1. Type help lrtest for more about this command, which works with any of Stata's maximum-likelihood estimation procedures (logit, mlogit, stcox, and many others). The overall x2 statistic routinely given by logit or logistic output (equation [10.3]) is a special case of [10.6]. The previous lrtest example performed this calculation: X2 =-2H2.991096-(-1 1.350748)] = 3.28 with 1 degree of freedom, yielding P = .0701; the effect of temp is significant at a = .10. Given the small sample and fatal consequences of a Type II error, a =. 10 seems a more prudent cutoff than the usual a = .05. Conditional Effect Plots__ Conditional effect plots help in understanding what a logistic model implies aboutprobabilities. The idea behind such plots is to draw a curve showing how the model's prediction ofy changes as a function of one x variable, while holding all others variables constant at chosen values such as their means, quartiles, or extremes. For example, we could find the predicted probability of any thermal distress incidents as a function of temp, holding date constant at its 25th percentile. The 25th percentile of date, found by summarize date, detail, is 8569^ that is, June 18, 1983. . quietly logit any date temp . generate LI = _b[_cons] + _b[date]*8569 + _b[temp]* temp . generate Phatl = 1/(1 + exp(-Ll)) . label variable Phatl "P(distress >= 1 1 date = 8569)" LI is the predicted logit, and Phatl equals the corresponding predicted probability that distress > 1, calculated according to equation [ 10.5]. Similar steps find the predicted probability of any distress with date fixed at its 75th percentile (9341, or July 29, 1985): 274 Statistics with Stata . generate L2 = _b[_cons] + _b[date]*9341 + _b[temp]* temp . generate Phat2 = 1/(1 + exp(-£2)) . label variable Phat2 "P(distress >= 1 1 date = 9341)" We can now graph the relationship between temp and the probability of any distress, for the two levels of dale, as shown in Figure 10.2. Using median splines with many vertical bands (graph twoway mspline, bands (50)) produces smooth curves in this figure, approximating the smooth logistic functions. graph twoway mspline Phatl temp, bands(50) || mspline Phat2 temp, bands(50) || , ytitle("Probability of thermal distress") ylabel(0(.2)1, grid) xlabel(, grid) legend(label (1 "June 1983") label(2 "July 1985") rows (2) position(7) ring(0)) -4 Figure 10.2 (f) co (/) 0> U} T3 "TO CD CD o CO o CL CM June 1983 July 1985 30 40 50 60 70 Joint temperature, degrees F 80 Among earlier flights (date-8569, left curve), the probability of thermal distress goes from very low, at around 80° F, to near 1, below 50° F. Among later flights (date = 9341, right curve), however, the probability of any distress exceeds .5 even in warm weather, and climbs toward 1 on flights below 70° F. Note that Challenger's launch temperature, 31° F, places it at top left in Figure 10.2. This analysis predicts almost certain booster joint damage. Diagnostic Statistics and Plots As mentioned earlier, the logistic regression influence and diagnostic statistics obtained by predict refer not to individual observations, as do the OLS regression diagnostics of Chapter 7. Rather, logistic diagnostics refer to .r patterns. With the space shuttle data, however, each x pattern is unique — no two flights share the same combination of date and Logistic Regression 275 temp (naturally, because no two were launched the same day). Before using predict , we quietly refit the recent model, to be sure that model is what we think: quietly logistic any date temp . predict Phat3 (option p assumed; Pr (any) ) . label variable Phat3 "Predicted probability" . predict dX2, dx2 (2 missing values generated) label variable dX2 "Change in Pearson chi-squared" . predict dB, dbeta (2 missing values generated) label variable dB "Influence" . predict dD, ddeviance (2 missing values generated) label variable dD "Change in deviance" Hosmer and Lemeshow (2000) suggest plots that help in reading these diagnostics. To graph change in Pearson x2 versus probability of distress (Figure 10.3), type: . graph twoway scatter dX2 Phat3 0 Figure 10.3 o to CD cn c en 6 CM .4 .6 Predicted probability Two poorly fit x patterns, at upper right and left, stand out. We can identify these two flights (STS-2 and STS 51-A) if we include marker labels in the plot, as seen in Figure 10.4. > c < r D ft! 276 Statistics with Stata . graph twoway scatter dX2 Phat3, mlabel(flight) mlabsize(small) Figure 10.4 e STS-2 "D 0) CO to cr o (f) k-«J d) _c CD STS 51-A STS 51-J • ST#£TS-3 ...jSTS 41-G 61-C .4 .6 Predicted probability list flight any date temp dX2 Phat3 if dX2 > 5 | flight any date temp dX2 Phat3 2 . | STS-2 1 7 986 70 9 . 630337 . 10 91 805 4 . | STS-4 8213 80 . 0407 113 14 . I STS 51-A 0 9078 67 5.899742 .8400974 25 . I STS 51-L 9524 31 .9999012 Flight STS 51-A experienced no thermal distress, despite a late launch date and cool temperature (see Figure 10.2). The model predicts a .84 probability of distress for this flight. All points along the up-to-right curve in Figure 10.4 have any = 0, meaning no thermal distress. Atop the up-to-left (any= 1) curve, flight STS-2 experienced thermal distress despite being one of the earliest flights, and launched in slightly milder weather. The model predicts only a .109 probability of distress. (Because Stata considers missing values as "high" numbers, it lists the two missing-values flights, including Challenger, among those with dX2 > 5.) Similar findings result from plotting «*»*ol mlabposition(0> mlabel(flight) mlabsize(small) 1 ' o c «J ■>. "D _C d) CD S' .c O STS-2 Figure 10.5 STS 51-A STS 51-J STS3TS-9 STS 41-G J?1-B ~STS 61-C •4 6 Predicted probability dB measures an x pattern's influence in logistic regression, as Cook's D measures an individual observation's influence in OLS. For a logistic-regression analogue to the OLS diagnostic plot in Figure 7.7, we can make the plotting symbols proportional to influence as done in Figure 10.6. Figure 10.6 reveals that the two worst-fit observations are also the most influential. . graph twoway scatter dD Phat3 [aweight = dB], msymbol(oh) Figure 10.6 S3 "O O 64 o ■4 .6 Predicted probability Pi § Poorly fit and influential observations deserve special attention because they both contradict the main pattern of the data and pull model estimates in their contrary direction. Of course, simply removing such outliers allows a "better fit" with the remaining data — but this is circular reasoning. A more thoughtful reaction would be to investigate what makes the outliers unusual. Why did shuttle flight STS-2, but not STS 51-A, experience booster joint damage? Seeking an answer might lead investigators to previously overlooked variables or to otherwise respecify the model. Logistic Regression with Ordered-Category y_ logit and logistic fit only models that have two-category {0,1 }y variables. We need other methods for models in which y takes on more than two categories. For example, Ordered logistic regression, wherey is an ordinal (ordered-category) variable. The numerical values representing the categories do not matter, except that higher numbers mean "more." For example, the v categories might be {1 = "poor," 2 --"fair," 3 = "excellent"}. Multinomial logistic regression, where y has multiple but unordered categories such as {1 = "Democrat," 2 = "Republican," 3 = "undeclared"}. , logit (or logistic ), ologit, and mlogit all produce essentially the same estimates. We earlier simplified the three-category ordinal variable distress into a dichotomy, any. logit and logistic require {0,1} dependent variables, ologit, on the other hand, is designed for ordinal variables like distress that have more than two categories. The numerical codes representing these categories do not matter, so long as higher numerical values mean "more" of whatever is being measured. Recall that distress has categories 0 = "none," 1 ="1 or 2," and 2 ^"3 plus" incidents of booster-joint distress. Ordered logistic regression indicates that date and temp both affect distress, with the same signs (positive for date, negative for temp) seen in our earlier analyses: ologit distress date temp, nolog ologit mlogit If y is {0, Ordered logit est imat es Number of obs 23 LR chi2 (2) 12 . 32 Prob > chi2 0 .0021 Log likelihood = -1 8 .7 970 6 Pseudo R2 0.24 68 distress 1 Coef . S t d. Err. z P> 1 z I [95% Conf. Interval] date I . 00 328 6 . 00 1 2662 2 . 60 0.009 .0008043 . 0057 677 temp 1 .173375 2 . 083447 3 -2.08 0 .038 -.336929 - .0098215 cutl 1 1 6 . 428 1 3 9 . 5548 1 3 (Ancillary pa rame te rs) cut2 I 18 . 12227 9.722293 Likelihood-ratio tests are more accurate than the asymptotic z tests shown. First, have estimates store preserve in memory the results from the full model (with two predictors) just estimated. Arbitrarily, we can name this models. Logistic Regression 279 t estimates store A Next, fit a simpler model without temp, store its results as model B, and ask for a likelihood-ratio test of whether the fit of reduced model B differs significantly from that of the full model, model A: quietly ologit distress date estimates store B . lrtest B A likelihood-ratio test (Assumption: B nested in A) LR chi2 (1) Prob > chi 6 . 12 0.0133 The lrtest output notes its assumption that model B is nested in model A — meaning that the parameters estimated inB are a subset of those \nA, and that both models are estimated from the same pool of observations (which can be tricky when the data contain missing values). This likelihood-ratio test indicates that 5 's fit is significantly poorer. Because the presence of temp as a predictor in model A is the only difference, the likelihood-ratio test thus informs us that temp's contribution is significant. Similar steps find that date also has a significant effect. . quietly ologit distress temp estimates store C . lrtest C A likelihood-ratio test (Assumption: C nested in A) LR chi2 (1) Prob > chi2 =■■ 10.33 .00 13 The estimates store and lrtest commands provide flexible tools for comparing nested maximum-likelihood models. Type help lrtest and help estimates for details, including more advanced options. The ordered-logit model estimates a score, S, as a linear function of date and temp: S = .0032S6date~ M33152temp Predicted probabilities depend on the value of S, plus a logistically distributed disturbance u, relative to the estimated cut points: P(distress="nor\e") = P(S+u < _cut 1) - P(S+u < 16.42813) P(distress="\ or 2") = P(_cutl < S+u < _cut2) - P( 16.42813 < S+u < 18.12227) P(distress="3 plus") = P(_cut2<5+«) = P(18.12227 chi2 Pseudo R2 259 46 . 23 0.0000 0 . 08 63 RRR Std. Err other AK kot: ___f__ [95% Conf- Interval] 2.205882 .7304664 2.39 0.017 1.152687 4.221369 6.132946 33.991 88 leave AK I kotz | 14.4385 6.307555 6.11 0.000 (Out come life= = same is the comparison group) base (1) specifies that category 1 of 37 (life = "same") is the base category for comparison. The rrr option instructs mlogit to show relative risk ratios, which resemble the odds ratios given by logistic. Referring back to the tabulate output, we can calculate that among Kotzebue students the odds favoring "leave Alaska" over "stay in the same area" are P(leave AK) / P(same) = (36/93) / (17/93) = 2.1176471 C S3 ^ 282 Statistics with Stata Among other students the odds favoring "leave Alaska" over "same area" are P(leave AK) / P(same) = (11/166) / (75/166) = .1466667 Thus, the odds favoring "leave Alaska" over "same area" are 14.43 85 times higher for Kotzebue students than for others: 2.1176471 / .1466667= 14.4385 This multiplier, a ratio of two odds, equals the relative risk ratio (14.4385) displayed by mlogit. In general, the relative risk ratio for category j ofy, and predictor xk, equals the amount by which predicted odds favori ngy = / (compared withy = base) are multiplied, per 1 -unit increase mxki other things being equal. In other words, the relative risk ratio rrr^ is a multiplier such that, if all x variables except jc* stay the same, rrryi. x ----—--=—--—--— P(y = base |xk) P(y = base \xk+\) ties is a continuous scale indicating the strength of students' social ties to family and community. We include ties as a second predictor: . mlogit life kotz ties, noloq base(l) rrr Multinomial logistic regression Number ofobs = 259 LR chi2 (4 ) - 91 . 96 Prob>chi2 =- 0.00 00 Log likelihood - -221.77969 Pseudo R2 = 0.1717 life I RRR Std. Err. z P> I z I [9S£ Conf. Interval] ------------------.---------------------------.---.---------------------------- other AK | kotz ! 2.214184 .7724996 2.28 0.023 1.117483 4.387193 ties | .4802486 .0799184 -4.41 0.000 .3465911 .6654492 -------------+----------------------------------------------------------------- leave AK I kotz I 14.84604 7.146824 5.60 0.000 5.778907 38.13955 ties 1 .230262 .059085 -5.72 0.000 .1392531 .38075 (Ou t come 1 ife = = same is the comparison group) Asymptotic z tests here indicate that the four relative risk ratios, describing twox variables' effects, all differ significantly from 1.0. If ay variable has/categories, then mlogit models the effects of each predictor (x) variable with J- 1 relative risk ratios or coefficients, and hence also employs/- 1 z tests — evaluating two or more separate null hypotheses for each predictor. Likelihood-ratio tests evaluate the overall effect of each predictor. First, store the results from the full model, here given the name full. . estimates store full Then fit a simpler model with one of thex variables omitted, and perform a likelihood-ratio test. For example, to test the effect of ties, we repeat the regression with ties omitted: . quietly mlogit life kotz estimates store rio_ties Irtest no ties full Logistic Regression 283 likelihood-ratio test (Assumption: no ties nested in full) LR chi2 (2) Prob > chi2 = 45 . 73 0 .0000 The effect of ties is clearly significant. Next, we run a similar test on the effect of kotz: . quietly mlogit life ties estimates store no^kotz Irtest no kotz full likelihood-ratio test (Assumption: no kotz nested in full) LR chi2 (2) Prob > chi2 39 . 05 0.0000 If our data contained missing values, the three mlogit commands just shown might have analyzed three overlapping subsets of observations. The full model would use only observations with nonmissing life, kotz, and ties values; the kotz-only model would bring back in any observations missing just their ties values; and the ties-only model would bring back observations missing just kotz values. When this happens, Stata returns an error messages saying "observations differ." In such cases, the likelihood-ratio test would be invalid. Analysts must either screen observations with if qualifiers attached to modeling commands, such as mlogit life Jcotz ties, nolog base(l) rrr estimates store full quietly mlogit life Jcotz if ties < estimates store no_ties Irtest no^ties full quietly mlogit life ties if kotz < . estimates store no_kotz Irtest no^kotz full or simply drop all observations having missing values before proceeding: . drop if life >~ . I Jcotz >= . | ties >= . Dataset NWarctic.dta has already been screened in this fashion to drop observations with missing values. Both kotz and ties significantly predict life. What else can we say from this output? To interpret specific effects, recall that life = "same" is the base category. The relative risk ratios tell us that: Odds that a student expects migration to elsewhere in Alaska rather than staying in the same area are 2.21 times greater (increase about 121%) among Kotzebue students (kotz= 1), adjusting for social ties to community. Odds that a student expects to leave Alaska rather than stay in the same area are 14.85 times greater (increase about 1385%) among Kotzebue students (kotz=\), adjusting for social ties to community. Odds that a student expects migration to elsewhere in Alaska rather than staying are multiplied by .48 (decrease about 52%) with each 1 -unit (since ties is standardized, its units equal standard deviations) increase in social ties, controlling for Kotzebue/village residence. IT 284 Statistics with Stata Odds that a student expects to leave Alaska rather than staying are multiplied by .23 (decrease about 77%) with each 1-unit increase in social ties, controlling for Kotzebue/village residence. predict can calculate predicted probabilities from mlogit . The outcome (#) option specifies for which y category we want probabilities. For example, to get predicted probabilities that life = "leave AK" (category 3), quietly mlogit life kotz ties . predict PleaveAK, outcome(3) (option p assumed; predicted probability) . label variable PleaveAK "P(life = 3 | kotz, ties)" Tabulating predicted probabilities for each value of the dependent variable shows how the model fits: table life, contents(mean PleaveAK) row Expect to live most of life? mean {PleaveAK) s ame . 08112 67 other AK . 1770225 leave AK . 3 8 922 64 T o t a 1 . 1 8 14 672 A minority of these students (47/259 = 1 8%) expect to leave Alaska. The model averages only a .39 probability of leaving Alaska even for those who actually chose this response —reflecting the fact that although our predictors have significant effects, most variation in migration plans remains unexplained. Conditional effect plots help to visualize what a model implies regarding continuous predictors. We can draw them using estimated coefficients (not risk ratios) to calculate probabilities: . mlogit life kotz ties, nolog base(l) Multinomial logi stic regression Number of obs 259 LR chi2 (4 ) 91 . 96 Prob > chi 2 0.0000 Log likelihood = -22 1 .77 969 P seudo R2 0 . 1717 life I Coef . Std. Err. z P> 1 z | [95% Conf. Interval] other AK I kotz | .7 94884 . 34 8 88 68 2 28 0 023 .1110784 1.47869 ties | - . 7 3345 13 .1664104 -4 41 0 000 -1.05961 -.407293 cons | .206402 . 172 80 53 1 19 0 232 - . 1 322 902 . 54 50 942 leave AK I kotz | 2 . 6 97733 .4813959 5 60 0 000 1 . 754215 3 . 64 12 52 ties 1 -1.4 68537 .2565991 -5 72 0 .000 -1 .97 14 62 -.9656124 cons | ~2 . 115025 .3758163 -5 63 0 000 -2.851611 -1.378439 (Outcome life==same is the comparison group) Logistic Regression 285 The following commands calculate predicted logits, and then the probabilities needed for conditional effect plots. L2villag represents the predicted logit of life = 2 (other Alaska) for village students. L3kotz is the predicted logit of life = 3 (leave Alaska) for Kotzebue students, and so forth: . generate L2villag = .206402 +.794884*0 -.7334513*ties . generate L2kotz = .206402 +.794884*1 -.7334513*ties . generate L3villag = -2.115025 +2.697733*0 -1.468537*ties . generate L3kotz = -2.115025 +2.697733*1 -1.468537*ties Like other Stata modeling commands, mlogit saves coefficient estimates as macros. For example, [2]_b[/co/r] refers to the coefficient on kotz in the model's second (life = 2) equation. Therefore, we could have generated the same predicted logits as follows. L2v will be identical to L2villag defined earlier, L3k the same as L3kotz, and so forth: . generate L2v = [2]_b[_cons] +[2]_b[kotz]*0 +[2]_b[ties]* ties . generate L2k = [2]_b[_consl +[2]_b[Jfcotz]*1 +[2] Jb [ties]*ties . generate L3v = [3]_b[_cons] + [3]_b [Jtotz] *0 + [3] _b[ ties] * ties . generate L3k = [3]_b[_cons] +[3]_b[kotz]*1 + [3]__b[ties]* ties From either set of logits, we next calculate the predicted probabilities: . generate Plvillag = 1/(1 +exp(L2villag) +exp(L3villag)) . label variable Plvillag "same area" . generate P2villag = exp(L2villag)/(1+exp(L2villag)+exp(L3villag)) . label variable P2villag "other Alaska" . generate Plvillag = exp(L3villag)/(1+exp(L2villag)+exp (L3villag)) . label variable P3villag "leave Alaska" . generate Plkotz - 1/(1 +exp (L2Jeotz) +exp (L3Jcotz) ) label variable Plkotz "same area" . generate P2kotz = exp {L2kotz) I (1 +exp (i^Jcotz) +exp (L3kotz) ) . label variable P2kotz "other Alaska" . generate P3Jtotz = exp (L3Jtotz) / (1 +exp (L2kotz) +exp (L3kotz)) . label variable P3kotz "leave Alaska" c B i § o o O w f. 286 Statistics with Stata Figures 10.7 and separately. .0.8 show conditional effect plots for village and Kotzebue students graph two-way mspline Plvillag ties, bands (50) || mspline P2villag ties, bands(50) || mspline P3villag ties, bands(50) || , xlabel (-3 (1)3) ylabel(0(.2)1) yline(0 1) xline(O) legend(order(2 3 1) position (12) ring(0) label(1 "same area") label (2 "elsewhere Alaska") label(3 "leave Alaska") cols(l)) ytitle("Probability") Figure 10.7 elsewhere Alaska leave Alaska same area -3 -1 0 1 Social ties to community scale Logistic Regression 287 graph twoway mspline Plkotz ties, bands(50) I I mspline P2kotz ties, bands (50) | | mspline P3kotz ties, bands (50) II , xlabel(-3 (1)3) ylabel(0(.2)1) yline(0 1) xline(O) legend(order(3 2 1) position(12) ring(0) label(l "same area") label(2 "elsewhere Alaska") label(3 "leave Alaska") cols(l)) ytitle("Probability") Figure 10.8 leave Alaska elsewhere Alaska same area -1 0 1 Social ties to community scale The plots indicate that among village students, social ties increase the probability of staying rather than moving elsewhere in Alaska. Relatively few village students expectto leave Alaska. In contrast, amongKotzebue students, ties particularly affects the probability of leaving Alaska, rather than simply moving elsewhere in the state. Only if they feel very strong social ties do Kotzebue students tend to favor staying put. ii B % o O a: O fir; Survival and Event-Count Models This chapter presents methods for analyzing event data. Survival analysis encompasses several related techniques that focus on times until the event of interest occurs. Although the event could be good or bad, by convention we refer to that event as a "failure." The time until failure is "survival time." Survival analysis is important in biomedical research, but it can be applied equally well to other fields from engineering to social science — for example, in modeling the time until an unemployed person gets a job, or a single person gets married. Stata offers a full range of survival analysis procedures, only a few of which are illustrated in this chapter. We also look briefly at Poisson regression and its relatives. These methods focus not on survival times but, rather, on the rates or counts of events over a specified interval of time. Event-count methods include Poisson regression and negative binomial regression. Such models can be fit either through specialized commands, or through the broader approach of generalized linear modeling (GLM). Consult the Survival Analsysis and Epidemiological Tables Reference Manual for more information about Stata's capabilities. Type help st to see an online overview. Selvin (1995) provides well-illustrated introductions to survival analysis and Poisson regression. I have borrowed (with permission) several of his examples. Other good introductions to survival analysis include the Stata-oriented volume by Cleves, Gould and Gutierrez (2004), a chapter in Rosner (1995), and comprehensive treatments by Hosmer and Lemeshow (1999) and Lee (1992). McCullagh and Nelder (1989) describe generalized linear models. Long (1997) has a chapter on regression models for count data (including Poisson and negative binomial), and also has some material on generalized linear models. An extensive and current treatment of generalized linear models is found in Hardin and Hilbe (2001). Stata menu groups most relevant to this chapter include: Statistics - Survival analysis Graphics - Survival analysis graphs Statistics - Count outcomes Statistics - Generalized linear models (GLM) Regarding epidemiological tables, not covered in this chapter, further information can be found by typing help epitab or exploring the menus for Statistics - Observational/Epi. analysis. 288 Survival and Event-Count Models 289 Example Commands Most of Stata's survival-analysis (st* ) commands require that the data have previously been identified as survival-time by issuing an stset command (see following), stset need only be run once, and the data subsequently saved, stset timevar, failure(failvar) Identifies single-record survival-time data. Variable timevar indicates the time elapsed before either a particular event (called a "failure") occurred, or the period of observation ended ("censoring"). Variablefailvar indicates whether a failure (failvar = 1) or censoring (failvar - 0) occurred at timevar. The dataset contains only one record per individual. The dataset must be stset before any further st* commands will work. If we subsequently save the dataset, however, the stset definitions are saved as well, stset creates new variables named_.s;,_^,_/, and_/fl that encode information necessary for subsequent st* commands. stset timevar, failure(failvar) id(patient) enter(time start) Identifies multiple-record survival-time data. In this example, the variable timevar indicates elapsed time before failure or censoring;/<3//var indicates whether failure (1) or censoring (0) occurred at this time, patient is an identification number. The same individual might contribute more than one record to the data, but always has the same identification number. startrecovds the time when each individual came under observation. . stdes Describes survival-time data, listing definitions set by stset and other characteristics of the data. stsum Obtains summary statistics: the total time at risk, incidence rate, number of subjects, and percentiles of survival time. ctset time nfail ncensor nenter, by(ethnic sex) Identifies count-time data. In this example, the variable time is a measure of time; nfail is the number of failures occurring at time. We also specified ncensor (number of censored observations at time) and nenter (number entering at time), although these can be optional. ethnic and sex are other categorical variables defining observations in these data. cttost Converts count-time data, previously identified by a ctset command, into survival-time form that can be analyzed by st* commands. sts graph Graphs the Kaplan-Meier survivor function. To visually compare two or more survivor functions, such as one for each value of the categorical variable sex, use the by () option, sts graph, by(sex) To adjust, through Cox regression, for the effects of a continuous independent variable such as age, use the adjust£or() option, sts graph, by(sex) adjustfor(age) Note: the by() and adjustfor() options work similarly with the other sts commands sts list, sts generate, and sts test. 290 Statistics with Stata Survival and Event-Count Models 291 O O o o sts list Lists the estimated Kaplan-Meier survivor (failure) function, sts test sex Tests the equality of the Kaplan-Meier survivor function across categories of sex. sts generate survfunc = S Creates a new variable arbitrarily named survfunc, containing the estimated Kaplan-Meier survivor function, stcox xl x2 x3 Fits a Cox proportional hazard model, regressing time to failure on continuous or dummy variable predictors xl-x3. stcox xl x2 x3, strata(x4) basechazard{hazard) robust Fits a Cox proportional hazard model, stratified by x4. Stores the group-specific baseline cumulative hazard function as a new variable named hazard. (Baseline survivor function estimates could be obtained through a basesur {survive) option.) Obtains robust standard error estimates. See Chapter 9 or, for a more complete explanation of robust standard errors, consult the User's Guide. stphplot, by(sex) Plots -ln(-ln(survival)) versus ln(analysis time) for each level of the categorical variable sex, from the previous stcox model. Roughly parallel curves support the Cox model assumption that the hazard ratio does not change with time. Other checks on the Cox assumptions are performed by the commands s tcoxkm (compares Cox predicted curves with Kaplan-Meier observed survival curves) and stphtest (performs test based on Schoenfeld residuals). See help stcox for syntax and options. streg xl x2, dist{weibull) Fits Wei bull-distribution model regression of time-to-failure on continuous or dummy variable predictors xl andx2. streg xl x2 x3 x4, dist(exponential) robust Fits exponential-distribution model regression of time-to-failure on continuous or dummy predictors xl-x4. Obtains heteroskedasticity-robust standard error estimates. In addition to Weibull and exponential, other dist() specifications for streg include lognormal, log-logistic, Gompertz, or generalized gamma distributions. Type help streg for more information. stcurve, survival After streg , plots the survival function from this model at mean values of all the jc variables. stcurve, cumhaz at(x3=50, x4=0) After streg, plots the cumulative hazard function from this model at mean values ofxl and x2, x3 set at 50, and x4 set at 0. poisson count xl x2 x3, irr exposure(x4) Performs Poisson regression of event-count variable count (assumed to follow a Poisson distribution) on continuous or dummy independent variablesxl-x3. Independent-variable effects will be reported as incidence rate ratios ( irr ). The exposure () option identifies a variable indicating the amount of exposure, if this is not the same for all observations. Note: A Poisson model assumes that the event probability remains constant, regardless of how many times an event occurs for each observation. If the probability does not remain constant, we should consider using nbreg (negative binomial regression) or gnbreg (generalized negative binomial regression) instead. glm count xl x2 x3, link(log) family(poisson) lnoffset(x4) eform Performs the same regression specified in the poisson example above, but as a generalized linearmodel (GLM). glm can fit Poisson,negative binomial, logit, and many other types of models, depending on what link() (link function) and family () (distribution family) options we employ. Survival-Time Data_ Survival-time data contain, at a minimum, one variable measuring how much time elapsed before a certain event occurred to each observation. The literature often terms this event of interest a "failure," regardless of its substantive meaning. When failure has not occurred to an observation by the time data collection ends, that observation is said to be "censored." The stset command sets up a dataset for survival-time analysis by identifying which variable measures time and (if necessary) which variable is a dummy indicating whether the observation failed or was censored. The dataset can also contain any number of other measurement or categorical variables, and individuals (for example, medical patients) can be represented by more than one observation. To illustrate the use of stset, we will begin with an example from Selvin (1995:453) concerning 51 individuals diagnosed with HIV. The data initially reside in a raw-data file (aids.raw) that looks like this: 1 i l 2 17 1 3 37 0 (rows 4-50 omitted) 51 81 0 34 42 4 7 29 The first column values are case numbers (1,2,3.....51). The second column tells how many months elapsed after the diagnosis, before that person either developed symptoms of AIDS or the study ended (1, 17, 37,. . .). The third column holds a 1 if the individual developed AIDS symptoms (failure), or a 0 if no symptoms had appeared by the end of the study (censoring). The last column reports the individual's age at the time of diagnosis. We can read the raw data into memory using inf ile , then label the variables and data and save in Stata format as file aidsl.dta: infile case time aids age using aids.raw, clear (51 observations read) . label variable case "Case ID number" label variable time "Months since HIV diagnosis" label variable aids "Developed AIDS symptoms" label variable age "Age in years" 292 Statistics with Stata . label data "AIDS (Selvin 1995:453)" compress case was float now byte time was float now byte aids was float now byte age was float now byte save aidsl file c:\dara\aidsl.dta saved The next step is to identify which variable measures time and which indicates failure/ censoring. Although not necessary with these single-record data, we can also note which variable holds individual case identification numbers. In an stset command, the first-named variable measures time. Subsequently, we identify with failure () the dummy representing whether an observation failed (1) or was censored (0). After using stset, we save the data again to preserve this information. stset time, failure(aids) id(case) id: failure event : obs. time interval : exit on or before : case aids != 0 & aids < (time[_n-l], time] failu re 1 total obs. 0 exclusions 51 51 25 3164 obs. remaining, representing subjects failures in single failure-per-subject data total analysis time at risk, at risk from t earliest observed entry t last observed exit t 0 97 . save, replace file c:\data\aidsl.dta saved stdes yields a brief description of how our survival-time data are structured. In this simple example we have only one record per subject, so some of this information is unneeded. Survival and Event-Count Models 293 stdes failure _d analysis time _t id aids t ime case per sub j ect Ca t ego r y total mean min med!an max no. of subjects no. of records 51 51 1 1 1 1 (first) entry time (final) exit time 62 . 0 03922 0 1 0 67 0 97 subjects with gap time on gap if gap time at risk 0 0 3164 62 . 03922 1 67 97 failures 25 .49 01961 0 0 1 The stsum command obtains summary statistics. We have 25 failures out of 3,164 person-months, giving an incidence rate of 25/3164 = .0079014. The percentiles of survival time derive from a Kaplan-Meier survivor function (next section). This function estimates about a 25% chance of developing AIDS within 41 months after diagnosis, and 50% within 81 months. Over the observed range of the data (up to 97 months) the probability of AIDS does not reach 75%, so there is no 75th percentile given. . stsum failure _d analysis time _t id aids t ime case time at risk 3164 in ci dence rate no . of subject s -- Survival time 2 5 % 5 0 % 75 total .007 9014 51 If the data happen to include a grouping or categorical variable such as sex (0 = male, 1 = female), we could obtain summary statistics on survival time separately for each group by a command of the following form: s tsum, by(sex) Later sections describe more formal methods for comparing survival times from two or more groups. Count-Time Data Survival-time ( st) datasets like aidsl.dta contain information on individual people or things, with variables indicating the time at which failure or censoring occurred for each individual. A different type of dataset called count-time ( ct ) contains aggregate data, with variables counting the number of individuals that failed or were censored at time t. For example, diskdriv.dta contains hypothetical test information on 25 disk drives. All but 5 drives failed before testing ended at 1,200 hours. 294 Statistics with Stata Survival and Event-Count Models Contains data from C:\data\diskdriv.dta obs : 6 vars : 3 slze; 48 (99.9s-. of memory free) Count-time data on disk drives 21 Jul 2005 09:34 storage display variable nam.e type format value label variable label hours failures censored int ůs8.0g byte % 8 . 0 g byte 19.0 g Hours of continuous operation Number of failures observed Number still working Sorted by . list hours failures cen s o red 200 2 0 400 0 600 4 0 800 8 0 1000 3 0 1200 0 5 To set up a count-time dataset, we specify the time variable, the number-of-failures variable, and the number-censored variable, in that order. After ctset , the cttost command automatically converts our count-time data to survival-time format. . ctset hours failures censored dataset name: 11 me : no. fail: no. lost: no . enter: cttost (data are now st) failure event obs. time interval exit on or before we ight C:\data\diskdriv.dta hours failures censored. (meaning all enter at time 0} failures != 0 & failures (0, hours] failure [fwe i gh t] 6 total obs . 0 exclusions 6 physical obs. remaining, equal to 25 weighted obs., representing 20 failures in single record/single failure data 19400 total analysis time at risk, at risk from t - 0 earliest observed entry t = 0 last observed exit t = 1200 list hours failures w _s t _d __t _to 1200 0 5 1 0 1200 0 200 1 2 1 1 200 0 400 1 3 1 1 400 0 600 1 4 1 1 600 0 800 1 8 1 1 800 0 1000 1 3 1 1 1000 0 stdes failure d analysis time _t weight failures hours [ fweight=w] Category unweighted total unweighted mean per subject ------- u nwe i ghted min median no. of subjects no. of records (first) entry time (final) exit t ime subjects with gap time on gap if gap time at risk failures 0 0 4200 700 700 8333333 200 0 700 700 1 0 1200 1200 1 The cttost command defines a set of frequency weights, w, in the resulting st-format dataset. st* commands automatically recognize and use these weights in any survival-time analysis, so the data now are viewed as containing 25 observations (25 disk drives) instead of the previous 6 (six time periods). s tsum failure time: hours failure/censor: failures weight: [fweight-w] I incidence ra te I time at risk total I l9400 0010309 no. of subjects ■---Survival time 25% 50% 75% 1000 25 600 eoo Kaplan-Meier Survivor Functions Let n, represent the number of observations that have not failed, and are not censored, at the beginning of time period t. d, represents the number of failures that occur to these observations during time period t. The Kaplan-Meier estimator of surviving beyond time t is the product of survival probabilities in t and the preceding periods: g o w o z o fcs.. o v:, 296 Statistics with Stata S(t) n {(«, - ^■)/«/ 11 For example, inthe AIDS data seen earlier, one of the 51 individuals developed symptoms only one month after diagnosis. No observations were censored this early, so the probability of "surviving" (meaning, not developing AIDS) beyond time ~ I is S(l)= ( 51 - 1)/51 -.9804 A second patient developed symptoms at time - 2, and a third at time = 9: S(2)= .9804 x (50- 1)/ 50 = .9608 S(9)= .9608 x ( 49- 1) / 49= .9412 Graphing S(t) against / produces a Kaplan-Meier survivor curve, like the one seen in Figure 11.1. Stata draws such graphs automatically with the sts graph command. For example, use aids, clear (AIDS (S e1vi n 1 99 5 : 4 5 3) ] sts graph failure _d analysis time _t : aids : t ime d: case Kaplan-Meier survival estimate Figure 11.1 20 40 60 analysis time 80 100 For a second example of survivor functions, we turn to data in smokingI.dta, adapted from Rosner (1995). The observations are 234 former smokers, attempting to quit. Most did not succeed. Variable days records how many days elapsed between quitting and starting up again. The study lasted one year, and variable smoking indicates whether an individual resumed If Survival and Event-Count Models 297 smoking before the end of this study (smoking = 1, "failure") or not {smoking = 0, "censored") With new data, we should begin by using s tset to set the data up for survival-time analysis! Contains data from C : \data\smokingl.dta obs : 2 3 4 Smoking (Rosner 1995: 6 07 ) vars : 8 21 Jul 2005 09:35 size: 3,744 (99.9% of nemory free) storage display value variable name type format label variable label id int %9.0g Case ID numbe r days in t z 9 . 0 g Days abstinent smoking byte %9 . Og Resumed smoking age byte % 9 . 0 g Age in years sex byte % 9 . 0 g sex Sex (female) c lgs byte i 9 . 0 g Cigarettes per day CO int % 9 . Og Carbon monoxide x 10 minutes int i 9 . 0 g Minutes elapsed since last c ig Sorted by stset days, failure(smoking) failure event: smoking != 0 & smoking < obs. time interval: (0, days] exit on or before: fa ilu re 234 tota 1 obs . 0 exclusions 234 201 1 894 6 obs. remaining, representing failures in single record/single failure data total analysis time at risk, at risk from t -earliest observed entry t = last observed exit t = 0 0 366 The study involved 110 men and 124 women. Incidence rates for both similar: . stsum, by(sex) failure _d: smoking analysis time t: davs sexes appear to be 1 sex j time at risk incidence rate n o . o f 1 subjects Survi val time-----| 25% 50% 75% Male 1 Female I 8813 101 33 . 0105526 . 010 6582 110 124 4 15 68 4 15 91 total 1 1 8 946 .0106091 2 3 4 4 15 73 Figure 1.2 confirms this similarity, showing little difference between the survivor ------^v.. iviun^u oinwiviiig at duuui me same rate. The survival probabilities of nonsmokers decline very steeply during the first 30 days after quitting. For either sex, there is less than a 15% ehance of surviving beyond a full year. 298 Statistics with Stata sts graph, by(sex) failure _d: smoking analysis time _t: days Kaplan-Meier survival estimates, by sex Figure 11.2 400 Male-----sex = Female We ean also formally test for the equality of survivor Unsurprisingly, this test finds no recidivism of men and women. functions using a log-rank test. sT^diffe^ce (P = .6772) between the smokmg its test sex failure _d: smoking analysis time _t: days test for equality of survivor functions ,og- rank Events observed Events ■ xcected Male I Female I 93 108 95.88 105.12 Total 201 chf2CM = Pr>chi2 ■= 201 .00 0 . 17 6772 Survival and Event-Count Models 299 Cox Proportional Hazard Models Regression methods allow us to take survival analysis further and examine the effects of multiple continuous or categorical predictors. One widely-used method known as Cox regression employs a proportional hazard model. The hazard rate for failure at time t\s defined as h(t) probability of failing between times t and t + At [11.2] (At) (probability of failing after time t) We model this hazard rate as a function of the baseline hazard (h 0) at time t, and the effects of one or more* variables, h(t) = /20(0exp(p,xI + P2x2 + ...+p,.vJ [11.3a] or, equivalently, \n[h(t)} = ln[A0(0] + P1x1 + p2x2 + ... + p,xt [11.3b] "Baseline hazard" means the hazard for an observation with all x variables equal to 0. Cox regression estimates this hazard nonparametrically and obtains maximum-likelihood estimates of the p parameters in [11.3]. Stata's stcox procedure ordinarily reports hazard ratios, which are estimates of exp(p). These indicate proportional changes relative to the baseline hazard rate. Does age affect the onset of AIDS symptoms? Dataset aids.dta contains information that helps answer this question. Note that with stcox , unlike most other Stata model-fitting commands, we list only the independent variable(s). The survival-analysis dependent variables, timevariables, and censoring variables are understood automatically with stset data. use aids (AIDS (Selvin 1995:453)) stcox age, nolog failure __d analysis time t id aids time case Cox regression -- Breslow method for ties NO . of subjects = 51 Number of obs 51 NO . of failures = 25 Time at risk -= 3164 LR chi2(1) 5 . 00 Log 1i ke11 hood = -86.576295 Prob > chi2 0 . 0254 t [ Haz . Ratio Std, Err. z P> 1 z1 [95% Conf. Interval] age | 1 .084557 .0375 362 3 2.33 0 . 020 1 . 012 83 1 . 161363 We might interpret the estimated hazard ratio, 1.084557, with reference to two HIV-positive individuals whose ages are a and a + 1. The older person is 8.5% more likely to develop AIDS symptoms over a short period of time (that is, the ratio of their respective hazards 300 Statistics with Si at a is 1.084557). This ratio differs significantly (P = .020) from 1. If we wanted to state our findings for a five-year difference in age, we could raise the hazard ratio to the fifth power: . display exp(_b[age]) A5 1.5005865 Thus, the hazard of AIDS onset is about 50% higher when the second person is five years older than the first. Alternatively, we could learn the same thing (and obtain the new confidence interval) by repeating the regression after creating a new version of age measured in five-year units. The nolog noshow options below suppress display of the iteration log and the st-dataset description. generate age5 = age/5 label variable age 5 "age in 5-year units" stcox age5, nolog noshow Cox regression Breslcw method for ties No. of subjects No. of failures Time at risk "Log likelihood 51 2 5 3164 86.57 62 95 Number of obs LR chi2 (1) Prob > chi2 51 5 . 00 0.0254 Std. Err. z P> 1 z 1 . 26 1 9305 2 . 33 0.020 Like ordinary regression, Cox models can have more than one independent variable. Dataset heart.dta contains survival-time data from Selvin (1995) on 35 patients with very high cholesterol levels. Variable time gives the number of days each patient was under observation. coronary indicates whether a coronary event occurred during this time (coronary = 1) or not (coronaiy = 0). The data also include cholesterol levels and other factors thought to affect heart disease. File heart.dta was previously set up for survival-time analysis by an stset time, failure (coronary) command, so we can go directly to st analysis. describe pa ti en t - ab storage display fo rmat patient byte % 9 . 0 g time int % 9 . 0 g coronary byte % 9 . 0 g weight int l 9 . 0 g s bp int % 9 . 0 g chol int -h 9 . 0 g c i gs byte \ 9 . 0 g ab byte % 9 . 0 g value 1 ab el variable label Patient ID number Time in days Coronary event (1) or none (0) We ight in pounds Systolic blood pressure Cholesterol level Cigarettes smoked per day-Type A (1) or B (0) personality Survival and Event-Count Models 301 stdes failure _d: coronary analysis time t: time ' a t e g o r y total per subject no. of subjects no. of records 35 35 1 1 1 1 (first) entry time (final) exit time 0 2580.629 0 7 73 0 2 8 75 0 3 141 subjects with gap time on gap if gap time at risk 0 0 5 0 3 2. 2 25S0.629 7'73 2 675 3141 failures 8 . 228 57 1 4 0 0 Cox regression finds that cholesterol level and cigarettes both significantly increase the hazard of a coronary event. Counterintuitively, weight appears to decrease the hazard. Systolic blood pressure and A/B personality do not have significant net effects, stcox weight sbp chol cigs ab, noshow nolog Cox regression -- no ties Numbe r o f ob s No. of subjects = No. of failures -Time at risk Log likelihood 5 b 8 90322 17.263231 LR chi2 ( 5) Prob > chi2 35 13 . 97 0.0158 Haz. Rati. Std. Err weight sbp chol cigs ab .934 933 6 1.012947 1.032142 1.20 333 5 3.0 4969 0 3 0 518 4 0 3 3 8 0 61 0139984 1071031 .985616 z p> 1 z j í 9 5 i C o n f . 2 06 0 . 039 . 87699 1 9 . 9967 034 0 39 0 . 700 .9488087 1 . 081421 2 33 0 . 020 1.0050 67 1 . 0 5 9947 2 08 0 . 038 1 . 01 07 07 1.432676 1 14 0 . 255 .4476492 After estimating the model, stcox can also generate new variables holding the estimated baseline cumulative hazard and survivor functions. Since "baseline'' refers to a situation with all x variables equal to zero, however, we first need to recenter some variables so that 0 values make sense. A patient who weighs 0 pounds, or has 0 blood pressure, does not provide a useful comparison. Guided by the minimum values actually in our data, we might shift weight so that 0 indicates 120 pounds, sbp so that 0 indicates 100, and chol so that 0 indicates 340: summarize patient - ab Variable I Obs Me a n Std. Dev. Mi n Max patient 1 t i me í coronary j weight 1 sbp l 35 'J c, 35 35 35 18 2 580.62 9 . 22 857 l4 1 70 . 08 57 129.7143 10.24 695 616.0796 - 4 2604 3 23 . 555 1 6 14 . 28403 1 77 3 0 120 104 3 5 3111 1 22 5 154 w % O a: b.. O É k) 302 Statistics with Stata chol c i gs ab 3 69 . 2857 17 . 11286 .5142857 51 . 32284 13 . 077 02 . 5070926 343 0 645 40 1 replace weight = weight - 120 (35 real changes made) replace sbp = sbp - 100 (35 real changes made) . replace chol = chol ~ 340 (35 real changes made) . summarize patient - ab Variable [ pat ient 11 me coronary we i ght sbp chol c igs ab Obs Mean Std. Dev. Min 35 18 1 0 . 24 695 1 35 2580 . 62 9 616.0796 773 35 .2285714 . 42604 3 0 35 50 . 0857 1 23 . 55516 0 35 29 . 7 14 29 1 4 . 2840 3 4 35 29 . 2 857 1 51 . 32284 3 35 17 . 1 428 6 13 . 07702 0 35 . 51 42857 . 5070926 0 Max 35 3141 1 105 54 305 40 Zero values for all the x variables now make more substantive sense. To create new variables holding the baseline survivor and cumulative hazard function estimates, we repeat the regression with basesurv() and basechazO options: stcox weight sbp chol cigs ab, noshov nolog basesurv(survivor) basechaz(hazard) Cox regression -- No. of subjects = No. of failures = Time at risk Log likelihood 35 8 90322 17.263231 Number of obs LR chi2(5) Prob > chi2 35 13 . 97 0.0158 Ha z. Rat io Std. Err P> I z I [95% Conf. Interval] we ight 1 sbp 1 chol 1 9 34 93 36 .0305184 -2 06 0 039 . 87 69 91 9 9967034 1 .01294 7 .0338061 0 39 0 700 . 94 8 8087 1 .081421 1 . 0321 42 .0 1 3 9984 2 33 0 020 1 . 00 5067 1 .059947 c i gs 1 ab \ 1 .2.03335 . 1 071031 2 08 0 038 1.010707 1 .432676 3.04969 2.98 5616 1 14 0 255 .44764 92 2 0.776 55 Survival and Event-Count Models 303 graph twoway line survivor time, connect(stairstep) sort - -1 Figure 11.3 o > »8 500 1000 1500 2000 Time in days 2500 3000 The baseline survivor function — which depicts survival probabilities for patients having "0" weight (120 pounds), "0" biood pressure (100), "0" cholesterol (340), 0 cigarettes per day, and a type B personality — declines with time. Although this decline looks precipitous at the right, notice that the probability really only falls from 1 to about .96. Given less favorable values of the predictor variables, the survival probabilities would fall much faster. The same baseline survivor-function graph could have been obtained another way, without stcox . The alternative, shown in Figure 11.4, employs an sts graph command with adjustfor () option listing the predictor variables: Note that recentering three x variables had no effect on the hazard ratios, standard errors, and so forth. The command created two new variables, arbitrarily named survivor and hazard. To graph the baseline survivor function, we plot survivor against time and connect data points with in a stairstep fashion, as seen in Figure 11.3. Í T 304 Statistics with Stata sts gr aph, adjustfor(we ight sbp Chol cigs ab] failure _ci: coronary analysis time Survivor function adjusted for weight sbp chol cigs ab Figure 11.4 1000 analysis time 2000 3000 v, ir Ii* fnllnw^ the usual survivor-function convention of scaling to .033. Survival and Event-Count Models 305 graph twoway connected hazard time, connect(stairstep) sort msymbol(Oh) Figure 11.5 E ° 500 1000 1500 2000 Time in days 2500 3000 Exponential and Weibull Regression Cox regression estimates the baseline survivor function empirically without reference to any theoretical distribution. Several alternative "parametric" approaches begin instead from assumptions that survival times do follow a known theoretical distribution. Possible distribution families include the exponential, Weibull, lognormal, log-logistic, Gompertz, or generalized gamma. Models based on any of these can be fit through the streg command. Such models have the same general form as Cox regression (equations [11.2] and [11.3]), but define the baseline hazard h 0 (t) differently. Two examples appear in this section. If failures occur randomly, with a constant hazard, then survival times follow an exponential distribution and could be analyzed by exponential regression. Constant hazard means that the individuals studied do not "age," in the sense that they are no more or less likely to fail late in the period of observation than they were at its start. Over the long term, this assumption seems unjustified for machines or li vingorganisms, but it might approximate ly hold if the period of observation covers a relatively small fraction of their life spans. An exponential model implies that logarithms of the survivor function, \n(S(t)), are linearly related to t. A second common parametric approach, Weibull regression, is based on the more general Weibull distribution. This does not require failure rates to remain constant, but allows them to increase or decrease smoothly over time. The Weibull model implies that ln(-ln(5(0)) is a linear function of ln(r). Graphs provide a useful diagnostic for the appropriateness of exponential or Weibull models. For example, returning to aids.dta, we construct a graph (Figure 11.6) of \n(S(t)) versus time, after first generating Kaplan-Meier estimates of the survivor function S(t), The s B £:! 6 a; § 306 Staf/st/cs w/th Srafa v-axis labels in Figure 11.6 are given a fixed two-digit, one-decimal display format (%2.1f) and oriented horizontally, to improve their readability. use aids, clear (AIDS (Selvm 1 99 5 : 4 5 3) ) sts gen 5 = S . generate logS = ln(S) graph twoway scatter XogS time, ylabel<-.8(.1)0, format(%2.If) angle(horizontal)) Figure 11.6 -0.0 -0.1 -0.2 -0.3 -04 -0.5 -0.6 -0.7 -0.8 20 40 60 Months since HIV diagnosis 80 100 The pattern in Figure 11.6 appears regression: streg age, dist(expo log relative-hazard form somewhat linear, encouraging us to try an exponential nential) nolog noshow Expone .tial regression No. of subjects No. of failures Time at risk Log likelihood 51 25 3164 -59.996976 Number of ob; LR chi2(1) Prob > chi2 51 4 . 34 0.0372 Haz. Ratio Std. Err P> I z [95% Conf. Interval] age I 1.07 4414 0349626 2.21 0.027 1.008028 1.145172 The hazard ratio (1 do not greatly differs similarity reflects the degree Survival and Event-Count Models 307 functions. According to this exponential model, the hazard of an HIV-positive individual developing AIDS increases about 7.4% with each year of age. After streg,the stcurve command draws a graph ofthe models 'cumulative hazard, survival, or hazard functions. By default, stcurve draws these curves holding all x variables in the model at their means. We can specify other x values by using the at () option. The individuals in aids.dta ranged from 26 to 50 years old. We could graph the survival function at age = 26 by issuing a command such as stcurve, surviv at(age=26) A more informative graph uses the atl () and at2 () options to show the survival curve at two different sets of x values, such as the low and high extremes of age: stcurve, survival atl(age=26) at2(age=50) connect(direct direct) Exponential regression Figure 11.7 _ to > 20 40 60 analysis time age=26 - 80 100 _ age=50 Figure 11.7 shows the predicted survival curve (for transition from HIV diagnosis to AIDS) falling more steeply among older patients. The significant age hazard ratio greater than 1 in our exponential regression table implied the same thing, but using stcurve with atl() and at2 () values gives a strong visual interpretation of this effect. These options work in a similar manner with all three types of stcurve graphs: stcurve , survival Survival function. stcurve, hazard Hazard function. stcurve, cumhaz Cumulative hazard function. Instead of the exponential distribution, streg can also fit survival models based on the Weibull distribution. A Weibull distribution might appear curvilinear in a plot of ln(5(0) versus /, but it should be linear in a plot of ln(-ln(S(?))) versus ln(f), such as Figure 11.8. An exponential distribution, on the other hand, will appear linear in both plots and have a slope 308 Statistics with Stata equal to 1 in the ln(-ln(£(f))) versus ln(r) plot. In fact, the data points in Figure 11.8 are not far from a line with slope I, suggesting that our previous exponential model is adequate. . generate loglogS = ln(-ln(S)) generate logtime = In(time) . graph twoway scatter loglogS logtime, ylabel ( ,angle(horizontal)) Figure 11.8 en TO „ 0 -2 01 o -4 * 0 1 2 3 4 5 logtime Although we do not need the additional complexity of a Weibull model with these data, results are given below for illustration. streg age, dist(weibull) noshow nolog Weibull regression -- log relative-hazard form No. of subjects = 51 Number of obs = 51 No. of failures = 25 Time at risk = 3164 LR chi 2 (1) = 4 .68 Log likelihood = -59.778257 Prob > chi2 = 0.0306 _t I Haz. Ratio Std. Err. z P>1z I (95%Conf. Interval] age I 1.079477 .0363509 2.27 0.023 1.010531 1.153127 /1n_p | .1232638 .1820858 0.68 0.498 -.2336179 .4801454 p| 1.131183 . 2 0 5>9723 .7916643 1.616309 l/p| .8840305 .1609694 .6186934 1.263162 The Weibull regression obtains a hazard ratio estimate (1.079) intermediate between our previous Cox and exponential results. The most noticeable difference from those earlier models is the presence of three new lines at the bottom of the table. These refer to the Weibull distribution shape parameter/?. Ap value of 1 corresponds to an exponential model: the hazard Survival and Event-Count Models 309 does not change with time, p > 1 indicates that the hazard increases with time;p < 1 indicates that the hazard decreases. A 95% confidence interval forp ranges from .79 to 1.62, so we have no reason to reject an exponential (p = 1) model here. Different, but mathematically equivalent, parameterizations of the Weibull model focus on ln(/?),/>, or \/p, so Stata provides all three, stcurve draws survival, hazard, or cumulative hazard functions after streg, dist (weibull) just as it does after streg, dist (exponential) or other streg models. Exponential or Weibull regression is preferable to Cox regression when survival times actually follow an exponential or Weibull distribution. When they do not, these models are misspecified and can yield misleading results. Cox regression, which makes no a priori assumptions about distribution shape, remains useful in a wider variety of situations. In addition to exponential and Weibull models, streg can fit models based on the Gompertz, Iognormal, log-logistic, or generalized gamma distributions. Type help streg, or consult the Survival Analysis and Epidemiological Tables Reference Manual, for syntax and a list of current options. Poisson Regression If events occur independently and with constant probability, then counts of events over a given period of time follow a Poisson distribution. Let r ■ represent the incidence rate: count of events number of times event could have occurred [11.4] The denominator in [ 11.4] is termed the "exposure" and is often measured in units such as person-years. We model the logarithm of incidence rate as a linear function of one or more predictor (x) variables: [11.5a] [11.5b] Equivalently, the model describes logs of expected event counts: \r\(expected count) = \n(exposure) + p0 + x l ■+ $2x2 + Assuming that a Poisson process underlies the events of interest, Poisson regression finds maximum-likelihood estimates of the p parameters. Data on radiation exposure and cancer deaths among workers at Oak Ridge National Laboratory provide an example. The 56 observations in dataset oakridge.dta represent 56 age/radiation-exposure categories (7 categories of age * 8 categories of radiation). For each combination, we know the number of deaths and the number of person-years of exposure. Contains data from C:\data\oakridge.dta obs: 5 6 vars : 4 size: 616 (99.9% of mentor y f r ee ) Radiation (Selvin 1 995 : 47 4 } 21 Jul 2005 09:34 storage display variable name type format value label var iable label age r ad byte byte % 9 . 0 g % 9 . 0 g ageg Age group Radiation exposure level 9 >1 w v; , a; 3)0 Statistics with Stata deaths pyears byte float h 9 . 0 g -:- 9 . G g Number of deaths Person-years o r t e d by: s u m m a r i z e Var i able age r ad 1 Ob s Mean Std. Dev. Min Max 1 5 6 4 2.0181 1 7 1 56 4 . 5 2 . 3 12 024 1 8 1 5 6 1 .8 392 36 3 . 1782 03 0 16 1 56 3807.679 104 55 . 91 23 71382 list in 1/6 1 age rad de a t h s p yea r s 1. ( chi2 Pseudo R2 14 . 87 0.0001 0.0420 deaths | IRR Std. Err. P> i z 1 [95í Conf. Interval] rad 1 pyears I 1.23 (expos 64 6 9 ur e) .0603551 4 . 35 0 . 000 1 . 12 3657 1. 360 606 For the regression above, we specified the event count (deaths) as the dependent variable and radiation (rad) as the independent variable. The Poisson "exposure" variable is pyears, or person-years in each category of rad. The irr option calls for incidence rate ratios rather than regression coefficients in the results table — that is, we get estimates of exp(p) instead of p,the default. According to this incidence rate ratio, the death rate becomes 1.236 times higher (increases by 23.6%) with each increase in radiation category. Although that ratio is statistically significant, the fit is not impressive; the pseudo R: (see equation [10.4]) is only .042. To perform a goodness-of-fit test, comparing the Poisson model's predictions with the observed counts, use the follow-up command poisgof : Survival and Event-Count Models 311 poisgof Goodness-of-fit chi2 Prob > chi2 (54) 2 54 . 547 5 0.0000 These goodness<,f-fit test results (X2 = 254.5, P < .00005) indicate that our model's predictions are significantly different from the actual counts - another sign that the model fits poorly. We obtain better results when we include age as a second predictor. Pseudo R2 then rises to .5966, and the goodness-of-fit test no longer leads us to reject our model. . poisson deaths rad age, nolog exposure(pyears) irr Poisson regression Number of cbs = LR chi2 (2 } 56 211.41 Log likelihood - -71.4653 Prob > chi2 Pseudo R2 0.0000 0.5966 deaths I IRR std. Err z P>1 z | [95% Conf. In te rva 1] rad | 1 . 1 7 667 3 .0 5934 4 6 age | 1.960034 .0997536 pyears | (exposure) 3 . 23 13 . 22 0.001 1.065924 0.000 1.7739 5 5 1 - 2 98 929 2.165631 . poisgof Goodness-of-fit chi2 = 58. 00 534 Prob > chi2 ( 53) 0.2960 For simplicity, to this point we have treated rad and age as if both were continuous variables, and we expect their effects on the log death rate to be linear. In fact, however, both independent variables are measured as ordered categories, rad = 1, for example, means 0 radiation exposure; rad- 2 means 0 to 19 milliseiverts; rad- 3 means 20 to 39 milliseiverts; and so forth. An alternative way to include radiation exposure categories in the regression, while watching for nonlinear effects, is as a set of dummy variables. Below we use the gen () option of tabulate to create 8 dummy variables, rl to rS, representing each of the 8 values of rad. tabulate rad, gen(r) Radiation [ exposure | level l Fr eq . Pe rcen t Cum. 1 I 7 12.50 12 . 50 2 ( 7 12. 50 25 . 00 3 I 7 12 . 50 37,50 4 1 7 12. 50 50 .00 5 1 7 12. 50 62 .50 6 1 7 12. 50 75.00 7 1 7 12. 50 87 .50 8 1 7 12. 50 100.00 Total 1 56 100.00 describe Ü 55 o < C? Ä e ř U-0 HA *>-L h- ■ Z z ■53 P 312 Statistics with Stata Contains data from C : \data\oak ridge.dta obs: 5 6 12 1,0 6-] (95.9;ŕ of memory free) var s : size : Radiation. (Selvm 1 995:474) 21 Jul 2005 09:34 variable name erage display type format value label variable label age rad dear n s byte o9 Og aqeg Age group byte % 9 Og Radiation exposure level byte ;.9 Og Number of deaths p y e a t s float v, 9 Og Person -year s r 1 byte 5 8 Og rad = = 1 . OOOO r 2 byte % 8 Og rad = = 2 . 00 00 r 3 byte 8 Og rad- = 3 . 000 0 r 4 byte 5 8 Og rad- 4 . 0000 r5 byte %8 Og r a d = = 5 . 00 00 r 6 byte ■-8 Og rad = = 6 . 0 0 0 0 r 7 byte u,8 Og r ad = = 7 . 0000 r 8 byte -.8 .Og rad = = 8 . 0000 Sorted by: We now include seven of these dummies (omitting one to avoid multicollinearity) as regression predictors. The additional complexity of this dummy-variable model brings little improvement in fit. It does, however, add to our interpretation. The overall effect of radiation on death rate appears to come primarily from the two highest radiation levels (r7 and r8, corresponding to 100 to 119and 120ormoremilliseiverts). At these levels, the incidence rates are about four times higher. . poisson dea ths r2-r8 age, no log exposure (pyeaz-s) Poisson regression Number of obs LR chiľ(8) 56 215.44 Prob > chi2 0.0000 Log likelihood - -69.451814 Ps eudo R2 0.6080 deaths I IRR Std. Err. z P> 1 z 1 [95% Conf. Interval] r2 1 . 4 7 35 91 .426898 1 34 0 . 181 .8351884 2 . 599975 r 3 1 1.630683 . 6 6 59257 1 20 0 . 231 .7 32428 3 . 6305 87 r4 1 2.375967 1.088835 1 89 0 . 059 .967742 9 5.833389 r 5 1 .7278113 . 751S255 -0 31 0 . 758 . 0961 01 8 5.511957 r b [ 1.168477 1.20691 0 15 0 . 830 .15 4 319 5 8.84 7 472 r7 1 4 . 4 3372 9 3 . 337738 1 98 0 . 048 1 . 0 1 3863 19.38915 rS 1 3 . 89188 1.6409 7 8 3 2 2 0 . 001 1 . 7031 63 8.893267 age 1 1.961907 . 1 000652 13 21 0 . 000 1 . 77 52 67 2.168169 p y ears 1 (exposure) Radiation levels 7 and 8 seem to have similar effects, so we might simplify the model by combining them. First, we test whether their coefficients are significantly different. They are not: . test r7 = r8 ( 1) [deaths]r7 - [deaths]r8 = 0.0 chi2( 1) = Prob > ch.i 2 - 0.03 0 . 867 6 Survival and Event-Count Models 313 Next, generate a new dummy variable r78, which equals 1 if either rl or r8 equals 1; . generate r78 = (r7 | r8) Finally, substitute the new predictor for rl and r8 in the regression: . poisson deaths r2-r6 r78 age, irr ex(pyears) nolog Poisson regiessien Log 1ikelihood ■69. 465332 Number of obs LR chi2(7) Prot > chi2 Pseudo R2 56 215.41 0.0000 0.6079 deaths | Std. Err p>lzl [95, Conf. Interval] r2 r3 r4 r5 r6 r78 age pyea rs 1.47 36 0 2 1.630718 2 . 3760 65 .7278387 1.168507 3.980326 1.961722 (expos ure) .42 6901 3 .66 5 9381 1.08888 .7518533 1.206942 1 . 580024 .100043 1 . 34 1 . 20 1 . 89 -0 . 31 0 . 15 3 . 48 13.21 1 81 231 059 758 880 C 0 1 000 - 8 3 5 1 94 9 . 7 324 4 1 5 .9677823 - 0961 055 .1543236 1 .8282 1 4 1.775122 539996 630655 8 33 629 512165 84 '7 7 04 665833 167937 We could proceed to simplify the model further in this fashion. At each step, test helps to evaluate whether combining two dummy variables is justifiable. Generalized Linear Models Generalized linear models (GLM) have the form MH(>-)] = Po + P1-x1 + p2x2 + ... + p,xi, y~F [11.6] where g[ ] is the link function and F the distribution family. This general formulation encompasses many specific models. For example, \fg[ ] is the identity function andy follows a normal (Gaussian) distribution, we have a linear regression model: E(J) = Po + P, x, + %x2 + • • ■ + P***, y - Normal [11.7] tf g[ ] is the logit function and y follows a Bernoulli distribution, we have logit regression instead: logit[E(y)] = p(1 + p,x, + p,.r2 + . . . + p,x,, v~ Bernoulli [11.8] Because of its broad applications, GLM could have been introduced at several different points in this book. Its relevance to this chapter comes from the ability to fit event models. Poisson regression, for example, requires that g[) is the natural log function and that;- follows a Poisson distribution: ln[E(y)] = Po + p,.r, + P,x2 + . . . , y~~ Poisson [11.9] As might be expected with such a flexible method, Stata's glm command permits many different options. Users can specify not only the distribution family and link function, but also details of the variance estimation, fitting procedure, output, and offset. These options make glm a useful alternative even when applied to models for which a dedicated command (such as regress, logistic, or poisson ) alreadv exists 314 Statistics with Stata We might represent a "generic" glm command as follows: . glm y xl x2 x3, family(familyname) link(1inkname) lnoffset(exposure) eform jknife where family () specifies the y distribution family. link() the link function, and lnoffset() an "exposure" variable such as that needed for Poisson regression. The eform option asks for regression coefficients in exponentiated form, exp(p) rather than p. Standard errors are estimated through jackknife ( jknife ) calculations. Possible distribution families are Gaussian or normal (default) Inverse Gaussian Bernoulli binomial Poisson Negative binomial Gamma We can also specify a number or variable indicating the binomial denominator N (number of trials), or a number indicating the negative binomial variance and deviance functions, by declaring them in the family () option: family(binomial ft) family(binomial va z name) family(nbinomial #) family(gaussian) family(igaussian) family(binomial) family(poisson) family(nbinomial) family(gamma) Possible link functions are link(identity) link(log) link(logit) link(probit) link(cloglog) linMopower #) link(power #) link(nbinomial) link(loglog) link (logc) Identity (default) Log Logit Probit Complementary log-log Odds power Power Negative binomial Log-log Log-complement Coefficient variances or standard errors can be estimated in a variety of ways. A partial list of glm variance-estimating options is given below: opg Berndt, Hall, Hall, and Hausman "B-H-cubed" variance estimator. oim Observed information matrix variance estimator, robust Huber/White/sandwich estimator of variance, unbiased Unbiased sandwich estimator of variance Survival and Event-Count Models 31s nwest jknife jknifel bstrap Heteroskedasticity and autocorrelation-consistent variance estimator. Jackknife estimate of variance. One-step jackknife estimate of variance. Bootstrap estimate of variance. Thedefaultis 199 repetitions; specifysomeothernumberbyaddingthe bsrep(#) option. For a full list of options with some technical details, look up glm in the Base Reference Manual. A more in-depth treatment of GLM topics can be found in Hardin and Hilbe (2001). Chapter 6 began with the simple regression of mean composite SAT scores (csat) on per-pupil expenditures (expense) of the 50 U.S. states and District of Columbia (states.dta): regress csat expense We could tit the same model and obtain exactly the same estimates with the following command: . glm csat expense, link(identity) family(gaussian) Iteration 0: log likelihood = -279.99 369 Generalized linear Optimisation : mode1s ML : Newton-Raphson No, of CDS = Residual df 51 49 Deviance = Pear son = 175306. 1 7530 6. 2097 20 97 Scale param = (1/df) Deviance = (1 /'df) Pearson = 3577 . 678 3577 . 678 3577.678 Variance function: Link function Standard errors V{u) = 1 g (u) = u OIM [ Gaus s i an] [Identity] Log likelihood BIC -279.998 175298 6936 .34 6 AIC 1 1 . 05877 csat [ Coef . Std. Err. 2 P> 1 z | [95% Conf. Interval] expense | - .022275 6 _cons / 1060 .732 . 0060371 32 . 700 9 -3 . 69 32 . 44 0.000 - . 034 1082 0.000 996.6399 - . 0104431 1124.825 Because link (identity) and family (gaussian) are default options, we could actually have left them out of the previous glm command. The glm command can do more than just duplicate our regress results, however. For example, we could fit the same OLS model but obtain bootstrap standard errors: w 316 Statistics with Stata . glm csat expense, link(identity) family(gaussian) bstrap Iteration 0: log livelihood = -279.99869 Bootstrap iterations (19 9) ----+---}---+---2---+---3---+---4---+---5 .................................................. 50 .................................................. 10 0 .................................................. 150 Generalized linear models No. of obs - 51 Optimization : ML; Newton-Raphson Residual df ^ 4 9 Scale parani = 412 4.656 Leviance = 175306.2097 (1/df) Deviance - 3577.678 Fear;s on = 175306./0 97 (1/df) Pearson = 3577.673 Variance function: V" ( u) =1 [ Ga us s i a n 1 Link function : g(u) = u [Identity] Standard errors : Bootstrap Log likelihood =-279.998693 6 AIC = 11.05877 BIG = 175298.34 6 I Boo t st r ap csat | Coef. Std. Err. z P> I z I [95c Conf. Interval] expense I -.0222756 .0039284 -5.67 0.000 -.0299751 -.0145762 _cons I 1060.732 25.36566 4 1.82 0.000 1011.017 1110.448 The bootstrap standard errors reflect observed variation among coefficients estimated from 199 samples of n = 51 cases each, drawn by random sampling with replacement from the original w = 51 dataset. In this example, the bootstrap standard errors are less than the corresponding theoretical standard errors, and the resulting confidence intervals are narrower. Similarly, we could use glm to repeat the first logistic regression of Chapter 10. In the following example, we ask for jackknife standard errors and odds ratio or exponential-form f ef orm ) coefficients: Survival and Event-Count Models 317 glm any date, link(logit) family(bernoulli) eform jknife Iteration 0 Iteration 1 Iteration 2 log likelihood = -12.995268 log likelihood = -12.991098 log likelihood - -12.991096 Jackknife iterations (2 3) ----+--- 1 ---+--_ 2 --- + Generalized linear models Optimization : ML: Newton-Raphson Deviance Pearson 25.98219269 22.8885488 Variance function: V(u) = u*(l-u) Link function Standard errors Log likelihood BIC g(V) = In (u/ (1-u) ) Jackkn i fe -12.9 9109634 19.71120426 No. of obs Residual df Scale param (1/df) Deviance (1/df) Pearson [Bernoulli] [ Logi t ] AIC 23 21 1 1.2 37247 1.089931 1.303574 any 1 1 Odds Ratio Jackkni fe Std. Err. z P>1 z 1 [95% Conf. date 1 1.002093 .0015486 1 . 35 0.176 .9990623 The final poisson regression of the present chapter corresponds to this glm model: . glm deaths t2-z6 r78 age, link(log) family(poisson) lnoffset(pyears) eform Although glm can replicate the models fit by many specialized commands, and adds some new capabilities, the specialized commands have their own advantages including speed and customized options. A particular attraction of glm is its ability to fit models for which Stata has no specialized command. w V- \ i't Principal Components, Factor, and Cluster Analysis Principal components and factor analysis provide methods for simplification, combining many correlated variables into a smaller number of underlying dimensions. Along the way to achieving simplification, the analyst must choose from a daunting variety of options. If the data really do reflect distinct underlying dimensions, different options might nonetheless converge on similar results. Tn the absence of distinct underlying dimensions, however, different options often lead to divergent results. Experimenting with these options can tell us how stable a particular finding is, or how much it depends on arbitrary choices about the specific analytical technique. Stata accomplishes principal components and factor analysis with five basic commands: Principal components analysis. Extracts factors of several different types. Constructs a scree graph (plot of the eigenvalues) from the recent pea or factor. Performs orthogonal (uncorrelated factors) or oblique (correlated factors) rotation, after factor . Generates factor scores (composite variables) after pea , factor , or rotate. The composite variables generated by score can subsequently be saved, listed, graphed, or analyzed like any other Stata variable. Users who create composite variables by the older method of adding other variables together without doing factor analysis could assess their results by calculating an cc reliability coefficient: alpha Cronbach's a reliability Instead of combining variables, cluster analysis combines observations by finding non-overlapping, empirically-based typologies or groups. Cluster analysis methods are even more diverse, and less theoretical, than those of factor analysis. Stata's cluster command provides tools for performing cluster analysis, graphing the results, and forming new variables to identify the resulting groups. pea factor greigen rotate score 318 Principal Components, Factor, and Cluster Analysis 319 Methods described in this chapter can be accessed through the following Statistics - Other multivariate analysis Graphics - More statistical graphs Statistics - Cluster analysis menus: Example Commands . pea xl~x20 Obtains principal components of the variables xl through x20. . pea xl-x20, mineigen(1) Obtains principal components of the variables*/ through.r20. Retains components having eigenvalues greater than 1. factor xl-x20, ml factor(5J Performs maximum likelihood factor analysis ofthevariables.r/ through.r2fl. Retains only the first five factors. greigen Graphs eigenvalues versus factor or component number from the most recent factor command (also known as aktscree graph"), rotate, varimax factors(2) Performs orthogonal (varimax) rotation of the first two factors from the most recent factor command. rotate, promax factors(3) Performs oblique (promax) rotation of the first three factors from the most recent factor command. score fl f2 f3 Generates three new factor score variables named//, /2, and/3, based upon the most recent factor and rotate commands. alpha xl-xlO Calculates Cronbach's a reliability coefficient for a composite variable defined as the sum of xI-x/0. The sense of items entering negatively is ordinarily reversed. Options can override this default, or form a composite variable by adding together either the original variables or their standardized values. cluster centroidlinkage x y z w, L2 name(L2cent) Performs agglomerative cluster analysis with centroid linkage, using variables .t, v, z, and w. Euclidean distance (L2) measures dissimilarity among observations. Results from this cluster analysis are saved with the name L2cent. cluster tree, ylabel(0 { .5)3) outnumber(20) vertlabel Draws a cluster analysis tree graph or dendrogram showing results from the previous cluster analysis, cutnumber (20) specifies that the graph begins with only 20 clusters remaining, after some previous fusion of the most-similar observations. Labels are printed in a compact vertical fashion below the graph. cluster dendrogram does the same thing as cluster tree . 320 Statistics with Stata cluster generate ctype = groups(3), name(L2cent) Creates a new variable ctype (values of 1,2, or 3) that classifies each observation into one of the top three groups found by the cluster analysis named Ucent. Principal Components To illustrate basic principal components and factor analysis commands, we will use a small dataset describing the nine major planets of this solar system (from Beatty et ah 1981). The data include several variables in both raw and natural logarithm form. Logarithms are employed here to reduce skew and linearize relationships among the variables. Contains data from C:\data\planets.dta obs: 9 var s : size: 12 Solar system data 22 Jul 2005 09:49 441 (9 9.9~ of memory free storage display ame type format planet s tr7 %9s ds un float ?j9 . Og r adiu s float %9 . og rings byte •c8 . Og moons byte 's 8 . Og mas s float %9 . Og density float %9 . Og 1ogds un float 9 . og 1ogra d float %9 . og 1ogmoon s float %9 . Og 1ogma s s float %9 Og logdense float %9 . Og Sorted by ds un value label variable label PIanet Mean dist. sun, km*l0^6 Equatorial radius in km ringlbl Has rings? Numbe r of known moons Mass in kilograms Mean density, q/cm"3 natural log dsun natural log radius natural log (moons + 1) natural log mass natural log dense To extract initial factors or principal components, use the command factor followed by a variable list (variables in any order) and one of the following options: pcf Principal components factoring pf Principal factoring (default) ipf Principal factoring with iterated communalities ml Maximum-likelihood factoring Principal components are calculated through the specialized command pea . Type help pea or help factor to see options for these commands. Principal Components, Factor, and Cluster Analysis 321 To obtain principal components factors, type logdense, pcf factor rings logdsun (obs = 9) Fa ct o r 1 (principal component factors; 2 factors retained) Eigenvalue Difference Proportion Cumulative 4 . 62 3 65 1.16896 0 . 11232 0.05837 0 . 03 663 0.00006 3.454 69 1 . 05664 0 . 053 95 0 . 02174 0 . 03 657 0.7706 0.1948 0.0187 0.0097 0.0061 0.0000 0.7706 0.9654 0.9842 0.9939 1 .0000 1.0000 Var iabli Factor Loadings 1 2 Un i quen es s rings I logdsun | lograd I logmoons I logmass j logdense I 0 . 97 91 7 0 . 67105 0 . 92287 0.97 64 7 0.83377 ■0.84511 0.0772 0 -0 . 7 1093 0.37357 0 . 00028 0 . 54 463 0 . 4 7 053 0 . 03526 0 .04 427 0.00875 0 . 04 65 1 0 . 00821 0 . 0643 9 Only the first two components have eigenvalues greater than 1, and these two components explain over 96% of the six variables' combined variance. The unimportant 3rd through 6th principal components might safely be disregarded in subsequent analysis. Two factor options provide control over the number of factors extracted: factors (#) where # specifies the number of factors mineigen (#) where # specifies the minimum eigenvalue for retained factors The principal components factoring ( pcf ) procedure automatically drops factors with eigenvalues below 1, so . factor rings logdsun - logdense, pcf is equivalent to factor rings logdsun - logdense, pcf mineigen (1) In this example, we would also have obtained the same results by typing factor rings logdsun - logdense, pcf factors(2) To see a scree graph (plot of eigenvalues versus component or factor number) after any factor , use the greigen command. A horizontal line at eigenvalue = 1 in Figure 12.1 marks the usual cutoff for retaining principal components, and again emphasizes the unimportance in this example of components 3 through 6. 322 Statistics with Stata greigen, yline (1) Figure 12.1 3 4 Number Rotation Rotation further simplifies factor structure. After factoring, type rotate followed by one of these options: varimax Varimax orthogonal rotation, for uncorrelated factors or components (default). promax () Promax oblique rotation, allowing correlated factors or components. Choose a number (promax power) < 4; the higher the number, the greater the degree of interfactor correlation, promax (3) is the default. Two additional rotate options are factors () As it does with factor , this option specifies how many factors to retain, hörst Horst modification to varimax and promax rotation. Rotation can be performed following any factor analysis, whether it employed the pcf , pf, ipf ,or ml options. In this section, we will follow through on our pcf example. For orthogonal (default) rotation of the first two components found in the planetary data, we type . rotate (varimax rotation) Rotated Factor Loadings Variable I 1 2 Uniqueness rings 1ogdsun log rad 1ogmoons 1ogna ss 0.9717 3 0.2 5 804 0.58824 0.0678 4 ■ n R R 4 7 9 8 2 7 9 2 1 0 7 0 7 9 615 9 7794 0 99357 3 9085 0 3 52 6 04 427 0 0 8 7 5 04 651 0 0 821 0 6 4 3 9 Principal Components, Factor, and Cluster Analysis 323 This example accepts all the defaults: varimax rotation and the same number of factors retained in the last factor . We could have asked for the same rotation explicitly, with the following command: rotate, varimax factors (2) For oblique promax rotation (allowing correlated factors) of the most recent factoring, type rotate, promax V a r i a b 1 (pronax rotation) Rotated Factor Loadings 1 1 2 Uniqueness 1 ogds i:n lograd 1 o g rn o o n s lugmass 1 o g d e n s e 0.34 664 1-05196 0.00599 0.42747 0 . 2154 3 0.87150 7 6264 1 7270 9 9262 6 9 0 7 0 08534 1 6 9 2 2 0352 6 0 4 4 2 7 0 0 87 5 04 651 0082 1 0 6 4 3 9 By default, this example used a promax power of 3. We could have specified the promax power and desired number of factors explicitly: . rotate, promax(3) factors(2) promax (4) would permit further simplification of the loading matrix, at the cost of stronger interfactor correlations and less total variance explained. After promax rotation, rings, lognuL iogmoons, and logmass load most heavily on factor 2. This appears to be a 'large size/many satellites" dimension. logdsun and logdense load higher on factor 1, forming a "far out/low density" dimension. The next section shows how to create new variables representing these dimensions. Factor Scores Factor scores are linear composites, formed by standardizing each variable to zero mean and unit variance, and then weighting with factor score coefficients and summing for each factor, score performs these calculations automatically, using the most recent rotate or factor results. In the score command we supply names for the new variables, such as // and/2. . score fl f2 Variable (based or. rotated factors) Scoring Coefficients I 1 2 r i ngs -ogdsun 1og rad ] oqnoonFj logmass 1ogden se 1 2 67 4 48 769 0384 0 : o664 1 4 338 39127 22 099 0 9 68 9 3 0 608 1 ?bA 3 3 4 3 8 6 31 60 9 324 Statistics with Stata label variable fl "Far out/low density" . label variable f2 "Large size/many satellites" . list planet fl £2 | planet f 1 f 2 1 1. | Me r cury -1 . 25688 1 - . 9172 388 1 2 . 1 Venus -1 . 188757 -.5160229 1 3 . i Earth - 1 . 035 242 -.3939372 1 4 . ! Ma r 5 -.5970106 -.6799535 1 5 . ! Jupiter . 3841 08 5 1.342658 1 G . | Saturn .9259058 1.18 4 4 7 5 1 | Uranus . 93 47457 .7682409 1 8 . | Neptune .8161058 .647119 1 9 . I Pluto 1.0170 25 -1.43534 1 Being standardized variables, the new factor scores/7 and/2 have means (approximately) equal to zero and standard deviations equal to one: summarize fl £2 Variable I Obs Mean Std Max f 1 I f2 I 9 . 9 3 e - 0 9 •3 . 3le-09 ■1 . 256881 -1.43534 1.017025 1 . 342 658 Thus, the factor scores are measured in units of standard deviations from their means. Mercury, for example, is about 1.26 standard deviations below average on the far out/low density (fl) dimension because it is actually close to the sun and high density. Mercury is .92 standard deviations below average on the large size/many satellites (J7) dimension because it is small and has no satellites. Saturn, in contrast, is .93 and 1.18 standard deviations above average on these two dimensions. Promax rotation permits correlations between factor scores: . correlate fl f2 (obs=9) f 1 1.0000 0.4 974 1.0000 Scores on factor 1 have a moderate positive correlation with scores on factor 2: far out/low density planets are more likely also to be larger, with many satellites. If we employ varimax instead of promax rotation, we get uncorrected factor scores: quietly factor rings logdsun - logdense, pcf quietly rotate quietly score varlmaxl varlmax2 i Principal Components, Factor, and Cluster Analysis 325 correlate varimaxl varimax2 (obs = 9) I varimaxl varimax2 varimaxl 1 varimax2 I 1 . 0000 0 . 0000 1.00' Once created by score , factor scores can be treated like any other Stata variable — listed, analyzed, graphed, and so forth. Graphs of principal component factors sometimes help to identify multivariate outliers or clusters of observations that stand apart from the rest. For example, Figure 12.2 reveals three distinct types of planets. graph twoway scatter fl f2f yline{0) xline(O) mlabel(planet) mlabsixe(medsmall) ylabel{, angle(horizontal)) xlabel(-1.5(.5)1.5, grid) 1 • Pluto Mars Mercury -1.5 Figure 12.2 • Uranus • Saturn Neptune Jupiter ■1.5 • Earth • Venus -.5 0 .5 Large size/many satellites 1.5 The inner, rocky planets (such as Mercury, low on "far out/low density" factor 1; low also on "large size/many satellites1' factor 2) cluster together at the lower left. The outer gas giants have opposite characteristics, and cluster together at the upper right. Pluto, which physically resembles some outer-system moons, is unique among planets for being high on the "far out/low density" dimension, and at the same time low on the "large size/many satellites" dimension. This example employed rotation. Factor scores obtained by principal components without rotation are often used to analyze large datasets in physical-science fields such as climatology and remote sensing. In these applications, principal components are called "empirical orthogonal functions.1' The first empirical orthogonal function, or EOF1, equals the factor score for the first unrotated principal component. EOF2 is the score for the second principal component, and so forth. 326 Statistics with Stata Principal Factoring Principal factoring extracts principal components from a modified correlation matrix, in which the main diagonal consists of communality estimates instead of l's. The factor options pf and ipf both perform principal factoring. They differ in how communalities are estimated: pf Communality estimates equal R: from regressing each variable on all the others, ipf Iterative estimation of communalities. Whereas principal components analysis focuses on explaining the variables1 variance, principal factoring explains intervariable correlations. We apply principal factoring with iterated communalities ( ipf ) to the planetary data: factor rings logdsun - logdense, ipf (obs=9) Factor (iterated principal factors; 5 factors retained) Eigenvalue Differenee Proportion Cumulative -0 5 9 6 6 3 13 4 6 0 7 7 3 9 013 01 0 012 5 0 0 0 12 3.46817 1.05107 0.064 38 0.01176 0 . 0013 7 7903 1940 0133 0022 00 02 0 0 0 0 0 . 7 903 0 . 9 8 4 3 0-9 97 6 0.9998 1.0000 1.0000 Variable Factor Loadings 1 2 Uniqueness rings logdsun 1ograd 1ogmoons 1ogma ss 1cqdense 0.97599 0.65708 0. 92 67 0 0.96738 0.83783 0.84602 0 6 64 9 67054 37001 0 1 07 4 54576 0.48941 11374 14114 0 4 50 4 0 0 781 00 557 2 0 594 0.02065 0.04 471 0.04865 0.08593 0.0282 4 0.0 0 610 0.02234 0.0081 6 0.01662 0.01597 0.0071 4 0.00997 0291 6 096d3 00036 05 63 6 00069 0021 7 Only the first two factors have eigenvalues above 1. With pcf or pf factoring, we can simply disregard minor factors. Using ipf , however, we must decide how many factors to retain, and then repeat the analysis asking for exactly that many factors. Here we will retain two factors: . factor rings logdsun - logdense, ipf factor (2) chi2 -51.73, Prob > chi2 - 0.0000 0.0000 The ml output includes two %2 tests: J vs. no factors This tests whether the current model, with/factors, fits the observed correlation matrix significantly better than a no-factor model. A low probability indicates that the current model is a significant improvement over no factors. J vs. more factors This tests whether the current 7-factor model fits significantly worse than a more complicated, perfect-fit model. A low P-value suggests that the current model does not have enough factors. 328 Statistics with Stata The previous 1-factor example yields these results: 1 vs. no factors X 2 [6] = 62.02, P = 0.0000 (actually, meaning P < .00005). The 1-factor model significantly improves upon a no-factor model. 1 vs. more f a ctors X: [9] = 51.73, /> = 0.0000 chi2 0.0000 0.1513 rings 1ogdsun 1 o g r a d 1 o g m o o n s 1ogma ss 1oqdensc Now we find the following: 2 vs. no factors X1 [12] = 134.14, P = 0.0000 (actually, P < .00005). The 2-factor model significantly improves upon a no-factor model. 2 vs. more factors X2 [4] = 6.72, P = 0.1513. The 2-factor model \snot significantly worse than a perfect-fit model. These tests suggest that two factors provide an adequate model. Computational routines performing maximum-likelihood factor analysis often yield "improper solutions" — unrealistic results such as negative variance or zero uniqueness. When this happens (as it did in our 2-factor ml example), the x2 tests 'ac^ formal justification. Viewed descriptively, the tests can still provide informal guidance regarding the appropriate number of factors. Principal Components, Factor, and Cluster Analysis 329 Cluster Analysis — 1 Cluster analysis encompasses a variety of methods that divide observations into groups or clusters, based on their dissimilarities across a number of variables. It is most often used as an exploratory approach, for developing empirical typologies, rather than as a means of testing pre-specified hypotheses. Indeed, there exists little formal theory to guide hypothesis testing for the common clustering methods. The number of choices available at each step in the analysis is daunting, and all the more so because they can lead to many different results. This section provides no more than an entry point to begin cluster analysis. We review some basic ideas and illustrate them through a simple example. The following section considers a somewhat larger example. Stata's Multivariate Statistics Reference Manual introduces and defines the full range of choices available. Everitt et al. (2001) cover topics in more detail, including helpful comparisons among the many cluster-analysis methods. Clustering methods fall into two broad categories,partition and hierarchical. Partition methods break the observations into apre-set number of nonoverlapping groups. We have two ways to do this: cluster kmeans Kmeans cluster analysis User specifies the number of clusters (K) to create. Stata then finds these through an iterative process, assigning observations to the group with the closest mean. cluster kmedians ICmedians cluster analysis Similar to Kmeans, but with medians. Partition methods tend to be computationally simpler and faster than hierarchical methods. The necessity of declaring the exact number of clusters in advance is a disadvantage for exploratory work, however. Hierarchical methods, involve a process of smaller groups gradually fusing to form increasingly large ones. Stata takes an agglomerative approach in hierarchical cluster analysis: it starts out with each observation considered as its own separate "group." The closest two groups are merged, and this process continues until a specified stopping-point is reached, or all observations belong to one group, A graphical display called a dendrogram or tree diagram visualizes hierarchical clustering results. Several choices exist for the linkage method, which specifies what should be compared between groups that contain more than one observation: cluster singlelinkage Single linkage cluster analysis Computes the dissimilarity between two groups as the dissimilarity betweenthe closest pair of observations between the two groups. Although simple, this method has low resistance to outliers or measurement errors. Observations tend to join clusters one at a time, forming unbalanced, drawn-out groups in which members have little in common, but are linked by intermediate observations — a problem called chaining. cluster completelinkage Complete linkage cluster analysis Uses the farthest pair of observations between the two groups. Less sensitive to outliers than single linkage, but with the opposite tendency towards clumping many observations into tight, spatially compact clusters. cluster averagelinkage Average linkage cluster analysis Uses the average dissimilarity of observations between the two groups, yielding properties intermediate between single and complete linkage. Simulation studies report that this 330 Statistics with Stata works well for many situations and is reasonably robust (see Evcrittetal. 2001, and sources they cite). Commonly used in archaeology, cluster centroidlinfcage Centroid linkage cluster analysis Centroid linkage merges the groups whose means are closest (in contrast to average linkage which looks at the average distance between elements of the two groups). This method is subject to reversals — points where a fusion takes place at a lower level of dissimilarity than an earlier fusion. Reversals signal an unstable cluster structure, are difficult to interpret, and cannot be graphed by cluster tree, cluster waveragelinkage Weighted-average linkage cluster analysis cluster medianlinkage Median linkage cluster analysis. Weighted-average linkage and median linkage are variations on average linkage and centroid linkage, respectively. In both cases, the difference is in how groups of unequal size are treated when merged. In average linkage and centroid linkage, the number of elements of each group are factored into the computation, giving correspondingly larger influence to the larger group (because each observation carries the same weight). In weighted-average linkage and median linkage, the two groups are given equal weighting regardless of how many observations there are in each group. Median linkage, like centroid linkage, is subject to reversals, cluster wardslinkage Ward's linkage cluster analysis Joins the two groups that result in the minimum increase in the error sum of squares. Does well with groups that are multivariate normal and of similar size, but poorly when clusters have unequal numbers of observations. All clustering methods begin with some definition of dissimilarity (or similarity). Dissimilarity measures reflect the differentness or distance between two observations, across a specified set of variables. Generally, such measures are designed so that two identical observations have a dissimilarity of 0, and two maximally different observations have a dissimilarity of 1. Similarity measures reverse this scaling, so that identical observations have a similarity of 1. Stata's cluster options offer many choices of dissimilarity or similarity measures. For purposes of calculation, Stata internally transforms similarity to dissimilarity: dissimilarity = 1 - similarity The default dissimilarity measure is the Euclidean distance, option L2 (or Euclidean). This defines the distance between observations / and / as where jr*, is the value of variable xk for observation i,xkj the value of xk for observation/, and summation occurs over all the* variables considered. Other choices available for measuring the (dis)similarities between observations based on continuous variables include the squared Euclidean distance (L2squared), Lk(xki-xkj)2 the absolute-value distance ( Ll), maximum-value distance ( Linfinity ), and correlation coefficient similarity measure ( correlation ). Choices for dissimilarities or similarities based on binary variables include simple matching ( matching ), Jaccard binary similarity coefficent ( Jaccard), and many others. Type help cldis for a list and explanations. Principai Components, Factor, and Cluster Analysis 331 Earlier in this chapter, a principal components analysis of variables in planets.dta (Figure 12.2) identified three types of planets: inner rocky planets, outer gas giants, and in a class by itself, Pluto. Cluster analysis provides an alternative approach to the question of planet "types.*' Because variables such as number of moons (moons) and mass in kilograms (mass) are measured in incomparableunits, with hugely different variances, we should standardize in some way to avoid results dominated by the highest-variance items. A common, although not automatic, choice is standardization to zero mean and unit standard deviation. This is accomplished through the egen command (and using variables in log form, for the same reasons discussed earlier), summarize confirms that the new z variables have (near) zero means, and standard deviations equal to one. egen zrings = std(rings) egen zlogdsun = std(logdsun) . egen zlograd = std(lograd) egen zlogmoon = std(logmoons) egen zlogmass = std(logmass) . egen zlogdens = std(logdense) summ zrings - zlogdens Variable | Obs Mean Std. Dev Min Max z r ings ] 9 -1 99e- -08 1 - . 84 3274 1 1 . 054 0 93 zlogdsun | 9 -1 16e- -08 1 -1 . 39382 1 1 288 216 zlograd | zlogmoon | zlogmass | 9 -3 31e- -09 1 -] .34 71 1 372 7 51 9 0 1 -1 . 207 296 1 175 84 9 9 -A 14e- 09 1 -1 . 74466 1 365 167 zlogdens j 1 . 32e-i 1.453143 1.128901 The "three types" conclusion suggested by our principal components analysis is robust, and could have been found through cluster analysis as well. For example, we might perform a hierarchical cluster analysis with average linkage, using Euclidean distance ( L2 ) as our dissimilarity measure. The option name (L2avg) gives the results from this particular analysis a name, so that we can refer to them in later commands. The results-naming feature is convenient when we need to try a number of cluster analyses and compare their outcomes. . cluster averagelinkage zrings zlogdsun zlograd zlogmoon zlogmass zlogdens, L2 name(L2avg) Nothing seems to happen, although we might notice that our dataset now contains three new variables with names based on L2avg. These new L2avg* variables are not directly of interest, but can be used unobtrusively by the cluster tree command to draw a cluster analysis tree or dendrogram visualizing the most recent hierarchical cluster analysis results (Figure 12.3). The label {planet) option here causes planet names (values of planet) to appear as labels below the tree. Typing cluster dendrogram instead of cluster tree would produce the same graph. w 332 Statistics with Stata . cluster tree, label(planet) ylabel (0(1)5) 5n Figure 12.3 Mercury Venus Earth Mars Pluto Jupiter Saturn Uranus Neptune Dendrogram for L2avg cluster analysis Dendrograms such as Figure 12.3 provide key interpretive tools for hierarchical cluster analysis. We can trace the agglomerative process from each observation its own cluster, at bottom, to all fused into one cluster, at top. Venus and Earth, and also Uranus and Neptune, are the least dissimilar or most alike pairs. They are fused first, forming the first two multi-observation clusters at a height (dissimilarity) below 1. Jupiter and Saturn, then Venus-Earth and Mars, then Venus-Earth-Mars and Mercury, and finally Jupiter-Saturn and Uranus-Neptune are fused in quick succession, all with dissimilarities around 1. At this point we have the same three groups suggested in Figure 12.2 by principal components: the inner rocky planets, the gas giants, and Pluto. The three clusters remain stable until, at much higher dissimilarity (above 3), Pluto fuses with the inner rocky planets. At a dissimilarity near 4, the final two clusters fuse. So, how many types of planets are there? The answer, as Figure 12.3 makes clear, is "it depends." How much dissimilarity do we want to accept within each type? The long vertical lines between the three-cluster stage and the two-cluster stage in the upper part of Figure 12.3 indicate that we have three fairly distinct types. We could reduce this to two types only by fusing an observation (Pluto) that is quite dissimilar to others in its group. We could expand it to five types only by drawing distinctions between several planet groups (e.g., Mercury-Mars and Earth-Venus) that by solar-system standards are not greatly dissimilar. Thus, the dendrogram makes a case for a three-type scheme. The cluster generate commandcreatesanewvariableindicatingthetypeorgroup to which each observation belongs. In this example, groups (3) calls for three groups. The name (L2avg) option specifies the particular results we named L2avg. This option is most useful when our session included multiple cluster analyses. Principal Components, Factor, and Cluster Analysis 333 cluster generate plantype = groups (3) , name(L2avg) label variable plantype "Planet type" list planet plantype planet pi an type Me r cu r y 1 Venu s 1 Earth 1 Mar s 1 Jupite r 3 Saturn 3 Uranu s 3 Neptune 3 Pluto The inner rocky planets have been coded as plantype = 1; the gas giants as plantype = 3; and Pluto, which resembles an outer-system moon more than it does other planets, is by itself as plantype = 2. The group designations as 1, 2, and 3 follow the left-to-right ordering of final clusters in the dendrogram (Figure 12.3). Once the data have been saved, our new typology could be used like any other categorical variable in subsequent analyses. These planetary data have a strong pattern of natural groups, which is why such different techniques as cluster analysis and principal components point towards similar conclusions. We could have chosen other dissimilarity measures and linkage methods for this example, and still arrived at much the same place. Complex or weakly patterned data, on the other hand, often yield quite different results depending on nuances of the methods used. The clusters found by one method might not prove replicable under others, or even with slightly different analytical decisions. Cluster Analysis — 2_ Discovering a simple, robust typology to describe the nine planets was straightforward. For a more challenging example, consider the cross-national data in nations.dta. This dataset contains living-conditions variables that might provide a basis for classifying countries into types. Contains data from C: \data\nat i o n s . d t a obs : 109 Data on 109 nations, ca. 1985 vars : 15 23 Jul 2005 18:37 size: 4,142 (99.9% of memory free) sto rage display value va r i able name type f o rmat label variable label country str8 %9s Count ry pop float \ 9 . 0 g 1985 population in millions birth byte 8 . 0 g Crude birth rate/1000 people death byte I 8 . 0 g Crude death rate/1000 people ch1dmo rt byte °c8 . Og Child (1-4 yr) mortality 1985 in fmo rt int I 8.0 g Infant (<1 yr) mortality 1985 li f e byte %Q . Og Life expectancy at birth 1985 334 Statistics with Stata food i n t t 8 Og Per capita daily calories 1985 e n e r q y i lit - 8 Og Per cap energy consumed, kg oil gnpcap i n t ; e Og Per capita GNP 1985 gnpg r o float 9 Og Annual GNP growth % 65-85 urbar: byte ■c 8 Og c> population urban 1985 schoc11 i r. t ■ 8 Og Primary enrollment z. age-group s c h o o 12 i nr. ^8 Og Secondary enroll % age-group s c h o o 1 3 byte í 8 Og Higher ed. enroll i age-group ■ o r ~ c d o y In Chapter 8, we saw that nonlinear transformations (logs or square roots) helped to normalize distributions and linearize relationships among some of these variables. Similar arguments for nonlinear transformations could apply to cluster analysis, but to keep our example simple, we will not pursue them here. Linear transformations to standardize the variables in some fashion remain important, however. Otherwise, the variablegnpcap, which ranges from about $100 to $19,000 (standard deviation $4,400) would overwhelm other variables such as life, which ranges from 40 to 78 years (standard deviation 11 years). In the previous section, we standardized planetary data by subtracting each variable's mean, then dividing by its standard deviation, so that the resulting z-scores all had standard deviations of one. In this section we take a different approach, range standardization, which also works well for cluster analysis. Range standardization involves dividing each variable by its range. There is no command to do this directly in Stata, but we can improvise one easily enough. The summarize, detail command calculates one-variable statistics, and afterwards unobtrusively stores the results in memory as macros (described in Chapter 14). Amacro named r(max) stores the variable's maximum, and r(min) stores its minimum. Thus, to generate new variable rpop, defined as a range-standardized version of pop (population), type the commands quietly summ pop, detail generate rpop = pop/(r(max) - r(min)) label variable rpop "Range-standardized population" Simi lar commands create range-standardized versions of other living-conditions variables; quietly summ birth, detail generate rbirth = birth/(r(max) - r(min)) label variable rbirth "Range-standardized bith rate" quietly summ infmort, detail generate rinf = infmort/(r(max) - r(min)) label variable rinf "Range-standardized infant mortality" and so forth, defining the 8 new variables listed below. These range-standardized variables all have ranges equal to 1. . describe rpop-rschoo!2 storage display value variable name t ype fo rma t label variable label rpop float 1 9.0g Range-standardized popu 1 a t i on rbirth float ■.9.0g Range-standardized bith rate rinf float i 9 . 0 g Range-standardized infant mo rt a1 it y r 1 i f e float -', 9.0g Range-standardized life i Principal Components, Factor, and Cluster Analysis 335 r f c o d re n e rgy r g n p c a p r s c h o o 1; float 9 . 0 g float , 9 . 0g float ;9.0g float 9 . 0 g expectancy Range-standardized food per capita Range-standardized energy per capita Range-standardized GNP per capita Range-standardized secondary school °: summarize rpop - rschool2 'a r iable !bs Me a n t d Min Max rpop rb:rth rinf r i i f e r food 10 9 109 10 9 10 9 108 .0 3744 93 .7452043 .4 051354 1.621922 1.230213 12 0 6 4 7 4 30 98672 2 913 8 2 5 .291343 ■ 0009622 .2272727 .035503 1.052632 1.000 96 2 1 . 2272 7 3 1 . 0 355 0 3 2 . 0 52 632 2644839 .7793776 1.77937 "energy i ■gnpcap | : c h c o 1 2 I 107 .159786 .2137914 .0018464 1 001846 109 .1666459 .2319276 .0057411 1 005741 104 .4574849 .2899882 .0196078 1 019608 After the variables of interest have been standardized, we can proceed with cluster analysis. As we divide more than 100 nations into "types,1" we have no reason to assume that each type will include a similar number of nations. Average linkage (used in our planetary example), along with some other methods, gives each observation the same weight. This tends to make larger clusters more influential as agglomeration proceeds. Weighted average and median linkage methods, on the other hand, give equal weight to each cluster regardless of how many observations it contains. Such methods consequently tend to work better for detecting clusters of unequal size. Median linkage, like centroid linkage, is subject to reversals (which will occur with these data), so the following example applies weighted average linkage. Absolute-value distance ( LI ) provides our dissimilarity measure. cluster waveragelinkage rpop - rschool2, LI name{LIwav) The full cluster analysis proves unmanageably large for a tree graph: cluster tree too many leaves; consider using the cutvalue() or cutn umbe r () options r (1 9 8 ) ; Following the error-message advice, Figure 12.4 employs a outnumber (100) option to forma dendrogram that starts with only 100 groups, afterthe first few fusions have taken place. 336 Statistics with Stata . cluster tree, ylabel (0 ( .5)3) outnumber(100) 3-, Figure 12.4 Dendrogram for L1wav cluster analysis The bottom labels in Figure 12.4 are unreadable, but we can trace the general flow of this clustering process. Most of the fusion takes place at dissimilarities below 1. Two nations at far right are unusual; they resist fusion until about 1.5, and then form a stable two-nation group quite different from all the rest. This is one of four clusters remaining above dissimilarities of 2. The first and second of these four final clusters (reading left to right) appear heterogeneous, formed through successive fusion of a number of somewhat distinct major subgroups. The third cluster, in contrast, appears more homogeneous. It combines many nations that fused into two subgroups at dissimilarities below 1, and then fused into one group at slightly above 1. Figure 12.5 gives another view of this analysis, this time using the outvalue (1) option to show only clusters with dissimilarities above 1. The vertlabel option, not really needed here, calls for the bottom labels (Gl, G2, etc.) to be printed vertically instead of horizontally. Principal Components, Factor, and Cluster Analysis 337 cluster tree, ylabe1(0(.5)3) cutvalue(l) vertlabel 3- 2.5 1.5 Figure 12.5 i 1- ? ! 9 » i i i k j Dendrogram for L1wav cluster analysis As Figure 12.5 shows, there are 11 groups remaining at dissimilarities above 1. For purposes of illustration, we will consider only the top four groups, which have dissimilarities above 2. cluster generate creates a categorical variable for the final four groups from the cluster analysis we named Llwav. cluster generate ctype = groups (4) , name(Llwav) label variable ctype "Country type" We could next examine which countries belong to which groups by typing . by ctype: list country A more compact list of the same information appears below. This list was produced by copying and pasting data from nations, dta into the Data Editor to form a separate, single-purpose dataset in which the columns are country types. 6 7 a 9 10 11 12 13 1 ctypel ctype2 ctype 3 ctype4 1 Algeria 1 Brazil 1 Burma 1 Chile 1 Colombi a Argentin Australi Austria Belgium Canada Banglade Ben i n Bolivia Botswana BurkFaso China Indi a 1 CostaRic 1 DomRep 1 Ecuador 1 Egypt 1 Indonesi Denmark Finland F ran ce Greece HongKon g Burundi Cameroon CenAfrRe ElSalvad Ethiopia Jama i ca Jordan Malaysia Hunga ry Ireland Israel Ghana Guatemal Guinea 338 Statistics with Stata 1 4 . Mau ritiu Italy 15 . Mexico Japan i 6 . Morocco Kuwait 1 7 . Par. aria Net her .1 a 18 . | Paraguay "NewZeala 19 . j Peru Norway 2 0 . I Philippi Poland 21 . I SauArabi Portugal 2 2 . | Sri Lanka S Korea 2 3 . | Syria S i ngapor 24 . i Thailand Spa i n 25 . | Tunisia Sweden 26 . | Turkey TrinToba 2 7 , | Uruguay U K 2 8 . | yenezuel U__S_A 2 9 . UnkrEmir 30 . 1 W German 3 1 . | Y ugoslav Haiti Hondu ras Ivo ryCoa Kenya Liber i a Madagasc Malawi Mauritan Moz ambiq Nepal N i c a r a g u Niger Niger i a Pakistan Papu aNG Rwanda Senega 1 3 3 34 . 35 . 3 6 37 38 39 40 Sie rraLe Somalia Sudan Tanzania Togo YemenAR YenenPDR Zaire Zambia Z imbabwe The two-nation cluster seen at far right in Figure 12.4 turns out to be type 4, China and India. The broad, homogeneous third cluster in Figure 12.4, type 3, contains a large group of the poorest nations, mainly in Africa. The relatively diverse type 2 contains nations with higher living conditions includingthe U.S., Europe, and Japan. Type 1, also diverse, contains nations with intermediate conditions. Whether this or some other typology is meaningful remains a substantive question, not a statistical one, and depends on the uses for which a typology is needed. Choosing different options in the steps of our cluster analysis would have returned different results. By experimenting with a variety of reasonable choices, we could gain a sense of which findings are most stable. -■■Aft -»«T .■ Time Series Analysis Stata's evolving time series capabilities are covered in the 350-page Time-Series Reference Manual. This chapter provides a brief introduction, beginning with two elementary and useful analytical tools: time plots and smoothing. We then move on to illustrate the use of correlograms, ARIMA models, and tests for stationarity and white noise. Further applications, notably periodograms and the flexible ARCH family of models, are left to the reader's explorations. A technical and thorough treatment of time series topics is found in Hamilton (1994). Other sources include Box, Jenkins, and Reinsel (1994), Chatfield (1996), Diggle (1990), Enders (1995), Johnston and DiNardo (1997), and Shumway (1988). Menus for time series operations come under the following headings: Statistics - Time series Statistics - Multivariate time series Statistics - Cross-sectional time series Graphics - Time series graphs Example Commands ac y, lags(8) level(95) generate(newvar) Graphs autocorrelations of variable^, with 95% confidence intervals (default), for lags 1 through 8. Stores the autocorrelations as the first 8 values of newvar. arch D.y, arch(l/3) ar(l) ma(l) Fits an ARCH (autoregressive conditional heteroskedasticity) model for first differences of y, including first- through third-order ARCH terms, and first-order AR and MA disturbances. arima y, arima(3,1,2) Fits a simple ARIMA(3,1,2) model. Possible options include several estimation strategies, linear constraints, and robust estimates of variance. arima y, arima(3,1,2) sarima(1,0,1,12) Fits ARIMA model including a multiplicative seasonal component with period 12. arima D.y xl Ll.xl x2, ar(l) ma(l 12) Regresses first differences ofy onx/, lag-1 values ofxl, andx2, including AR(1), MA(1), and MA(12) disturbances. 339 340 Statistics with Stata corrgram yr lags(8) Obtains autocorrelations, partial autocorrelations, and Q tests for lags 1 through 8. dfuller y Performs Dickey-Fuller unit root test for stationarity. dwstat After regress , calculates a Durbin-Watson statistic testing First-order autocorrelation. . egen newvar = ma(y), nomiss t(7) Generates newvar equal to the span-7 moving average of v, replacing the start and end values with shorter, uncentered averages. generate date = mdy(mouth,day,year) Creates variable date, equal to days since January 1, 1960, from the three variables month, day, and year. generate date = date(strádáte, "mdy") Creates variable date from the string variable str_date, where striate contains dates in month, day, year form such as "11/19/2001", "4/18/98", or "June 12, 1948". Type help dates for many other date functions and options. . generate newvar = L3 .y Generates newvar equal to lag-3 values ofy. . pac y, lags(8) yline(O) ciopts (bstyle(outline)) Graphs partial autocorrelations with confidence intervals and residual variance for lags I through 8. Draws a horizontal line atO; shows the confidence interval as an outline, instead of a shaded area (default). pergram y, generate(newvar) Draws the sample periodogram (spectral density function) of variable y and creates newvar equal to the raw periodogram values. prais y xl x2 Perfonns Prais-Winsten regression of y on xl and x2, correcting for first-order autoregressive errors, prais y xl x2, core does Cochrane-Orcutt instead. smooth 73 y, generate(newvar) Generates newvar equal to span-7 running medians ofy, re-smoothing by span-3 running medians. Compound smoothers such as "3RSSH" or "4253h,twice" are possible. Type help smooth , or help tssmooth ,for other smoothing and filters. tsset date, format(%d) Defines the dataset as a time series. Time is indicated by variable date, which is formatted as daily. For "panel" data with parallel time series for a number of different units, such as cities, tsset city year identifies both panel and time variables. Most of the commands in this chapter require that the data be tsset. tssmooth ma newvar = y, window(2 12) Applies a moving-average filter toy, generating newvar. The window(2 1 2) option finds a span-5 moving average by including 2 lagged values, the current observation, and 2 leading values in the calculation of each smoothed point. Type help tssmooth for a list of other possible filters including weighted moving averages, exponential or double exponential, Holt-Winters, and nonlinear. Time Series Analysis 341 tssmooth nl newvar = y, smoother(4253h,twice) Applies a nonlinear smoothing filter toy, generating newvar. The smoother (4253h, twice) option iteratively finds running medians of span 4,2,5, and 3, then applies Hanning, then repeats on the residuals, tssmooth nl , unlike other tssmooth procedures, cannot work around missing values, wntestq y, lags(15) Box-Pierce portmanteau Q test for white noise (also provided by corrgram). xcorr x y, lags(8) xline(O) Graphs cross-correlations between input (x) and output (v) variable for lags 1-8. xcorr x y, table gives a text version that includes the actual correlations (or include a generate (newvar) option to store the correlations as a variable). Smoothing_ Many time series exhibit rapid up-and-down fluctuations that make it difficult to discern underlying patterns. Smoothing such series breaks the data into two parts, one that varies gradually, and a second "rough" part containing the leftover rapid changes: data = smooth + rough Dataset MILwater.dta contains data on daily water consumption forthetownofMilford,New Hampshire over seven months from January through July 1983 (Hamilton 1985b). Contains data f r c rn MI a t e r . d t a obs: 212 Milford daily water use, 1/1/83 - 7/31/3 3 vars: 4 27 Jul 2005 12:41 size: 2, 12 0 (93.1-;. of memory free) storage display value variable nanic type format label variable label month byto :'-9.0g Month day byte t9.0g Date year int V 9 . 0 g Year water' int - 9,0g Water use m 1000 gallons Sorted by: Before further analysis, we need to convert the month, day, and year information into a single numerical index of time. Stata's mdy () function does this, creating an elapsed-date variable (named date here) indicating the number of days since January 1, 1960. generate date = mdy(month,day,year) . list in 1/5 rnon t h d a y yea r water date 1 1 1983 520 84 01 1 2 1983 600 84 02 3 1983 610 84 03 1 4 1983 590 84 04 1 3 1983 620 8 4 05 342 Statistics with Stata The January 1, 1960 reference date is an arbitrary default. We can provide more understandable formatting for date, and also set up our data for later analyses, by using the tsset (time series set) command to identify date as the time index variable and to specify the %d (daily) display option for this variable. . tsset date, format(%d) time variable: date, 01 janl 98 3 to 31jull983 . list in 1/5 month day year water date 1 1 1983 52 0 Dljanl983 1 2 1983 600 02j anl983 1 1983 610 0 3janl983 1 4 1983 5 90 04j anl983 1 5 1983 620 05janl983 Dates in the new date format, such as "05janl983", are more readable than the underlying numerical values such as "8405" (days since January I, J 960). If desired, we could use %d formatting to produce other formats, such as "05 Jan 1983" or "01/05/83". Stata offers a number of variable-definition, display-format, and dataset-format features that are important with time series. Many of these involve ways to input, convert, and display dates. Full descriptions of date functions are found in the Data Management Reference Manual and the User's Guide, or they can be explored within Stata by typing help dates . The labeled values of date appear in a graph of water against date, which shows day-to-day variation, as well as an upward trend in water use as summer arrives (Figure 13.1): . graph twoway line water date, ylabel(300(100)900) Figure 13.1 3 could be used. For time series ( tsset ) data, powerful smoothing tools are available through the tssmooth commands. All but tssmooth nl can handle missing values. tssmooth ma tssmooth exponential tssmooth dexponential tssmooth hwinters tssmooth shwinters tssmooth nl moving-average filters, unweighted or weighted single exponential filters double exponential filters nonseasonal Holt-Winters smoothing seasonal Holt-Winters smoothing nonlinear filters Type help tssmooth_exponential , help tssmooth_hwinters , etc. for the syntax of each command. Figure 13.2 graphs a simple 5-day moving average of Milford water use (water5), together with the raw data (water). This graph twoway command overlays a line plot of smoothed water5 values with a line plot of raw water values (thinner line). A'-axis labels markstart-of-month values chosen "by hand" (8401, 8432, etc.) to make the graph more readable. Readability is also improved by formatting the labels as %dmd (date fonnat, but only month followed by day). Compare Figure 13.2's labels with their default counterparts in Figure 13.1. tssmooth ma water5 = water, window(2 1 2) The smoother applied was (1/5) *[x (t-2) + x(t-l) + l*x(t) + x(t + l) + xft + 2)]; x(t> = water 344 Statistics with Stata graph twoway line water5 date, clwidth (thick) II line water date, clwidth(thin) clpattern(solid) I| , ylabel(300(100)900) xlabel(8401 8432 8460 8491 8521 8552 8582 8613, grid format(%dmd)) xtitle("") ytitle(Water use in 1000 gallons) legend(order(2 1) position(4) ring(0) rows (2) label (1 "5-day average") label(2 "daily water use ) ) o o Ü) co c o ra o 0)0 ON o o I§ „, CO 0)°, ra in o o -i co Figure 13.2 Jan1 Feb1 Marl Apr1 May1 Jun1 Jul1 Aug1 Moving averages share a drawback of other mean-based statistics: they have little resistance to outliers. Because outliers form prominent spikes in Figure 13.1, we might also try a different smoothing approach. The tssmooth nl command performs outlier-resistant nonlinear smoothing, employingmethods and a terminology described in Velleman and Hoaglin (1981) and Velleman (1982). For example, . tssmooth nl water5r = water, smoother(5) creates a new variable named water5r, holding the values of water after smoothing by running medians of span 5. Compound smoothers using running medians of different spans, in combination with "hanning'1 (%, !/2, and % -weighted moving averages of span 3) and other techniques, can be specified in Velleman's original notation. One compound smoother that seems particularly useful is called "4253h, twice." Applying this to water, we calculate smoothed variable water4r\ tssmooth nl water4r = water, smoother(4253h,twice) Figure 13.3 graphs new smoothed values, water4r. Compare Figure 13.3 with 13.2 to see how the 4253h, twice smoothing performs relative to a moving-average. Although both smoothers have similar spans, 4253h, twice does more to reduce the jagged variations. Time Series Analysis 345 Figure 13.3 daily water use 4253h, twice smooth Jan1 Feb1 Marl Apr1 May1 Jun1 Jul1 Aug1 Sometimes our goal in smoothing is to look for patterns in smoothed plots. With these particular data, however, the "rough" or residuals after smoothing actually hold more interest. We can calculate the rough as the difference between data and smooth, and then graph the results in another time plot, Figure 13.4. generate rough = ter - water4r . label variable rough "Residuals from 4253h, twice" . graph twoway line rough date, xlabel(8401 8432 8460 8491 8521 8552 8582 8613, grid format(%dmd)) xtitle("") Figure 13.4 £ o ^ o _C0 o ra ^7 "D (0 9 . Og Fylla Bank temp, at 0-40m f y 11 s a 1 float *9 . Og Fylla Bank salinity at 0-40m nuuktemp f 1 oat ^9 . Og Nuuk air temperature wNAO float l,.9 . Og Winter (Dec-Mar) Lisbon-Stykkisholmur NAO wAO float %9 . Og Winter (Dec-Mar) AO index rcodl float \9 . Og Division 1 cod catch, 1000t tshrimp 1 float S 9 . Og Division 1 shrimp catch, 1000t Scried by: year Before analyzing these time series, we tsset the dataset, which tells Stata that the variable year contains the time-sequence information, tsset year, yearly t-:mo variable: year, 1 950 to 2000 With a tsset dataset, two new qualifiers become available: tin (times in) and twithin (times within). To list Fylla temperatures and NAO values for the years 1950 through 1955, type . list year fylltemp wNAO if tin(1950,1955) year f y11temp wNAO 1950 1951 1952 2 . 1 1 . 9 1 . 6 1 . 4 . 7 6 . 83 Time Series Analysis 347 4.J1953 2.1 .18| 5.119 5 4 2.3 .131 I-------------------------I 6. | 1955 1.2 -2.52 | +----■---------------------+ The twithin qualifier works similarly, but excludes the two endpoints: . list year fylltemp wNAO if twithin(1950,1955) +-------------------------+. I year fylltemp wNAO I 2. I 1951 1.9 -1.26 | 3.11952 1.6 .831 4.11953 2.1 .181 5. | 1954 2.3 .13 I +-------------------------f We use tssmooth nl to define a new variables/y//4, containing 4253h, twice smoothed values offylltemp (data from Bueh 2000). tssmooth nl fy!14 = fylltemp, smoother(4253h, twice) Figure 13.5 graphs raw (fylltemp) and smoothed (fyll4) Fylla Bank temperatures. Raw temperatures are shown as spike-plot deviations from the mean (1.67 °C), so this graph emphasizes both decadal cycles and annual variations. . graph twoway spike fylltemp year, base(1.67) yline(1.67) || line fyll4 year, clpattern(solid) II , ytitle("Fylla Bank temperature, degrees C") ylabel (0 (1)3) xtitle("") xtick(1955(10)1995) legend(off) Figure 13.5 1950 1960 1970 1980 1990 2000 cooiľ^r^cíľ^ľr"115 eľhibit irregu,ar penods °f generaii>— ^^£^^34^^). *"» ~d ^d; these SUmmer sea w 348 Statistics with Stata Fylla Bank temperatures arc influenced by a large-scale atmospheric pattern called the North Atlantic Oscillation, or NAO. Figure 13.6 graphs smoothed temperatures together with smoothed values of the NAO (a new variable named wNA04). For this overlaid graph, temperature defines the left axis scale, yaxis(l) , and NAO the right, yaxis(2) . Further v-axis options specify whether they refer to axis 1 or 2. For example, a horizontal line drawn by ylinefO, axis(2) ) marks the zero point of the NAO index. On both axes, numerical labels are written horizontally. The legend appears at the 5 o'clock position inside the plot space, position(5) ring(O) . graph twoway line fyll4 year, yaxis(l) ylabel(0(1)3, angle(horizontal) nogrid axis(1)) ytitle("Fylla Bank temperature, degrees C", axis(1) ) || line wNA04 year, yaxis(2) ytitle ("Winter NAO index", axis<2)) ylabel(-3(1)3, angle(horizontal) axis(2)) yline(0, axis(2)) | | , xtitle (" ") xlabel (1950(10)2000, grid) xtick(1955 (5)1995) legend(label(1 "Fylla temperature") label(2 "NAO index") cols(l) position(5) ring(0)) Figure 13.6 1950 1960 1970 1990 2000 Overlaid plots provide a way to visually examine how several time series vary together. In Figure 13.6, we see evidence of a negative correlation: high-N AO periods correspond to low temperatures. The physical mechanism behind this correlation involves northerly winds that bring Arctic air and water to west Greenland during high-NAO phases. The negative temperature-NAO correlation became stronger during the later part of this time series, roughly the years 1973 to 1997. We will return to this relationship in later sections. Time Series Analysis 349 Lags, Leads, and Differences Time series analysis often involves lagged variables, or values from previous times. Lags can be specified by explicit subscripting. For example, the following command creates variable \vNAO_f, equal to the previous year's NAO value: . generate wNAO^l = wWAO[_n-l] (1 missing value generated) An alternative way to achieve the same thing, using tsset data, is with Stata's L. (lag) operator: . generate wNAO^l = L.wNAO (1 missing values generated) Lag operators are often simpler than an explicit-subscripting approach. More importantly, the lag operators also respect panel data. To generate lag 2 values, use . generate wNAO__2 = L2 . wNAO (2 missing values generated) list year wNAO wNAO_l wNAO__2 if tin (1 950 , 1 954 1 year wNAO wNAO 1 wNAO 2 | 1 . 1 1. 9 5 0 1 . 4 ri 1 19 51 -1.26 1 . 4 3 . 1 1952 . 83 -1.26 1.4 1 4 . I 19 5 3 . 18 . 8 3 -1.2 6 | 5 . 1 19 5 4 . 1 3 . 18 .83 I We could have obtained this same list without generating any new variables, by instead typing . list year wNAO L.wNAO L2.wNAO if tin (1950,1954) The L. operator is one of several that simplify the analysis of tsset datasets. Other time series operators are F. (lead), D. (difference), and S. (seasonal difference). These operators can be typed in upper or lowercase — for example, F2 . wNAO or f 2 . wNAO . Time Series Operators_ Lag V/ i ( Ll. means the same thing) 2-period lag v, 2 (similarly, L3.,etc. L (1/4). means Ll. through L4 .) Lead vv, ( Fl. means the same thing) 2-period Iead_y,h, (similarly, F3 ., etc.) Difference- v, , ( Dl. means the same thing) Second difference (y, ->■,_,) - (y, , ->', :) (similarly, D3 ., etc.) Seasonal differenceyt -yr (which is the same as D .) Second seasonal difference (v, -yt.2) (similarly, S3 ., etc.) In the case of seasonal differences, S12 . does not mean "12th difference," but rather a first difference at lag 12. For example, if we had monthly temperatures instead of yearly, we might L. L2 F. F2 D. D2 S. S2 350 Statistics with Stata want to calculate S12 . temp , which would be the differences between December 2000 temperature and December 1999 temperature, November 2000 temperatures and November 1999 temperature, and so forth. Lag operators can appear directly in most analytical commands. We could regress 1973-97 fylltemp on wMAO, including as additional predictors wNAO values from one, two, and three years previously, without first creating any new lagged variables. . regress fylltemp wNAO Ll.wNAO L2.wNAO L3.wNAO if tin (1973, 1997) ■ ource I SS cif MS Model I 3.1884913 Residual \ 3.48929123 4 20 797122826 174 4 64562 Total 1 6.67778254 24 27 8 240939 Number of obs F( 4, 20) Prob > F R-squared Adj R-squared Root MSE 25 4 . 57 .0088 .4 775 .37 30 41769 fylltemp wNAO :oef Std. Err t P> I t [95% Conf. Interval] LI L2 L3 1688424 0043805 0472993 0264682 .727913 04 12995 0421436 .050851 04954 16 1213588 -4 09 0 001 - .254 991 7 0 10 0 918 - . 08352 94 -0 93 0 363 - .1 533725 0 53 0 599 - .0768738 1 4 . 24 0 .000 1.4747 63 .0826931 .0922905 .058774 .1298102 1.981063 Equivalently, we could have typed . regress fylltemp h(d/3) .vNAO if tin (1973,1997) The estimated model is predictedjylltemp= 1.728-A69wNAO, + .QMwNAO,., - M7wMAO,.2 + M6wiVAOt_3 Coefficients on the lagged terms are not statistically significant; it appears that current (unlagged) values of wNAO, provide the most parsimonious prediction. Indeed, if we re-estimate this model without the lagged terms, the adjusted R1 rises from .37 to .43. Either model is very rough, however. A Durbin-Watson test for autocorrelated errors is inconclusive, but that is not reassuring given the small sample size, dwstat Durbin-Watson d-statistic ( 5, 25) = 1.423806 Autocorrelated errors, commonly encountered with time series, invalidate the usual OLS confidence intervals and tests. More suitable regression methods for time series are discussed later in this chapter. Time Series Analysis 351 Correlograms Autocorrelation coefficients estimate the correlation between a variable and itself at particular lags. For example, first-order autocorrelation is the correlation betweeny t andy t , . Second order refers to Cox\y ny, 2], and so forth. A correlogram graphs correlation versus lags. Stata's corrgram command provides simple correlograms and related information. The maximum number of lags it shows can be limited by the data, by matsize , or to some arbitrary lower number that is set by specifying the lags () option: corrgram fylltemp, lags(9) LAG AC PAC Q Prob>Q [Autoco r re 1at ion] [Partial Autocor] 1 0 4 038 0 4 14 1 8 . 8151 0 0030 1 — 1—- 2 0 1996 0 0565 11 .012 0 004 1 \ - | 3 0 078 8 0 004 5 11 .361 0 00 99 | 1 4 0 007 1 -0 0556 1 1 . 364 0 0228 | | 5 -0 1623 -0 2 23 2 12 . 912 0 02 4 2 - [ - | 6 -0 0733 0 088 0 13 .234 0 03 95 | | 7 0 04 90 0 1367 13 . 332 0 0 633 | | - 8 -0 1029 -0 2510 14 .047 0 0805 | -- 1 9 -0 2228 -0 277 9 17 . 243 0 0450 - | -- 1 Lags appear at the left side of the table, and are followed by columns for the autocorrelations (AC) and partial autocorrelations (PAC). For example, the correlation between fylltemp, and fylltemp t__2 is .1996, and the partial autocorrelation (adjusted for lag 1) is .0565. The Q statistics (Box-Pierce portmanteau) test a series of null hypotheses that all autocorrelations up to and including each lag are zero. Because the P-values seen here are mostly below .05, we can reject the null hypothesis, and conclude {h&Xfylltemp shows significant autocorrelation. If none of the Q statistics had been below .05, we might conclude instead that the series was "white noise" with no significant autocorrelation. At the right in this output are character-based plots of the autocorrelations and partial autocorrelations. Inspection of such plots plays a role in the specification of time series models. More refined graphical autocorrelation plots can be obtained through the ac command: ac fylltemp, lags(9) The resulting correlogram, Figure 13.7, includes a shaded area marking pointwise 95% confidence intervals. Correlations outside of these intervals are individually significant. i 352 Statistics with Stata Figure 13.7 |° o o ° ■f° 0 2 4 6 8 10 Lag Bartlett's formula for MA(q) 95% confidence bands A similar command, pac , produces the graph of partial autocorrelations seen in Figure 13.8. Approximate confidence intervals (estimating the standard error as l/vn~) also appear in Figure 13.8, The default plot produced by both ac and pac has the look shown in Figure 13.7. For Figure 13.8 we chose different options, drawing a baseline at zero correlation, and indicating the confidence interval as an outline instead of a shaded area. . pac fylltemp, yline(O) lags(9) ciopts(bstyle(outline)) Figure 13.8 O ; o o 2 ° ra ra ■E ra CM Ö 0 2 4 95% Confidence bands [se = Vsqrt(n)] 10 Lag Time Series Analysis 353 Cross-correlograms help to explore relationships between two time series. Figure 13.9 shows the cross-correlogram of wNAO and fylltemp over 1973-97. The cross-correlation is substantial and negative at 0 lag, but is closer to zero at other positive or negative lags. This suggests that the relationship between the two series is "instantaneous" (in yearly data) rather than delayed or distributed over several years. Recall the nonsignificance of lagged predictors from our earlier OLS regression. . xcorr wNAO fylltemp if tin(1973,1997) , lags(9) xlabel (-9 (1)9, grid) ^ , Figure 13.9 Cross-correlogram ■o « c o ca O < 5 o -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 Lag If we list our input or independent variable first in the xcorr command, and the output or dependent variable second — as was done for Figure 13.9 — then positive lags denote correlations between input at time t and output at time r+1, t+2, etc. Thus, we see a positive correlation of .394 between winter NAO index and Fylla temperature four years later. The actual cross-correlation coefficients, and a text version of the cross-correlogram, can be obtained with the table option; . xcorr wNAO fylltemp if tin (1973,1997) , lags (9) table LAG CORR [Cross-correlation] -9 -0 .054 1 1 -8 -0 .078 6 1 -7 0 .1040 1 -6 -0 .0261 1 -5 -0 .0230 1 -4 0 .3185 1 -- -3 0 .1212 1 -2 0 .0053 1 -1 0 -0 -0 .0909 .674 0 1 -0 .1386 - 1 2 -0 .08 65 1 354 Statistics with Stata 07 57 3 9 4 0 24 64 1100 0183 .2699 .304 2 ARIMA Models Autoregressive integrated moving average (ARIMA) models for time series can be estimated through the arima command. This command encompasses simple autoregressive (AR), moving average (MA), or ARIMA models of any order. It also can estimate structural models that include one or more predictor variables and AR or MA errors. The general form of such structural models, in matrix notation, is y, =x,p + li, [13.1] where v, is the vector of dependent-variable values at time t, xt is a matrix of predictor-variable values (usually including a constant), and u,, is a vector of disturbances. Those disturbances can be autoregressive or moving-average, of any order. For example, ARM A( 1,1) disturbances are \x, = p[i,_, + 6e,., + e, [13.2] where p is the first-order autocorrelation parameter, 6 is the first-order moving average parameter, and e is a white-noise (normal i.i.d.) disturbance, arima fits simple models as a special case of [13.1] and [13.2]. with a constant (P 0) replacing the structural term x , p. Therefore, a simple ARMA(1,1) model becomes y, =Po + H, Some sources present an alternative version. In the ARMA( 1,1) case, they show v, as a function of the previous>' value (y,..}) ana* the present (e f) and lagged (e,_,) disturbances'. y, = a + py,i + 6e,_, + et Because in the simple structural model y r = P equivalent to [ 13.4], apart from rescaling the constant a = (l-p)p0. Using arima , an ARMA(1,1) model (equation [13.3]) can be specified in either of two ways: arima y, ar(l) ma(l) [13.4] p ,, equation [13.3] (Stata's version) is .rima (1,0,1) or arima y The i in arima stands for "integrated," referring to models that also involve differencin; To fit an ARIMA(2,I,I) model, use arima y, arima(2,1,1) or equivalently, T Time Series Analysis 355 arima D r(l 2) ma(l) Either command specifies a model in which first differences of the dependent variable (y{-y, ,) are a function of first differences one and two lags previous (v, , - v, : andj',.: - vf ,) and also of present and previous disturbances (e, and e, ,). To estimate a structural model in which y s depends on two predictor variables .v (present and lagged values, .v, and x,_{) and w (present values only, u(), with ARIMA( 1,0,1) errors, an appropriate command would be arima y x L.x w, arima(1,0,1) Although seasonal differencing (e.g., S12 . y) and/or seasonal lags (e.g., L12 . x ) can be included, as of this writing arima does not estimate multiplicative AR\MA(p.d,q)(P,D,Q)^ seasonal models. A time series y is considered "stationary" if its mean and variance do not change with time, and if the covariance between y: and v,+[, depends only on (he lag u, and not on the particular values of /. ARIMA modeling assumes that our series is stationary, or can be made stationary through appropriate differencing or transformation. We can check this assumption informally by inspecting time plots for trends in level or variance. Forma! statistical tests for "unit roots" (a nonstationary AR(1) process in which p, = 1, also known as a "random walk") also help. Stata offers three unit root tests, pperron (Phillips-Perron), dfuller (augmented Dickey-Fuller), and dfgls (augmented Dickey-FullerusingGLS,generally a more powerful test than dfuller ). Applied to Fylla Bank temperatures, a pperron test rejects the null hypothesis of a unit root {P< .01). pperron fylltemp, lag(3) Ph: Hips-Perron test for unit root Number of obs Newey-West lag. 50 Test Statistic Interpolated Die key-Fuller--------- I- Critical 5 7 Critical 107 Critical Value Value Value Z(rhoj Z (t ) ■2 9 -4 871 440 18.900 -3.580 13.300 -2.930 10.7 00 -2.600 MacKinnon approximate p-value for Z(t) 0.0003 Similarly, a Dickey-Fuller GLS test evaluating the null hypothesis that fylltemp has a unit route (versus the alternative hypothesis that it is stationary with a possibly nonzero mean but no linear time trend) rejects this null hypothesis {P < .05). Both tests thus confirm the visual impression of stationarity given by Figure 13.5. 356 Statistics with Stata dfgls fylltemp, notrend maxlag(3) DF-GLS for fylltemp Number of obs [lags] DF-GJ.S rnu Test Statistic 1% Critical Value 5% Critical Value 10'A Critical Value 2.304 2 . 479 ■3 . 0 08 Opt Lag (Ng-Perron seq t) = Min SC - -.6735952 at lag Min MAIC - -.2683716 at lag -2.620 -2.620 -2.62 0 [use maxlag ( 0) ] 1 with RMSE .6578912 2 with RMSE .6569351 -2.211 -2.238 - 2 .261 1.913 1.938 1.959 For a stationary series, correlograms provide guidance about selecting a preliminary ARIMA model: AR(/?) An autoregressive process of order p has autocorrelations that damp out gradually with increasing lag. Partial autocorrelations cut off after lag p. MA(#) A moving average process of order q has autocorrelations that cut off after lag q. Partial autocorrelations damp out gradually with increasing lag. ARMA,4) A mixed autoregressive-moving average process has autocorrelations and partial autocorrelations that damp out gradually with increasing lag. Correlogram spikes at seasonal lags (for example, at 12, 24, 36 in monthly data) indicate a seasonal pattern. Identification of seasonal models follows similar guidelines, but applied to autocorrelations and partial autocorrelations at seasonal lags. Figures 13.7-13.8 weakly suggest an AR(1) process, so we will try this as a simple model for fylltemp. . arima fylltemp, arima(1,0,0) nolog ARIMA regression Sample: 1950 to 2000 Log likelihood = -48.6627 Number of obs Wald chi2(1) Prob > chi2 51 7 . 53 0.0061 fylltemp | Coef . OPG Std. Err. z P> 1 z 1 [95% C o n f. I n terva1] fylltemp | cons 1 1.6892 3 . 15 1 30 96 11 16 0 . 000 1 . 392 669 1 . 98 57 92 ARMA ] a r 1 LI 1 .4 0957 59 . 1 4 924 91 2 74 0 . 006 .1170531 . 7 020 987 /sigrna | . 6271 5 1 . 0601859 10 42 0 . 000 . 5091 88 9 . 74 51 1 31 After we fit an arima model, its coefficients and other results are saved temporarily in Stata's usual way. For example, to see the recent model's AR(1) coefficient and s.e., type . display [ARMA]_b[LI.ar] .4095759 . display [ARMA]_se[LI.ar] . 14 92 4 909 Time Series Analysis 357 The AR(1) coefficient in this example is statistically distinguishable from zero (f = 2.74, p = .006), which gives one indication of model adequacy. A second test is whether the residuals appear to be uncorrelated "white noises' We can obtain residuals (also predicted values, and other case statistics) after arima through predict: . predict r~yli.res, resid corrgram fyllres, lags(15) LAG AC PAC Q 0 .04 67 0 . 0465 . 13631 0 .0 38 6 0 .0497 .2202 9 0 .0413 0 .0496 . 31851 -0 .1834 -0 .2450 2.2955 -0 .04 98 -0 0 602 2.4442 0 .1532 0 2156 3.8852 -0 05 67 -0 0 72 6 4.087 -0 2 055 -0 3232 6.8055 -0 1156 -0 2418 7.6865 0 1397 0 . 2 794 9 . 0051 -o 002 8 0 . 1 606 9.0057 0 . 1091 0 . 0 64 7 9.8519 0 . 1014 -o. 054 7 10.603 -o. 0 673 -0 . 2 83 7 10.943 -1 Prob>Q^ [Autocorrelation! ~\Part i a 1 " Au t ocor ] 1 2 3 4 5 6 7 8 9 10 11 12 13 1 4 15 . 8 987 .934 1 . 97 42 . 988 6 8069 874 7 7 929 84 92 657 4 6594 6214 7024 7 0 6 0 7169 7 566 corrgram 's Q test finds no significant autocorrelation among residuals out to lag 15. We could obtain exactly the same result by requesting a writes tq (white noise test Q statistic) for 15 lags. . wntestq fyllres, lags (15) Portmanteau test for white noise Portmanteau (Q) statist! Prob > chi2 (15) 10.9435 0.7 5 66 By these criteria, our AR(1) or ARIMA( 1,0,0) model appears adequate. More complicated versions, with MA or higher-order AR terms, do not offer much improvement in fit. A similar AR(1) model fits fylltemp over just the years 1973-1997. During this period, however, information about the winter North Atlantic Oscillation (wNAO) significantly improves the predictions. For this model, we include wNAO as a predictor but keep an AR(1) term to account for autocorrelation of errors. 41 358 Statistics with Stata . arima fylltemp wNAO if tin (1973,1997) , ar (1) nolog ARIMA regression Sample: 1973 to Log likelihood = 1997 -10.3481 Number of obs Wald chi2(2) Prob > chi2 25 12.73 0.0017 1 fylltemp I Coef . OPG S t d. Err. z P> 1 z | [95% Conf. Interval] fylltemp | wNAO I cons I -.1736227 1 . 7 034 62 .0531688 .134 85 99 -3 12 27 63 0.001 0.000 -.2778317 1.439141 -.0694 138 1.967 7 82 ARMA I ar | LI I . 2 965222 .237438 1 25 0.212 - . 1 6884 7 8 .7 618921 /sigma | .36536 .0 654008 5 59 0.000 .23717 67 .4 9354 32 . predict fyllhat (option xb assumed; predieted values) . label variable fyllhat "predicted temperature" . predict fyllres2, resid corrgram fyllres2, lags(9) LAG AC PAC 0 Prob>Q [Autocorrelation] [Partial Autocor] 1 0 1485 0 1 529 1 . 1929 0 2 74 7 | - j - 2 -0 .1028 -0 1320 1 . 77 62 0 4114 | - | 3 0 .04 95 0 1 182 1 . 9143 0 5904 | | 4 0 0887 0 054 6 2 . 3672 0 6686 | | 5 -0 1690 -0 2334 4 . 04 4 7 0 54 30 - | - | 6 -0 0234 0 0722 4 . 0776 0 6662 | | 7 0 .2 658 0 3062 8 . 41 68 0 297 3 1 8 -0 072 6 -0 223 6 8 . 74 8 4 0 3640 | - | 9 -0 1623 -0 0999 10 .444 0 3157 - | | wNAO has a significant, negative coefficient in this model. The AR(1) coefficient now is not statistically significant. If we dropped the AR term, however, our residuals would no longer pass corrgram's test for white noise. Figure 13.10 graphs the predicted values, fyllhat, together with the observed temperature series fylltemp. The model does reasonably well in fitting the main warming/cooling episodes and a few of the minor variations. To have they-axis labels displayed with the same number of decimal places (0.5, 1.0,1.5,... insteadof.5,1, 1.5,...) in this graph, we specify their format as %2 . If . Time Series Analysis 359 graph two-way line fylltemp year if tin(1973, 1997) || line fyllhat year if tin(1973, 1997) || , ylabel(.5(.5)2.5, angle(horizontal) format(%2.If)) ytitle("Degrees C") xlabel (1975 (5) 1995, grid) xtitle("") legend(label(1 "observed temperature") label(2 "model prediction") position(5) ring(0) col(1)) 2.5 2.0 1.0 /' 0.5 1975 Figure 13.10 1980 1985 observed temperature - - model prediction 1990 1995 A technique called Prais-Winsten regression ( prais ), which corrects for first-order autoregressive errors, can also be illustrated with this example. . prais fylltemp vNAO if tin (1973,1997) , nolog Prais-Winsten AR(1} regression -- iterated estimates Source | SS df MS Number of obs - 25 -------------+------------------------------ F( lf 23) = 23.14 Model I 3.35819258 1 3.35819258 Prob > F - 0.0001 Residual [ 3.3 3743545 23 .145105889 R-squared = 0.5016 -------------+------------------------------ Adj R-SqUared = 0.47 99 Total ! 6.69562S03 24 .278984501 Root MSE - .38093 fylltemp I Coef. Std. Err. t P> I t I [951 Conf. Interval] wNAO | -.17356 .037567 -4.62 0.000 -.2512733 -.0958468 _cons I 1.703436 .1153695 14.77 0.000 1.464776 1.942096 -------------+----------------------------------------------------------------- rho | . 295157 6 Durbin-Watson statistic (original) 1. 344998 Durbin-Watson statistic (transformed) 1.789412 prais is an older method, more specialized than arima. Its regression-based standard errors assume that rho (p) is known rather than estimated. Because that assumption is untrue, 360 Statistics with Stata ./oV). in tins cAampiv, —«----- igmficant first-order autocorrelation remains. Introduction to Programming As mentioned in Chapters 2 and 3, we can create a simple type of program by writing any sequence of Stata commands in a text (ASCII) file. Stata's Do-file Editor (click on Window -Do-file Editor or the icon <^) provides a convenient way to do this. After saving the do-file, we enter Stata and type a command with the form do filename that tells Stata to read filename.do and execute whatever commands it contains. More sophisticated programs are possible as well, making use of Stata's built-in programminglanguage. Many of the commands used in previous chapters actually involve programs written in Stata. These programs might have originated either from Stata Corporation or from users who wanted something beyond Stata's built-in features to accomplish a particular task. Stata programs can access all the existing features of Stata, call other programs that call other programs in turn, and use model-fitting aids including matrix algebra and maximum likelihood estimation. Whether our purposes are broad, such as adding new statistical techniques, or narrowly specialized, such as managing a particular database, our ability to write programs in Stata greatly extends what we can do. Substantial books (Stata Programming Reference Manual; Mata Reference Manual; Maximum Likelihood Estimation with Stata) have been written about Stata programming. This engaging topic is also the focus of periodic NetCourses (see www.stata.com) and a section of the User's Guide. The present chapter has the modest aim of introducing a few basic tools and giving examples that show how these tools can be used. Basic Concepts and Tools_ Some elementary concepts and tools, combined with the Stata capabilities described in earlier chapters, suffice to get started. Do-files are ASCII (text) files, created by Stata's Do-file Editor, a word processor, or any other text editor. They are typically saved with a .do extension. The file can contain any sequence of legitimate Stata commands. In Stata, typing the following command causes Stata to read filename.do and execute the commands it contains: Do-files do filename 362 Statistics with Stata Each command in filename.do, including the last, must end with a hard return —- unless we have reset the delimiter to some other character, through a #delimit command. For example, #delimit ; This sets a semicolon as the end-of-line delimiter, so that Stata does not consider a line finished until it encounters a semicolon. Setting the semicolon as delimiter permits a single command to extend over more than one physical line. Later, we can reset ''carriage return" as the usual end-of-line delimiter by typing the following command: #de1i mi t cr Ado-files Ado (automatic do) files are ASCII flies containing sequences of Stata commands, much like do-files. The difference is that we need not type do filename in order to run an ado-file. Suppose we type the command clear As with any command, Stata reads this and checks whether an intrinsic command by this name exists. If a clear command does not exist as part of the base Stata executable (and, in fact, it does not), then Stata next searches in its usual "ado11 directories, trying to find a file named clear.ado. If Stata finds such a file (as it should), it then executes whatever commands the file contains. Ado-files have the extension .ado. User-written programs commonly go in a directory named C:\ado\personal, whereas the hundreds of official Stata ado-files get installed in C:\stata\ado. Type sysdir to see a list of the directories Stata currently uses. Type help sysdir or help adopath for advice on changing them. The which command reveals whether a given command really is an intrinsic, hardcoded Stata command or one defined by an ado-file; and if it is an ado-file, where that resides. For example, logit is a built-in command, but the logistic command is defined by an ado-file named logistic.ado: which logit built-in c ommand: logit which logistic C:VSTATA\ado\base\1\logistic.ado *! version 3.1.9 01oct2002 This distinction makes no difference to most users, because logit and logistic work with similar ease and syntax when called. Programs Both do-files and ado-files might be viewed as types of programs, but Stata uses the word "program'1 in a narrower sense, to mean a sequence of commands stored in memory and executed by typing a particular program name. Do-files, ado-files, or commands typed interactively can define such programs. The definition begins with a statement that names the program. For example, to create a program named count5, we start with program count 5 Introduction to Programming 363 Next should be the lines that actually define the program. Finally, we give an end command, followed by a hard return: end Once Stata has read the program-definition commands, it retains that definition of the program in memory and will run it any time we type the program's name as a command: . count5 Programs effectively make new commands available within Stata, so most users do not need to know whether a given command comes from Stata itself or from an ado-file-defined program. As we start to write a new program, we often create preliminary versions that are incomplete or just unsuccessful. The program drop command provides essential help here, allowing us to clear programs from memory so that we can define a new version For example, to clear program count5 from memory, type . program drop count5 To clear all programs (but not the data) from memory, type program drop all Local Macros Macros are names (up to 31 characters) that can stand for strings, program-defined results, or user-defined values. A local macro exists only within the program that defines it, and cannot 1 c'" *' ' ng for the :zl:iz:another program- t° create a iocai——d te, M local iterate = 0 w 1",0 in ,h,s ~ple)'place *-«™ -»e display ' i terate1 Thus, to increase the value of iterate by one, we write local iterate = "iterate' + 1 Global Macros Global macros are similar to local macros, but once defined, they remain in memory and can be used by other programs. To refer to a global macro's contents, we preface the macro name with a dollar sign (instead of enclosing the name in left and right single quotes as done with local macros): 14 6 global distance = 7 3 display ^distance * 2 364 Statistics with Stata Version Stata's capabilities and features have changed over the years. Consequently, programs written for an older version of Stata might not run directly under the current version. The version command works around this problem so that old programs remain usable. Once we tell Stata for what version the program was written, Stata makes the necessary adjustments and the old program can run under a new version of Stata. For example, if we begin our program with the following statement, Stata interprets all the program's commands as it would have in Stata 6: vers ion 6 Comments Stata does not attempt to execute any line that begins with an asterisk. Such lines can therefore be used to insert comments and explanation into a program, or interactively during a Stata session. For example, * This entire line is a comment. Alternatively, we can include a comment within an executable line. The simplest way to do so is to place the comment after a double slash, / / (with at least one space before the double slash). For example, s umma ri ze in come education II this part is the comment A triple slash (also preceded by at least one space) indicates that what follows, to the end of the line, is a comment; but then the following physical line should be executed as a continuation of the first. For example, summarize income education occupation age will be executed as if we had typed /// this part is the comment summarize income education occupation age With or without comments, the triple slash provides an easy way to include long command lines inaprogram. For example, the following lines would be read as one table command, even though they are separated by a hard return. table gender kids school if contam==\, contents(mean lived III median lived count lived) If our program has more than a few long commands, however, the # delimit ; approach (described earlier; also see help delimit) might be easier to write and read. It is also possible to include comments in the middle of a command line, bracketed by / * and * / . For example, s umma ri z e income I * this is the comme nt */ education occupation If one line ends with / * , and the next begins with * / , then Stata skips over the line break and reads both lines as a single command -— another line-lengthening trick sometimes found in programs. Introduction to Programming 365 Looping There are a number of ways to create program loops. One simple method employs the forvalues command. For example, the following program counts from 1 to 5. * Program that counts from one to five program count5 version 8.0 forvalues i = 1/5 { display ' i ' } end By typing these commands, we define program coun 15 . Alternatively, we could use the Do-file Editor to save the same series of commands as an ASCII file named counts.do. Then, typing the following causes Stata to read the file: do count5 Either way, by defining program count5 we make this available as a new command: count5 The command forvalues i 1/5 { assigns to local macro i the consecutive integers from 1 through 5. The command display 1i1 shows the contents of this macro. The name i is arbitrary. A slightly different notation would allow us to count from 0 to 100 by fives (0, 5, 10, ..., 100): forvalues j = 0(5)100 { The steps between values need not be integers. To count from 4 to 5 by increments of .01 (4.00,4.01,4.02,... ,5.00), write forvalues k = 4 ( . 01)5 ( Any lines containing valid Stata commands, between the opening and closing curly brackets { }, will be executed repeatedly for each of the values specified. Note that nothing (on that line) follows the opening bracket, and that the closing bracket requires a line of its own. The foreach command takes a different approach. Instead of specifying a set of consecutive numerical values, we give a list of items for which iteration occurs. These items could be variables, files, strings, or numerical values. Type help foreach to see the syntax of this command. forvalues and foreach create loops that repeat for a pre-specified number of times. If we want looping to continue until some other condition is met, the while command is useful. A section of program with the following general form will repeatedly execute the commands within curly brackets, so long as expression evaluates to ''true11: 366 Statistics with Stata while expression { command A command B } command Z As in previous examples, the closing bracket } should be on its own separate line, not at the end of a command line. When expression evaluates to "false," the looping stops and Stata goes on to execute command Z. Parallel to our previous example, here is simple program that uses a while loop to display onscreen the iteration numbers from 1 through 6: * Program that counts from one to six program count6 version 8.0 local iterate = 1 while "iterate* <= 6 { display 'iterate1 local iterate = 'iterate1 +■ 1 end A second example of a while loop appears in the gossip.ado program described later in this chapter. The Programming Reference Manual contains more about programming loops. If . .. else The if and else commands tell a program to do one thing if an expression is true, and something else otherwise. They are set up as follows: if expression { command A command B else { command Z \ For example, the following program segment checks whether the content of local macro span is an odd number, and informs the user of the result. if mt ( ' span 1 / 2 ) != ('span' - l)/2 { display "span is NOT an odd number- else { display "span IS an odd number Arguments Programs define new commands. In some instances (as with the earlier example counts ), we STd our command to do exactly the same thing each time it is used^ ^ ho^ we —a n that is modified bv arguments such as variable names or options. There are Introduction to Programming 367 two ways we can tell Stata how to read and understand a command line that includes arguments. The simplest of these is the args command. The following do-file (iistresi.do) defines a program that performs a two-variable regression, and then lists the observations with the largest absolute residuals. # IDvariable * Perform simple regression and list observations with # * largest absolute residuals. * Iistresi Yvariable Xvariable program Iistresi, sortpreserve version 8.0 args Yvar Xvar number id quietly regress 'Yvar' 'Xvar' capture drop Yhat capture drop Resid capture drop Absres quietly predict Yhat quietly predict Resid, resid quietly gen Absres = gsort -Absres drop Absres list 'id' 'Yvar' Yhat ab s(Re s i d) Resid in l/'number end The line args Yvar Xvar number id tells Stata that the command listresid should be followed by four arguments. These arguments could be numbers, variable names, or other strings separated by spaces. The first argument becomes the contents of a local macro named Yvar , the second a local macro named Xvar , and so forth. The program then uses the contents of these macros in other commands, such as the regression: quietly regress 'Yvar' 'Xvar' The program calculates absolute residuals (Absres), and then uses the gsort command (followed by a minus sign before the variable name) to sort the data in high-to-low order, with missing values last: gsort -Absres The option sortpreserve on the command line makes this program "sort-stable11: it returns the data to their original order after the calculations are finished. Dataset nations.dta, seen previously in Chapter 8, contains variables indicating life expectancy (life), per capita daily calories (food), and country name (country) for 109 countries. We can open this file, and use it to demonstrate our new program. A do command runs do-ftlc Iistresi.do, thereby defining the program Iistresi : . do Iistresi.do Next, we use the newly-defmed Iistresi command, followed by its four arguments. The first argument specifies they variable, the second jt, the third how many observations to list, and the fourth gives the case identifier. In this example, our command asks for a list of observations that have the five largest absolute residuals. Iistresi life food 5 country 368 Statistics with Stata | country life Yhat Res i d 1. | Libya 60 76 . 690 1 -16 690 1 1 2 1 Bhutan 4 4 60 . 4 9 5 7 7 -1 6 4 9 57 7 3 . ] Panama 72 58 . 13118 1 3 86882 4 . [ Malawi 45 58 . 5 8 7 3 2 -13 58232 5 . | Ecuador 66 52 . 4 5305 13 54 695 Life expectancies are lower than predicted in Libya, Bhutan, and Malawi. Conversely, life expectancies in Panama and Ecuador are higher than predicted, based on food supplies. Syntax The syntax command provides a more complicated but also more powerful way to read a command line. The following do-file named listres2.do is similar to ourprevious example, but it uses syntax instead of args : * Perform simple or multiple regression and list * observations with # largest absolute res idua1s. * 1 i s t r e s 2 yvar xvarlist [if] fin], number (#) [id (varname) ] program listres2, sortpreserve version 8.0 syntax varlist(min=l) [if] [in], marksample touse quietly regress ""varlist* if capture drop Yhat capture drop Res id capture drop Absres quietly predict Yhat if 'touse' quietly predict Resid if "touse' quietly gen Absres = abs (Resid) gsort -Absres drop Absres list "id' '1' Yhat Resid in 1/'number' end listres2 has the same purpose as the earlier listresl: it performs regression, then lists observations with the largest absolute residuals. This newer version contains several improvements, however, made possible by the syntax command. It is not restricted to two-variable regression, as was listresl . Iistres2 will work with any number of predictor variables, including none (in which case, predicted values equal the mean of v, and residuals are deviations from the mean). Iistres2 permits optional if and in qualifiers. A variable identifying the observations is optional with listres2 , instead of being required as it was with listresl . For example, we could regress life expectancy on food and energy, while restricting our analysis to only those countries where per capita GNP is above 500 dollars: Number(integer) [Id(string)] touse' resid 1 do listres2.do listres2 life food energy if gnpcap > 500 I country life Yhat Resid | 1 - 1 YemenPDR 4 6 61.34964 - 15 . 34 964 t 2 . | YemenAR 4 5 59.85839 -14.85839 | 3 . | Libya 6 0 73.62 5 1 6 -13.62516 [ 4. | S_Africa 5 5 6 7.9146 -12.9146 | 5 . J HongKong 7 6 64.61 022 11.3 5 978 | 6. 1 Panama 72 61 .7 7788 10.22212 | Introduction to Programming 369 n (6) i{country) The syntax line in this example illustrates some general features of the command'. syntax varlist (min = l) [if] [in], Number (integer) [Id (string) ] The variable list for a lis tres2 command is required to contain at least one variable name ( varlist {min=l) ). Square brackets denote optional arguments — in this example, the if and in qualifiers, and also the id () option. Capitalization of initial letters for the options indicates the minimum abbreviation that can be used. Because the syntax line in our example specified Number (integer) Id (string), an actual command could be written: listres2 life food, number(6) id(country) Or, equivalently, listres2 life food, n(6) i(country) The contents of local macro number are required to be an integer, and id is a string (such as country, a variable's name). This example also illustrates the marksample command which marks the subsample (as qualified by i f and in ) to be used in subsequent analyses. The syntax of syntax is outlined in the Programming Manual. Experimentation and studying other programs help in gaining fluency with this command. Example Program: Moving Autocorrelation The preceding sections presented basic ideas and example short programs. In this section, we apply those ideas to a slightly longer program that defines a new statistical procedure. The procedure obtains moving autocorrelations through a time series, as proposed for ocean-atmosphere data by Topliss (2001). The following do-file, gossip.do, defines a program that makes available a new command called gossip. Comments, in lines that begin with * or in phrases setoff by / / , explain what the program is doing. Indentation of lines has no effect on the program's execution, but makes it easier for the programmer to read. capture program drop gossip // FOR WRITING & DEBUGGING; DELETE LATER program q o s s : p version 8.0 * 5 yn t ax require? user to specify two variables (Yvar and TIMEva r) , and * the span of the moving window. Optionally, the user can ask to generate 370 Statistics with Stata introduction to Programming 371 * a new variable holding autocorrelations, to draw a graph, or both, syntax varlist (min^l max = 2 numeric), SPan ( integer) [GENerate(string) GRaph] if int('span'//) != ('span' - 1) /2 { display as error "Span must be an odd integer" ) else { * The first variable in 'varlist' becomes Yvar, the second TIMEvar. tokenize "varlist1 local Yvar " 1 1 local TIMEvar '2' tempvar NEWVAR guietly gen "NEWVAR' = . local miss = 0 * spanlo and spanhi are local macros holding the observation number at the * low and high ends of a particular window. spanmid holds the observation * number at the center of this window. local spanlo = 0 local spanhi - "span" local spanmid = int("span'/2) while 'spanlo' <^ _N -'span' { local spanhi = 'span' + "spanlo1 local spanlo = 'span]o' + 1 local spanmid = "spanmid' + 1 * The next lines check whether missing values exist within the window. * If they do exist, then no autocorrelation is calculated and we * move on to the next window. Users are informed that this occurred. quietly summ 'Yvar' in "spanlo'/ "spanhi' if r (N) != span ' { local miss = 1 } * The value of NEWVAR in observation "spanmid1 is set equal to the first * row, first column (1,1) element of the row vector of autocorrelations * r (AC) saved by corrgram. else ( quietly corrgram "Yvar" In 'spanlo'/'spanhilag(l) quietly replace 'NEWVAR' = el(r(AC),l,l) in 'spanmid' if "'graph'" != "" ( * The following graph command illustrates the use of comment s to ca u s e * Stata to skip over line breaks, so it reads the next two lines as if * they were one. graph twoway spike 'NEWVAR' 'TIMEvar', yline(O) /// ytitle("First-order autocorrelations of "Yvar" (span 'span')") } if "miss' == 1 { display as error "Caution: missing values exist" } if ""generate'" != "" { rename "NEWVAR' "generate' label variable "generate' /// "First-order autocorrelations of "Yvar' (span 'span')" end As the comments note, gossip requires time series (tsset) data. From an existing time series variable, gossip calculates a second time series consisting of lag-1 autocorrelation coefficients within a moving window of observations — for example, a moving 9-year span. Dataset nao.dta contains North Atlantic climate time series that can be used for illustration: Contains data from C:\data\nac.dta obs: 159 North Atlantic Oscillation 5 mean air temperature at Stykkisholmur, Iceland vars: 5 1 Aug 2005 10:50 size: 3,498 (99.9% of memory free) storage display value variable name type format label variable label year int ^ t y Year wNAO float %9.0g Winter NAO wNA04 float %9.Og Winter NAO smoothed temp float -9 . Og Mean air temperature (C) temp 4 float 9 . 0 g Mean air te mp etature smoothed Sorted by: yea r The variable temp records annua] mean air temperatures at Stykkisholmur in west Iceland from 1841 to 1999. temp4 contains smoothed values of temp (see Chapter 13). Figure 14.1 graphs these two time series. To visually distinguish between raw {temp) and smoothed {temp4) variables, we connect the former with very thin lines, clwidth (vthin), and the latter with thick lines, clwidth (thick). Type help linewidthstyle for a list of other line-width choices. . graph twoway line temp year, clpattern(solid) clwidth(vthin) || line temp4 year, clpattern(solid) clwidth(thick) || , ytitle("Temperature, degrees C") legend(off) Figure 14.1 1850 1900 Year 1950 2000 To calculate and graph a series of autocorrelations of temp, within a moving window of Q years, we type the following commands. They produce the graph ^^^ul 372 Statistics with Stata do gossip.do gossip temp year, span(9) generate(autotemp) graph Figure 14.2 _ra o u_ to 1850 1900 1950 2000 Year In addition to drawing Figure 14.2, gossip created a new variable named autotemp: . describe autotemp storage display value variable name type format label variable label _________________ autotemp float 9 . 0q First-order autocorrelations of temp (span 9) list year temp autotemp in 1/10 1 year temp autotemp 1. 1 184 1 2 73 2 . 1 1842 4 34 3 1 18 4 3 2 97 4 . 1 1844 3 4 1 5 . 1 1845 3 62 -.2324 8 37 6 . 1 184 6 4 28 088 3 512 7 . 1 1847 4 . 45 -. 01 94 607 8 . 1 1848 2 . 3 2 . 01752 47 9 . 1 184 9 3 .27 - . 03303 10 . i 1850 3 . 2 3 .0181154 autotemp values are missing for the first four years (1841 to 1844). In 1845, the autotemp value (- 2324837) equals the lag-1 autocorrelation of temp over the 9-year span from 1841 to 1849. This is the same coefficient we would obtain by typing the following command: corrgram temp in 1/9, lag(l AC PAC 2325 -0.2398 Introduction to Programming 373 -i o i-i o i Prob>Q [Autocorrelation] [Partial Autocor] 66885 0.4135 In 1846, autotemp (-.0883512) equals the lag-1 autocorrelation of temp over the 9 years from 1842 to 1850, and so on through the data, autotemp values are missing for the last four years in the data (1996 to 1999), as they are for the first four. The pronounced Arctic warming of the 1920s, visible in the temperatures of Figure 14.1, manifests in Figure 14.2 as a period of consistently positive autocorrelations. A briefer period of positive autocorrelations in the 1960s coincides with a cooling climate. Topliss (2001) suggests interpretation of such autocorrelations as indicators of changing feedbacks in ocean-atmosphere systems. The do-file gossip.do was written incrementally, starting with input components such as the syntax statement and span macros, running the do-file to check how these work, and then adding other components. Not all of the trial runs produced satisfactory results. Typing the following command causes Stata to display programs line-by-line as they execute, so we can see exactly where an error occurs: set trace on Later, we can turn this feature off by typing set trace off gossip.do contains a first line, capture program drop gossip , that discards the program from memory before defining it again. This is helpful during the writing and debugging stage, when a previous version of our program might have been incomplete or incorrect. Such lines should be deleted once the program is mature, however. The next section describes further steps toward making gossip available as a regular Stata command. Ado-File_ Once we believe our do-file defines a program that we will want to use again, we can create an ado-file to make it available like any other Stata command. For the previous example, gossip.do, the change involves two steps: 1. With the Do-file Editor, delete the initial "DELETE LATER" line that had been inserted to streamline the program writing and debugging phase. We can also delete the comment lines. Doing so removes useful information, but it makes the program more compact and easier to read. 2. Save our modified file, renaming it to have an .ado extension (for example, gossip.ado), in a new directory. The recommended location is in C:\ado\personal; you might need to create this directory and subdirectory if they do not already exist. Other locations are possible, but review the User's Manual section on "Where does Stata look for ado-files?" before proceeding. Once this is done, we can use gossip as a regular command within Stata. A listing of gossip.ado follows. 374 Statistics with Stata Introduction to Programming 375 1 ! version 2.C * ! L. Hamilton, Statistics with Stata (7004) prog tram goss ip version 8.0 syntax "a r 11 s t ( m i n = 1 rna x = 2 numeric) , Span (integer) [GENerate (string) GRap i f int ( ' span 1 //.) ! -: ( span' - 1 ) /2 { display as error "Span must be an odd integer" J else { tokenizo ~va r1i s t' local Yvar '1' local TIMEvar '2' tempvar N EWVAR quietly gen ' NEWVAR' " . local miss = 0 local s p a n 1 o = 0 local spanhi = 'span' local spanmid = mt('span'/2) while 's p an1o' < = _N - s p a n ' | local spanhi = 'span' + 'spanio1 local spanio - "spanio1 + 1 local spannid = 'spanmid' + 1 quietly suran Vvar' in 'spanio'/ 'spanhi' if r(N) !- 'span' i local miss = 1 ) else \ quietly corrgram Yvar' in 'spanio' ''spanhi', lag(l) quietly replace ' NEWVAR 1 = el ( r (AC ) , 1, 1) m "spanrmd' if graph "' ! = "" { graph twoway spi>o ~NEWVAR' "TIMEvar•, yline(O) / ' f ytitle("First-order autocorrelations of Yvar' (span 'span')" J if 'miss' == 1 I display as error "Caution: missing values exist" i if "'generate'" !- "" { rename ' NEWVAR' ' generate' label variable 'generate' // / "First-order autocorrelations of Yvar' (span 'span')" end The program could be refined further to make it more flexible, elegant, and user-friendly. Note the inclusion ofcomments stating the source and "version 2.0" in the first two lines, which both begin *! . The comment refers to version 2.0 of gossip.ado, not Stata (an earlier version of gossip.ado appeared in a previous edition of this book). The Stata version suitable for this program is specified as 8.0 by the version command a few lines later. Although the *! comments do not affect how the program runs, they are visible to a which command: which gossip \ a d o \ p p r s o n a 1 \ q o s s i p. a d o version 2.0 L. Hamilton, Statistics with Stata (2004) Once gossip, ado has been saved in the C:\ado\personal directory, the command gossip could be used at any time. If we are following the steps in this chapter, which previously defined a preliminary version of gossip , then before running the new ado-file version we should drop the old definition from memory by typing program drop gossip We are now prepared to run the final, ado-file version. To see a graph of span-15 autocorrelations of variable wNAO from dataset nao.dta, for example, we would simply open nao.dta and type . gossip wNAO year, span (15) graph Help File Help files are an integral aspect of using Stata. For a user-written program such as gossip, ado, they become even more important because no documentation exists in the printed manuals. We can write a help file for gossip.ado by using Stata's Do-file Editor to create a text file named gossip.hip. This help file should be saved in the same ado-file directory (for example, C:\ado\personal) as gossip.ado. Any text file, saved in one of Stata's recognized ado-file directories with a name of the form filename.hip, will be displayed onscreen by Stata when we type help filename. For example, we might write the following in the Do-file Editor, and save it in directory C:\ado\personal as file gossip!.hip. Typing help gossipl at any time would then cause Stata to display the text. help for gossip L. Hamilton Moving first-order autocorrelations gossip yvar timevar, span(#) [ generate (newvar) graph ] Description calculates first-order autocorrelations of time series yvar, within a moving window of span #. For example, if we speci fy span (l) gen (new), then the first through 3rd values of new are missing. The 4th value of new equals the lag-1 autocorrelation of yvar across observations 1 through 1. The 5th value of new equals the lag-1 autocorrelation of yvar across observations 2 through 8, and so forth. The last 3 values of new are missing. See Topiiss (2001) for a rationale and applications of this statistic to atmosphere-ocean data. Statistics with Stata (2 004) discusses the gossip program itself. gossip requires tsset data. timevar is the time variable to be used for graphing. Opt ions span (#) specifies the width of the window for calculating autocorrelations. This option is required; # should be an odd integer. gen (newvar) creates a new variable holding the autocorrelation coefficients. 376 Statistics with Stata graph requests a spike plot of lag-1 autocorrelations vs. timevar. Examples . gossip water month, span(13) graph gossip water month, span (9) gen {autowater) gossip water month, span{17) gen (autowater) graph References Hamilton, Lawrence C. 2004. Statistics with Stata. Pacific Grove, CA: Duxbury. Topliss, Brenda J. 2001. "Climate variability I: A conceptual approach to ocean-atmosphere feedback." In Abstracts for AGU Chapman Conference, The North Atlantic Oscillation, Nov. 28 - Dec 1, 2000, Ourense, Spain. Nicer help files containing links, text formatting, dialog boxes, and other features can be designed using Stata Markup and Control Language (SMCL). All official Stata help files, as well as log files and onscreen results, employ SMCL. The following is an SMCL version of the help file for gossip. Once this file has been saved in C:\ado\personal with the file name gossip.hlp, typing help gossip will produce a readable and official-looking display. {smcl} (* laug2Q03}{...} {hline} help for {hi:gossipHright:(L. Hamilton)} {hiine} [title:Moving first-order autocorrelations J {p 8 12} (cmd-.gossip} (it:yvar timevar) { cmd : , } { cmdab : sp : an ] {cmd: (} {it:#}{cmd:)} [ [cmdab:gen:erate}{cmd:(}{it:newvar}{cmd:)} [cmdab:gr:aph} ] {title:Description} [pMcmd:gossipi calculates first-order autocorrelations of time series {it:yvar}, within a moving window of span {it:#}. For example, if we specify {cmd:span(}7(cmd:)} {cmd:gen(}{it:new}{cmd:)}, then the first through 3rd values of {it:new} are missing. The 4th value of {it:new} equals the lag-1 autocorrelation of {it:yvar} across observations 1 through 7. The 5th value of (it:new) equals the lag-1 autocorrelation observations 2 through 8, and so forth. The last are missing. See Topliss (2001) for a rationale this statistic to atmosphere-ocean data, stata.com/bookstore/sws.html":Statistics with Stata of {it:yvar} across 3 values of (it:new and applications of {browse "http://www (2004) discusses the {cmd:gossip} program itself.{p_end} {p}{cmd:gossip} requires (cmd:tsset} data. variable to be used for graphing.{p_end} it:timevar} is the time (title:Options] (p 0 4}{cmd:span (}{it:#} {cmd: calculating autocorrelations. an odd integer. specifies the width of the window for This option is required; {it: #} should be Introduction to Programming 377 {p 0 4]{cmd:gen(}{it:newvar}{cmd:)} creates a new variable holding the autocorrelation coefficients. {p 0 4}{cmd:graph} requests a spike plot of lag-1 autocorrelations vs. {it:timevar J . {title:Examples} (p 8 12}{inp:. gossip water month, span(13) graph)(p_end} {p 8 12}{inp:. gossip water month, span(9) gen (autowater) ) (p^end) (p 8 12}{inp:. gossip water month, span (11) gen (autowater) graph}(p end? {title:References} {p 0 4}Hamilton, Lawrence C. 2004. {browse "http://www.stata.com/bookstore/sws.html":Statistics with Stata}. Pacific Grove, CA: Duxbury.{p_end} {p 0 4}Topliss, Brenda J. 2001. "Climate variability I: A conceptual approach to ocean-atmosphere feedback." In Abstracts for AGU Chapman Conference, The North Atlantic Oscillation, Nov. 28 - Dec 1, 2000, Ourense, Spain, citation.{p_end} The help file begins with { smcl}. which tells Stata to process the file as SMCL. Curly brackets { } enclose SMCL codes, many of which have the form {command: text} or { command arguments : text}. The following examples illustrate how these codes are interpreted. Draw a horizontal line. Highlight the text "gossip". Display the text "Moving . . ." as a title. Right-justify the text "L. Hamilton". Format the following text as a paragraph, with the first line indented 8 columns and subsequent lines indented 12. Display the text "gossip" as a command. That is, show "gossip" with whatever colors and font attributes are presently defined as appropriate for a command. Display the text "yvar" in italics. Display "span" as a command, with the letters "sp" marked as the minimum abbreviation. Format the following text as a paragraph, until terminated by {hline } (h i :go s s i p} {title:Moving. . . } {right:L Hamilton} (P 8 12) {cmd:go s s i p} {it: yvar} [cmdab:sp:an} {p_end}. (browse "http://www.stata.com/bookstore/sws.html":Statistics. . . } Link the text "Statistics with Stata" to the web address (URL) http://www.stata.com/bookstore/sws.html. Clicking on the words "Statistics with Stata" should then launch your browser and connect it to this URL. The Programming Manual supplies details about using these and many other SMCL commands. 378 Statistics with Stata Matrix Algebra Matrix algebra provides essential tools for statistical modeling. Stata's matrix commands and matrix programming language (Mata) are too diverse to describe adequately here; the subject requires its own reference manual (Mata Reference Manual), in addition to many pages in the Programming Reference Manual and User's Guide. Consult these sources for information about the Mata language, which is new with Stata 9. The examples in this section illustrate earlier matrix commands, which also still work (hence the placement of version 8 .0 commands at the start of each program). The built-in Stata command regres s performs ordinary least squares (OLS) regression, among other things. But as an exercise, we could write an OLS program ourselves, olsl.do (following) defines a primitive regression program that does nothing except calculate and display the vector of estimated regression coefficients according to the familiar OLS equation: b = (X'X) 1 X'y * A very simple program, "olsl" estimates linear regression * coefficients using ordinary least squares (OLS). program olsl version 8.0 * The syntax allows only for a variable list with one or more * numeric variables. syntax varlist(min=l numeric) * "tempname..." assigns names to temporary matrices to be used in this * program. When olsl has finished, these matrices will be dropped. tempname crossYX crossX crossY b * "matrix accum..." forms a cross-product matrix. The K variables in * varlist, and the N observations with nonmissing values on all K variables, * comprise an N row, K column data matrix we might call yX. * The cross product matrix crossYX equals the transpose of yX times yX. * Written algebraically: * crossYX = (yX)'yX quietly matrix accum 'crossYX' = 'varlist' * Matrix crossX extracts rows 2 through K, and columns 2 through K, * from crossYX: * crossX = X'X matrix 'crossX' = 'crossYX' [2. . . , 2. . .] * Column vector crossY extracts rows 2 through K, and column 1 from crossYX: * crossY =X'y matrix 'crossY' = 'crossYX' [2...,1] * The column vector b contains OLS regression coefficients, obtained by * the classic estimating equation: * b - inverse(X'X)X'y matrix " b' = syminv('crossX') * 'crossY' * Finally, we list the coefficient estimates, which are the contents of b. matrix list 'b1 end Comments explain each command in olsl.do. A comment-free version named ols2.do (following) gives a clearer view of the matrix commands: Introduction to Programming 379 program ols2 vers ion 8.0 syntax varlist(min=i numeric) tempname crossYX crossX crossY quietly matrix accum 'crossYX' matrix 'crossX' = 'crossYX'[2. matrix 'crossY' = "crossYX1[2.. matrix 'b' = syminv('crossX*) '■ matrix list 'b' end varlist 2. . .] 1] crossY' Neither olsl.do nor olsl.do make anv provision for \ n nr \* ™ it nr nnti^n. tu 1 A , '.r^viMonror in or if qualifiers, syntax errors, or options. They also do not calculate standard errors, confidence intervals or the othe ancillary statistics we usually want with regression. To see just what the^do^ompl sh we will analyze a small dataset on nuclear power plants (reactor, dta): cc0mP11^ we Contains data from c:\data\reactor.dt, obs : 5 va r s : size: 6 130 (99.9% of memory free) Reactor decommi (from Brown et al 1 Aug 2005 10:50 oning co 1 986) storage disci, variable name type format s t r 1 4 %14s de com byte %8 . Og capacity int ^8 . Og years byte 19 . Og start int %8 . Og close int % 8 . 0 g : start value label variable labe 1 Reactor site Decommissioning cost, millions Generating capacity, megawatts Years m operation Year operations started Year operations closed The cost of decommissioning a reactor increases with its generating capacity and with the number of years in operation, as can be seen by using regress: . regress decom capacity years Source j ss df MS Number of obs 5 Model 1 Residual J 4 666.16571 24.6342883 2 2333.0828 6 2 12.317 1442 F( 2, 2) Prob > F R-squared 189.42 = 0.0053 = 0.9947 Total 1 4690.80 4 1 ] 72.70 Adj R-squared Root MSE = 0.9895 = 3.5 0 96 decom | Coef . Std. Err. t P> 1 t 1 years | _cons I . 1 7587 3 9 3 . 8 9 9314 -11 . 3 9963 .02 4 7774 .2643087 4.330311 7 . 10 1 4 . 75 -2 . 63 0.019 0.005 0.119 . 0692653 2 .7 6208 5 .2824825 5 . 036543 Our home-brewed program olsl.do yields exactly the same regression coefficients: 380 Statistics with Stata do ols2.do . ols2 decom capacity years 0 0 0 0 0 3 [ 3 , 1 ] decom capacity .1758739 years 3.899 3139 _con? -11.3P9633 Although its results are correct the minimalist ols2 program lacks many features we would want in a useful modeling command. The following ado-file, ols3.ado, defines an improved program named ols3 . This program permits in and if qualifiers, and optionally allows specification of the level for confidence intervals. It calculates and neatly displays regression coefficients in a table with their standard errors, / tests, and confidence intervals. 1 Introduction to Programming 381 use of a right single quote ( ' ) as the "matrix transpose" operator. We write the transpose of the coefficients vector (syminv('crossX') * ^crossY') as follows: (syminv ( 'cross.X' ) * 'crossY' ) ' The ols3 program is defined as e-class, indicating that this is a statistical model-estimation command: program o x s 3, eclass E-class programs store their results with e () designations. After the previous ols3 command, these have the following contents: ereturn list scalars: e ( N ) = 5 e ( d f r ) - 2 * ! vers i or. 2.0 laug2 0 03 *! Matrix demonst ra tion: more complete OLS regression program, program o 1 s 3 , eclass version 8.0 syntax varlist (min-1 numeric) [in] [rf] [, Level (integer $S_level) ma r ks ample tou s e tokenize " 'varlist ' " tempname crossYX crossX crossY b hat V quietly matrix accum 'crossYX' - 'varlist' if touse' local nobs = r(N) local df = "nobs' - (rowsof('crossYX') - 1) crossX' = 'crossYX' [2 ... ,2 ... ] crossY * = 'crossYX' [2. . .,1] b' = (syminv ('crossX') * "crossY')' hat' = ' b ' * "crossY' V = syminv ('crossX' ) * ("crossYX1 [1, 1] - "hat'[l,l])/ df ereturn post b' 'V', dot ( ' df ' ) obs ( 'nobs') depname ( '1') /// es ample ( 'touse') ereturn local depvar ereturn local cmd "ols3" if "level' 10 I "level' > 99 { display as error "level( ) must be between 10 and 99 inclusive exit 198 } ereturn d sp 1 a y , level ('level' ) Because ois3.ado is an ado-file, we can simply type ols3 as a command: mat r i :■: matrix mat rlx mat r i x mat r i x ols3 decom capacity years dec om | Coef . S t d. Err. t P> 1 t I [95% C o n f. Interval] capa c i ty 1 years 1 cons I . 1 7587 3« 3.8 9931 4 -11.39963 .02 4 777 4 .2643087 4.330311 7 1 4 -2 10 75 63 0.019 0 .005 0.119 . 0 692 65 3 2.762085 -30.03146 . 2 82 4 825 5.036543 7.23219 ols3.ado contains familiar elements including syntax and marksample commands, as well as matrix operations built upon those seen earlier in olsl.do and ols2.do. Note the mat rice s funct ions e ( cmd) e (depvar) e (b) e fV) ols3" decom1 1x3 3x3 e(samp 1e) display e(N) 5 . matrix list e(b) e (b) 11,3] capacity yl .175873S . matrix list e(V) symmetric e{V)[3,3] capacity capacity . 00 061392 years -.002167 3 2 cons -.01492755 years 3.8993139 _cons 1 1 . 399 633 .0698591 - . 942 62 6 18.751591 The e () results from e-class programs remain in memory until the next e-class command. In contrast, r-class programs such as summarize store their results with r () designations, and these remain in memory only until the next e- or r-class command. Several ereturn lines in ols3.ado save the e () results, then use these in the output display : ereturn post "b' 'V, dof (Ndf ) obs('nobs') depname ( 11■) /// e sample ( "touse ' ) The above command sets the contents of e () results, including the coefficient vector (b) and the variance-covariancc matrix (V). This makes all the post-estimation features detailed in help estimates and help postest available. Options specify the residual degrees of freedom ( df ), number of observations used in estimation ( nobs ), 382 Statistics with Stata Introduction to Programming 353 dependent variable name ( " 1 ' , meaning the contents of the first macro obtained when we tokenize varlist ), and estimation sample marker ( touse ). creturn local depvar " 1'" This command sets the name of the dependent variable, macro 1 after tokenize varlist , to bethe contents ofmacro e (depvar) . ereturn local cmd "ols 3" This sets the name of the command, ols 3 , as the contents of macro e(cmd). ereturn display, level ('level') The ereturn display command displays the coefficient table based on our previous ereturn post . This table follows a standard Stata format; its first two columns contain coefficient estimates (from b ) and their standard errors (square roots of diagonal elements from V ). Further columns are / statistics (first column divided by second), two-tail / probabilities, and confidence intervals based on the level specified in the ols3 command line (or defaulting to 95%). Bootstrapping__ Bootstrapping refers to a process of repeatedly drawing random samples, with replacement, from the data at hand. Instead of trusting theory to describe the sampling distribution of an estimator, we approximate that distribution empirically. Drawing k bootstrap samples of size n (from an original sample also size n) yields k new estimates. The distribution of these bootstrap estimates provides an empirical basis for estimating standard errors or confidence intervals (Efron and Tibshirani 1986; for an introduction, see Stine in Fox and Long 1990). Bootstrapping seems most attractive in situations where the statistic of interest is theoretically intractable, or where the usual theory regarding that statistic rests on untenable assumptions. Unlike Monte Carlo simulations, which fabricate their data, bootstrapping typically works from real data. For illustration, we turn to islands.dta, containing area and biodiversity measures for eight Pacific Island groups (from Cox and Moore 1993). Contains data from c: \data\islands.dt a obs : 8 Pacific island biodiversity (Cox & Moore 1993) './ a r s : 4 1 Aug 2 0 05 10:50 si 7.e : 208 (99.9'- of memo r y free) s 10 rage display value variable nanie type format labe 1 variable label island s t rl5 % 15s Island group area float % 9 . 0 g Land area, km"2 birds byte ',.8 . Og Number of bird genera plants int t 8 . 0 g Number flowering plant genera Sorted by Suppose we wish to form a confidence interval for the mean number of bird genera. The usual confidence interval for a mean derives from a normality assumption. We might hesitate to make this assumption, however, given the skewed distribution that, even in this tiny sample (n = 8), almost leads us to reject normality: sktest birds Skewness/Kurtosis tests for Normality Variable I Pr (Skewness) -----------4-------------- birds [ 0.079 0.181 ------- joint ------ Pr(Kurtosis) adj chi2(2) Prob>chi2 4 . 75 0 . 0928 Bootstrapping provides a more empirical approach to forming confidence intervals. An r-class command, summarize, detail unobtrusively stores its results as a series of macros. Some of these macros are: r ( n ) Number of observations r (mean) Mean r (skewness) SkewnesS r (min) Minimum r (max) Maximum r (p50) 50th percentile or median r (Var) Variance r (sum) Sum r (s d) Standard deviation Stored results simplify the job of bootstrapping any statistic. To obtain bootstrap confidence intervals for the mean of birds, based on 1,000 resamplings, and save the results in new file boot 1 .dta, type the following command. The output includes a note warning about the potential problem of missing values, but that does not apply to these data. . bs summarize birds, detail" "r(mean)", rep(1000) saving(bootl) command: summarize birds , detail statistic: _bs_l = r(mean) Warning: Since summarize is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means no observations will be excluded from the resampling due to missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure the dataset in memory contains only the relevant data. Bootstrap statistics Number of ob. Replications 8 1000 Variable | Reps Observed Bias Std. Err. [95% Conf . Interval] _bs_l 1 1000 4 7 . 625 -.475875 12.39088 23.30986 71.94 01 4 ín) 1 25.75 74.8125 (P) 1 27 78.25 (BC) Note: N P BC - no rma1 = percentile = bias-corrected 384 Statistics with Stata The bs command states in double quotes what analysis is to be bootstrapped ( "summ birds, detail" ). Following this comes the statistic to be bootstrapped, likewise in its own double quotes ( "r (mean) " ). More than one statistic could be listed, each separated by a space. The example above specifies two options: rep (1000) Calls for 1,000 repetitions, or drawing 1,000 bootstrap samples, saving (bootl) Saves the 1,000 bootstrap means in a new dataset named bootl.dta. The bs results table shows the number of repetitions performed and the "observed" (original-sample) value of the statistic being bootstrapped — in this case, the mean birds value 47'.625. The table also shows estimates of bias, standard error, and three types of confidence intervals. "Bias" here refers to the mean of the k bootstrap values of our statistic (for example, the mean of the 1,000 bootstrap means of birds), minus the observed statistic. The estimated standard error equals the standard deviation of the k bootstrap statistic values (for example, the standard deviation of the 1,000 bootstrap means ofbirds). This bootstrap standard error (12.39) is less than the conventional standard error (13.38) calculated by ci : ci birds Variable b i rds Ob. Mean 4 7 . 62 5 Std. Err 1 3 . 38034 [95% Conf. Interval] 1 5 . 98552 7 9.264 48 Normal-approximation (N) confidence intervals in the bs table are obtained as follows: observed sample statistic ± t x bootstrap standard error where / is chosen from the theoretical / distribution with k - 1 degrees of freedom. Their use is recommended when the bootstrap distribution appears unbiased and approximately normal. Percentile (P) confidence intervals simply use percentiles of the bootstrap distribution (for a 95% interval, the 2.5th and 97.5th percentiles) as lower and upper bounds. These might be appropriate when the bootstrap distribution appears unbiased but nonnormal. The bias-corrected (BC) interval also employs percentiles of the bootstrap distribution, but chooses these percentiles following a normal-theory adjustment for the proportion of bootstrap values less than or equal to the observed statistic. When substantial bias exists (by one guideline, when bias exceeds 25% of one standard error), these intervals might be preferred. Since we saved the bootstrap results in a file named bootl.dta, we can retrieve this and examine the bootstrap distribution more closely if desired. The saving (bootl) option created a dataset with 1,000 observations and a variable named _bs_l, holding the mean of each bootstrap sample. Contains dat.a from c:\data\bootl.dta obs va r s size 1, 0 0 I 1 0 (99.9% of memory free) bs: summarize birds, detail 1 Aug 2005 15:10 var i able s torage name type display forma t value label variable label _bs_l float 9:9 , Og r (mean) Sorted by introduction to Programming 385 summarize Variable bs 1 Mean Std. Dev. Min 47.14912 12.39088 14.625 92 Note that the standard deviation of these 1,000 bootstrap means equals the standard error (12.82) shown earlier m the bs results table. The mean of the 1,000 means minus the observed (original sample) mean equals the bias: 47.14912-47.625 --.47588 Figure 14.3 shows the distribution of these 1,000 sample means, with the original-sample mean (47.625) marked by a vertical line. The distribution exhibits mild positive skew, but is not far from a theoretical normal curve. ' histogram _Ds_l, norm bcolor(gslO) xaxisfl 2) xline(47 625) xlabel(47.635, axis (2)) xtitle("", axis(2)) W CM c o Q 47.635 Figure 14.3 60 r(mean) 100 Biologists have observed that biodiversity, or the number of different kinds of plants and animals, tends to increase with island size. In islands.dta, we have data to test this proposition with respect to birds and flowering plants. As expected, a strong linear relationship exists between birds and area: 386 Statistics with Stata 1 Introduction to Programming 387 regress birds area Source I SS df MS Number of obs F ( 1 , 6) s 162.96 Model ( Residual | 9669 . 83255 356. 042449 1 9 6 6 9.8325 6 59.3404 0 8 5 Frob x F R-squared Adj R-squared Root MSE - 0.0000 0.9 64 5 - 0.9586 Total 1 10025.87 5 7 1432 2 67 8 6 - 7.7 0 33 birds 1 Coe f . S r. d . Err. t-_ PN t 1 [951 Conf. Interval] area | cons 1 .0026512 13.97169 . 0002077 3.79046 1 2 . 77 69 0.000 0.010 .002143 4.69 6773 .0031594 23 . 24 662 An e-class command, regress saves a set of e() results as noted earlier in this chapter. It also creates or updates a set of system variables containing the model coefficients ( _b [ varname]) and standard errors ( _se [ varname]). To bootstrap the slope and y intercept from the previous regression, saving the results in file bootlJta, type . bs "regress birds area" "_b[area] Jb[_cons]'\ rep(lOOO) saving{boot2) command: regress birds area statistics bs 1 = _b[area] bs 2 b[ cons] Boot strap st a tistics Number of obs 8 Replications 1000 Va r iable 1 Reps Observed Bias Std. Er r. [95% Conf. Interval] bs 1 1 10 0 0 .0 026512 -.0000737 .000334 5 .0019947 .0033077 (N) i . 00 1 975 9 .002 90 66 (P) 1 .00199 .002 9 246 (BC) bs 2 1 1000 13.97169 .6230986 3 .63770 5 6.83327 5 21.11011 (M) 1 7.891942 21.74494 (P) 1 6.9495 3 9 19.73 012 (BC) Mote: N = noma 1 P percentile _3C = bias-corrected The bootstrap distribution of coefficients on area is severely skewed (skewness = 4.12). Whereas the bootstrap distribution of means (Figure 14.3) appeared approximately normal, and produced bootstrap confidence intervals narrower than the theoretical confidence interval, in this regression example bootstrapping obtains larger standard errors and wider confidence intervals. In a regression context, bs ordinarily performs what is called "data resampling," or resampling intact observations. An alternative procedure called "residual resampling" (resampling only the residuals) requires a bit more programming work. Two additional commands make such do-it-yourself bootstrapping easier: bsample Draws a sample with replacement from the existing data, replacing the data in memory. bootstrap Runs a user-defined program reps () times on bootstrap samples of size size (). The Base Reference Manual gives examples of programs for use with bootstrap . Monte Carlo Simulation_ Monte Carlo simulations generate and analyze many samples of artificial data, allowing researchers to investigate the long-run behavior of their statistical techniques. The simulate command makes designing a simulation straightforward so that it only requires a small amount of additional programming. This section gives two examples. To begin a simulation, we need to define a program that generates one sample of random data, analyzes it, and stores the results of interest in memory. Below we see a file defining an r-class program (one capable of storing r () results) named central . This program randomly generates 100 values of variable x from a standard normal distribution. It next generates 100 values of variable w from a "contaminated normal" distribution: N(0,1) with probability .95, and N(0,10) with probability .05. Contaminated normal distributions have often been used in robustness studies to simulate variables that contain occasional wild errors. For both variables, central obtains means and medians. sample containing Creates x~N (0, 1) w~TJ (0,1) with p= . 95 w L00 observations of variables x and w. x is standard normal N ( 0, 1 0) with p=.0 5 w is contaminated normal mean and median of x and w. r(xmean) r(xmedian) r(wmean) rclass r(wmedian) 05 11 Calculat es t he * Stored results program central, version 8.0 drop all set obs 100 generate x = mvnorm(uniform()) summarize x, detail return scalar xmean = r (mean) return scalar xmedian = r (p50) gene i'ate w = invnorm (uniformO ) replace w - TO*'*/ if uniform() < summarize w, detail return scalar wmean = r(mean) return scalar wmed i an - r (p5 0) end Because we defined central as an r-class command, like summarize , it can store its results in r() macros. central creates four such macros: r (xmean) and r (xmedian) for the mean and median of x; r (wmean) and r (wmedian) forthemeanand median of w. Once central has been defined, whether through a do-file, ado-file, or typing commands interactively, we can call this program with a simulate command. To create a new dataset containing means and medians ofx and >v from 5,000 random samples, type i 388 Statistics with Stata simulate "central" xmean = r(xmean) xmedian = r(xmedian) wmean = r(wmean} wmedian = r(wmedian) , reps (5000) :Vt)ua n.i : cen t r a 1 statistics: xmean = r(xnean) xmedian - rfxmedian) wme a n = r \ wne a n ) wmcd:an = r(wmedian) This command creates new variables xmean, xmedian, wmean, and wmedian, based on the r () results from each iteration of central . describe Contains data simulate: central ots : 5 , 0 0 0 v a r s : 4 1 Aug 2 0 05 17:50 size: 103,000 (99. 61--. of memo ry free) storage display va 1 ue variable name type f o rma t label variable label xmean float '.9 . Og r (xmean) xniedi a n float 19 .0g r ( xmedi an) wne an float ^9 .0g r (wmean) wmedian float '9 . Og r (wmedian) Sorter] by : summarize Variable | Obs Mean Std. Dev. Min Max xmean 1 5 0 0 0 - . OC U5915 .098^788 -.4112561 3 6 99 4 6 7 xmedian | 5 00 0 - . oc ) 1 5566 . 1246915 - . 4 647 84 8 4 74064 2 wmean 1 5 0 0 0 -.0001433 .24 70 823 -1.11406 8774 976 wmedian | 5 00 0 . OC 530762 . 1 30375 6 - . 4 584 52 1 5152998 The means of these means and medians, across 5,000 samples, are all close to 0 — consistent with our expectation that the sample mean and median should both provide unbiased estimates of the true population means (0) for* and w. Also as theory predicts, the mean exhibits less sample-to-sample variation than the median when applied to a normally distributed variable. The standard deviation of xmedian is .125, noticeably larger than the standard deviation of xmean (.099). When applied to the outlier-prone variable w, on the other hand, the opposite holds true: the standard deviation of wmedian is much lower than the standard deviation of wmean (.130 vs. .247). This Monte Carlo experiment demonstrates that the median remains a relatively stable measure of center despite wild outliers in the contaminated distribution, whereas the mean breaks down and varies much more from sample to sample. Figure 14.4 draws the comparison graphically, with box plots (and, incidentally, demonstrates how to control the shapes of box plot outlier-marker symbols). Introduction to Programming 389 graph box xmean xmedian wmean wmedian, yline(O) legend(col(4)) marker (1, msymbol( + )) marker(2, msymbol(Th)) marker(3, msymbol(Oh)) marker(4, msymbol(Sh)) Figure 14.4 s Our final example extends the inquiry to robust methods, bringing together several themes fromthisbook. Program regsim generates 100 observations ofx (standard normal) and two y variables, yl is a linear function ofx plus standard normal errors. y2 is also a linear function of .t, but adding contaminated normal errors. These variables permit us to explore how various regression methods behave in the presence of normal and nonnormal errors. Four methods are employed: ordinary least squares (regress), robust regression (rreg), quantile regression ( qreg ), and quantile regression with bootstrapped standard errors ( bsqreg , with 500 repetitions). Differences among these methods were discussed in Chapter 9. Program regsim applies each method to the regression ofyl onx and then to the regression of v2 on x. For this exercise, the program is defined by an ado-file, regsim.ado, saved in the C:\ado\personal directory. 4 1 390 Statistics with Stata Introduction to Programming 391 procrara regsim, rclass * Performs one iteration of a Monte Carlo simulation comparing * OLS regression (regress) with robust (rreg) and quantile * (qreg and bsqrcg) regression. Generates one n = 100 sample * with x - N(0,1) and y variables defined by the models: MODEL 1: yl - 2x + el MODEL 2: \'2 = 2x + e2 ?o t s t: i a p s t a nda rd errors for qreg i nvc 1 vc : version 8.0 it- » ■ i . " = = - ? •• ■ #de1imit ; global Si "bl blr selr blq selq selqb b 2 b 2 r s e 2 r b 2 q s e 2 q s c 2 q b " ,' #de1imi t or exit \ drop _all set obs 100 generate x - in vn o r m(u n i f o rm{) ) generate e =-■ rnvnorra (uni form () ) generate y 1 = 2 * x + e-r e g y 1 x return scalar Bl = _b[x] rreg y1 x, iterate(25) return scalar B1R - _ b [ x j return scalar SE1P - _se[xj qreg yl x return scalar BIO " _b[x] return scalar SElQ = _se[x] bsqreq yl >;, reps ( 500) return scalar SE1QE = _ a e [ x ] replace e - 10 * e if uniform () < .05 generate y 2 = 2 * x + e r e g y 2 x return scalar B2 = _b[x] rreg y2 v., iterate (2 5) return scalar E2R = __b t x j return scalar SE2R -= _so[xl qreg y2 x return scalar B 2 g = return scalar SE2q = b s q r e g y 2 x, r e p s ( 5 0 0 ) return scalar SE2QB N ( 0 , 1) N ( 0, 1) wrth p = N ( 0, 10) with p = repetitions . 0 5 b [ x ] _se[x] se [ x ] end The r-class program stores coefficient or standard error estimates from eight regression analyses. These results have names such as r (Bl) coefficient from OLS regression of yl on x r (B1R) coefficient from robust regression of yl on x r (SE1R) standard error of robust coefficient from model 1 and so forth. All the robust and quantile regressions involve multiple iterations: typically 5 to 10 iterations for rreg, about 5 for qreg, and several thousand for bsqreg with its 500 bootstrap re-estimations of about 5 iterations each, per sample. Thus, a single execution of i regsim demands more than two thousand regressions. The following command calls for five repetitions. . simulate "regsim" bl = r(Bl) blr = r(BlR) selr = r(SElR) blq = r(BlQ) selg = r(SElQ) selqb = r(SElQB) b2 = r(B2) b2r = r(B2R) se2r = r(SE2R) b2q = r(B2Q) se2q = r(SE2Q) se2qb = r (SE2QB) , reps(5) You might want to run a small simulation like this as a trial to get a sense of the time required on your computer. For research purposes, however, we would need a much larger experiment. Dataset regsim.dta contains results from an overnight experiment involving 5,000 repetitions of regsim — more than 10 million regressions. The regression coefficients and standard error estimates produced by this experiment are summarized below. . describe Contains data from C: \ d a t ax regsim.dta cbs : 5, 0 0 0 Monte Carlo estimates of b in 5000 samples of n=100 va r s : 12 2 Aug 2 0 05 08:17 size: ?0 0,000 (99.0"- of memory free) storage display value variable name type fc lmat 1abe1 va r i ab1e label bl float % 9 . 0q OLS b (normal errors) blr float Robust b (normal errors) selr float °* 9 . 0 g Robust SE[b] (normal errors) blq float %9 . 0g Quantile b (normal errors) s e 1 q float % 9 . 0 g Quantile SE[bJ (normal errors) se lqb float %9 . Og Quantile bootstrap SE[b] (normal errors} b2 float %9 . Og OLS b (contaminated errors) b2r float %9 . 0g Robust b (contaminated errors) s e 2 r float %9 . Og Robust SE[b] (contaminated errors) b2q float 2c 9 . Og Quantile b (contaminated errors) se 2 q float *9 . 0g Quantile SE[b] (contaminated errors) se2qb f 1 o a t %9 . 0g Quantile bootstrap SE[bj (contaminated errors) Sorted bv: summarize Variable 1 Obs Me a n Std. Dev. Mi n Max bl | 5000 2.000828 . 1 02018 1 . 63124 5 2 . 4 0481 4 blr I 5 00 0 2.000989 . 1 052277 1.603106 2.39194 6 selr | 5000 . 1 04 13 9 9 .01094 2 9 .069378 6 .1515 4 2 1 blq | 5000 2 . 001135 . 1 3091 86 1.4 71802 2 . 53662 1 se Iq J 5 00 0 . 1262578 .0281738 . 0532731 .2371508 selqb | 5000 . 1362755 . 032 673 . 0510808 .2997 9 b2 | 5 00 0 2.006001 .2 4 64 688 .9001114 3 . 0 50 5 52 b2 r | 5 00 0 2.000399 .10 92 5 53 1 . 63324 1 2 . 411423 s e 2 r 1 5000 . 108134 8 . 01 192 74 . 0743 103 .1560973 b2q | 5 00 0 2.0007 01 .137111 1 . 4 71 8 02 2.536621 se 2q | 5000 .1328431 . 02 9964 4 . 0 542015 . 2594 84 4 se2qb ! 5 00 0 .14 36366 .0346679 .0589409 .20 06 417 392 Statistics with Stata Figure 14 5 draws the distributions of coefficients as box plots. To make the plot more readable we use the legend (symxsize (2) colgap(4)) options, which set the width of symbols and the gaps between columns within the legend at less than their default size, help legend_option and help relativesize supply further information about these options. graph box bl blr blq b2 b2r b2q, ytitle("Estimates of slope (b=2)") yline (2) legend(row(1) symxsize(2) colgap (4) label(l "OLS 1") label (2 "robust 1") label(3 "quantile 1") label(4 "OLS 2") label(5 "robust 2") label(6 "quantile 2")) Figure 14.5 CM 0) E "tfl U") OLS1 robust 1 | quantile 1 | OLS 2 | robust 2 quantile 2 All three regression methods (OLS, robust, and quantile) produced mean coefficient estimates for both models that are not significantly different from the true value, (3 = 2. This can be confirmed through t tests such as . ttest b2r = 2 One-sample t test Var i able | Obs Mean Std. Err. Std. Dev. [95 i Conf. Int e rval] b2 r | 5 0 0 0 2.0003 99 .0015451 .10 9 2 5 5 3 1 . 9 97 37 2 . 00 3428 Degrees o f freedom: 4 999 Ho : mean(b2 r) 2 Ha : t P < t mean < 2 0 . 2585 0.6020 P > Ha: nean [= t = 0 . \ t I = 0 . 258 5 7960 Ha P > : mean t - 0 t = 0 > 2 .256 5 .3 980 Introduction to Programming 393 All the regression methods thus yield unbiased estimates of p, but they differ in their sample-to-sample variation or efficiency. Applied to the normal-errors model 1, OLS proves the most efficient, as the famous Gauss-Markov theorem would lead us to expect. The observed standard deviation of OLS coefficients is .1016, compared with .1047 for robust regression and .1282 for quantile regression. Relative efficiency, expressing the OLS coefficient's observed variance as a percentage of another estimator's observed variance, provides a standard way to compare such statistics: . quietly summarize bl . global Varbl = r(Var) . quietly summarize blr . display 100*($Varbl/r(Var)) 93.992 612 quietly summarize blq . display 100*($Varbl/r(Var)) 60."722696 The calculations above use the r(Var) variance result from summarize. We first obtain the variance of the OLS estimates bl, and place this into global macro Varbl . Next the variances of the robust estimates blr, and the quantile estimates blq, are obtained and each compared with Varbl. This reveals that robust regression was about 94% as efficient as OLS when applied to the normal-errors model — close to the large-sample efficiency of 95% that this robust method theoretically should have (Hamilton 1992a). Quantile regression, in contrast, achieves a relative efficiency of only 61% with the normal-errors model. Similar calculations for the contaminated-errors model tell a different story. OLS was the best (most efficient) estimator with normal errors, but with contaminated errors it becomes the worst: quietly summarize b2 . global Varb2 = r(Var) . quietly summarize b2r . display 100*($Varb2/r(Var)) 51 7 . 200 57 . quietly summarize b2q . display 100*($Varb2/r(Var)) 3 2 8.3971 Outliers in the contaminated-errors mode/ cause OLS coefficient estimates to vary wildly from sample to sample, as can be seen in the fourth box plot of Figure 14.5. The variance of these OLS coefficients is more than five times greater than the variance of the corresponding robust coefficients, and more than three times greater than that of quantile coefficients. Put another way, both robust and quantile regression prove to be much more stable than OLS in the presence of outliers, yielding correspondingly lower standard errors and narrower confidence intervals. Robust regression outperforms quantile regression with both the normal-errors and the contaminated-errors models. 394 Statistics with Stata Figure 14.6 illustrates the comparison between OLS and robust regression with a scatterplot showing 5,000 pairs of regression coefficients. The OLS coefficients (vertical axis) vary much more widely around the true value, 2.0, than rreg coefficients (horizontal axis) do. . graph twoway scatter i>2 b2r, msymbol(p) ylabel(1(.5)3, grid) yline(2) xlabel (1 ( .5)3, grid) xline(2) Figure 14.6 O CN 1.5 2 . _ 2.5 Robust b (contaminated errors) The experiment also provides information about the estimated standard errors under each method and model. Mean estimated standard errors differ from the observed standard deviations of coefficients. Discrepancies for the robust standard errors are small — less than l%. For the theoretically-derived quantile standard errors the discrepancies appear abit larger, between 3 and 4%. The least satisfactory estimates appear to be the bootstrapped quantile standard errors obtained by bsqreg . Means of the bootstrap standard errors exceed the observed standard deviation of blq and blq by 4 to 5%. Bootstrapping apparently overestimated the sample-to-sample variation. Monte Carlo simulation has become a key method in modern statistical research, and it plays a growing role in statistical teaching as well. These examples demonstrate how readily Stata supports Monte Carlo work. References Barron's Educational Series. 1992. Barron's Compact Guide to Colleges, 8th ed. New York: Barron's Educational Series. Beatty, J. Kelly. Brian O'Leary and Andrew Chaikin (eds.). 1981. The New Solar System. Cambridge, MA: Sky. Belsley, D. A., E. Kuh and R. E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley & Sons. Box, G. E. P., G. M. Jenkins and G. C. Reinsel. 1994. Time Series Analysis: Forecasting and Control. 3rded. Englewood Cliffs, NJ: Prentice-Hall. Brown, Lester R., William U. Chandler, Christopher Flavin, Cynthia Pollock, Sandra PosteL Linda Starke and Edward C. Wolf. 1986. State of the World 19S6. New York: W.W.Norton. Buch, E. 2000. Oceanographic Investigations off West Greenland 1999. Copenhagen: Danish Meteorological Institute. CDC (Centers for Disease Control). 2003. Website: http://www.cdc.gov Chambers, John M., William S. Cleveland, Beat Kleiner and Paul A. Tukey (eds.). 1983. Graphical Methods for Data Analysis. Belmont, CA: Wadsworth. Chatfield, C. 1996. The Analysis of Time Series: An Introduction, 5th edition. London: Chapman & Hall. Chatterjee, S., A. S. Hadi and B. Price. 2000. Regression Analysis by Example, 3rd edition. New York: John Wiley & Sons. Cleveland, William S. 1994. The Elements of Graphing Data. Monterey, CA: Wadsworth. Cleves, Mario, William Gould and Roberto Gutierrez. 2004. An Introduction to Survival Analysis Using Stata, revised edition. College Station, TX: Stata Press. Cook, R. Dennis and Sanford Weisberg. 1982. Residuals and Influence in Regression. New York: Chapman & Hall. Cook, R. Dennis and Sanford Weisberg. 1994. An Introduction to Regression Graphics. New York: John Wiley & Sons. Council on Environmental Quality. 1988. Environmental Quality 1987-1988. Washington, DC: Council on Environmental Quality. Cox, C. Barry and Peter D. Moore. 1993. Biogeography: An Ecological and Evolutionary Approach. London: Blackwell Publishers. 395 396 Statistics with Stata References 397 Cryer, Jonathan B. and Robert B. Miller. 1994. Statistics for Business: Data Analysis and Modeling, 2nd edition. Belmont, CA: Duxbury Press. Davis, Duane. 2000. Business Research for Decision Making, 5th edition. Belmont, CA: Duxbury Press. DFO (Canadian Department of Fisheries and Oceans). 2003. Website: http://www.meds-sdmm.dfo-mpo.gc.ca/alphapro/zmp/climatc/IceCoverage_e.shtml Diggle. P. J. 1990. Time Series: A Biostatistieal Introduction. Oxford: Oxford University Press. Efron, Bradley and R. Tibshirani. 1986. "Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy/' Statistical Science 1 (1): 54—77. Enders, W. 2003. Applied Econometric Time Series, 2nd edition. New York: John Wiley&Sons. Everitt, Brian S., Savine Landau and Morven Leese. 2001. Cluster Analysis, 4th edition. London: Arnold. Federal, Provincial and Territorial Advisory Commission on Population Health. 1996. Report on the Health of Canadians. Ottowa: Health Canada Communications. Fox, John. 1991. Regression Diagnostics. Newbury Park. CA: Sage Publications. Fox, John and J. Scott Long. 1990. Modern Methods of Data Analysis. Beverly Hills: Sage Publications. Frigge, Michael, David C. Hoaglin and Boris Iglewicz. 1989. "Some implementations of the boxplot." The American Statistician 43( 1 ):50—54. Gould, William, Jeffrey Pitblado and William Sribney. 2003. Maximum Likelihood Estimation with Stata, 2nd edition. College Station, TX: Stata Press. Hamilton, Dave C. 2003. "The Effects of Alcohol on Perceived Attractiveness." Senior Thesis. Claremont, CA: Claremont McKenna College. Hamilton, James D. 1994. Time Series Analysis. Princeton, NJ: Princeton University Press. Hamilton, Lawrence C. 1985a. "Concern about toxic wastes: Three demographic predictors." Sociological Perspectives 28(4):463-486. Hamilton, Lawrence C. 1985b. "Who cares about water pollution? Opinions in a small-town crisis." Sociological Inquiry 55(2): 170-1 81. Hamilton, Lawrence C. 1992a. Regression with Graphics: A Second Course in Applied Statistics. Pacific Grove, CA: Brooks/Cole. Hamilton, Lawrence C. 1992b. "Quartiles, outliers and normality: Some Monte Carlo results." Pp. 92-95 in Joseph Hilbe (ed.) Stata Technical Bulletin Reprints, Volume I. College Station, TX: Stata Press. Hamilton, Lawrence C. 1996. Data Analysis for Social Scientists. Belmont, CA: Duxbury Press. Hamilton, Lawrence C. Benjamin C. Brown and Rasmus Ole Rasmussen. 2003. "Local dimensions of climatic change: West Greenland's cod-to-shrimp transition." Arctic 56(3):27 1-282. Hamilton, Lawrence C, Richard L. Haedrich and Cynthia M. Duncan. 2003. "Above and below the water: Social/ecological transformation in northwest Newfoundland." Population and Environment 25(2)101-121. Hamilton, Lawrence C. and Carole L. Seyfrit. 1993. "Town-village contrasts in Alaskan youth aspirations." Arctic 46(3):255-263. Hardin, James and Joseph Hilbe. 2001. Generalized Linear Models and Extensions. College Station, TX: Stata Press. Hoaglin, David C. Frederick Mostellerand John W. Tukey (eds.). 1983. Understanding Robust and Exphratoiy Data Analysis. New York: John Wiley & Sons. Hoaglin, David C, Frederick MosteHerand John W. Tukey (cds.). 1985. Exploring Data Tables, Trends and Shape. New York: John Wiley & Sons. Hosmer, David W., Jr. and Stanley Lemeshow. 1999. Applied Survival Analysis. New York: John Wiley & Sons. Hosmer, David W., Jr. and Stanley Lemeshow. 2000. Applied Logistic Regression, 2nd edition. New York: John Wiley & Sons. Howell, David C. 1999. Fundamental Statistics for the Behavioral Sciences, 4th edition. Belmont, CA: Duxbury Press. Howell, David C. 2002. Statistical Methods for Psychology, 5\h c6\uon. Belmont, CA: Duxbury Press. Iman, Ronald L. 1994. A Data-Based Approach to Statistics. Belmont, CA: Duxbury Press. Jentoft, Svein and Trond Kristoffersen. 1989. "Fishermen's co-management: The case of the Lofoten fishery." Human Organization 48(4):355-365. Johnson, Anne M., Jane Wadsworth, Kaye Wellings, Sally Bradshaw and Julia Field. 1992. "Sexual lifestyles and HIV risk." Nature 360(3 December):410-412. Johnston, Jack and John DiNardo. 1997. Econometric Methods, 4th edition New York- McGraw-Hill. Keller, Gerald, Brian Warrack and Henry Bartel. 2003. Statistics for Management and Economics, abbreviated 6th edition. Belmont, CA: Duxbury Press. League of Conservation Voters. 1990. The 1990 National Environmental Scorecard. Washington, DC: League of Conservation Voters. Lee,ElisaT. 1992. Statistical Methods for Sun'ival Data Analysis, 2nd edition. New York" John Wiley & Sons. Li, Guoying. 1985. "Robust regression." Pp. 281-343 in D. C. Hoaglin, F. Mosteller and J. W. Tukey (eds.) Exploring Data Tables, Trends and Shape. New York: John Wiley & Sons. Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications. Long, J. Scott and Jeremy Freese. 2003. Regression Models for Categorical Outcomes UsingStata, revised edition. College Station, TX: Stata Press. MacKenzie, Donald. 1990. Inventing Accuracy: A Historical Sociology of Nuclear Missile Guidance. Cambridge, MA: MIT. Mallows, C. L.. 1986. "Augmented partial residuals." Technometrics 28:3 1 3-3 19. 398 Statistics with Stata References 399 Mayewski, P. A., G. Holdsworth, M. J. Spencer, S. Whitlow, M. Twickler, M. C. Morrison, K. K. Ferland and L. D. Meeker. 1993. "Ice-core sulfate from three northern hemisphere sites: Source and temperature forcing implications/1 Atmospheric Environment 27A(17/18):29I5-2919. Mayewski, P. A., L. D. Meeker, S. Whitlow, M. S. Twickler, M. C. Morrison, P. Bloomfield, G. C. Bond, R. B. Alley. A. J. Gow, P. M. Grootes, D. A. Meese, M. Ram, K. C. Taylor and W. Wumkes. 1994. "Changes in atmospheric circulation and ocean ice cover over the North Atlantic during the last 41,000 years." Science 263:1747-1751. McCullagh, D. W. Jr. and J. A. Nelder. 1989. Generalized Linear Models, 2nd edition. London: Chapman & Hall. Nash, James and Lawrence Schwartz. 1987. "Computers and the writing process." Collegiate Microcomputer 5( 1 ):45-48. National Center for Education Statistics. 1992. Digest of Education Statistics 1992. Washington, DC: U.S. Government Printing Office. National Center for Education Statistics. 1993. Digest of Education Statistics 1993. Washington, DC: U.S. Government Printing Office. Newton, H. Joseph and Jane L. Harvill. 1997. StatConcepts: A Visual Tour of Statistical Ideas. Pacific Grove, CA: Duxbury Press. Pagano, Marcello and Kim Gauvreau. 2000. Principles of Biostatistics, 2nd edition. Belmont, CA: Duxbury Press. Rabe-Hesketh, Sophia and Brian Everitt. 2000. A Handbook of Statistical Analysis Using Stata, 2nd edition. Boca Raton, FL: Chapman & Hall. Report of the Presidential Commission on the Space Shuttle Challenger Accident. 1986. Washington, DC. Rosner, Bernard. 1995. Fundamentals of Biostatistics, 4th edition. Belmont, CA: Duxbury Press. Selvin, Steve. 1995. PracticalBiostatistical Methods. Belmont, CA: Duxbury Press. Selvin, Steve. 1996. Statistical Analysis of Epidemiologic Data, 2nd edition. New York: Oxford University. Seyfrit, Carole L.. 1993. Hibernia's Generation: Social Impacts of Oil Development on Adolescents in Newfoundland. St. John's: Institute of Social and Economic Research, Memorial University of Newfoundland. Shumway, R. H. 1988. Applied Statistical Time Series Analysis. Upper Saddle River, NJ: Prentice-Hall. Stata Corporation. 2005. Getting Started with Stata for Macintosh. College Station, TX: Stata Press. Stata Corporation. 2005. Getting Started with Stata for Unix. College Station, TX: Stata Press. Stata Corporation. 2005. Getting Started with Stata for Windows. College Station, TX: Stata Press. Stata Corporation. 2005. Mata Reference Manual. College Station, TX: Stata Press. Stata Corporation. 2005. Stata Base Reference Manual (3 volumes). College Station, TX: Stata Press. Stata Corporation. 2005. Stata Data Management Reference Manual. College Station, TX: Stata Press. Stata Corporation. 2005. Stata Graphics Reference Manual. College Station, TX: Stata Press. Stata Corporation. 2005. Stata Programming Reference Manual. College Station, TX: Stata Press. Stata Corporation. 2005. Stata LongitudinalPanel Data Reference Manual. College Station, TX: Stata Press. Stata Corporation. 2005. Stata Multivariate Statistics Reference Manual. College Station, TX: Stata Press. Stata Corporation. 2005. Stata Quick Reference and Index. College Station, TX: Stata Press. Stata Corporation. 2005. Stata Survey Data Reference Manual. College Station, TX: Stata Press. Stata Corporation. 2005. Stata Survival Analysis and Epidemiological Tables Reference Manual. College Station, TX: Stata Press. Stata Corporation. 2005. Stata Time-Series Reference Manual. College Station, TX: Stata Press. Stata Corporation. 2005. Stata User's Guide. College Station, TX: Stata Press. Stine, Robert and John Fox (eds.). 1997. Statistical Computing Environments for Social Research. Thousand Oaks, CA: Sage Publications. Topliss, Brenda J. 2001. "Climate variability 1: A conceptual approach to ocean-atmosphere feedback." In Abstracts for AGU Chapman Conference, The North Atlantic Oscillation, Nov. 28 - Dec. 1, 2000, Ourense, Spain. Tufte, Edward R. 1997. Visual Explanations: Images and Quantities, Evidence and Narratives. Cheshire, CT: Graphics Press. Tukey, John W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley. Velleman, Paul F. 1982. "Applied Nonlinear Smoothing," pp. 141-177 in Samuel Leinhardt (ed.) Sociological Methodology 1982. San Francisco: Jossey-Bass. Velleman, Paul F. and David C. Hoaglin. 1981. Applications, Basics and Computing of Exploratory Data Analysis. Boston: Wadsworth. Ward, Sally and Susan Ault. 1990. "AIDS knowledge, fear, and safe sex practices on campus." Sociology and Social Research 74(3): 158-1 61. Werner, Al. 1990. "Lichen growth rates for the northwest coast of Spitsbergen, Svalbard." Arctic and Alpine Research 22(2): 129-140. World Bank. 1987. World Development Report 1987. New York: Oxford University. World Resources Institute. 1993. The 1993 Information Please Environmental Almanac. Boston: Houghton Mifflin. Index A ac (autocorrelations), 339, 351-352 acprplot (augmented component-plus-residual plot), 197, 202-203 added-variable plot, 198, 201-202 ado-file (automatic do), 233-235, 362, 373-375 alpha (Cronbacrfs alpha reliability), 318-319 analysis of covariance (ANCOVA), 141-142, 153-154 analysis of variance (ANOVA) factorial, 142, 152-153, 156 interaction effects, 142, 152-154, 156-157 median, 253-255 JV-way, 152-153 one-way, 142, 155 predicted values, 155-158, 167 regression model, 153-154, 249-256 repeated-measures, 142 robust, 249-256 standard errors, 155-157, 167 three-way, 142 two-way, 142, 152-153, 156-157 anova, 142, 152-158, 167, 239 append, 13, 42^14 ARCH model (autoregressive conditional heteroskedasticity). 339 area plot, 86-87 args (arguments in program), 366-368 areg (absorb variables in regression), 179-180 ARIMA model (autoregressive integrated moving average), 339, 354-360 arithmetic operator, 26 artificial data, 14, 57-61,241, 387-394 ASCII (text) file read data, 13-14, 39-42 write data, 42 write results (log tile), 2-3, 6-7 autocode (create ordinal variables), 31, 37-38 autocorrelation, 339, 350-352, 357-358, 369-373 aweight (analytical weights), 54 axis label in graph, 66 angle, 81-82 format, 13, 24-25, 76, 305-306 grid, 113-115 suppress, 118, 129, 173 axis scale in graph, 66, 112-118 _b coefficients (regression), 230, 269, 273-274, 285, 356 band regression, 217-219 bar chart, 94-99, 147, 150-151 Bartlett's test for equal variances, 149-150 batch-mode program, 61 bcskewO (transform to reduce skew), 129 beta weight (standardized regression coefficient), 160, 164-165 Bonferroni multiple-comparison test correlation matrix, 172-173 one-way ANOVA, 150-151 bootstrap, 246, 315-316, 382-387, 389-394 box plot, 66, 90-91, 118-119, 147, 150-151,389, 392 401 402 Statistics with Stata Index 403 Box-Cox regression, 215, 226-227 transformation, 129 Box-Pierce Q test (white noise), 341,351, 354,357-358 browse (Data Browser), 13 bs (bootstrap), 382-387 bsqreg (quantile regression with bootstrap), 240,246,389-394 by prefix, 121, 133-134 C c chart (quality control), 105 caption in graph, (09-110 case identification number, 38-39 categorical variable, 35-39, 183-185 censored-normal regression, 264 centering to reduce multicollinearity. 212-214 chi-squared deviance (logistic regression), 271, 275-278 equal variances in ANOVA, 149-150 independence in cross-tabulation, 55, 130-133,281 likelihood-ratio in cross-tabulation, 130-131,281 likelihood-ratio in logistic regression, 267-268,270, 272-273,281 probability plot, 105 quantile plot, 105 ci (confidence interval), 124, 255 cii (immediate confidence interval), 124 classification tabic (logistic regression), 264, 270-272 clear (remove data from memory), 14-15, 23, 362 cluster analysis, 318-320, 329-338 coefficient of variation, 123-124 collapse, 52—53 color bar chart, 95-96 pie chart, 92 scatterplot symbols, 74 shaded regions, 86 combine data files, 14, 42^17 combine graphs. See graph combine comments in programs, 364, 369-370, 373-374 communality (factor analysis), 326 component-plus-residual plot, 197-198, 202-203 compress, 13, 40, 60-61 conditional effect plot, 230-232,273-274, 284-287 confidence interval binomial, 124 bootstrap, 383-384,386 mean, 124 regression coefficients, 163 regression line, 66, 85, 110-112, 160 robust mean, 255 Poisson, 124 constraint (linear constraints), 262 Cook and Weisberg heteroskedasticity test, 197 Cook's A 158, 167, 197,206-210 copy results, 4 correlation hypothesis test, 160, 172-173 Kendall's tau, 131, 174-175 matrix, 18,59, 160, 171-174 Pearson product-moment, 1, 18, 160, 171-173 regression coefficient estimates, 214 Spearman, 174 corrgram (autocorrelation), 339, 351, 357-358,373 count-time data, 293-295 covariance regression coefficient estimates, 167, 173, 197,214 variables, 160, 173 COVRATIO, 167, 197,206 Cox proportional hazard model, 290, 299-305 Cramer's F, 131 Cronbach's alpha, 318-319 cross-correlation, 353-354 cross-tabulation, 121, 130-136 ctset (define count-time data), 289, 293-294 cttost (convert count-time to survival-time data), 289, 294-295 cubic spline curve. See graph twoway mspline cv (coefficient of variation), 123-124 D Data Browser, 13 data dictionary, 41 Data Editor, 13, 15-16 data management, 12-63 database file, 41-42 date, 30, 266, 340-342 decode (numeric to string), 33-34 #delimit (end-of-line delimiter), 61, 116, 362 dendrogram, 319, 329, 331-337 describe (describe data), 3, 18 destring (string to numeric), 35 DFBETA, 158, 167, 197, 205-206, 208-210 DFITS, 167, 197,206,208-210 diagnostic statistics ANOVA, 158, 167 logistic regression, 271, 274-278 regression, 167, 196-214 Dickey-Fuller test, 340, 355-356 difference (time series), 349-350 display (show value onscreen), 31-32, 39, 211,269 display format, 13, 24-25, 359 do-file, 60-61, 115-116, 361-362, 367-373 Do-File Editor, 60, 361 dot plot, 67, 95,99-100, 150-151 drawnorm (normal variable), 13, 59 drop variable in memory, 22 data in memory, 14-15, 23, 40, 56 program in memory, 363, 373-375 dummy variable, 35-36, 176-185, 267 Durbin-Watson test, 158, 197, 350 dwstat (Durbin-Watson test), 197, 350 E e-class, 381,386 edit (Data Editor), 13, 15-16 effect coding for ANOVA, 250-251 efficiency of estimator, 393 egen, 33, 331, 340. 343 eigenvalue, 318-319, 321, 326 empirical orthogonal function (EOF), 325 Encapsulated Postscript (.eps) graph, 6, 116 encode (string to numeric), 13, 33-34 epidemiological tables, 288 error-bar plot, 143, 155-157 estimates store (hypothesis testing), 272-273, 278-279, 282-283 event-count model, 288, 290, 310-313 Exploratory Data Analysis (EDA), 124-126 exponential filter (time series), 343 exponential growth model, 216, 232-235 exponential regression (survival analysis), 305-307 F factor analysis, 318-328 factor rotation, 318-319,322-325 factor score, 318-319, 323-325 factorial ANOVA, 142, 152-153, 156 FAQs (frequently asked questions), 8 filter. 343 Fisher's exact test in cross-tabulation, 131 fixed and random effects, 162 foreach, 365 format axis label in graph, 76, 305-306 input data, 40^11 numerical display, 13, 24-25, 359 forvalues, 365 frequency table, 130-133, 138-139 frequency weights, 54-55, 66, 73-74, 120, 123,138-140 function date, 30 mathematical, 27-28 probability, 28-30 special, 31 string, 31 fweight (frequency weights), 54-55, 73-74, 138-140 404 Statistics with Stata Index 405 generalized linear modeling (GLM), 264, 291, 313-317 generate, 13,23-26,37,39 gladder, 128 Gompertz growth model, 234-238 Goodman and Kruskal's gamma, 131 graph bar, 66-67, 94-99, 147 graph box, 66, 90-91, 118-119, 147,389, 392 graph combine, 117-119, 147, 150-151, 222, 231-232 graph dot, 67, 95, 99-100, 150-151 graph export, 116 graph hbar, 97-98 graph hbox, 91, 150-151 graph matrix, 66, 77, 173-174 graph pie, 66, 92-94 graph twoway all types, 84-85 overlays, 66, 85, 110-115, 344-345, 347-348 graph twoway area, 84, 85-86 graph twoway bar, 84 graph twoway connected, 5-6,50-51,66, 79-80, 83-84, 114-1 15,157, 192-193 graph twoway dot, 85 graph twoway lfit, 66, 74, 85, 110, 168, 181 graph twoway lfitci, 85, 110-112, 170-171 graph twoway line, 66, 77-82, 112-115, 117,221-222,242,244,247,344-345, 371 graph twoway lowess, 85, 88-89, 216, 219-221 graph twoway mband, 85, 216, 217-219 graph twoway mspline, 85, 182, 190, 218-219, 226, 287-288 graph twoway qfit, 85, 110, 190 graph twoway rarea, 84, 170 graph twoway rbar, 85 graph twoway reap, 85, 89, 157 graph twoway scatter, 65-66, 72-77, 181-182,277, 394 graph twoway spike, 84, 87-88, 347 graph use, 116 graph7, 65 gray scale, 86 greigen (graph eigenvalues), 318-319, 321-322 gsort (general sorting), 14 H hat matrix, 167,205-206,210 hazard function, 290, 302, 307, 309 help, 7 help file, 7, 375-377 heteroskedasticity, 161, 197, 199-200, 223-224,239,256-258, 290,315, 339 hettest (heteroskedasticity test), 197, 199-200 hierarchical linear models, 162 histogram, 65, 67-71, 385 Holt-Winters smoothing, 343 Huber/White robust standard errors, 160, 256-261 I if qualifier, 13, 14, 19-23,204-205,209 if„.else, 366 import data, 39^2 in qualifier, 14, 19-23, 166 incidence rate, 289-290, 293, 297, 309-310,312 inequality, 21 infile (read ASCII data), 13-14, 40^2 infix (read fixed-format data), 41^2 influence logistic regression, 271, 274-278 regression (OLS), 167, 196-198, 201, 204-208 robust regression, 248 insert graph into document, 6 table into document, 4 insheet (read spreadsheet data), 41^12 instrumental variables (2SLS), 161 interaction effect AN OVA, 142, 152-157, 250-253 regression, 160, 180-185, 211-212, 259-261 interquartile range (IQR), 53, 91, 95, 103, 123-124, 126, 135 iteratively reweighted least squares (1RLS), 242 iweight (importance weights), 54 J jackknife residuals, 167 standard errors, 314-317 Kaplan-Meier survivor function, 289-290, 295-298 keep (keep variable or observation), 23, 173 Kendall's tau, 131, 174-175 kernel density, 65, 70, 85 Kruskal-Wallistest, 142, 151-152 kurtosis, 122-124, 126-127 L-estimator, 243 label data, 18 label define, 26 label values, 25-26 label variable, 16, 18 ladder of powers, 127-129 lag (time series), 349-350 lead (time series), 349-350 legend in graph, 78,81,112,114-1 15, 157. 221,344 letter-value display, 125-126 leverage, 158, 159, 167, 196, 198, 201-206,210, 229, 246-248 leverage-vs.-squared-residuals plot, 198, 203-204 lfit (fit of logistic model), 264 likelihood-ratio chi-squared. See chi- squared line in graph pattern, 81-82, 84, 115 width, 221,344,371 line plot, 77-84 link function (GLM), 291, 313-317 list, 3^4, 14, 17, 19,49,54,265 log, 2-3 log file, 2-3, 6-7 logarithm, 27, 127-129, 223-229 logical operator, 20 logistic growth model, 216, 233-234 logistic regression, 262-287 logistic (logistic regression), 185,262-264, 269- 278 logit (logistic regression), 267-269 looping, 365-366 lowess smoothing, 88-89, 216,219-222 lroc (logistic ROC), 264 lrtest (likelihood-ratio test), 272-273, 278-279, 282-283 Isens (logistic sensitivity graph), 264 lstat (logistic classification table), 264, 270- 272 lvr2plot (leverage-vs.-squared-residuals plot), 198,203-204 M M-estimator, 243 macro, 235, 334, 363, 365, 367, 370, 387 Mann-Whitney C/test, 142, 148-149, 152 margin in graph, 110, 113, 117-118, 192-193 marker label in graph, 66, 75-76, 202, 204 marker symbol in graph, 66, 73-75, 84, 100, 183, 277 marksample, 368-369 matched-pairs test, 143, 145-146 matrix algebra, 378-382 mean, 122-124, 126,135-137, 139-140, 143-158, 387-389 median, 90-91, 122-124, 126, 135-137, 387-389 median regression. See quantile regression memory, 14, 61-63 merge, 14, 44-50 missing value, 13-16,21,37-38 406 Statistics with Stata Index 407 Monte Carlo, 126, 246, 387-394 moving average filter, 340, 343-344 time series model, 354-360 multicollinearity, 210-214 multinomial logistic regression, 264, 278, 280-287 multiple-comparison test correlation matrix, 172-173 one-way ANOVA, 150-151 N negative exponential growth model, 233 nolabel option, 32-34 nonlinear regression, 216, 232-238 nonlinear smoothing, 340-341, 343-346 normal distribution artificial data, 13, 59, 241 curve, 65 test for, 126-129 normal probability plot. See quantile-normal plot numerical variables, 16, 20, 122 O OBDC (Open Database Connectivity), 42 odds ratio. See logistic regression observation number, 38-39 omitted-variables test, 197, 199 one-sample t test, 143-146 one-way ANOVA, 149-152 open file, 2 order (order variables in data), 19 ordered logistic regression, 278-280 ordinal variable, 35-36 outfile (write ASCII data), 42 outlier, 126, 239-248, 344, 388-394 overlay two way graphs, 110-115 P p chart (quality control), 105-107 paired-difference test, 143, 145-146 panel data, 161, 191-195 partial autocorrelation, 339-340, 352 partial regression plot. See added-variable plot Pearson correlation, 5, 19, 160, 171-173 percentiles, 122-124, 136 periodogram, 340 Phillips-Perron test, 355 pie chart, 66, 92-94 placement (legend in graph), 114-115 poisgof (Poisson goodness of fit test), 310-311 Poisson regression, 290-291,309-313,317 polynomial regression, 188-191 Portable Network Graphics (.png) graph, 6, 116 Postscript (.ps or .eps) graph, 6, 116 Prais-Winsten regression, 340, 359-360 predict (predicted values, residuals, diagnostics) anova, 155-158, 167 arima, 357 logistic, 264,268-271,284 regress, 159, 165-167, 190, 196-197, 205-210,216, 233 principal components, 318-325 print graph, 6 print results, 4 probit regression, 262-263, 314 program, 362-363 promax rotation, 319, 322-325 pweight (probability or sampling weights), 54-56 pwcorr (pairwise Pearson correlation), 160, 172-173, 174-175 Q qladder, 128-129 quality-control graphs, 67, 105-108 quantile defined, 102 quantile plot, 102-103 quantile-normal plot, 67, 104 quantile-quantile plot, 104-105 regression, 239-256, 389-394 quartile, 91, 125-126 quietly, 175, 182, 188 R r chart (quality control), 67, 106, 108 r-class, 381,387, 390 Ramsey specification error test (RESET), 197 random data, 56-60, 241, 387-394 random number, 30, 56-59, 241 random sample, 14, 60 range (create data over range), 236 range plot, 89 range standardization, 334-335 rank, 32 rank-sum test, 142, 148-149, 152 real function, 35-36 regress (linear regression), 159-165, 239, 386, 389-394 regression absorb categorical variable, 179-180 beta weight (standardized regression coefficient), 160, 164-165 censored-normal, 264 confidence interval, 110-112, 163, 169-171 constant, 163 curvilinear, 189-191, 216, 223-232 diagnostics, 167, 196-214 dummy variable, 176-185 hypothesis test, 160, 175-176 instrumental variable, 161 line, 67, 110-112, 159-160, 168-171, 190, 242,244, 247 logistic, 262-287 multinomial logistic, 264, 278, 280-287 multiple, 164-165 no constant, 163 nonlinear, 232-238 ordered logistic, 278-280 ordinary least squares (OLS), 159-165 Poisson, 290-291, 309-313, 317 polynomial, 188-191 predicted value, 165-167, 169 probit, 262-263, 314 residual, 165-167, 169, 205-207 robust, 239-256, 389-394 robust standard errors, 256-261 stepwise, 161, 186-188 tobit, 188,263 transformed variables, 189-191, 216 223-232 two-stage least squares (2SLS), 161 weighted least squares (WLS), 161 245 relational operator, 20 relative risk ratio, 264, 281-284 rename, 16, 17 replace, 16,25-26,33 RESET (Ramsey test), 197 reshape, 49-52 residual, 159-160, 167, 200-208 residual-vs.-fitted (predicted values) plot, 160, 169, 188-191, 198,200 retrieve graph, 116 robust anova, 249-255 mean, 255 regression, 239-256 standard errors and variance, 256-261 ROC curve (receiver operating characteristic), 264 rotation (factor analysis), 318-319, 322-325 rough, 345 rreg (robust regression), 239-256, 389-394 rvfplot (residual-vs.-fitted plot), 160, 188-191, 198,200 rvpplot (residual-vs.-predictor plot S sample (draw random sample), 14, 60 sampling weights, 55-56 sandwich estimator of variance, 160, 256-261 SAS data files, 42 save (save dataset), 14, 16, 23 save graph, 6 saveold (save dataset in previous Stata format), 14 scatterplot. Also see graph twoway scatter axis labels, 66, 72 basic, 66-67 marker labels, 67, 74-75, 202-204 408 Statistics with Stata Index 409 marker symbols, 72-73, 1 19, 182-183 matrix, 66,77, 173-174 weighting, 66, 74-75, 207-208 with regression line, 66, 110-112, 159-160, 181-182 Scheffe multiple-comparison test, 150-151 score (factor scores), 318-319, 323-325 scree graph (eigenvalues), 318-319, 321-322 search, 8-9 seasonal difference (time series), 349-350 serrbar (standard-error bar plot), 143, 155-157 set memory, 14, 62-63 shading color, 86 intensity, 91 Shapiro-Francia test, 127 Shapiro-Wilk test, 127 shewart, 106 Sidak multiple-comparison test, 150, 172-173 sign test, 144-145 signed-rank test, 143 146 skewness, 122-124, 126-127 sktest (skewness-kurtosis test), 126-127, 383 slope dummy variable, 180 SMCL (Stata Markup and Control Language), 376-377 smoothing, 340-341, 343-346 sort, 14, 19, 21-22, 166 Spearman correlation, 174-175 spectral density, 340 spike plot, 84,87-88,347 spreadsheet data, 41-42 SPSS data files, 42 standard deviation, 122-124, 126, 135 standard error ANOVA, 155-157 bootstrap. See bootstrap mean, 124 regression prediction, 167, 169-171 robust (Huber/White), 160,256-261 standardized regression coefficient, 160, 164-165 standardized variable, 32, 331 Stat/Transfer, 42 Stata Journal, 10-11 Statalist online forum, 10 stationary time series, 340, 355-356 stcox (Cox hazard model), 290, 299-303 stcurve (survival analysis graphs), 290, 307 stdes (describe survival-time data), 289, 292-293 stem-and-leaf display, 124-125 stepwise regression, 161, 186-188 stphplot, 290 streg (survival-analysis regression). 290, 305-309 string to numeric, 32-35 string variable, 17, 40-41 sts generate (generate survivor function), 290 sts graph (graph survivor function), 289, 296, 298 sts list (list survivor function), 290 sts test (test survivor function), 290, 298 stset (define survival-time data), 289, 291-292,297 stsum (summarize survival-time data), 289, 293, 297 studentized residual. 167, 205, 207 subscript, 39-40, 343 summarize (summary statistics), 2, 17,20, 31-32, 90-91, 120-124, 383 sunflower plot, 74-75 survey sampling weights, 55-56, 161. 263 survival analysis, 288-309 svy: regress (survey data regression), 161 svyset (survey data definition), 56 sw (stepwise model fitting), 186-188 symmetry plot, 100, 102 syntax (programming), 368-369 T nest correlation coefficient, 160, 172-173 means, 143-149 robust means, 255 unequal variance, 148 table, 121, 134-136, 152 tabstat, 120, 123-124 tabulate, 4, 15, 36-37, 56, 121, 130-133, 136 technical support, 9 test (hypothesis test for model), 160, 175-176,312 text in graph, 109-110, 113, 222 time plot, 77-84, 343-348 time series, 339-360 tin (times in), 346-347, 350, 359 title in graph, 109-110, 112-113 tobit regression, 188, 263 transfer data, 42 transform variable, 126-129,189-190,216 transpose data, 47-49 tree diagram, 319, 329, 331-337 tsset (define time series data), 340, 342, 346 tssmooth (time series smoothing), 340-341, 343-346 ttest, 143-149, 392 Tukey, John, 124 twithin (times within), 346-347 two-sample test, 146-149 two-stage least squares (2SLS), 161 u unequal variance invest, 143, 148-149 uniform (random number generator), 30, 56-58,241 unit root, 355-356 use, 2-3, 15 V variance, 122-124, 135, 214 variance inflation factor, 197, 211-212 varimax rotation, 319, 322-325 version, 364 W web site, 9 Weibull regression (survival analysis), 305, 307-399 weighted least squares (WLS), 161, 245 weights, 55-57,74-75,122-124,138-140, 161 Welsch's distance, 167, 206-210 which, 374 while, 365-366 white noise, 341, 351, 354, 357-358 Wilcoxon rank-sum test, 142, 148-149, 152 Wilcoxon signed-rank test, 143, 146 Windows metafile (.wmf or .emf) graph, 6, 116 wntest (Box-Pierce white noise Q test), 341 word processor insert Stata graph into, 6 insert Stata table into, 4 x axis in graph. See axis label in graph, axis scale in graph x-bar chart (quality control), 106-108 xcorr (cross-correlation), 353-354 xi (expanded interaction terms), 160, 183-185 xpose (transpose data), 48-49 xtmixed (multilevel mixed-effect models), 162 xtreg (panel data regression). 161, 191-195 y axis in graph. See axis label in graph, axis scale in graph z score (standardized variable), 32, 331