1 - ~ .\ Descriptive % -' J Statistics and Graphs In this chapter we look at how Stata produces descriptive, or univariate, statistics. "1 hese techniques are commonly used to explore the data before making decisions about further analysis and, when reporting analyses, to give the reader information about the nature and categories of the variables that have been used. Deciding which statistics to use to describe a variable largely depends on the level ol measurement of the variable. Here we refer to nominal, ordinal and interval levels of measurement. More details are given in Box 5.1 if you need them. Sox S.%:A refresher on levels of maas.jr«ment Level of measurement is one of the fundamental building blocks m quantitative data analysis. Almost every' 'statistical process starts with recognizing the level of measurement of the variablefs) to be used, it is vital that level of measurement is thoroughly understood. Nominal •At the nominal level of measurement, numbers or other symbols are assigned to a set of categories for the purpose of naming., labelling or classifying the observations. Gender is an example of a-nominal lever variable. Using the numbers-1 and 2, for instance, we can classify our observations into the categories 'female' and 'male1', with 1 representing female and 2 representing male. We ' could use any of a variety of words to represefrt the different categories of-a nominal variable; however, when numbers are used to represent the different categories, we do not imply anything about the magnitude oi quantitative difference between'the categories, ► 127 128 Descriptive Statistics and Giaphs ► Other examples of nominal Wei variables are etimie.ty, nationality, race and case/control. , Ordinal Ordinal level variables assign numbers to rank-ordered categories ranging from lowest to Highest The classic ordinal level measure is a Likert scale thai has categories of strongly disagree, disagree, neither agree or disagree, agree, and strongly agree. We can say that a person in the category "strongly agree' agrees with the statement more than a person in the 'agree' category, but we do not know the magnitude of the differences between the categories; that is, we don't know how much more agreement there is when 'strongly agree" is compared to "agree". Another example of an ordinal level variable is age groups such as young, middle-age and old or evon 16-40, 41-64, 65 and over. Interval/ratio For inlotvaUwio (usually just referred to as interval) level of measurement the categories (or values) of a variable can be rank-ordered and the differences between these categories (or values) hip. rnnstar* Examples of variables merited at the irtt^rvat/rauo level are age, income, height and weight. With all these variables we can compare values not only in terms of which is larger or smaller, but also in terms of how much larger or smaller' one is compared with another. In some discussions of levels of measurement you will see a distinction made between interval/ratio variables that have a natural zero point (where zero means the absence of the property) and those variables that have zero as an arbitrary point. For example, weight and length have a natural zero point whereas temperature in degrees Celsius or Fahrenheit has an arbitrary zero point. Variables with a natural zero point are also called ratio variables. In statistical practice, however, ratio variables are subjected to operations that treat them as interval and ignore their ratio properties. Therefore, in practice there is no distinction between these two types. Discrete or continuous? Here we enter a maze of terminology arid its uses and abuses. Let us start with the least controversial aspect. Nominal and ofdinal Descriptive Statistics and Graphs 129 measures are also commonly/ referred to' as categorical variables. I hey are always discrete. Provided that toe categories of a nominal or ordinal variable are exhaustive (cover al! potential categories) and mutually exclusive (cases can belong to only one category) then they are naturally discrete in .that a case is assigned to a category with no other options, interval level measures can either be discrete or continuous. For example, number of children's interval level but. discrete in that the answers can only be 0; 1.' 2, 3;... and hot 0.6 or 1,34. Age is an interval level , measuie that is continuous in that someone can actually be one minute or even one second older than someone else, But here is ] the twist - even if it is a truly continuous phenomenon, we usually measure it in a discrete way' Age is usually in whole years at the last birthday: 12,13,14, etc. If you want to think a bit more deeply about it, think about time. Even if we recorded the duration of some event to the nearest tenth of a second - usually not the case in social and behavioural sciences - then it is still a discrete measure because an event could be 23.1657 seconds. So, let's accept that more often than not we measure continuous phenomena in a discrete way, and. these discrete categories such as age last birthday have a standard unit of 1 year between them How many interval level vara hies that we commonly use have truly standard units? Age, number of children, time and income are relatively straightforward, but how about all these scales we create from questionnaire items or standard 'instruments' such as the General' Health Questionnaire and many offer measures of psychologic?! well-being and quality of life? Are the scores from these scales truly interval in that the difference in psychological well-being, for example, is the same between those scoring 4 and those scoring 5 as between those scoring 13 and those scoring 14? Or is there any real difference between those scoring 4 or 5 anyway? ft is very common (we do it regularly) to mate the assumption that those scaies have standard unite so that they can be analysed as interval level,measures. In general, there h nothing wrong with roakipg this assumption and we would be very restricted in the analysis we could do if we didn't, but it is. worth iavisiting the basic assumptions we often make in data analysts. 1 Ws conclude by summarizing the properties of the various levels of measurement in a labie. _ - ► 130 Descriptive Statistics end Grafts i'-'iir -J,]/ Ordered exclusiv« (rank) zero Nominal ^es yds no1 no , no Ordinal yes yes yes no no Interval 1 yes yes ^es yes no Rails1 yes yes yes y--$ , Adapted fron Frankfort-Naohmias and Leon Guerrero (2OÖ0' I3~t4) ana BcwKng (2002; 144-7?. Ar the end of this chapter we examine some of the graphing abilities of Stata and cover some basic single-variable graphs. FREQUENCY DISTRIBUTIONS Tor frequency distributions, yon use the tatwlats command (which you can shorten to tab or ta). This command will show you the value labels associated with the variable, the frequency of each value, percentage frequency, and cumulative percentage frequency, for example: tab sex . tab sex sex 1 Freq. Percent. Cum, ma 1 e j A , 833 47.09 47 . 09 female j b, 431 52 .91 1.00 . 00 Tota-.. ; 10, 264 1Ü0.00 You will notice that this command gives the value labels, but not the numerical value associated with the label, hi order to get thai, yon must add the option nolabel (noli to the tab command: Frequency distributions 131 tab sex, nol . tab sex.. r.o_ sex \ Freq. Percent Cum. 4,333 47.09 47.09 5,431. 52.91 100.00 I'otal | 10,264 100.00 The option missing for miss or ml gives you information on the missing values. ta hlBtat, miss . ta hl.-sta s., mlsy health over 12 months Freq. Pen; ent Cum. excel!ent | 2,930 28 . 55 28 . 55 good I 4 , 613 44 .94 73 . 4 9 fail- ! 1, «53 .] 8 . 0 5 91. 54 poo;: j 641 6 97 . 7 9 verv poor | 219 2 . 13 99 . 92 - ! 8 0 . 0 8 100.00 Total 10,264 100 . 00 Ihe frequency table produced by Stata has three columns of figures. The left-hand one labelled Freq. has the actual counts in each of the categories of the variable. The middle column labelled Percent gives the percentage of that category so that all categories add up to 100%. Note here that when you use the missing option the missing-values category contributes to the 100% total so that the percentage is of all cases in your data. If the missing option is not used then the percentage is of the number ot cases who answered that item - sometimes less than 100% of the totai sample. The hlslal variable has only eight missing cases, but it could be much higher so that the percentages of the non-missiug categories woind be inaccurate. See the example below with the ibsj.ll variable. The right-hand column labelled 132 Descriptive Statistics and Graphs Cum. gives the cumulative percentage of the categories from the top of the table. The same inclusion rule applies when using the nsisaing option. . tab j'osatl,miss job | satisfaction: j promotion ( prospects | Freq. Percent Cum. does't apply 599 5 . 84 5 . 84 not satisfied j 770 7 . 50 13 . 34 2 I 2 0 8 2 . 0 3 15 . 3 6 3 I 238 2 . 81 1 8 . 17 not satis/dissat 1, 33b 1 3 . 01 31 . 18 5 ] 527 5 . 13 36 .31 6 | 42S 4 , 17 40 .48 complete Iv satis 9 85 9 . 6 0 50 . 08 • I 5, 124 49 , 9 2 100 . 00 Total | 1.0, 264 100. . 00 . tab jbsatl job i satisfaction: promotion prospects | Freq. Percent Cum. doesn't apply 599 11 . 65 11. . 6b not satisfied j 770 14 . 98 26. . 63 2 I 208 4 . 05 30. 68 28 8 60 36 . 28 not satis/dissat j 1 , 33b 25 . 97 62 . 2 6 r> l 52 7 10. 2 5 72 . 51 6 I 42 8 8. 3 3 80. 84 completely satis 985 19. 16 100. 00 TotaJ. 5, 140 100. 0 0 Frequency distributions 133 In the first table above the missing option is used. From this r.-iHe you might (incorrectly) read that 9.6% are completely satisfied with their promotion prospects. However, when the missing cases are omitted from the second table above. 19.16% ate completely satisfied with their promotion prospects. Almost half the total sample was not asked this item because they were not in employment at the time. Two other options to use with the tab command are plot and sort. The plot option produces a basic bar chart of the relative frequencies of the variable categories. This chart is shown m the Results window and not through the graphing facility in Stata. The sort option rearranges the frequency table so that the cate-goiies are presented in descending order with the most frequent (mode) at the top of the table. These two options can be used together: tab hlstat, sort plot . tab hlstat, so health over 'last 12 months | good I excellent | fair | poor very poor Total 1 10,256 (.ompare this output with the frequency table given above for the v unable hlstat. Using the plot option will mean that the percentages, category and cumulative, are not presented in the table but UMiig the sort option by itself produces the same style of table as earlier by reordered categories. rt plot F'req. 4,613 | * * ******************* * 2,930 ]**************** 1,853 ************ 641 |******* 219 I * * tab hlstat, sort 134 Descriptive Statistics and Graphs tab hlstat, s health over last 12 months j Fr eq. Per cent Cur?. good 4, 613 44 . 9 8 44.98 excellent J 2 , 93 0 28. 57 73 . 55 la.ir j 1, 353 18. . 07 91. 51 poor 641 6 . 25 97 . 86 very poor 219 2 , 14 100.00 Total | 10, 256 100. 00 The last two options we will cover here are nofreq and gen 'Ihe nofreq option tells Stata nor to present the frequencies, which in a single-variable table means that no table is produced. This option has more uses in two-way tables (or crosstabulations) and is covered in the next chapter. The tab command combined with the gen option produces a series of dummy (or binary) variables - one for every category of the variable in the table. After gen goes the name prefix (or stub) of the new series of dummy variables: tab hlstat,gen(hlth) The table output is the same as using just the tab command but Stata has generated five new variables all starting with, hilh and called hltbl, bithl, hlth3, hitb4 and blthS. These new variables are shown at the bottom of the list of variables in the Variables wmdow. If you tab the new variable hlthS you can see that the same number of cases (1853) are given the value ! as were in the third category (fair) on the hlstat variable: tab falth3 . tab hitri3 hlstal=;~£ai { r J Frcq. Percent. Cum. C j 8,403 81.93 81.93 1 i 1,853 18.07 100.00 Total j 10,2 56 100.0 0 Frequency distributions 135 Note that if cases are missing on the original variable {hlstat) then they will have missing values on these new dummy variables as well, If you want to use the tab command to create a series of dummy variables but do not want to produce a frequency table, then you could use the gen and nofreq options together: tab histat,gen(hlth) nofreq This combination creates the new dummy variables but does not produce a frequency table in the Results window. If you ask Stata to produce a frequency cable of a variable that has a large number of categories/values then it will return an error message in red: too many values The if qualifier works with almost all commands, including rccoding and variable creation. Try: tab hlstat if sex==l This command gives you the frequency distribution of health status for males. Don't forget that double equals signs are required for conditional commands like this. Alternatively, the foysort command can be combined with tab. For example, the command below will produce two frequency tables - one for the health status for males and one for health status of females. The missing, nolabel, plot and sort options can still be used in this combination. bysort seat: tab mastat .. bysort sex: tab hlstat sex - male health over | st 12 months | 1'req. Percent Cum. excellent | 1, 536 31. .79 31. 79 good | 2 , 149 44 . . 47 76.26 fair ] 8 08 16 . . 7 2 92 . 98 poor | 246 5 . , 09 98 . 08 very poor' j 93 1, . 92 100.00 Total I 4, 832 100 . .00 138 Descriptive Statistics and Graphs -> sex - female health over | last 12 months • Freq. Percent Cum - excellent \ 1, . 394 25 .70 25 . .70 good | 2 , . 464 45 . 43 71. . 13 fair [ 1 f 045 19 .27 90 . ,39 poor j 335 7 .28 97 , .68 very poor- | 126 2 .32 100 . . 00 Total 1 5 , 424 100 . 00 If you want to produce a series of frequency tables you do nor need to type in each tab command separately. The command tabl will produce separate frequency tables for each of the variables in the list after the command. For example: tabl sex hlstat will produce a frequency table for sex followed by one for hlstat. The variable list can, of course, be much longer. . tabl sex hlstat -> tabulation <~ r sex sex | Freq, Percent Cum. male | 4,833 47.09 47.09 female [ 5,431 52.91 100.00 Total | 10,264 100.00 tabulation oi hlstat heal th over 12 months | Freq. Percent Cum. excellent I 2,930 28. , 5 7 28 5 7 good | 4, 513 44, .98 73 . , 55 fair 1, 8 53 18 , 07 91, .61 poor ( 641 5 2 5 97 .86 very poor | 219 2 , 14 100 , . 00 Total j 10,256 100 , . 00 Frequency distributions 137 i hi.1 options nolabel, Kissing, sort and nofreq can be m d \ th the tabl command as well. The gen option can also lw st ■* but care needs to be taken as the category values from the 011i il variables will be used, but if the values are duplicated fhen itita will stop and send an error message as it will not create r\o variables with the same name. Our advice is to use the gen option only with single-variable tabulations. Frequency tables can also be produced by using the pull-down menus (see Box 5.2). Box 5.2: Frequency tables using pull-down menus Although we advocate quick progress to using <$o flies and we use single iiiiOractive oone-aariti ■:" t-ie Osrnaiii-'id window in this arvi other chapters, Stata has the facility to use pull-down menus. You can also use these to produce descriptive statistics in frequency tables and summary statistics. ■Statistics -» Summaries, tables, and tests -s» Tables -> One-way tables This path show;, that you need to firs! choose the Statistics main puff clown menu from ihe top toolbar then follow the foiue sub ' menus to One-way tables. This brings up the dialogue box called tabulate 1 - one-way tables. This dialogue box has foui tabs: Main, by/Wtrt, Weights and Advanced On the Main tab, uoe 'the fist of variables under the Categorical variable selector to choose the variable you want to put into a frequency table - in this case the variable sex. ► 138 Descriptive Stetntths and Graphs I ; Nef la^ tlio t.i'il^ in de:'cencl«'j iT'fier ul liC'q^.-ic,' f I 'i The five option tick boxes (cii&K in tho box to choose) miatt to the options: Treat missing values lite other values - missing Do not display frequencies = aefxeq Display numeric codes rather than value labels = aolabsi Produce a bar chart of the relative frequencies = .plot Display the table in descending order of frequency = sort in the by/if/in tab you can specify conditions, in the- same way as the byaort and if command combinations. Options in, the Advanced tab let you create new dummy variables in the same way as the gen option, . SUMMARY STATISTICS The command for common summary statistics is summarise, which can be shortened to sum or so. The command is followed by the vanable or list of variables you wash to analyse. Be eareiuL because just typing sum in the Command window will produce summary statistics of all the variables in the data, which can result in pages and pages of output if you have a very large data set! Jf you find you have done this - or made any other mistake where Stata ends up returning tar too much information - you can always click on the Break button © to stop- the command. Summary statistics 139 The sum command produces basic descriptive statistics such as the number ot observations (Obs), mean, standard deviation !Std, Dev.), minimum and maximum values. For example: sum ghqscale . sum ghqscale Variable | Obs Mean Std. Dev. Min Max ghqscale | 9613 10.77125 4.914182 0 36 Additional statistics can be obtained by adding the detail option: sua ghqscale, detail . sum ghqscale., detail subjective weilbeing (ghq) 1: ]. ikert Percentiles Sma 11e s t 1% 3 0 5% 5 0 10% 6 0 Obs 9613 2 5% 7 0 Sum ot Wgt. 9 613 50% 10 Mean 10.77125 Largest Std. Dev. 4.914182 75% 13 36 90% 17 36 Variance 24.14919 95% 21 36 Skewness 1.366197 99% 27 36 Kurtosis 5,713574 Tins expanded output includes information on the number of cases, mean and standard deviation. The minimum and maximum values can be seen with the percentile listing on the left. The 50th percentile (median) is aiso reported here along with the 25th and 75th percentiles for the inter-quartile range. The variance (standard deviation squared) is presented in the right-hand column along with statistics for the skewness and kurtosis of the variable (see Box 5.3). 140 Descriptive Statistics and Graphs Box 5.3: Skewness and kurtosis The- skewness statistic summarizes the degree and dimctkm of asymmetry m the distribution of the variable, A symmetric distribution has a skewness statistic o* 0. If the d'stributton is skewed to the left (or negatively ske*ed) the statistic has a negative value, and if the distribution is skewed to the right (or positively skewed) the statistic has a positive value. In the above example, the skewness has a value of 1,3? indicating that the distribution of the GHQ variable is skewed to the right. We will revisit this distribution later in the chapter when we look at graphing. The kurtosis statistic is a summary of the shape of the distribution in relation to its peak and tails. A normal distribution has a kurtosis of 3. (f the value is less than 3 than the distribution is 'flatter' than a normal distribution, with a lower peak and heavier or wider tails. If the value is greater than 3 then the distribution is 'sharper' than a normal distribution, with a higher peak and lighter or narrower tails. In this example of the ghqscsle variable the kurtosis statistic has a value of 5,71 indicating that it is more peaked than normal, with thinner tails. Other software calculates skewness and kurtosis statistics differently, cc you need to make turn you know how tne statistics are calculated so you can interpret them correctly, In Stata you can also use the sfctsat command to formally test the normality of a variable. Two other commands - swdlk and afrancis - are also available depending on the sample sue, The sktest command tests the skewness and kurtosis of the variable with a null hypothecs that the variable is normally distributed. Our examination of the ghqscale variable shows that we would reject the null hypothesis on both its skewness and kurtosis. Note that the sfctest command does not produce the value of the statistics, and you would need to run a summary statistics command to obtain the actual values. . sKttis- aht}.,<_<=■ ------jainr------ Var i-iblci 1 Pr i S> .-'./-ess; pr (.-'artosiE ; (Jc'~ cl i"l2; Pr^o>cili2 Summary statistics 141 The stall command and detail option can be used with more than one variable. Without the detail option, Stata produces a list of the variables with separating lines after every tilth variable. For example: sum ghqa-ghgl . sum ghqa-ghql Variable j Obs Mean Stci. Dev. Min Max ghqa | 9728 2 . 162212 .5286127 1 4 ghqb j 972 8 1.7978 . 778507 1 4 ghqc 9719 "i .049079 .5911111 1 4 ghqd 9730 1 .969476 .4843033 1 4 ghqe j 9729 2 .058999 .7992425 1 4 ghqf 9718 1 .760136 . 7077888 1 4 g'hqg 9730 2 .131449 .5690908 1 4 ghqh j 9732 2 . 02949 .4746731 1 4 ghqi ] 9730 I .854265 .8214758 1 4 ghqi 1 9 687 1 .597399 .7404045 1 4 ghqk j 9677 1 .361992 . 6329897 i 4 ghqi 1 9684 2 .011669 .5338565 1 4 The separating lines can be omitted by using the sap (0) option and changed to any other interval you wish; for example, for an interval of variables use sep (2): sunt ghga-ghql, sep(2) . sum ghqa- ghql,sep(2j Variable gnqa ghqb Obs 9728 9728 2 . 162212 1.7978 cd. Dev. .5286127 .778507 Man Ma.x ghqc ghqd 971 9 9730 2.049079 1. .969476 ghqe j 9729 2.058999 ghqf j 9718 1.760136 .5911111 .4843033 7992425 7077888 142 Descriptive Statistics and Graphs ghqg , 9730 2.131449 .5fc9Q908 1 4 ghqh j 9732 2.02949 .4746731 1 4 ghqi J 9730 1.854263 .3214758 1 4 ghqj | 9687 1,597399 ./404045 1 4 ghqk 1 9677 1,361992 .6329897 1 4 ghql ; 9684 2.011669 .5338565 1 4 As we have mentioned before, there are usually a few different ways of producing the statistics you want in Stata so there is some overlap with the following commands. The style of presentation differs, and for some commands this can be adjusted to suit your preferences. It you wish to produce the standard error of the mean and confidence intervals instead of the standard deviation, minimum and maximum values you can use the ci command. ci ghqscale vai-flble j obs «t«. Sta. V, , . (95% coaf . Tnf.-rval] Sihqscalfi I 3613 1C. 77125 .0;JX212 10.673 1C,BBS'S The default output shows the number o.l cases (Obs) and mean as with the sum command, but then the standard error of the mean (,3t.d. Err.) and .95% confidence interval are shown. The level of the confidence intervals can be chosen by using the 1ml option. For example, if you wanted the 99% confidence intervals: ci ghqscale, level(99) . c\ qhqscale,level(SS; Variable | Obs M^'i SJ',C. err, [99^: Cuaf. IrU.erva: 1 g>u.:;csle 10.7?12j .';h0',712 10 .6-"' 2 10.50078 Trie mean command produces similar output to the default ci command and also has the level option. The other options for Summary statistics 143 mean enable you to specify how the standard error is calculated using jackkniie, bootstrap or clustering adjiistmcnrs. ■ • . nsdi: gl" icjsc ■■an est.!. nat icn Ni;mJosi" of obs - 9613 I Meari Etdr .-.ri . [55% Corif . Interval] hqsealc 1j . .0501212 10.6 73 10 . 3Q9£3 The command aaisaES vvili produce the arithmetic, geometric and harmonic means with confidence intervals for the variables U-ted. 13% Cont . Interval 10.673 10.8695 '-) . 7163 3 3 9 . S39 6 82 An extremely useful and flexible command for producing descriptive or summary statistics is tabstat. This command can tlso be used extensively when summarizing variables by categories ot another variable, and this use is covered in Chapter 6. If you use just the tabstat command without any options rheu Stata simply returns the mean. However, the tabstat command can produce a large number of statistics. In the output below tabstat is used first on its own and then with a statistics option (shortened to s) that specifies the number of cases in), mean (me!, standard deviation (ad), minimum value (mln) md »., iximum value {max). This second output is similar to that prod i> ::d by the Bum command-scat y b q g c a 1 e I mean ile i 10.77125 nriqs.-.ulo | Ae.i eh'netic 9613 JO.7','12 I Geometric 96U2 3.3Ü/.62 Hareu'.ui i1-662 P, 144 Descriptive Statistics and Graphs . tabstat ghqscale,s(n rae sc: min maxi variable ; N .lean sd min max ghqscale | 9613 10.17125 4.914182 0 36 The statistics or a option can be used to generate other summary statistics and the output will be in the order specified inside the parentheses of the option. In this example Stata has produced the number of cases, mean and standard deviation, followed by the standard error of the mean (sent), skewness \sk) and kurtosis (kur). You may wash to compare this style of output with that ptoduced by sum and the detail option. . Cabstat gnqycale,s(a oic scl seir :=k kur) variable ; K if.ean s:: yefmean) skewr.esr; kurtosis ghqscale | 96.13 iu.7'1?5 4.91.41.82 .01)01212 1.3661 97 5.713b74 The default style of output ot the tabstat command with only one variable is to put the requested statistics in one row as in the above example. However, if you are specifying a large number of statistics you may find it more convenient to have them in a column. This can he done by using a column!variable) option which can be shortened to c (v). In the example below additional statistics have been requested. These, are interquartile range (i - to produce the second output below, . tabstat ghqa-gbqc,s{n me sd min max) stats j ghqa ghqb ghqc 11 | 9728 9728 9719 mean j 2.162212 1.7978 2.049079 sd j .5286127 .778507 .5911111 min | ill max I 4 4 4 , tabstat ghqa-ghqc,s(n me sd min max) c(s) variable | N mean sd min max ghqa ghqb ghqc 9728 9728 9719 2 . 162212 1.7978 2.049079 .5286127 .778507 .5911111 All of the summary statistics commands - sum, mean, ci, ameans and tabstat - can be combined with bysort and if commands. For example, if you wanted the summary statistics tor these three variables, but separately for men and women, then you could use the bysort command. . bysort sex: tabstat ghqa-ghqc,s(n me sk) sex - ma 1e ghqa ghqb ghqc s ! mean | skewness | 4522 2 -12 53 87 1.094095 4523 i,688702 ,9064145 4521 2 . 040478 .8466639 14& Descriptive Statistics and Graphs ~> sex ' feraale stats ' gfcqa N : 5206 mean | 2.194199 skewness J 1.134487 ghqb g:iqc 5205 5198 1.892603 2.05656 .6264102 1.016004 However, the tabstat command has a by option that you can use to specify output spiit by categories of another variable, "file advantage of using the by option is that the summary statistics for rhe total sample are also produced. The output below comes from the by option and you can see that the male and female summary statistics are given above the total sample statistics. . cabstat ghqa-griqc, s (n me sk) by(sex) Summary statistics: N, mean, skewnes; by categories of: sex (cox; sex | gjiqa 4522 2.125387 1.094095 gttqb g hqc 1.6887 0? 2.04047 8 .9064145 .8166639 5206 2 19419°: 1.134487 5205 1.892603 .6264102 5198 2.05636 .. .016004 Total | 9728 9728 9719 2.162212 1.7978 2.049079 \ 1.10933 .7433614 .932651 Summary statistics can also be produced using the pulhdown menus (see Box 5.4). Summary statistics 147 Box 5.4: Summary statistic* u*i. ■■ " down m. .■ The path: Statistics —» Summaries, tautes. arc tests - Summa«.v id de-sceiptive statistic! -* Summaiy surt's^os takes you *o a dialogue box cal'ed summa'^e ■ .T" ■ .—■vv v Statistics, It hss three tabs. In the Main tab you on:--..- t'-c: ables you want statistics for by eith« typing ih-a-i"-"' into 'so Variables box O' by scrolling throuejh the list of all variables and selecting the ones you want ; •:■■ ll^r, h: il ' pidIi -'. : - ----- , ; ■ 3n=bie ll-a eSr,Pl,t, ,11 mil, | 3 i -".,: -: > Lrartplei v. all '/anklet :tartuia wilh". >:,,r Statistics —r Summaries, taKes, r-ro tests —» Summary and descriptive statistics Means The variables aic entered as before ,n tut Mode! tab, wi*ti command combinations in the if/ia'over up. 'fht SE/Cluster tab allows you lo specify the type of standard ci-rc: c.ac-.ilat'on appropriate ► 148 Descriptive Statistics and Graphs *or your dsta The Reporting fab allows you to specify confidence internals □the' than the default 55%. j -6 !# §j 11. n- ~.Z. I K>_:i I '__>■• _ This path takes you to a dialogue box equivalent to the', ci command: Statistics -• Summaries, tables, and tests —> Summary and descriptive statistics -/ Confidence intervals Graphs 149 GRAPHS The graphmg facilities in State, have improved significantly since version 6 and now Stata is able to produce publication-qualltv graphics. The downside is that the commands have become more complicated and sometimes extend to two ot three lines (often more), and the graphics now have their own manual. Rather than try and cover all the command options and structure, experience has taught us that it is better to use the pull-down menus to introduce graphing in Stata, much as it goes against our aim to get you using do files as soon as possible. In version 9 there is a Graphics pull-down menu function tailed Easy graphs which was removed in version 10, but the functions are retained in other menus. Here we cover using the mam graph functions in the Graphics pulldown menu in version 10. If you are using version 9, then see Box 5.5 for an introduction to the use of Easy graphs. Box 5.5: Ea*.y graphs iii version.8 The path through the pull-down menus is: Graphics -. Easy graphs : This opens a full list of graphing options: Tin Unit Pr Y. Djt] . - -tjii I Isn "irinV. fPlp 150 Descriptive Statistics and Graphs ► Pie charts Graphics -> Easy graphs —r Pie chart (by category) Ths path will take you to a dialogue bcw called graph pie - Pie chart (by category) where you enter the va-iaote nar^e for wh,ch you want to grepr> tr e categnues The pie c hart produced from the default settings car, be a little messy and may need t'dymg up before it is ready for a report. Box piots Graphics -a- Easy graphs, -> Box plot Grapnies ■-• F.asy graphs -t Horizontal box plot fiie bos Diet can et'iei be i/erfica! or li'mzGntsI doper.Jiny on its use andyout pmfemnce. Histograms Graphics -; Easy graphs -> Histogram The puM-down menu takes you to a dialogue box called Histogram - Histogram for continuous and categorical variables in the Main fab you can either type in or scroll down (he list of variables m the Variable box. On (he right-hand side you select whethei the variable chosen is either continuous or discrete. If you want tfa histogram to display percentages instead of the default density scale on the Y axis you select the Options tab and select Percent m I he Y axis box un the lowei ieft-hand side The level of information given is probably enough to judge the distribution .but fails well short of report quality. Note that the, delauft m that the bars still touch even tot discrete variables, which many would consider to be technically incorrect, Foi continuous variables, after lyping m or selecting the variable select the Continuous data option m rfm Main tab. in the Options lab you can tick the Acid norma' density plot option in tne De.isily ritotopjorrs bo« • uthonght-iwrid cuie This vvili show a normal distribution for the same mean an-J standard deviation as the vanablp so you can compare the actual distribution with sn expected normal distribution. Graphs 151 The default graph format and colours are determined by the g'a ih preferences through the pull-down menu path f.di. —> Preferences -» Graph Preferences This brings up a box that shows that the default scheme is s2 color and the default font is Anal, If you prefer your graphs to be formatted differently then you can change these. Stata has numerous schemes to choose from. Once set in this box, every future graph will be formatted in this way. However, each individual graph scheme and fonts can be changed in the tabs when you are creating the graph and you should check that what you are choosing is consistent with the general scheme. j to"--' Pnrta .Oftcl I f * riiiliEl-'H li Ht I,, iff irdr.. j ! i rl i llihan fhr Irr r til rr|.in-n irr pi sirl> ill i .1 j t jpp it rl i w t il flip When you create a graph it opens in a new window. You can leave this window open while you return to Stata (by clicking anywhere in the main Stata windows), but then the graph is not shown. On the toolbar there is an icon - pt» in version 9 and «•" in version 10 - that will bring the graph hack to the front. The icon is dimmed when there isn't a graph to show. The biggest, and best, change in version 10 is the interactive i»i ..phics editor. To open the editor you need to right-click on the fi'tph i ndow and select Start Graph Editor or click on this ^on u on H e toolbar above the graph when it's first created (see yc ' f> y ou can do this at any time a graph window is open. 152 Descriptive Statistics and Graphs The editor is very powerful and very intuitive and it allows you, tor example, to easily add text and arrows to more clearly indicate what the graph is showing, as well as format the graph, axes, and titles. The Stata website says that you should get the hang of it hi a few minutes as you just click on what you want to change; they are right. It's easy to get the hang of and makes many of the options available m the tabs unnecessary to start with as you can change them in the Graph Editor. Try creating basic graphs and then editing them. For a detailed text on Stata graphics, see Mitchell (2008). Box 5.8: Graphs and the Graph Editor When you first create a graph it opnns in a window called State Graph - Graph as shown below for a histogram of the GHQ.. If this graph is what you want to save or copy then you can do this from the pull-down menus or using the icons on the toolbar. IF you wish to use the Graph Editor then click on the iü icon on the toolbar. } Bfe &K üli«! wnh T™i5 Heb { J * ■ . ^ J &=ä I I: f> l= J-_____^^ ______ . | 0 10 20 32 40 f. general ne.-|th QLie^Iiunneu'E- l..>ert This opens the Graph Editor which looks like this. Note here that you cannot eta any data analysis in Stata while the Graph Editor is ..pen. i" -v?': :<> ho s tr.cr- ,-,a.-o a'aph ;ssc; Bos 5 71 end t"i! son aine' cl;.k cr It the curse- or ■: l-ck c" tee Dart 'r the for ?OTp:e. ir we wan;sd 10 chance tho x axis-, tin.--: w:-j could (a! dcau1!'o:..;!. on in;- i-ic ger«rai i;eaiih questionnaire; HKep. m loj iHi.-k ori >!v.'; - -.yc-.Sxi! "i)>! to xr.i'iil .1. ilia Deject Eiorjser ■on the right of iftc Editor. "rhis sjtponds to show the title in the tree, then double-dick ca title. Roy- o [i.esr. -.vri ■'.'■na ao a in:* vfooow u/naro you o-an change !l-c -oxt .oia. ao'.a,: cid c-ci it'^e. if yoe ar-a act oxaci-y sure v/hat ,.!-.-,|-..-;..-n •-/: | ;jr,l. ;,(,'■ vc ^uejojcs; yo-j c'-ck. ';n trie Apply !o ;ur which v.'li keep Lh:e ■/■/ieac'Vy open while you oaa see ih» nhsr-yei ia !he vrao". O you A'fi h-.ppy with tho r.haafjas ;hrn -ii-ji: ;iif:- OK ;-.ui;ra. tc c'-jsa ir is iv.r.dc.v. 154 Descriptive Statistics and Graphs ?! Sub; .Mediums™! Cohi .Elar.. ■|, i i\, '#sF\.^£y i;l^^.:ivj! _ } Aftet changing the X axis title and adding a main title, caption and some text the final graph looks much more informative. You can see that as tht^ text box and arrow have been added, tne tree ■r. the Obiect Browse," has expanded to include added text and added lines which you can double-click on to edit. Abce tho graph on thg let you can ses a tab - III'- d-L Bar chart ; Dot chart "; Pie chart Histogram ' Bo> plot ; Scatterpjot matrix Distributional graphs ► ] Smoothing and densities ► ■■ Regression diagnostic plots > , Time-senes graphs > \ Panel-data line, plots Survival analysis graphs > } ROC analysis &* \ Multivariate analysis graphs ► , Quality control > i Mere statistical graphs V Table of graphs Manage graphs * Change scheme/sise ~< PIE CHARTS Graphics —» Pie chart Fhis path will take you to a dialogue box tailed graph pie - Pie charts where, in the Main tab, you enter the variable name for which you want to graph the categories. Pie charts are most appropriate for variables that have relatively few categories as they are better displayed and interpreted. In this example we use the blstot varia ble. We click on Graph by categories as we want to ice the categories ot the variable, 15S Descriptive Statistics and Graphs i i i . i |. ij i I You can cither type in the variable name or use the down arrow to scroll through the list of variables and select one. The if/in tab allows you to specify that the graph is to be based on a subset of cases. The Titles tab gives a number of options for titles and labels on the graph. The Options tab is where you can change the colour scheme. The pie chart produced below is from these settings and some minor editing in the Graph Editor; h is still a little messy and would need tidying up before it is ready ior a report. Self-reported health status Box plots 157 BOX PLOTS Graphics —> Box plot Box plot? (or box-and-whisker plots) are a good way to examine the distribution of a variables). The distribution is shown against an axis of the values of the variable. The box plot can either be vertical or horizontal depending on its use and your preferences. The pull-down menu path takes you to a dialogue box titled graph box - Box plots which has a Variables box where you can either type or scroll through and select variable(s). For now we concentrate on single-variable plots. There are other tabs in this window. The Categories tab is where you specify categorical variables to produce box plots to compare, and this is covered in more detail in the next chapter. The if/in tab allows you to specify that the graph is to be based on a subset of cases. The Titles tab gives a number of options for titles and labels on the graph. The Options tab is where you can change the colour scheme and choose how to treat missing values. i i 7T- ._7"E«" [j Box plots are normally used for interval level (or continuous) variables and are not appropriate for ordinal or nominal level variables. In this example we want to graph the variable ghqscale so we enter this in the Variables box and select OK. The box in the box plot shows the inter-quartile range; that is, the values from the 25th to the 75th percentile. The line in the 1S8 Descriptive Statistics and Graphs middle of the box is the median or 50th percentile. The whiskers show the lower and upper adjacent values, These are the furthest observations which are within 1.5 times the inter-quartile range of the lower and upper ends of the box. If you wish to check these figures you can see the earlier results from the bu ghqscale, detail command.. Box plot of GHQ scores 40- 111 this example the median is 10, with the 25th percentile being 7 and the 75th percentile being 15, This makes the interquartile range 13 - 7 = 6. Therefore, the upper whisker is 1.5 x 6 - 9 from the 75th percentile, which is 13 + 9 = 22. All cases that have GHQ scores higher than 22 are potential outliers and are shown in the box plot as dots. The lower whisker extends to zero (the minimum value of the variable) because the 25th percentile is 7; subtracting 1.5 times the interquartile range gives -2, which is not a possible value. From this box plot you might conclude that the ghqscale variable is slightly skewed to the right (or positively skewed) as there a number of outliers above the upper whisker. If you wish to produce a series of box plots for two or mote variables then you can enter two or more variables in the Variables box on the Main tab. In the example below we have added the age variable to the ghqscale variable. Stata produces the two plots side by side. Note that Stata uses the variable labels as they are in the dara set so good labels are useful, but with a quick Histograms 159 use of the Graph Editor you can easily change them into more meaningful labels if necessary. You need to be careful when choosing variables to graph together because if one has a very small range compared to the other then the plot will be so compressed that judging its distribution will be difficult. 100 - "JGHQ BBS Age 80 - HISTOGRAMS Graphics ^> Histogram Histograms are the most common way of visually inspecting the distribution of a variable. I However, there is a minefield of discipline-specific terminology to negotiate when using histograms and/or bar charts (bar graphs). For bar charts in Stata, see Chapter 6. The mam characteristic of a "pure* histogram is that the area of the bars represents the value of the data, which is why the default scale for the Y axis in Stata graphs is density. The density means that the heights of the bars are adjusted so that the sum of their areas equals 1, as the width of the bars are the same for each category. As many users prefer to see the Y axis scaled in other ways, Stata offers three further options: • Fraction: the height of ail the bars equals 1. 8 Frequency: the height of the bars is equal to the number of observations m the category. • Percent: the height of all the bars equals 100. 180 Descriptive Statistics and Graphs One of the main distinctions Stata uses in constructing histograms is whether the variable is continuous (interval) or discrete (ordinal or nominal). For continuous variables Stata can automatically determine the grouping of the values to display in the bars of the chart, but for discrete variables one bar for every category is shown. The number of bars used for continuous variables is an option that we come to later. In the first example, we use a discrete variable histat. The pull-down menu takes you ro a dialogue box called histogram -Histograms for continuous and categorical variables. In the Main tab you can either type in or scroll down rhe list of variables in the Variable box. To the right of this you select whether the data chosen is continuous or discrete. In this case we also want the histogram to display percentages instead of the default density scale on the Y axis so we select Percent in the Y axis box on the lower left-hand side, We have specified that 'height labels' are added to the bars. As we have asked for the bars to represent percentages, percentages will be shown. You can see that the Bar properties button on the lower left has an * on it, which shows that we have asked for some of the bar characteristics to be changed from the default settings. These are to do with presentation as we want the bars to be separated rather than touching, the latter being the default setting. These settings followed by some editing m the Graph Lditor produces the histogram shown below. c 30 - Distribution of self-reported health status 44.98 20- 10 (N = 1 0,256) 6 25 British Household Panel Survey 1 991 13 fair poor very poor Self-reported health status We follow a similar process in this second example but use a continuous variable ghqscale as in the box plot examples above. This time, after typing in or selecting the variable, select the Continuous data option in the Main tab. We leave the Y axis as the default density scale but we tick the Add normal density plot option in the Density plots tab. This will show a normal distribution for the same mean and standard deviation as the variable so we can compare the actual distribution with an expected normal distribution. 'i« t"n -gU inilnpii i.dpl— i 34 Till-, I^j-nl II -ull E t 162 Descriptive Statistics and Graphs These options will produce the histogram shown below, which will need further editing in order to be of report quality. As you can see from the actual distribution against the expected normal distribution the variable ghqscale is slightly skewed to the right, confirming what we saw in the box plots. Q .05- 10 -1 I ! i Hill * i' I' i \ \ Jill!!! B« IB. t ' r. GHQ score Multiple graphs can also be combined into a single display (see Box .5.7!. Box 6,7; Savins and combining graph* . To save 8 graph for the; first time use the Pile ■ -> Save As pull-down menu. Then choose where you want to save it by browsing in the usual way. Once the graph has been saved any further changes can be saved by clicking on the. disk icon on the toolbar. Saved graphs can be opened by using the file -> Open graph pull-down menu in'the rtiain Stata window or by typing Sr»ph use "path/gfaphname" in the Command window. Rather than displaying single graphs you may want to combine a set of graphs in one image for a report or publication. This Demonstration exercise 163 is easily done in Stata by using the graph Boabiue coriimand. Be'ow is an example s' some of the graphs we have created in this chapter combined, into one image using the command: graph eoKibisse' pl©.$ph ha&»®ph J-'ox,3 ,gph /// histogram! .crpl histogram. aph There ste options to the graph combine command to choose columns or rows and positions. The really nice thing is that the new combined image opens in the Chart Editor so you can edit the image even more to ensure it's exactly what you want. DEMONSTRATION EXERCISE In Chapter 3 we manipulated the individual level variables and saved a new data set called demodatal.dta. In Chapter 4 we merged a household level variable indicating the region of the country onto the individual level data and saved the data with a new name deuiod.ita2.dta. In this chapter we analyse the variables we arc going to use lor their distribution, measures of central tendency and, for continuous variables, their normality. First, we determine the level of measurement of all the variables: 164 Descriptive Statistics and Graphs 1 Level of measurement ■■■ « female nominal age interval agecat ordinal marst'Z nominal empstat nominal numchd ordinal region! nominal d_ghq nominal ghqscale interval I Variable Starting with the nominal and ordinal level variables, we simply tabulate these to examine the number of eases in each of-the categories. tabl female agecat marst2 empstat numchd /// regions d tabulation of femail e female [ indicator j Freq. Percent Cure. male | 3,914 47.95 47.95 female j 4,249 52.05 100.00 Total j 8,16.3 10 0.00 - tabulation of agecat age j categories | Freq. Percent Cum. 18-32 years j 2,956 36.21 36.21 33-50 years j 3,336 40.87 77.08 51-65 years j 1,871 22.92 100.00 8,163 100.00 Demonstration exercise 165 -> calculation of marst.3 marital | status 4 j categories | Freq. Percent Cum. single I 1, 619 19 . 83 19 . 83 married I 5,736 7 0 . 88 9 0 .71 sep/div l 569 . 97 97 . 68 widowed l 189 2 . 32 100 .00 Total I 8,163 70 0 . 00 - > tabu1a t i on of empsta.t employment 1 status 1 E'req. Percent CUKi . employed 5,575 7 0 . 89 70. . 89 unemployed 50b 6. ,42 77 . .31 longterm sick 244 . 10 80 .42 sludying 224 2 . 85 83 . .27 family care 91 3 11 . 61 94 . 88 retired i 403 b . 12 100 . 00 Total 7,864 100 . 00 -> tabulation of numeric! ciii idren 3 1 categories 1 Freq. Percent Cum. none j 5, 182 63 . .48 63 . .48 one or two 1 2 , 443 29 . . 93 93 . .41 ;hroo or more i 538 6 . 59 100. . 00 Tonal 8 , 163 1.0 0 . .00 166 Descriptive Statistics and Graphs -> tabulation regions categories j Freq. Percent Cum. London j 898 11 . 00 11.00 South ! 2 , 504 30 . 67 41.68 Midlands j 1,399 17 . 14 58 .81 Northwest ] 849 10. . 40 69.21 North and Kortheast j , 310 16. . 05 85.26 Wales I 423 5 . 18 90 . 44 Scotland | 780 9. . 56 10 0.0 0 Total 8, 163 100. . 0 0 -> tabulation of cl_ghq d_ghq ' Freq . Percent Cum. 0 I 6,271 81 . 29 81.29 1 j 1,443 18 . . 71 100.00 Total ] 7, 714 100 . Ü0 For the interval level variables we have a number of options when inspecting them. We start with simple descriptive statistics. su age ghqscale - su age ghqscale Variable | Obs Mean St.ci. Dev. Min Max age | 8163 39.32733 13.08993 18 65 ghqscale | 7714 10.78727 4.945154 0 36 Next, we use the pull-down menus to produce histograms of the two variables with expected normal distributions in order to compare them and judge any skewness. The histograms below show that the distribution of the age variable is rather flat compared to the normal distribution, with more cases in the younger ages than the older ones. The distribution of the GHQ variable {ghqscale) is more peaked than the expected normal distribution and slightly skewed to the right. Demonstration exercise 167 .03- ghq 0-36 We now use the tabatat command to produce summary statistics for these two variables and then test their skewness and kurtosis using the sktest command (see Box 5.3). tabstat age ghqscale, s(sk kur) sktsst age ghqscale 188 Descriptive Statistics and Graphs tabs~a~ age- ghqssale, s(s/. kur> sr.acs | ag'e griqscal. e slwwaess | .2323762 1.364399 chi2 aye ! 0.U0C COCO gheseale , C.000 0.000 From the first part of the output you can see that the age variable has a skewness statistic of 0.22 which, being positive, indicates it is slightly skewed to the right, and a kurtosis statistic of 1,98 which, being less than 3, indicates that the distribution is flatter than normal. For the ghqscale variable the skewness value of 1.36 indicates that the distribution is skewed to the right and the kurtosis value of 5,66 indicates a distribution more peaked than normal. Most of the techniques that we will use in this demonstration are robust to reasonable departures from normality such as these, but for the sake of this exercise we will consider what we might do if we wanted to transform the ghqscale variable to a distribution closer to normal. For disrributions that are skewed to the right, one of the usual transformations used is the logarithmic (cither natural logarithm or base 10). However, with this variable there is a problem in that it contains cases that have a score of zero. The logarithmic transformation of zero is not possible as it is minus infinity. You can see this if we transform the variable as it stands. gen ln_ghq-ln(ghqscale) in ghqscale ln._.ghq . gen In_g:iq=lnIghqscalei (459 missing values oeneraLeli) . si: gnqscale "i n_ghq Variable j Obs Mean ;;r.d. Dev. win Max ghqscale | 7714 10.78727 4.945154 0 36 Demonstration exercise 169 The new variable {ln_gh0 has 10 fewer cases than the original variable [ghqscale). This is because there were ten cases with a value of zero in the ghqscale variable. This can be checked by either: tab ghqscale if ghqscale<2 or count if ghc$scale=~0 . ta ghqscale i r ghqsc ale<2 ghq 0-3 6 | freq. Percent Cum. 0 1 10 40.00 40.00 1 1 lb 60 . 00 100.00 Tof-.al 1 25 100.00 . count if yhqs cale--0 10 The new variable llri_ghq) has a minimum value of zero because there are 15 cases in the original variable that have a value of 1 which has a logarithmic transformation value of zero. If we now graph the new variable we can see that it is closer to normal than the original variable. 0 4 In _ghq 170 Descriptive Statistics and Graphs The summary statistics and test for normality produce the following output. tabstat ghqscale In shq, s(sk kur) sktest ghgscale In ghq . ta.by.ae g]~,q.z:ca i.e ln.._ghq, s;sk .-car) stala | ghcseale lr_2lhq skostess | 1.364399 ,2796359 kurtosis : 6.661291 4.491426 . sktssr. gti.q.!.-cale ln_ghq S'.ss",,vXi6ss:/Kurtoils test.: !!or Normal i.:y -----."joint ------ Vanahlc ) Pr(Skewnoyy) IT ikm-tosiji) adj eM2(2) Proh>chi2 g!-iq*eale 1 e.COO e.oeO ln_chi2 gnqscale In .ghq in_qhq..2 0.00 0 0 .coo 0 .0 00 0 . 00 0 0 .000 0 . 00 0 0.0000 0.0000 At the start of this output you can see that Stata teils you that 44.9 missing values are generated in the new variable, which matches rhe number of cases that don't have a GHQ store in the original variable [gbqscale). This is checked by producing summary statistics where you can see that both the original variable (ghqscale) and the newly created variable (ln_gbq_2) have the same number of cases. 1 he skewness and kurfosis statistics produced by the tabstat command indicate that the ht_ghq 2 variable is slightly less skewed than the ln_ghq variable but more peaked as the kur-tosis value is higher. As with all transformations, there is a trade-off to be made between skewness and ease of interpretation, hi this example we would probably use the gbqscale in its original form as the 172 Descriptive Statistics and Graphs techniques are robust to reasonable departures from normality. The transformed GHQ score variables (either ln_ghq or hi_ghqJ2) are closer to normal but the interpretation of statistics further on in this example is less intuitive. For example, a difference of means between men and women would be a difference in the logarithmic GHQ scores rather than the raw scores constructed from the items. For the rest of this demonstration we will use the untrans-formed ghqscale variable.