Structure of epidemiological dataset
 In a typical epidemiological study, a range of
different information is collected on each
participant.
 A typical dataset then has:
◦ Rows (normally each participant has one row)
◦ Columns (normally each variable has one
column)
Example of a dataset
id age sex education married weight smoking
1 56 1 2 1 88 1
2 54 2 4 2 57 2
3 53 2 1 4 63 3
4 58 2 3 2 49 3
5 49 1 2 3 79 2
6 55 1 5 4 90 1
7 56 1 3 1 89 1
8 57 2 4 1 63 1
etc… etc… etc… etc… etc… etc… etc…
Inspection of data
 It is crucial that you get familiar with your
dataset:
◦ Structure
◦ Number of individuals
◦ Variables
◦ Range of values
◦ Definitions
◦ Outliers
◦ Missingness
 Always inspect the raw dataset before any
analysis
Measures of association
 Risk of disease, rate of disease in different
groups of population
 Comparison of risks/rates
Constructing 2-way table
For binary health outcomes (Y/N), it is possible to construct
2x2 table and to estimate either relative or absolute measures
of risk
Disease
Exposure Yes No Total
Yes a b a+b
No c d c+d
Total a+c b+d a+b+c+d
Example
 Random sample of individuals were questioned
about their occupation and their BP was
measured. Based on SBP and DBP measures they
were classified as hypertensive or nonhypertensive.Among
300 people in non-manual
jobs, there were 72 hypertensive individuals.
Among 240 people in manual jobs, there were 96
hypertensive individuals.
Constructing 2-way table
As a first step we need to organize our data in a formal way
– we construct 2-way table
Hypertension
Yes No Total
Manual 96 144 240
Non-manual 72 228 300
Total 168 372 540
What does it mean when we speak
about an association between two
categorical variables?
 It means that knowing the value of one variable
tells us something about the value of the other
variable.
 Two variables are therefore said to be
associated if the distribution of one variable
varies according to the value of the other
variable.
What does it mean when we speak
about an association between two
categorical variables?
 In our example, the two variables, occupation and
hypertension, are associated if the distribution of
hypertension varies between occupational groups.
 And, if distribution of hypertension is same in both
occupational groups, we can say that there is no
association between hypertension and occupational
category - because knowing a occupational
category of individual will not tell us anything about
hypertension.
What does it mean when we speak
about an association between two
categorical variables?
 Having constructed a two-way table, the next step is to
look whether the distribution of one variable differs
according to the value of the other variable.
 We need to calculate either row or column percentages.
 Often, one variable can be regarded as the response
variable, while the other is the explanatory variable,
and this should help to decide what percentages are
shown
 If the columns represent the explanatory variable, then
column percentages are more appropriate, and vice versa.
Constructing 2-way table
As a second step we calculate proportion of hypertensive
individuals among manual workers, non-manual workers and
in the whole sample
Hypertension
Yes No Total
Manual 96 (40.0%) 144 (60.0%) 240
Non-manual 72 (24.0%) 228 (76.0%) 300
Total 168 (31.1%) 372 (68.9%) 540
The numbers in the four categories
in the 2-way table in the previous
slide all called
OBSERVED NUMBERS
 The data seem to suggest some association
between hypertension and occupation (40% of
manual workers with hypertension compared to
24% of non-manual workers with hypertension)
 The calculation and examination of such
percentages is an essential step in the analysis of a
two-way table, and should always be done before
starting formal significance tests.
Significance test for the association
 Although it seems that there is an association
in the table, the question is whether this may
be attributable to sampling variability
 Each of the percentages in the table is subject
to sampling error, and we need to assess
whether the differences between them may be
due to chance
 This is done by conducting a significance test
 The null hypothesis is “there is no
association between the two variables”
Expected numbers
 The significance test is Chi-squared test
• This test compares the observed numbers in
each of four categories of contingency table
with the numbers to be expected if there was
no difference in proportion of hypertensive
individuals in two occupational groups
Hypertension
Yes No Total
Manual 74.64 240
Non-manual 300
Total 168 (31.1%) 372 (68.9%) 540
 From the table above, the overall proportion of
hypertensive individuals is 168/372 (31.1%).
 If the null hypothesis were true, the expected number of
manual subjects with hypertension is 31.1% of 240, which is
74.64
Expected numbers
Hypertension
Yes No Total
Manual 74.64 165.36 240
Non-manual 93.36 206.64 300
Total 168 (31.1%) 372 (68.9%) 540
 Expected numbers in the other cells of the table can be
calculated similarly, using the general formula:
Row total x Column total
Expected number = -----------------------------------
Overall total
Next step – compare observed and expected numbers
EXPECTED Hypertension
Yes No Total
Manual 74.64 165.36 240
Non-manual 93.36 206.64 300
Total 168 (31.1%) 372 (68.9%) 540
OBSERVED Hypertension
Yes No Total
Manual 96 144 240
Non-manual 72 228 300
Total 168 (31.1%) 372 (68.9%) 540
Chi-squared test (Χ2 test)
 Calculate (O-E)2/E for each cell and sum over
all cells
 In our example:
Χ2 = [(96-74.64)2 / 74.64 + (144-165.36)2 / 165.36
+ (72-93.36)2 / 93.36 + (228-206.64)2 / 206.64] =
15.97
X2 =  [(O – E)2/E ]
 If χ2 value is large then (O-E) is, in general,
large and data do not support H0 =
association
 If χ2 value is small then (O-E) is, in general,
small and data do support H0 = no
association
 Large values of χ2 suggest that the data are
inconsistent with the null hypothesis, and
therefore that there is an association
between the two variables.
Obtaining p-value
 Under H0: χ2 distribution
Obtaining p-value
 The P-value is obtained by referring the
calculated value of χ2 to tables of the chisquared
distribution.
 The P-value in this case corresponds to the
value shown as  in the tables.
 The degrees of freedom are given by the
formula:
d.f. = (r – 1) x (c – 1)
• r = number of rows, c = number of columns
Back to our example:
Χ2 = 15.97
Table 2x2
d.f.=1
and from the table
P<0.001
Quick formula
There is a quick formula for
calculating chi-squared test
for 2x2 table
If we use following
notation:
Total
a b e
c d f
Total g h N
 
efgh
bcadN
2
2 

Quicker formula for calculating chi-squared test is
d.f.=1
SAMOSTUDIUM
Larger tables (r x c tables)
 



E
EO
2
2
d.f. = (r-1) x (c-1)
• Valid if less than 20% of expected numbers
are under 5 and none is less than 1
• If low expected numbers – combine either
rows or columns to overcome this problem
How to calculate expected number
in particular cell
Row total x Column total
Expected number = --------------------------------------
Overall total
Interpretation of chi-square test
results:
Chi-squared tests in STATA
 We try to evaluate whether there is an association
between current smoking and age
 We have age grouped into 4 groups (30-39, 40-49,
50-59, 60-69)
 Smoking (variable smok) was coded 1=current
smokers, 0=non-smokers
. tab smok agegroup, col
| 30-39,40-49,50-59,60-69
1=yes 0=no | 30 40 50 60 | Total
-----------+--------------------------------------------+--------
0 | 337 357 490 491 | 1,675
| 54.71 56.31 72.38 78.81 | 65.69
-----------+--------------------------------------------+--------
1 | 279 277 187 132 | 875
| 45.29 43.69 27.62 21.19 | 34.31
-----------+--------------------------------------------+-------
Total | 616 634 677 623 | 2,550
| 100.00 100.00 100.00 100.00 | 100.00
Let’s check proportion of smokers in each age category
Chi-squared test
. tab smok agegroup, col chi
| 30-39,40-49,50-59,60-69
1=yes 0=no | 30 40 50 60 | Total
-----------+--------------------------------------------+-------
0 | 337 357 490 491 | 1,675
| 54.71 56.31 72.38 78.81 | 65.69
-----------+--------------------------------------------+-------
1 | 279 277 187 132 | 875
| 45.29 43.69 27.62 21.19 | 34.31
-----------+--------------------------------------------+-------
Total | 616 634 677 623 | 2,550
| 100.00 100.00 100.00 100.00 | 100.00
Pearson chi2(3) = 118.7458 Pr = 0.000
Degrees of freedom Chi-squared test value p<0.001
Measures of effect
 So far, χ2 test – testing whether there is an
association between two proportions using a 2x2
(or 2xn) table
 We are often interested in the relative difference
between two proportions rather than the actual
difference
 The effect estimates that we present are then
ratios: there are three main measures we will use:
◦ Rate ratio
◦ Risk ratio
◦ Odds ratio
Relative measures of effect (relative
risk)
We have 2 groups of individuals:
 An exposed group (group with risk factor
of interest) and unexposed group
(without such factor of interest)
 We are interested in comparing the
amount of disease (mortality or other
health outcome) in the exposed group to
that in the unexposed group
Risk/rate
 Measures the strengths of association
between the risk factor and disease
 Incidence rate or Risk in exposed (r1)
 Incidence rate or Risk in unexposed (r0)
Risk ratio
• we calculate the risk ratio (RR) as:
RR=r1/r0
Risk difference
• the absolute difference between two risks (or
rates)
RD = r1 – r0
Example: cohort study of oral
contraceptive use and heart attack
Myocardial infarction
Yes No Total
OC use
Yes 25 400 425
No 75 1500 1575
Total 100 1900 2000
Risk (exposed) = 25/425=0.059
Risk (unexposed) = 75/1575=0.048
Relative risk = 0.059/0.048 = 1.23
 We can also have different strata of
exposure.We may calculate ratio measures
for each strata – we compare measure of
frequency in each level with measure of
frequency in the baseline (unexposed) level.
 Example: Death rates from CHD in smokers
and non-smokers by age
Age Smokers
rate
Nonsmokers
rate
Rate ratio
35-44 0.61 0.11 5.5
45-54 2.40 1.12 2.1
55-64 7.20 4.90 1.5
65-74 14.69 10.83 1.4
75-84 19.18 21.20 0.9
85+ 35.93 32.66 1.1
ALL AGES 4.29 3.30 1.3
What can you say about this table?
Age Smokers
rate
Nonsmokers
rate
Rate ratio
35-44 0.61 0.11 5.5
45-54 2.40 1.12 2.1
55-64 7.20 4.90 1.5
65-74 14.69 10.83 1.4
75-84 19.18 21.20 0.9
85+ 35.93 32.66 1.1
ALL AGES 4.29 3.30 1.3
The rate ratio decreases with increasing age. It may
suggest that the effect of smoking on the rate of CHD is
higher in younger ages.