Lecture 4
Working with skewed, categorical and clustered data
R101: A practical guide to making R your everyday statistical tool (PSY532)

Programme (lecture and seminar)
•Skewed outcome variable: generalized linear modelling
•Categorical outcome variable:
^ Covered in readings (Baguley) but not in lecture
Red denotes version of generalized linear modelling
•Clustered outcome variable: zero-inflated (mixture) modelling; multilevel modelling
•For each analysis type: descriptive statistics, running the analysis, diagnostics, and
reporting/article examples

Logistic regression and multilevel modelling demonstrations will make use of a different data set

Skewed outcome variable


Generalized linear models
•A principled alternative to transformation when the outcome variable is constrained – e.g., when
the outcome variable is:
–a count of members in a category (this is where chi-squared tests and logistic regression come in,
and these will be covered later in the lecture)
–a count of occurrences in a calendar year or some other period of time (this is where Poisson
regression is useful)
–highly skewed, and therefore better represented by a Poisson or negative binomial distribution
(this is the case we will focus on in this section)
•If your outcome variable has a highly skewed distribution (e.g., life satisfaction scores or
people’s estimates of a count, as in our SS data), these models are worth fitting after the initial
ANOVA or regression to see if they fit the data better.
•Terminology: the ANOVA and regression techniques we have discussed so far are instances of general
linear modelling, a special case of generalized linear modelling
Reading: Baguley Ch 17, Gelman Ch 6

Use lecture: constrained Y – counts, binary

Y
X2
X1
Three components:
1.Additive combination of predictors
2.Random component/ family of distribution for Y (e.g., Poisson, negative binomial, binomial)
3.The link function (e.g., logarithm, logistic). Analysts typically use canonical (default) link
functions as listed in the R help file. Logarithmic function is canonical for Poisson and negative
binomial random components. Logistic function is canonical for binomial distributions.
General linear model
2.Normally distributed Y (~ means “distributed as”)
3.Identity link function (Y is the additive comb.)
implies...
Yi ~ Normal(mean = b2Xi2 + b1Xi1 + b0, variance = σ2 which does not depend on X1 or X2)
Additive combination of predictors

Sigma squared: residual sum of squares with degrees of freedom taken into account
(SSY−SSE)/(DFY−DFE)

Y
X2
X1
Three components:
1.Additive combination of predictors
2.Random component/ family of distribution for Y (e.g., Poisson, negative binomial, binomial)
3.The link function (e.g., logarithm, logistic). Analysts typically use canonical (default) link
functions as listed in the R help file. Logarithmic function is canonical for Poisson and negative
binomial random components. Logistic function is canonical for binomial distributions.
2.Normally distributed Y
3.Identity link function (Y is the additive comb.)
Yi ~ Poisson(mean = variance = exp(b2Xi2 + b1Xi1 + b0))
2.Poisson-distributed Y
3.Logarithmic link: hence the exponentiation
Additive combination of predictors
Yi ~ NegBin(mean = log(exposure parameter) . exp(b2Xi2 + b1Xi1 + b0), overdispersion = w)
Yi ~ Norm(mean = b2Xi2 + b1Xi1 + b0, variance = σ2)
2.Negative-binomial-distributed Y
3.Logarithmic link: hence the exponentiation

From the distribution of Yi, you obtain E(Yi|Xi) = b[2]X[i2] + b[1]X[i][1] + b[0]

•Hypothesis
•In the literature on short-term memory, the first few words in word lists are consistently found
to be remembered better than the other words. Thus, more wins should be remembered in the
descending condition relative to the others.
•
•Descriptive statistics
•Same as for ANCOVA performed in Lecture/Seminar 2
•
•Running the analysis
•glm in base package and, for negative binomial random components, glm.nb in MASS package. By
default, non-sequential sums of squares. Can use anova.glm function to obtain sequential sums of
squares.
•
•Diagnostics: Does the model fit well? Does it fit better than the original ANOVA/regression?
•Residual deviance: looking for lower deviance values
•AIC: the smaller this is, the better the model
•Cook’s distances: view plot to make sure there are not a few stand-out influential points. Cook’s
distances three times greater than the mean of the Cook’s distances are a cause of concern,
especially if there are only one or two (easily deletable) associated cases.
Generalized linear modelling to test Hypothesis 2 from our dataset
Go to script

Reporting the analysis – as in Results section
•Table 1 (or very clear graph) showing means and SDs of outcome variable across levels of the
categorical predictors. •In text: To test Hypothesis 2, a generalized linear model was fitted using
the glm.nb package in R Version 3.1.0 with percentage of remembered wins as the outcome variable,
success-slope and question wording as the predictors, and background beliefs (Drake Beliefs About
Chance total score) as a covariate. The analysis  (with Type II sums of squares) revealed
significant effects of success-slope (LR χ2 (3) = 11.56,  p=.01) and question wording (LR χ2 (1) =
48.95,  p<.001).  There was also a significant interaction (LR χ2 (3) = 15.90,  p=.001), together
with a significant effect of the covariate (LR χ2 (1) = 14.81,  p<.001). Planned comparisons of the
Descending condition’s mean to those of other groups revealed a significant difference between the
U-shaped and Descending groups (p = .01), and the Flat and Descending groups (p = .02).
•

Categorical outcome variable


Chi-square tests
•Used for determining whether two categorical variables are significantly associated, based on
counts displayed in a frequency (or “contingency”) table.
Reading: LSR Ch 12
•Logic: Compare observed frequencies to what would be expected under the null hypothesis in light
of your degrees of freedom.
•Effect size: Cramer’s V, ranging from 0 (no association) to 1 (perfect association)
•Assumptions:
-The two variables in the table are independent (otherwise, for a “repeated measures” design, use
McNemar test)
-The expected frequency in each cell is sufficiently large (a rough guide is that, if the expected
frequency in any cell is less than 5, use Fisher exact test)
Desc.
U-shaped
Asc.
Flat
Strategy “no”
74
72
57
65
Strategy “yes”
11
14
22
19

Explain why table is two-dimensional

•Hypothesis
•If people perceive themselves to be problem-solving (learning a strategy) in games of chance, the
illusion of natural control should be greater in the Ascending slope condition relative to the
Descending slope condition. This implies that the number of people saying “Yes, I had a strategy”
in response to the PostStrategyPresent question should be associated with success-slope condition
(SeqCond).
•
•Descriptive statistics
•Frequency table as on previous slide. xtabs in base package.
•
•Running the analysis
•associationTest in lsr package (fisher.test and mcnemar.test are in the stats package)
•
Chi-squared test of a version of Hypothesis 1 in our dataset
Go to script

Reporting the analysis
•Frequency table
•Text: A chi-square test indicated that the association between success-slope and reporting of a
strategy approached significance (χ2(3) = 6.84, p = .07, Cramer’s V = 0.14). This might reflect the
slightly higher incidence of strategy reports in the Ascending condition, and possibly the Flat
condition also. [Notice that it is useful to comment on what the association implies - where were
the differences?]

Logistic regression
•In its basic form, can be used only with a binary outcome variable (e.g., alive or dead; ill or
not). However, can include any number of categorical and continuous predictors.
•Uses: hypothesis tests and prediction. There is an example of each in the Study Materials Lecture
4 folder. For another good example of prediction, search Masaryk University Catalogue or Google
Scholar for: “Prediction of probable Alzheimer's disease in memory-impaired patients: A prospective
longitudinal study” (article available only in HTML). Here, we will focus on hypothesis testing.
•A generalized linear model:
–Binomial random component
–Logit link function
Reading: Baguley Ch 17

Show students all three articles.

Hypothesis from a new dataset: a nationally representative Czech survey on drug use and gambling
(CG1 in workspace)
•Survey questions
•What type of game/s have you played in the last 12 months? Select all that apply:
qslot machines
qonline slot machines
qvirtual gaming machines (e.g. virtual roulette)
qcasino games (e.g. roulette, cards, dice)
qcard tournaments outside of casinos (e.g. poker)
qsports and non-sports betting at betting offices/bookmakers
qonline betting at registered Czech operators
qother online betting (e.g. online poker, roulette)
qlotteries
qI did not play on any of these
•In the last 12 months, how often have you played any of the games listed above? Response options:
(0) only once, (1) less than once a month, (2) once a month, (3) several times a month (2-3 times),
(4) at least once a week (1-2 times), (5) several times a week (3-4 times), (6) every day or almost
every day (5-7 times per week)
•
Over 2000 people answered the survey but they were excluded from this analysis if they answered
lotteries only or did not play.

Hypothesis from a new dataset: a nationally representative Czech survey on drug-use and gambling
•Survey questions (cont.)
•Problem Gambling Severity Index (Ferris & Wynne, 2001)
•Thinking about the last 12 months...
–have you bet more than you could really afford to lose?
•0: never
•1: sometimes
•2: most of the time
•3: almost always
–have you needed to gamble with larger amounts of money to get the same feeling of excitement?
–did you go back another day to try to win back the money you lost?
–... 9 questions total
–Interpretation of total score:
•0-2: no risk
•3-7: at risk
•8 or more: pathological gambler
–

Hypothesis from a new dataset
•Research question: Is online gambling a particularly dangerous type of gambling?
•Past studies have found pathological gambling to be associated with a higher reported frequency of
play, and also with play on a wider variety of gambling types. If online gambling leads to an
increased probability of gambling pathology, respondents’ pathological gambling scores (scores on
the PGSI) should relate not only to the frequency and variety of gambling activity, but also to
whether at least one of the gambling activities was performed online.
•
•Analysis plan
•Hierarchical logistic regression with pathological gambling status (no risk vs. at
risk/pathological) as the binary dependent variable.
–Step 1: Frequency of play (ordered), ranging  from 0 (once only) to 6 (almost every day). Entered
first.
–Step 2: Variety of games played (categorical): one vs. more than one
–Step 3: Whether one of the played games was online (categorical): yes vs. no
–
•Descriptive statistics and running the analysis
•Means and frequency counts of outcome and predictors; then glm in base
•
•
Script

Hypothesis from a new dataset
•Hierarchical logistic regression
•Use the anova function at each step (Steps 1, 2 and 3 in our example) to assess the contribution
of each predictor to the model: is the reduction in deviance significant with the introduction of
the predictor at each step? The displayed analysis of deviance test will provide this information
through a deviance statistic and associated p-value, determined based on a chi-square distribution.
•
•This approach can have some problems when there are continuous predictors in the model. In such a
case, you can conclude that a predictor is significant if its inclusion in the model reduces the
deviance and AIC. See Baguley p. 683.
•

Hypothesis from a new dataset
•Interpretation of coefficients
•If performing a hierarchical logistic regression, it makes sense to do this for the final model,
with all the predictors included.
•Some helpful plots are also shown in the script.
•Three sources of information:
•coefficients expressed as log odds using summary(modelname): multiply coefficient by .25 to obtain
a measure of percentage change in probability of moving from 0 (absence) to 1 (presence) on the
outcome variable with each one unit change in the predictor (or with a shift from the reference
category to the listed category if the predictor is categorical, as in our example)
•Wald z-tests shown as part of summary(modelname): tell us whether the listed predictor is a
significant predictor in that its slope in the model is significantly greater than 0
•coefficients expressed as odds ratios using exp(modelname$coefficients): tell us the odds of
moving from 0 (absence) to 1 (presence) on the outcome variable as a proportion of the odds of
doing so when the predictor is one unit less or is at the reference level (see further explanation
on next slide and in the script)

Hypothesis from a new dataset
•Interpretation of coefficients (cont.):
•The odds ratio
•If equal to 1, the odds are the same, so there is no change in the outcome variable with changes
in the predictor
•If greater than 1, the odds increase with the predictor’s unit increase or change from reference
category
•If less than 1, the odds decrease with the predictor’s unit increase or change from reference
category
•
•A mixture of categorical and continuous predictors
•The log odds and odds ratios are likely to be much smaller for continuous predictors than for
categorical ones. This is because unit changes are generally smaller than entire category shifts.
To make different types of predictors comparable in the same model, it is therefore useful to
calculate the coefficient for more than one unit change (e.g., some meaningful number or two
standard deviations). This is just a case of multiplying the predictor by the chosen number of
units.
•
•
The difference between “odds” and “probabillity”
In a race, you may see the odds for your horse, Camilla, are 8 to 1, which are the odds AGAINST
winning. This means in nine races Camilla would be expected to win 1 and lose 8. In probability
terms, Camilla has a probability of winning of 1/9, or 0.111. But the odds of winning are 1/8, or
0.125. Odds are actually the ratio of two probabilities...
probability of event
1 – probability of event

Classification table should be included here too

•Diagnostics
•Collinearity: Use chi-square tests and/or ANOVAs to determine whether any predictors are related
to each other. Relationships between predictors make the odds ratios for individual coefficients
less interpretable, since each ratio expresses the effect of a unit change in the associated
predictor when all others remain constant. But what if the predictor can only change when another
factor changes? Of course, the same considerations apply in linear regression as well, but to a
lesser degree. An alternative model should be considered with one of the two related predictors
removed. We do this here (model4).
•Influential points: If any are detected, try running the regression without them.
•Sparse data: If you have relatively few people in one of the two categories of the outcome
variable, your model will perform little better (if not worse) than a model that places everyone
into one group (e.g., “no risk”).  This situation is very common, especially if you are dealing
with illnesses, which tend to affect the minority (e.g., Alzheimer’s, pathological gambling). Try
to make the counts in the categories of the outcome variable as equal as possible through data
collection or careful selection of the two categories in the existing data set (e.g., in our
analysis here we look only at 206 people who reported gambling in the preceding 12 months, not the
whole survey sample of 2000 people).
•No normality of residuals or homogeneity of variance assumption.
•

Reporting the analysis
•Frequency table and/or description based on descriptive statistics:
–197 people who reported gambling in the last 12 months answered all three relevant questions:
gambling types selection, playing frequency and pathological gambling (PGSI)
–Among these participants, the play frequency variable was restructured to contain the following
categories: once total (N = 40), less than once per month (N = 31), once per month (N = 30), 2-3
times per month (N = 43), and weekly (N = 53).
–108 of the 197 people did not report gambling online.
–etc.
•Table 1 (next slide): Slope coefficients, Wald tests, standard errors and odds ratios.
•Text: In a hierarchical logistic regression with PGSI category as the outcome variable, playing
frequency was entered into the model first, followed by game variety, and online experience in the
third step. Results are presented in Table 1. It can be seen that online experience did not account
for PGSI category membership over and above the marginally significant effects of playing frequency
and game variety. Since chi-square tests revealed all three predictors to be related, we conducted
a logistic regression that included only game variety and online experience and considered the
effect of their interaction. Likelihood ratio tests following a Type II Sums of Squares ordering
revealed the effect of online gambling to not be significant (LR χ2(1) = .38, p = .53), while game
variety emerged again as a marginally significant predictor (LR χ2(1) = 3.58, p = .06). There was
also no significant interaction effect (LR χ2(1) = .55, p = .46).

Table 1. Final model of a hierarchical logistic regression in which playing frequency, game variety
and online gambling experience were entered in that order as predictors of pathological gambling
risk
Slope
S.E.
Odds ratio
Intercept
-2.06
0.49
0.13
Less than once a month (reference: only once)
-0.77
0.88
0.46
Once a month (reference: only once)
0.11
0.70
1.12
2-3 times per month (reference: only once)
0.89
0.60
2.44
Weekly (reference: only once)
1.07*
0.59
2.93
Game variety (reference: one)
0.72^
0.42
2.06
Online gambling experience (reference: no)
-0.08
0.43
0.92
* p = .06, ^ p = .08 (Wald z-test)

Clustered outcome variable


Zero-inflated (mixture) modelling
Reading: Baguley Ch 17
•Useful when the data has a high proportion of 0s (e.g., “Not at all”; “Completely disagree”;
“Never played”)
•Two stages:
–Logistic regression to estimate the effect of one or more predictors on the outcome variable being
“0” or “other”
–Generalised linear modelling (Poisson, negative binomial) to estimate the effect of one or more
predictors (not necessarily the same as in the logistic regression) on the values of the outcome
variable in the “other” category. If trying a number of random components (e.g., Poisson and
negative binomial), choose the model with Log likelihood closer to zero.

Student used more rarely than Welch

Examples based on success-slope Hypothesis 1 extension
•The Extensions section in Assignment 2 answers (Study Materials)shows that many people in the SS
dataset scored zero on the illusion of natural control measure.The stem-and-leaf plot for
PostNaturalIoC with its many zeroes is shown on the next slide.
•The SS data also includes responses (0-10) expressing degree of agreement with whether “It was all
chance” is an accurate description of how goals were achieved in the soccer-themed gambling game
(Lecture 1 Slide 13). The histogram on the next slide shows that, as might be expected, many people
agreed fully (10) with this statement. 10s can be considered “zeroes”.
•In Assignment 2, we also discovered a significant interaction between prior beliefs and
success-slope in predicting the illusion of natural control, except that the assumptions for the
associated ANOVA were not met. Here, we use success-slope and prior beliefs as predictors in a
zero-inflated model.
•Research questions: To what extent are responses to “It was all chance” (PostListCHANCE) and the
illusion of natural control (PostNaturalIoC) influenced by success-slope (SeqCond), prior beliefs
(PreDBC_Total) and their interaction? In relation to success-slope, Hypothesis 1 predicts:
–Higher illusion of control in the Ascending condition, compared to Descending
–Less agreement with the “It was all chance” statement in Asc. condition

Descriptive statistics
•PostNaturalIoC stem-and-leaf plot (many zeroes visible)
•The decimal point is at the |
•0 | 00000000000000000000000000000000000000000001111333344444444444
•0 | 555666666666666888888899999999
•1 | 00000000111113333333333444444444
•1 | 555555566666688888889999999
•2 | 00000111113333334444444444
•2 | 55555555555566666668889999999
•3 | 00000111133333334444
•3 | 55566666688888889999
•4 | 000000001133444444444
•4 | 5555566666888999
•5 | 000001114444
•5 | 555566666888999
•6 | 0000344
•6 | 555699
•7 | 00013344
•7 | 9
•8 | 33
HistCHANCE.png
PostListCHANCE histogram (also many zeroes – i.e., 10s)

Running the analysis
•zeroinfl function in the pscl package
•Outcome variable must be in integer form, with a clearly defined zero (see script for examples:
with “It was all chance”, for example, we reverse and categorise the responses)
•Random components in the generalized linear models can be Poisson, negative binomial, etc.
•Interpretation:
–“Count model coefficients (negbin with log link):” to be interpreted as with generalized linear
models. For NaturalIoC, we see a significant effect of prior beliefs, but not success-slope.
–“Zero-inflation model coefficients (binomial with logit link):” To be interpreted as with logistic
regression. For “It was all chance”, we see a significant effect of prior beliefs on whether a
person fully agreed (0) with the statement.

Reporting the analysis
•Table 1: Descriptive statistics table showing not only means and standard deviations, but also the
frequency of zeroes across levels  of the predictor(s)
•Table 2: Model fit results
•Text: A zero-inflated mixture model with a Poisson random component for the count model was fitted
using the pscl package in R Version 3.1.0. Prior beliefs, success-slope, and their interaction were
the predictors for both the count and zero-inflated parts of the model. As the model fit results in
Table 2 show, the only observed significant effect was that of prior beliefs on non-zero
illusion-of-control scores.

Table 2. Estimated effects in the count and logistic regression parts of the zero-inflated model.
The Descending condition is the reference category for success-slope.
Estimate
S.E.
Count model
Intercept
-0.31
0.43
U-shaped minus Descending
0.12
0.57
Ascending minus Descending
0.10
0.60
Flat minus Descending
0.13
0.55
Prior beliefs
0.02*
0.01
Prior beliefs: U-shaped minus Descending
0.004
0.01
Prior beliefs: Ascending minus Descending
0.01
0.01
Prior beliefs: U-shaped minus Descending
0.001
0.01
Logistic regression
Intercept
0.51
1.49
U-shaped minus Descending
1.03
1.79
etc.
* p = .01 (Wald z-test)

Multilevel modelling
Reading: Gelman & Hill Ch 11-13
•Variables manipulated or gathered to represent a theoretically meaningful range in the population
(e.g., full range of success-slope conditions) are modelled as “fixed” effects.
•Categorical variables potentially influencing the outcome variable but showing variability just in
the sample (e.g., the range of prior beliefs in the sample) are modelled as “random” effects.
•Random variables can affect the intercept, the slope, or both.
•The resultant analysis is a compromise between “complete pooling” (investigation of a manipulated
variable only) and “no pooling” (investigation of only sample-specific effects). Regression
coefficients for the sample-specific variable(s) are pulled towards their mean (completely pooled)
level. This is known as “shrinkage”. If a cluster within the sample is very small (e.g., if a
school has only two respondents in a survey where there are evident school clusters), the
coefficient for that group is pulled further towards the mean to adjust for the uncertainty arising
from the small number of people in the cluster.

Student used more rarely than Welch

Gelman & Hill Chapter 12 p257
Shrinkage towards completely pooled regression slope (dotted line) across clusters (Minnesota
counties) with different sample sizes: Counties (the random variable) affects the intercept, but
not the slope of the regression line.  More shrinkage is evident with smaller sample size (N = 2),
and the intercepts in counties with larger sample sizes are generally closer to the completely
pooled line anyway (examples).

Illustration of random effects on:
•The intercept (example: Gelman text p. 259)
•The slope (example: Gelman text p. 284)
•Intercept and slope (example: Gelman text p. 279)
−can be thought of as an interaction between the fixed and random variables
x-axis: fixed variable
y-axis: outcome variable
separate lines: random variable

Example based on success-slope Hypothesis 1
•In Assignment 2 (see Assignment 2 answers in Study Materials), we discovered a significant
interaction between prior beliefs and success-slope in predicting the illusion of natural control,
except that the assumptions for the associated ANOVA were not met. Here, we split prior beliefs
into “low” and “high” categories and model them as a random variable (i.e., a variable whose range
is not representative of the population the study is investigating). The outcome variable is the
illusion of natural control (PostNaturalIoC), and the fixed predictor is success-slope.
•The descriptive statistics obtained through describeBy suggest that a varying intercept, varying
slope model could be appropriate. Indeed, this is also suggested by our finding in Assignment 2 of
an interaction between success-slope and prior beliefs. in the terminology of the lmer function in
the lme4 package, the notation for a varying intercept, varying slope model is:
–Hyp1mm <- lmer(PostNaturalIoC ~ SeqCond + (1 + SeqCond|CatPreDBC_IOC)
•After summary(Hyp1mm) reveals a prefect correlation between slopes and intercepts, we switch to a
simpler varying intercept model:
•Hyp1mm <- lmer(PostNaturalIoC ~ SeqCond + (1|CatPreDBC_IOC)
•The effect of success-slope was found to be significant in this model, with the assumptions of
normality and homogeneity of variance met.

Reporting the analysis
•References to articles using lme4: http://lme4.r-forge.r-project.org/bib/lme4bib.html
•Description of category boundaries for the newly-created prior belief categories; obtainable
through:
–describeBy(x = SS$PreDBC_IOC, group = SS$CatPreDBC_IOC)
•Table 1: Table of descriptive statistics based  on describeBy
•The fitting algorithm (ML, REML, FML) needs to be mentioned. The default (used here) is REML.
•Table 2: Table showing coefficients and random effects as on next slide.
•Text: A multilevel varying intercept model was fitted using REML in the lme4 package in R Version
3.1.0. Prior belief category (low vs. high) was the random variable, while success-slope was a
fixed predictor. Model coefficients are shown in Table 2. A Wald Chi-square test showed the effect
of success-slope to be significant (Wald χ2 (3) = 9.79,  p=.02).

Table 2. Estimated fixed and random effects in the multilevel model. The Descending success-slope
condition serves as the reference category in fixed effects.
Coefficient: estimated group difference
Wald CI (95%)
Intercept
1.47 (low prior illusion); 3.26 (high prior illusion)
U-shaped minus Descending
0.49
-0.07-1.04
Ascending minus Descending
0.90
0.34-1.47
Flat minus Descending
0.43
-0.13-1.0
Random effect variance estimate: 0.89

Reading
•Navarro, D. J. (2014). Learning statistics with R: A tutorial for psychology students and other
beginners. Available online: http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/. Chapter
12.
•
•Baguley, T. Serious Stats: A Guide to Advanced Statistics for the Behavioural Sciences. Palgrave
Macmillan: UK. Chapter 17 “Modelling discrete outcomes” (pdf in Study Materials/Readings).
•
•Gelman, A., & Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models.
Cambridge University Press: New York. Chapters 11-13 (pdf in Study Materials/Readings).