Introduction to Bayesian Data Analysis using R and WinBUGS
Dr. Pablo E. Verde
Department of Mathematics and Statistics
Masaryk University
Czech Republic
April 2013
pabloemilio.verde@uni-duesseldorf.de
Dr. Pablo E. Verde 1
Overview of the course
Day 1
Lecture 1: Introduction to Bayesian Inference
Lecture 2: Bayesian analysis for single parameter models
Lecture 3: Prior distributions: univariate
Dr. Pablo E. Verde 2
Day 2
Lecture 4: Bayesian analysis for multiple parameter models with R
Lecture 5: An introduction to WinBUGS
Lecture 6: Multivariate models with WinBUGS
Day 3
Lecture 7: An introduction to MCMC computations
Lecture 8: Bayesian regression with WinBUGS
Lecture 9: Introduction to Hierarchical Statistical modeling
Dr. Pablo E. Verde 3
Learning objectives
Understanding of the potential role of Bayesian methods for making inference
about real-world problems
Insight into modern computations techniques used for Bayesian analysis
Learning Bayesian statistical analysis with R and WinBUGS
An interest in using Bayesian methods in your own ﬁeld of work
Dr. Pablo E. Verde 4
Style and emphasis
Immediately applicable methods rather than latest theory
Attention to real problems: case studies
Implementation in R and WinBUGS (although not a full tutorial)
Focus on statistical modeling rather than running code, checking convergence etc.
Make more emphasis to the complementary aspects of Bayesian Statistics to
Classical Statistics rather than one vs. the other
Dr. Pablo E. Verde 5
Recommended bibliography
The BUGS Book: A Practical Introduction to Bayesian Analysis. David
Lunn; Chris Jackson; Nicky Best; Andrew Thomas; David Spiegelhalter. CRC
Press, October 3, 2012.
Bayesian Data Analysis (Second edition). Andrew Gelman, John Carlin, Hal
Stern and Donald Rubin. 2004 Chapman & Hall/CRC.
Bayesian Computation with R (Second edition). Jim Albert. 2009. Springer
Verlag.
An introduction of Bayesian data analysis with R and BUGS: a simple
worked example. Verde, P.E. Estadistica (2010), 62, pp. 21-44
Dr. Pablo E. Verde 6
Lecture 1: Introduction to Modern Bayesian Inference
Lecture 1:
Introduction to Modern Bayesian Inference
”I shall not assume the truth of Bayes’ axiom (...) theorems which are
useless for scientiﬁc purposes.”
-Ronald A. Fisher (1935) The Design of Experiments, page 6.
Dr. Pablo E. Verde 7
Lecture 1: Introduction to Modern Bayesian Inference
How did it all start?
In 1763, Reverend Thomas Bayes wrote:
PROBLEM.
Given the number of times in which an unknown event has happened and
faild: Requiered the chance that the probability of its happening in a single
trial lies somewhere between any two degrees of probability that can be
named.
In modern language, given r ∼ Binomial(θ, n), what is Pr(θ1 < θ < θ2|r, n) ?
Dr. Pablo E. Verde 8
Lecture 1: Introduction to Modern Bayesian Inference
Example: surgical procedure
Suppose a hospital is considering a new high-risk operation
Experience in other hospitals indicates that the risk θ for each patient is around
10 %
It would be surprising to be less than 3% or more than 20%
We can directly express uncertainty about the patient risk θ with a probability
distribution
Dr. Pablo E. Verde 9
Lecture 1: Introduction to Modern Bayesian Inference
0 10 20 30 40 50
02468
Mortality rate (in %)
Density(mortalityrate)
Probability that mortality risk is greater than 15% is Pr(θ > 0.15) = 0.17
Dr. Pablo E. Verde 10
Lecture 1: Introduction to Modern Bayesian Inference
Why a direct probability distribution?
Tells us what we want: what are plausible values for the parameter of interest?
No P-values: just calculate relevant tail areas
No conﬁdence intervals: just report central area that contains 95% of distribution
Easy to make predictions (see later)
Fits naturally into decision analysis / risk analysis / cost-eﬀectiveness analysis
There is a procedure for adapting the distribution in the light of additional
evidence: i.e. Bayes theorem allows us to learn from experience
Dr. Pablo E. Verde 11
Lecture 1: Introduction to Modern Bayesian Inference
What about disadvantages?
Requires the speciﬁcation of what we thought before new evidence is taken into
account: the prior distribution
Explicit allowance for quantitative subjective judgment in the analysis
Analysis may be more complex than a traditional approach
Computation may be more diﬃcult
Currently no established standards for Bayesian reporting
Dr. Pablo E. Verde 12
Lecture 1: Introduction to Modern Bayesian Inference
Why does Bayesian statistics is popular today?
Bayesian methods optimally combine multiple sources of information in a common
model
The computational revolution produced by the rediscovery of Markov chain Monte
Carlo (MCMC) techniques in statistics
Free available software implementation of MCMC (e.g. WinBUGS, JAGS, STAN,
large number of packages in R, etc.)
As a result, we can routinely construct sophisticated statistical models that may
reﬂect the complexity for phenomena of interest
Dr. Pablo E. Verde 13
Lecture 1: Introduction to Modern Bayesian Inference
Bayes theorem for observable quantities
Let A and B be events; then it is provable from axioms of probability theory that
P(A|B) =
P(B|A)P(A)
P(B)
P(A) is the marginal probability of A, i.e., prior to taking account of the
information in B
P(A|B) is the conditional probability of A given B, i.e., posterior to taking
account of the value of B
P(B|A) is the conditional probability of B given A
P(B) is the marginal probability of B
Sometimes is useful to work with P(B) = P(A)P(B|A) + P(¯A)P(B|¯A), which is
curiously called ”extending the conversation” (Lindley 2006, pag. 68)
Dr. Pablo E. Verde 14
Lecture 1: Introduction to Modern Bayesian Inference
Example: Problems with statistical signiﬁcance (Simon, 1994)
Suppose that a priory only 10% of clinical trials are truly eﬀective treatments
Assume each trial is carried out with a design with enough sample size such that
α = 5% and power 1 − β = 80%
Question: What is the chance that the treatment is true eﬀective given a signiﬁcant
test results?
p(H1|”signiﬁcant results”)?
Dr. Pablo E. Verde 15
Lecture 1: Introduction to Modern Bayesian Inference
Let A be the event that H1 is true, then p(H1) = 0.1
Let B be the event that the statistical test is signiﬁcant
We want p(A|B) = p(H1|”signiﬁcant results”)
We have: p(B|A) = p(”signiﬁcant results”|H1) = 1 − β = 0.8
We have: p(B|A) = p(”signiﬁcant results”|H0) = α = 0.05
Now, Bayes theorem says
p(H1|”signiﬁcant results”) =
(1 − β) × 0.1
(1 − β) × 0.1 + α × 0.9
=
0.8 × 0.1
0.8 × 0.1 + 0.05 × 0.9
= 0.64
Answer: This says that if truly eﬀective treatments are relatively rare, then a
”statistically signiﬁcant” results stands a good chance of being a false positive.
Dr. Pablo E. Verde 16
Lecture 1: Introduction to Modern Bayesian Inference
Bayesian inference for unknown quantities
Makes fundamental distinction between:
Observable quantities y, i.e., data.
Unknown quantities θ, that can be statistical parameters, missing data, predicted
values, mismeasured data, indicators of variable selected, etc.
Technically, in the Bayesian framework parameters are treated as values of
random variables.
Diﬀerences with classical statistical inference:
In Bayesian inference, we make probability statements about model parameters
In the classical framework, parameters are ﬁxed non-random quantities and the
probability statements concern the data
Dr. Pablo E. Verde 17
Lecture 1: Introduction to Modern Bayesian Inference
Bayesian Inference
Suppose that we have observed some data y
We want to make inference about unknown quantities θ: model parameters,
missing data, predicted values, mismeasured data, etc.
The Bayesian analysis starts like a classical statistical analysis by specifying the
sampling model:
p(y|θ)
this is the likelihood function.
From a Bayesian point of view, θ is unknown so should have a probability
distribution reﬂecting our uncertainty about it before seeing the data. We need to
specify a prior distribution
p(θ)
Together they deﬁne a full probability model:
p(y, θ) = p(y|θ)p(θ)
Dr. Pablo E. Verde 18
Lecture 1: Introduction to Modern Bayesian Inference
Then we use the Bayes theorem to obtain the conditional probability distribution for
unobserved quantities of interest given the data:
p(θ|y) =
p(θ)p(y|θ)
p(θ)p(y|θ)dθ
∝ p(θ)p(y|θ)
This is the posterior distribution for θ,
posterior ∝ likelihood × prior.
Dr. Pablo E. Verde 19
Lecture 1: Introduction to Modern Bayesian Inference
Inference with binary data
Example: Inference on proportions using a discrete prior
Suppose I have 3 coins in my pocket:
1. A biased coin: p(heads) = 0.25
2. A fair coin: p(heads) = 0.5
3. A biased coin: p(heads) = 0.75
I randomly select one coin, I ﬂip it once and it comes a head.
What is the probability that I have chosen coin number 3 ?
Dr. Pablo E. Verde 20
Lecture 1: Introduction to Modern Bayesian Inference
Inference with binary data
1. Let y = 1 the event that I observed a head
2. Let θ denote the probability of a head: θ ∈ (0.25, 0.5, 0.75)
3. Prior: p(θ = 0.25) = p(θ = 0.5) = p(θ = 0.75) = 1/3
4. Sampling distribution:
y|θ ∼ Binomial(θ, 1),
with likelihood
p(y|θ) = θy
(1 − θ)(1−y)
.
Dr. Pablo E. Verde 21
Lecture 1: Introduction to Modern Bayesian Inference
If we observe a single positive response (y = 1), how is our belief revised?
Coin θ Prior Likelihood Likelihood × prior Posterior
p(θ) p(y = 1|θ) p(y = 1|θ)p(θ) p(θ|y = 1)
1 0.25 0.33 0.25 0.0825 0.167
2 0.50 0.33 0.50 0.1650 0.333
3 0.75 0.33 0.75 0.2475 0.500
1.0 1.50 0.495 1.0
So, observing a head on a single ﬂip of the coin means that there is now a 50%
probability that the chance of heads is 0.75 and only a 16.7% that the chance of
heads is 0.25.
Note that if we normalize the likelihood p(y = 1|θ) we have exactly the same
results.
Dr. Pablo E. Verde 22
Lecture 1: Introduction to Modern Bayesian Inference
Example: Three coins by simulation
We can approximate the results of the three coins example with the following
computer simulation algorithm:
1. Generate a random index i∗ from (1, 2, 3) with prior probability 1/3 for each
component
2. Evaluate the probability of y = 1 at θ∗ = θ(i∗)
3. Generate a Binomial y∗ with probability θ∗
4. Repeat 1 to 3 a large number of times
5. Average and normalize the outcomes of y∗
Dr. Pablo E. Verde 23
Lecture 1: Introduction to Modern Bayesian Inference
Example: Three coins by simulation in R
> coin.pri <- rep(1/3, 3) # prior for coins 1, 2, 3.
> theta <- c(0.25, 0.5, 0.75) # pr. heads under 1, 2, 3.
> sim <- 100000 # number of simulations
> y <- rep(0, 3*sim)
> dim(y) <- c(sim, 3)
> for( i in 1:sim)
{
ind <- sample(c(1,2,3), size = 1, prob = coin.pri, replace = TRUE)
y.new <- rbinom(1, 1, prob = theta[ind])
y[i, ind] <- y.new
}
> # average each class and normalize
> apply(y, 2, mean) / sum( apply(y, 2, mean))
[1] 0.1661 0.3306 0.5034
Dr. Pablo E. Verde 24
Lecture 1: Introduction to Modern Bayesian Inference
Monte Carlo error: How accurate are the empirical estimation by simulation?
Suppose that we obtain empirical mean ˆE(g(X)) and variance ˆV = ˆVar(g(X))
based on T simulations.
Then since ˆE is a sample mean based on T independent samples, it has true
sample variance V = Var(g(X))/T which may be estimated by ˆV /T
Hence ˆE has estimated standard error ˆV /T, which is known as Monte Carlo
error.
> th.post <- apply(y, 2, mean)/sum( apply(y, 2, mean))
> # monte carlo error
> V <- th.post * (1- th.post)/sim
> sqrt(V)
[1] 0.001177 0.001488 0.001581
>
Dr. Pablo E. Verde 25
Lecture 1: Introduction to Modern Bayesian Inference
Posterior predictive distributions
The predictive posterior distribution for a new observation ynew is given by
p(ynew
|y) = p(ynew
|y, θ)p(θ|y)dθ.
Assuming that past and future observations are conditionally independent given θ,
this simplify to
p(ynew
|y) = p(ynew
|θ)p(θ|y)dθ.
For the discrete case of θ, integrals are replaced by sums:
p(ynew
|y) =
θi
p(ynew
|θi )p(θi |y)
where the p(θi |y) can be thought of as ”posterior weights”.
Dr. Pablo E. Verde 26
Lecture 1: Introduction to Modern Bayesian Inference
Example: Three coins continue ...
Suppose we want to predict the probability that in the next toss is a head. We have:
p(ynew
= 1|y = 1) =
θi
θi , p(θi |y = 1)
= (0.25 × 0.167) + (0.50 × 0.333) + (0.75 × 0.5)
= 0.5833
By using the previous simulation results in R we have:
> sum(theta * apply(y, 2, mean)/sum( apply(y, 2, mean)))
[1] 0.5843
>
Dr. Pablo E. Verde 27
Lecture 1: Introduction to Modern Bayesian Inference
Sequential learning
Suppose we obtain data y1 and form the posterior p(θ|y1) and then we obtain further
data y2. The posterior based on y1, y2 is given by:
p(θ|y1, y2) ∝ p(y2|θ) × p(θ|y1).
Today’s posterior is tomorrow’s prior!
The resultant posterior is the same as if we have obtained the data y1, y2 together:
p(θ|y1, y2) ∝ p(y1, y2|θ) × p(θ).
Dr. Pablo E. Verde 28
Lecture 1: Introduction to Modern Bayesian Inference
Example: Three coins continue ...
Now suppose that after observing y1 = 1 we observe y2 = 1, how is our belief
revised?
Coin θ Prior Likelihood Likelihood × prior Posterior
p(θ) p(y = 1|θ) p(y = 1|θ)p(θ) p(θ|y = 1)
1 0.25 0.167 0.25 0.042 0.071
2 0.50 0.333 0.50 0.167 0.286
3 0.75 0.500 0.75 0.375 0.644
1.0 1.50 0.583 1.0
After observing a second head, there is now a 64.4% probability that the chance
of heads is 0.75 and only a 7.1% that the chance of heads is 0.25.
Dr. Pablo E. Verde 29
Lecture 1: Introduction to Modern Bayesian Inference
A bit of philosophy: Probability ”for” and probability ”of”
The prior distribution p(θ), expresses our uncertainty about θ before seeing the
data, that could be objective or subjective
The Bayesian inference allows the combination of diﬀerent types of probabilities
Subjective probability implied a mental construct where probabilities are used to
express our uncertainty. This is why we use a pedantic: ”probability for an
event...”
In the classical setting probabilities are deﬁned in terms of long run frequencies
and are interpreted as physical properties of systems. In this way we use:
”probability of ...”
In general we follow Bayesian Statistical Modeling, which is dynamic view of
data analysis, which includes model building and model checking as well as
statistical inference
Dr. Pablo E. Verde 30
Lecture 1: Introduction to Modern Bayesian Inference
Summary
Bayesian statistics:
Formal combination of external information with data model by the Bayesian rule
Uses of subjective probability to make inductive inference
Straightforward inference by manipulating probabilities
No need of repeated sampling ideas
Speciﬁcation of prior or external evidence may be diﬃcult or controversial
Prediction and sequential learning is straightforward.
Dr. Pablo E. Verde 31
Lecture 2: Bayesian inference for single parameter
Lecture 2:
Bayesian Inference for Single Parameter Models
Dr. Pablo E. Verde 32
Lecture 2: Bayesian inference for single parameter
Summary
1. Conjugate Analysis for:
Binomial model
Normal model known variance
Normal model known mean
Poisson model
2. Using R for:
Graphical visualization of posteriors
Direct Monte Carlo Simulation Methods
Calculations of posteriors for functional parameters
Dr. Pablo E. Verde 33
Lecture 2: Bayesian inference for single parameter
Inference of proportions using a continuous prior
Suppose we observe r positive responses out of n patients. Assuming patients are
independent, with common unknown response rate θ, leads to a binomial likelihood
p(r|n, θ) =
n
r
θr
(1 − θ)n−r
∝ θr
(1 − θ)n−r
We consider the response rate θ to be a continuous parameter, i.e., we need to give
a continuous prior distribution.
Suppose that before the data is observed we believe all values for θ are equally likely
(very unrealistic!), then we give θ ∼ Unif(0, 1), i.e., p(θ) = 1.
Posterior is then
p(θ|r, n) ∝ θr
(1 − θ)n−r
× 1
this has a form of the kernel of a Beta(r + 1, n − r + 1).
Dr. Pablo E. Verde 34
Lecture 2: Bayesian inference for single parameter
To represent external evidence that some response rates are more plausible than
others, it is mathematical convenient to use a Beta(a, b) prior distribution for θ
p(θ) ∝ θa−1
(1 − θ)b−1
Combining this with the binomial likelihood gives a posterior distribution
p(θ|r, n) ∝ p(r|θ, n)p(θ)
∝ θr
(1 − θ)n−r
= θr+a−1
(1 − θ)n−r+b−1
∝ Beta(r + a, n − r + b).
Dr. Pablo E. Verde 35
Lecture 2: Bayesian inference for single parameter
When the prior and posterior come from the same family of distributions the prior
is said to be conjugate to the likelihood
A Beta(a, b) distribution has
E(θ) = a/(a + b)
var(θ) = ab/[(a + b)2
(a + b + 1)]
Hence the posterior mean is E(θ|r, n) = (r + a)/(n + a + b)
a and b are equivalent to observing a prior a − 1 successes in a + b − 2 trials, then
it can be elicited.
With ﬁxed a and b, as r and n increase, E(θ|r, n) → r/n (the MLE).
A Beta(1,1) is equivalent to Uniform(0,1).
Dr. Pablo E. Verde 36
Lecture 2: Bayesian inference for single parameter
Example: Shape of the Beta density function
The Beta(a, b) prior is a ﬂexible distribution. We can explore it shape by using the
following lines in R:
# Beta flexibility
par(mfrow=c(2,2))
curve(dbeta(x, 1, 1), from =0, to=1, lty = 1, lwd=2,
main="Beta(1, 1)", xlab=expression(theta))
curve(dbeta(x, 1/2, 1/2), from = 0, to = 1, lty = 2,
lwd=2, col="red", main="Beta(1/2, 1/2)", xlab=expression(theta))
curve(dbeta(x, 2, 5), from = 0, to = 1, lty = 2, lwd=2,
col="blue", main="Beta(2, 5)", xlab=expression(theta))
curve(dbeta(x, 2, 2), from = 0, to = 1, lty = 2, lwd=2,
col="green", main="Beta(2, 2)", xlab=expression(theta))
par(mfrow=c(1,1))
Dr. Pablo E. Verde 37
Lecture 2: Bayesian inference for single parameter
0.0 0.2 0.4 0.6 0.8 1.0
0.60.81.01.21.4
Beta(1, 1)
θ
dbeta(x,1,1)
0.0 0.2 0.4 0.6 0.8 1.0
1.01.52.02.53.0
Beta(1/2, 1/2)
θ
dbeta(x,1/2,1/2)
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.52.02.5
Beta(2, 5)
θ
dbeta(x,2,5)
0.0 0.2 0.4 0.6 0.8 1.0
0.00.51.01.5
Beta(2, 2)
θ
dbeta(x,2,2)
Dr. Pablo E. Verde 38
Lecture 2: Bayesian inference for single parameter
Example: drug investigation
Consider an early investigation of a new drug
Experience with similar compounds has suggested that response rates between 0.2
and 0.6 could be feasible
Interpret this as a distribution with mean=0.4 and standard deviation 0.1
A Beta(9.2, 13.8) distribution has these properties
Suppose we treat n = 20 volunteers with the compound and observe r = 15
positive responses.
Then we update the prior and the posterior is Beta(15 + 9.2, 5 + 13.8)
Dr. Pablo E. Verde 39
Lecture 2: Bayesian inference for single parameter
R script to perform the analysis:
> par(mfrow=c(3,1)) # graphical output: 3 by 1 panels
# draw a curve for a beta density
> curve(dbeta(x,9.2, 13.8),from=0, to =1,
xlab= "prob of success",main="Prior")
# draw a curve for a binomial density
> curve(dbinom(15, 20,x),from=0, to =1,
col="blue", xlab="prob of sucess", main="Likelihood")
# draw the posterior
> curve(dbeta(x, 24.2, 18.8),from=0, to =1,
col="red", xlab="prob of sucess", main="Posterior")
> par(mfrow=c(1,1))
Dr. Pablo E. Verde 40
Lecture 2: Bayesian inference for single parameter
0.0 0.2 0.4 0.6 0.8 1.0
01234
Prior
prob of success
dbeta(x,9.2,13.8)
0.0 0.2 0.4 0.6 0.8 1.0
0.000.100.20
Likelihood
prob of sucess
dbinom(15,20,x)
0.0 0.2 0.4 0.6 0.8 1.0
024
Posterior
prob of sucess
dbeta(x,24.2,18.8)
Dr. Pablo E. Verde 41
Lecture 2: Bayesian inference for single parameter
Monte Carlo Simulation
If we are able to sample values θ∗ from the posterior p(θ|r, n), then we can extend
the inferential scope.
For example, one important application of this simulation process is when we need
to estimate the posterior of a functional parameter, e.g., the odds ratio:
φ = f (θ) =
θ
1 − θ
For simple models we can directly simulate from the posterior density and use this
values to empirically approximate the posterior quantiles.
Dr. Pablo E. Verde 42
Lecture 2: Bayesian inference for single parameter
Example: Posterior for the odds ratio
The posterior of
φ =
θ
1 − θ
is calculated in R as follows:
> theta.star <- rbeta(20000, 24.2, 18.8)
> odds.star <- theta.star/(1-theta.star)
> quantile(odds.star, prob = c(0.05, 0.5, 0.75, 0.95))
5% 50% 75% 95%
0.7730276 1.2852558 1.5784318 2.1592678
> hist(odds.star, breaks=100, xlab="odds ratio", freq=FALSE, xlim=c(0, 4))
> lines(density(odds.star), lwd =2, lty = 1, col ="red")
Dr. Pablo E. Verde 43
Lecture 2: Bayesian inference for single parameter
Histogram of odds.star
odds ratio
Density
0 1 2 3 4
0.00.20.40.60.81.0
Dr. Pablo E. Verde 44
Lecture 2: Bayesian inference for single parameter
Making predictions for binary data
Suppose that we want to make predictions from the model
The predictive posterior distribution for rnew (the number of successes in m trials)
follows a Beta-Binomial distribution with density:
p(rnew
) =
Γ(a + b)
Γ(a)Γ(b)
m
rnew
Γ(a + rnew )Γ(b + m − rnew )
Γ(a + b + m)
In R:
# Beta-binomial density
betabin <- function(r,a,b,m)
{
gamma(a+b)/(gamma(a)*gamma(b)) * choose(m,r) *
gamma(a+r)*gamma(b+m-r)/gamma(a+b+m)
}
Dr. Pablo E. Verde 45
Lecture 2: Bayesian inference for single parameter
Example: drug investigation continue ...
Suppose that we are interested in the predictive posterior of the number of
success in the next 40 trials:
Using the betabin function in R we have:
# Beta-binomial distribution of the number of successes x in the next
# 40 trials with mean 22.5 and standard deviation 4.3
x <- 0:40; px <- betabin(0:40, a=24.2,b=18.8,m=40)
plot(x, px, type="h", xlab="number of successes out of 40",
main="Predictive Posterior")
Dr. Pablo E. Verde 46
Lecture 2: Bayesian inference for single parameter
0 10 20 30 40
0.000.020.040.060.08
Predictive Posterior
number of successes out of 40
px
Dr. Pablo E. Verde 47
Lecture 2: Bayesian inference for single parameter
Example: drug investigation continue ...
Suppose that we would consider continuing a development program if the drug
managed to achieve at least a further 25 successes out of these 40 future trials
In R we can calculate:
# probability of at least 25 successes out of 40 further trials
> sum(betabin(25:40,24.2,18.8,m=40))
[1] 0.3290134
>
Dr. Pablo E. Verde 48
Lecture 2: Bayesian inference for single parameter
Simulations for predictive data
Instead of using the analytical approximation based on the Beta-Binomial distribution,
we can simulate predicted observations in the following way:
We simulate θ∗
1, . . . , θ∗
B from the posterior Beta(24.2, 18.8)
Then we simulate y∗ from a binomial distribution with rate θ∗ and n = 40.
We tabulate and normalize predictive frequencies.
Dr. Pablo E. Verde 49
Lecture 2: Bayesian inference for single parameter
Simulations for predictive data
In R notation
> # Simulation of predictive data with a Beta-Binomial model ...
> theta.star <- rbeta(10000, 24.2, 18.8)
> y <- rbinom(10000, 40, theta.star)
> freq.y <- table(y)
> ys <- as.integer(names(freq.y))
> predprob <- freq.y/sum(freq.y)
> plot(ys, predprob, type = "h", xlab = "y",
+ ylab = "Predictive Probability")
>
> # Probability of at least 25 out of future 40 trials.
> sum(predprob[ys>24])
[1] 0.3244
Dr. Pablo E. Verde 50
Lecture 2: Bayesian inference for single parameter
10 15 20 25 30 35
0.000.020.040.060.08
y
PredictiveProbability
Dr. Pablo E. Verde 51
Lecture 2: Bayesian inference for single parameter
Bayesian analysis for Normal data
Known variance, unknown mean
Suppose we have a sample of Normal data
yi ∼ N(µ, σ2
), i = 1, . . . , n
where σ2 is known and µ is unknown. The conjugate prior of µ is
µ ∼ N(µ0, τ2
).
It is convenient to write τ2 as σ2/n0, where n0 represents the ”eﬀective number of
observations” in the prior distribution.
Dr. Pablo E. Verde 52
Lecture 2: Bayesian inference for single parameter
Then the posterior distribution for µ is given by
p(µ|y) = N
n0µ0 + n¯y
n0 + n
,
σ2
n0 + n
Prior variance is based on an ”implicit” sample size n0
As n0 tends to 0, the distribution becomes ”ﬂatter”
Posterior mean is a weighted average of the prior mean µ0 and parameter
estimate ¯x , weighted by their precisions, i.e., relative sample sizes.
Posterior variance is based on an implicit sample size n0 and the data sample size
n.
Dr. Pablo E. Verde 53
Lecture 2: Bayesian inference for single parameter
Alternative expressions for the posterior mean µn are :
µn = wµ0 + (1 − w)¯y where w =
n0
n + n0
,
µn = µ0 + (¯y − µ0)
n
n + n0
,
µn = ¯y − (¯y − µ0)
n0
n + n0
.
That shows ”shrinkage” towards prior mean.
Dr. Pablo E. Verde 54
Lecture 2: Bayesian inference for single parameter
Prediction
Denoting the posterior mean and variance as µn and σ2
n = σ2/(n0 + n), the predictive
posterior distribution for a new observation y∗ is
p(y∗
|y) = p(y∗
|y, µ)p(µ|y)dµ
which is generally equal to
p(y∗
|y) = p(y∗
|µ)p(µ|y)dµ
which can be shown to give
p(y∗
|y) ∼ N(µn, σ2
n + σ2
)
The predictive posterior distribution is centered around the posterior mean µn with
variance equal to sum of the posterior variance of µ plus the data variance.
Dr. Pablo E. Verde 55
Lecture 2: Bayesian inference for single parameter
Bayesian inference for Normal data
Unknown variance, know mean
Suppose we have sample of Normal data
yi ∼ N(µ, σ2
), i = 1, . . . , n,
where µ is known and σ2 is unknown.
It is convenient to change parameterization to the precision w = 1/σ2. The conjugate
prior for w is then
w ∼ Gamma(α, β),
where
p(w) ∝ wα−1
exp(−βw).
Then σ2 is then said to have an inverse-gamma distribution.
Dr. Pablo E. Verde 56
Lecture 2: Bayesian inference for single parameter
The posterior distribution for w takes the form
p(w|µ, y) ∝ wα−1
exp(−βw) × w
n
2 exp −
w
2
n
i=1
(yi − µ)2
.
Collecting terms gives
p(w|µ, y) = Gamma α +
n
2
, β +
1
2
n
i=1
(yi − µ)2
.
Dr. Pablo E. Verde 57
Lecture 2: Bayesian inference for single parameter
Clearly we can think of α = n0/2, where n0 is the ”eﬀective prior sample size”.
Since n
i=1(yi − µ)2/n estimate σ2 = 1/w, then we interpret 2β as representing
n0 × prior estimate ofσ2
0.
Alternative, we can write our conjugate prior as
w ∼ Gamma
n0
2
,
n0σ2
0
2
,
which can be seen as a scale χ2
n0
distribution. This is useful when assessing prior
distributions for sample variances.
Dr. Pablo E. Verde 58
Lecture 2: Bayesian inference for single parameter
Bayesian inference with count data
Suppose we have an independent sample of counts y1 . . . , yn which can be assumed to
follow a Poisson distribution with unknown mean µti , where µ is the rate per unit t :
p(y|µ) =
i
(µti )yi exp(−µti )
yi !
The kernel of the Poisson likelihood as a function of µ has the same form as a
Gamma(a, b) prior for µ:
p(µ) ∝ µa−1
exp(−bµ).
Dr. Pablo E. Verde 59
Lecture 2: Bayesian inference for single parameter
This implies the following posterior
p(µ|y) ∝ p(µ)p(y|µ)
∝ µ(a−1)
e−bµ
n
i=1
e−µti
µy
i
∝ µa+Yn−1
e−(b+Tn)µ
= Gamma(a + Yn, b + Tn).
where Yn = n
i=1 yi and Tn = n
i=1 ti .
Dr. Pablo E. Verde 60
Lecture 2: Bayesian inference for single parameter
The posterior mean is:
E(µ|y) =
a + Yn
b + Tn
=
Yn
Tn
n
n + b
+
a
b
1 −
n
n + b
.
The posterior mean is a compromise between the prior mean a/b and the MLE Yn
Tn
.
Thus b can be interpreted as an eﬀective exposure and a/b as a prior estimate of the
Poisson mean.
Dr. Pablo E. Verde 61
Lecture 2: Bayesian inference for single parameter
Example: Annual number of cases of haemolytic uraemic
syndrome Henderson and Matthews, (1993)
Annual numbers of cases were available from two a specialist center at Birmingham,
from 1970 to 1988. For analysis we consider observed values from the ﬁrst decade.
year 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
cases, xi 1 5 3 2 1 0 0 2 1 1
Because we are interested in counts of disease over time, a simple model is a Poisson
process.
yi |µ ∼ Poisson(µ) i = 1, . . . , 10.
Dr. Pablo E. Verde 62
Lecture 2: Bayesian inference for single parameter
Similar data is collected by another centre, given a mean rate of 2.3 with s.d. 2.79.
with prior mean a/b = 2.3
prior sd
√
a/b = 2.79
Solving for a and b, this information can be translated to a prior Gamma(0.679,
0.295) distribution
Then
p(µ|y) = Gamma(
i
xi + 0.679, 10 + 0.295) = Gamma(16.679, 10.295)
E(µ|y) =
16.679
10.295
= 1.620; sd(µ|y) =
√
16.679
10.295
= 0.396
Dr. Pablo E. Verde 63
Lecture 2: Bayesian inference for single parameter
0 1 2 3 4 5
0.00.20.40.60.81.0
µµ
dgamma(x,0.679,0.295)
prior
posterior
Dr. Pablo E. Verde 64
Lecture 2: Bayesian inference for single parameter
The predictive posterior distribution for a new count y∗ is
y∗
|y ∼ Negative-Binomial(a + n¯y, b + n)
If we ask what is the probability to observe 7 or more cases in the next year we have
> sum(dnbinom(6:17, prob=10.295/(1+10.295), size=16.679))
[1] 0.0096
>
Note: the observed values for the following years were 7,11,4,7,10,... Indicating a
possible structural break.
We can visualize the density by
plot(dnbinom(0:10, prob=p, size=16.679, type="h",
lwd=2, col="red", xlab= "counts")
Dr. Pablo E. Verde 65
Lecture 2: Bayesian inference for single parameter
2 4 6 8 10
0.000.050.100.150.200.250.30
counts
dnbinom(0:10,prob=p,size=16.679)
Dr. Pablo E. Verde 66
Lecture 2: Bayesian inference for single parameter
Some comments
For all these examples, we see that
the posterior mean is a compromise between the prior mean and the MLE
the posterior s.d. is less that each of the prior s.d. and the s.e. (MLE)
As n → ∞,
the posterior mean → the MLE
the posterior s.d. → the s.e. (MLE)
the posterior does not depend on the prior.
These observations are generally true, when the MLE exists and is unique.
Dr. Pablo E. Verde 67
Lecture 2: Bayesian inference for single parameter
Priors
When the posterior is in the same family as the prior then we have what is known
as conjugacy.
Conjugate models have the advantage that:
Prior parameters can usually be interpreted as implied prior sample size: n0
They involve simple mathematical models, examples include:
Distribution of y Parameter conjugate prior
Binomial Prob. of success Beta
Poisson Mean Gamma
Exponential Reciprocal of mean Gamma
Normal Mean (variance known) Normal
Normal Variance (mean known) Inverse Gamma
Unfortunately conjugate priors only exists for small catalog of likelihoods.
Dr. Pablo E. Verde 68
Lecture 2: Bayesian inference for single parameter
Practical
Exercise: Conjugate inference for a binomial experiment
Drug investigation example from this Lecture
We treat n = 20 volunteers with a new compound and observe r = 15 positive
responses. We use as prior θ ∼ Beta(9.2, 13.8):
1. What is the posterior mean and median for the response rate?
2. What are the 2.5th and 97.5th percentiles of the posterior?
3. What is the probability that the true response rate is greater than 0.6?
4. How is this value aﬀected if a uniform prior is adopted?
5. Using the original Beta(9.2, 13.8) prior, suppose 40 more patients were entered
into the study. What is the chance that at least 25 of them respond positively?
Dr. Pablo E. Verde 69
Lecture 2: Bayesian inference for single parameter
Solution
# 1) Posteriors mean and median
# The posterior mean is: (r+a) / (n + a+b)
> (15 + 9.2) /( 20 + 9.2 + 13.8)
[1] 0.5627907
# the posterior median is
> qbeta(0.5, 24.2, 18.8 )
[1] 0.5637731
>
# 2) Posterior percentiles:
> qbeta(0.025, 24.2, 18.8 ) # 2.5%
[1] 0.4142266
> qbeta(0.975, 24.2, 18.8 ) # 97.5%
[1] 0.7058181
>
Dr. Pablo E. Verde 70
Lecture 2: Bayesian inference for single parameter
# 3) Posterior prob that the rate is greter than 0.6
> 1 - pbeta(0.6, 24.2, 18.8 )
[1] 0.3156323
# 4) Uniform prior is beta(1,1) then the posterior is beta( r+1, n-r+1)
> 1 - pbeta(0.6, 15+1 ,20-15+1)
[1] 0.9042598
# 5) Posterior probability of at least 25 successes out of 40 further trials
> sum(betabin(25:40,24.2,18.8,m=40))
[1] 0.3290134
Dr. Pablo E. Verde 71
Lecture 3: Prior distributions
Lecture 3:
Priors Distributions
Dr. Pablo E. Verde 72
Lecture 3: Prior distributions
Summary
Misunderstandings about prior distributions
Non-informative Priors and Jeﬀreys invariance priors
Sensitivity analysis and making priors predictions
Adjustment of priors based on historical data and judgement
Mixture of priors
Dr. Pablo E. Verde 73
Lecture 3: Prior distributions
Misunderstandings about prior distributions
It is worth pointing out some misunderstanding regarding prior distributions:
The name prior suggests a temporal relationship, however, this is
misleading. The prior distribution models the uncertainty given by the external
evidence. Cox (1999)
The prior is not necessarily unique! In a recent article Lambert et. al. (2005)
analyze the use of 13 diﬀerent priors for the between study variance parameter in
random-eﬀects meta-analysis.
There is no such thing as the‘correct’ prior. Bayesian analysis is regarded as
transforming prior into posterior opinion, rather than producing ’the’ posterior
distribution.
The prior may not be completely speciﬁed. In Empirical Bayes inference priors
have unknown parameters that are estimated from the data.
Dr. Pablo E. Verde 74
Lecture 3: Prior distributions
Priors can be overparametrized. Sometimes we intentionally overparametrized
the priors in order to accelerate convergence of simulation methods, see Gelman,
Carlin, Stern and Rubin (2004) Chapter 6.
Inference may rely only on priors. There are situations where no further data
are available to combine with our priors or there is no intention to update the
priors. This is the typical case of risk analysis, sample size determination in
experiments, simulation of complex process , etc. In these analytical scenarios
priors are usually used to simulate hypothetical data and we refer to that prior
predictive analysis.
Finally, priors are not necessarily important! In many scientiﬁc applications, as
the amount of data increases, the prior is overwhelmed by the likelihood and the
inﬂuence of the prior disappears, see Box and Tiao (1973) (pag. 20-25).
Dr. Pablo E. Verde 75
Lecture 3: Prior distributions
Non-informative Priors: Classical Bayesian Perspective
We may be interested to introduce an initial state of ”ignorance” in our Bayesian
analysis.
But representing ignorance raises formidable diﬃculties!
There has been a long and complex search for various ”non-informative”,
”reference” or ”objective” priors during the last century. A sort of ”oﬀ-the-shelf
objective prior” that can be applied in all circumstances.
The synthesis of this search is that those magic priors do not exists, although useful
guidance exists (Berger, 2006).
Dr. Pablo E. Verde 76
Lecture 3: Prior distributions
Problems with uniform priors for continuous parameters
Example: uniform prior on proportions
Let θ be the chance that a bias coin comes down heads, we assume
θ ∼ Uniform(0, 1).
Let φ = θ2 the chance that it coming down heads in both of the next 2 throws.
Now, the density of φ is
p(φ) =
1
2
√
φ
,
which corresponds to a Beta(0.5,1) distribution and is certainly not uniform!
Dr. Pablo E. Verde 77
Lecture 3: Prior distributions
Jeﬀreys’ invariance priors
Consider a 1-to-1 transformation of θ : φ = g(θ)
Transformation of variables: prior p(θ) is equivalent to prior on φ of p(φ) = p(θ) | dθ
dφ |
Jeﬀreys proposed deﬁning a non-informative prior for θ as
p(θ) ∝ I(θ)1/2
where I(θ) is Fisher information for θ
I(θ) = −Ex|θ
∂2 log p(X|θ)
∂θ2
= Ex|θ
∂ log p(X|θ)
∂θ
2
.
Dr. Pablo E. Verde 78
Lecture 3: Prior distributions
Non-informative priors for proportions
Data model is Binomial: we consider r successes from n trials
r|θ ∼ Binomial(θ, n)
we have
log p(x|θ) = r log(θ) + (n − r) log(1 − θ) + C
then
I(θ) =
n
θ(1 − θ)
.
So the Jeﬀreys’ prior is
p(θ) ∝ (θ(1 − θ))−1/2
,
which is a Beta(1/2, 1/2).
Dr. Pablo E. Verde 79
Lecture 3: Prior distributions
Non-informative priors for location parameters
A location parameters θ is such that p(θ|y) is a function of (y − θ) and so the
distribution of (y − θ) is independent of θ.
Example: data model is Normal with unknown mean θ and known variance v
x1, x2, . . . , xn|θ ∼ Normal(θ, v)
then we have
log p(x|θ) = −
(xi − θ)2
2v
+ C
with
I(θ) =
n
v
.
So the Jeﬀreys’ prior is
p(θ) ∝ 1
which is the Uniform distribution for θ.
Dr. Pablo E. Verde 80
Lecture 3: Prior distributions
Non-informative priors for scale parameters
A scale parameters θ is such that p(y|θ) is a function of 1/θf (y/θ) and so the
distribution of (y/θ) is independent of θ.
Example: data model is Normal with known mean m and unknown variance θ
x1, x2, . . . , xn|θ ∼ Normal(m, θ)
then we have
log p(x|θ) = −n/2 log(θ) −
s
2θ
,
where s = (xi − m)2, then
I(θ) =
n
2θ2
.
So the Jeﬀreys’ prior on variance is
p(θ) ∝
1
θ
This Jeﬀreys’ improper prior is approximated by a Gamma( , ) with → 0.
Dr. Pablo E. Verde 81
Lecture 3: Prior distributions
Priors for counts and rates
Data model is Poisson: x|θ ∼ Poisson(θ), then we have
log p(x|θ) = −θ + x log θ + C
with
I(θ) = 1/θ.
So the Jeﬀreys’ prior is
p(θ) ∝ θ−1/2
.
This improper prior is approximated by a Gamma distribution with α = 1/2 and β → 0.
Dr. Pablo E. Verde 82
Lecture 3: Prior distributions
Comments Jeﬀreys’ rule
They are invariant, whatever the scale we choose to measure the unknown
parameter, the same prior results when the scale is transformed to any particular
scale
Some inconsistencies associated with Jeﬀreys’ priors have been discussed:
Applying this rule to the normal case with both mean and variance parameters
unknown does not lead to the same prior as applying separately the rule for the mean
and the variance and assuming a priori independence between these parameters.
Although Jeﬀreys’ rule is suggestive, it cannot be applied blindly. It should be
thought of as guideline to consider particularly if there is no other obvious way of
ﬁnding a prior distribution.
Dr. Pablo E. Verde 83
Lecture 3: Prior distributions
Predictive Prior Analysis
In practice it is not necessary to adopt a full Bayesian approach. Sometimes is very
useful to use Bayesian methods for some analyzes and classical approaches for others.
Example: predictive distribution of the power in sample size calculations
a randomize trial is planned with n patients in each of two arms.
the response within each treatment arm is assumed to have between-patient
standard deviation σ.
the treatment estimate ˆθ = ¯y1 − ¯y2 is approximately distributed as
Normal(θ, 2σ2/n).
a trial designed to have two-sided Type I error α and Type II error β in detecting
a true diﬀerence of θ in mean response between the groups will require a sample
size per group of
n =
2σ2
θ2
(z1−β − z1−α/2)2
,
Dr. Pablo E. Verde 84
Lecture 3: Prior distributions
Alternatively, for ﬁxed n, the power of this experiment is
Power = Φ
nθ2
2σ2
− z1−α/2 .
If we assume θ/σ = 0.5, α = 0.05 and β = 0.2, we have z1−β = 0.84,
z1−α/2 = 1.96, then the power of the trial is 80% and n = 63 in each trial arm.
However, we accept uncertainty about θ and σ and we wish to include this feature
into the sample size and power calculations.
We assume from previous studies that it is reasonable that
θ ∼ N(0.5, 1) and σ ∼ N(1, 0.32
)I(0, ∞).
Dr. Pablo E. Verde 85
Lecture 3: Prior distributions
Then
1. Simulate values θ∗ ∼ N(0.5, 1) and σ∗ ∼ N(1, 0.32) (subject to the constrain of σ
being positive).
2. Substitute them in the formulae and generate n∗ and Power∗
.
3. Use the histogram of n∗ and Power∗
as their corresponding predictive distribution.
4. In R we have
## predictive prior distribution for sample size and power
set.seed(123)
theta <- rnorm(10000, 0.5, 0.1)
sigma <- rnorm(10000, 1, 0.3)
sigma <- ifelse(sigma <0, -1*sigma, sigma)
n <- 2*sigma^2 /(theta^2)*(0.84 + 1.96)^2
pow <- pnorm( sqrt( 63/2) * theta /sigma - 1.96)
par(mfrow=c(1,2))
hist(n, xlim = c(0, 400), breaks=50)
hist(pow)
par(mfrow=c(1,1))
Dr. Pablo E. Verde 86
Lecture 3: Prior distributions
Histogram of n
n
Frequency
0 100 200 300 400
0500100015002000
Histogram of pow
pow
Frequency
0.2 0.4 0.6 0.8 1.0
05001000150020002500
Dr. Pablo E. Verde 87
Lecture 3: Prior distributions
> round(quantile(n),2)
0% 25% 50% 75% 100%
0.00 37.09 62.03 99.45 1334.07
> round(quantile(pow),2)
0% 25% 50% 75% 100%
0.09 0.61 0.81 0.95 1.00
> sum(pow<0.7)/10000
[1] 0.3645
>
It is clear that there is huge uncertainty to the appropriate sample size.
For n=63 the median power is 81% and a trial of 63 patients per group could be
seriously underpowered. There is a 37% chance that the power is less than 70%.
Dr. Pablo E. Verde 88
Lecture 3: Prior distributions
Mixture of priors for proportions
We may want to express a more complex prior opinion that can not be
encapsulated by a beta distribution
A prior which is a mixture of beta distributions
p(θ) = qp1(θ) + (1 − q)p2(θ)
where pi = Beta(ai , bi ).
Now if we observe r successes out of n trials,the posterior is
p(θ|r, n) = q∗
p1(θ|r, n) + (1 − q∗
)p2(θ|r, n)
where
pi (θ|r, n) ∝ pi (θ)p(r|θ, n)
q∗
=
qp1(r|n)
qp1(r|n) + (1 − q)p2(r|n)
Dr. Pablo E. Verde 89
Lecture 3: Prior distributions
pi (r|n) is a beta-binomial predictive probability or r successes in n trials assuming
θ has distribution Beta(ai , bi ).
The posterior is a mixture of beta posteriors, with mixture weights adapted to
support prior that provides best prediction for the observed data.
In R:
# mixture of betas
mixbeta <- function(x,r,a1,b1,a2,b2,q,n)
{
qstar <- q*betabin(r,a1,b1,n)/
(q*betabin(r,a1,b1,n)+(1-q)*betabin(r,a2,b2,n))
p1 <- dbeta(x,a1+r,n-r+b1)
p2 <- dbeta(x,a2+r,n-r+b2)
posterior <- qstar*p1 + (1-qstar)*p2
}
Dr. Pablo E. Verde 90
Lecture 3: Prior distributions
Example: drug investigation continue ...
We want to combine:
an informative Beta(9.2, 13.8)
a non-informative Beta(1, 1)
we give 80 % of prior weight to the informative prior
Suppose we treat n = 20 volunteers with the compound and observe r = 15
positive responses.
Dr. Pablo E. Verde 91
Lecture 3: Prior distributions
We can visualize this analysis in R as follows:
par(mfrow=c(2,1))
# informative beta prior
curve(dbeta(x,9.2,13.8),from=0, to=1,col="red",
xlab="probability of response",main="priors")
# mixture beta prior with 80% informative and 20% flat
curve(0.8*dbeta(x, 9.2, 13.8)+0.2*dbeta(x, 1, 1),from=0,
to=1,col="blue",add=T)
# beta posterior
curve(dbeta(x,24.2,18.8),from=0,to=1,col="red",
xlab="probability of response", main="posteriors")
# posterior from a mixture prior
curve(mixbeta(x, r=15, a1=9.2, b1=13.8, a2=1, b2=1, q=.8, 20),
from=0,to=1, col="blue",add=T)
par(mfrow=c(1,1))
Dr. Pablo E. Verde 92
Lecture 3: Prior distributions
0.0 0.2 0.4 0.6 0.8 1.0
01234
priors
probability of response
dbeta(x,9.2,13.8)
0.0 0.2 0.4 0.6 0.8 1.0
012345
posteriors
probability of response
dbeta(x,24.2,18.8)
Dr. Pablo E. Verde 93
Lecture 3: Prior distributions
Further topics on priors:
Adjustment of priors based on historical data and judgement
1. Power priors (Ibrahim and Shen, 2000)
2. Bias modelling
Hirearchical priors for large dimensional problems (e.g regression with more
predictors than observations)
Summary:
The need for priors distributions should not be an embarrassment
It is reasonamble that the prior should inﬂuence the analysis, as long as the
inﬂuence is recognized and justiﬁed
Importance of transparency and sensitivity analysis
Dr. Pablo E. Verde 94
Lecture 3: Prior distributions
Practical
Exercise: Conjugate mixture priors inference for a binomial experiment
Suppose that most drugs (95%) are assumed to come from the stated Beta(9.2,
13.8) prior, but there is a small chance that the drug might be a ”winner”.
Winners are assumed to have a prior distribution with mean 0.8 and standard
deviation 0.1.
Dr. Pablo E. Verde 95
Lecture 3: Prior distributions
1. What Beta distribution might represent the ”winners” prior? ( Hint: Use the
formulas of the mean and the variance for a Beta(a, b) distribution from lecture 2.
This can be re-arranged to
a = m(m(1 − m)/v − 1)
and
b = (1 − m)(m(1 − m)/v − 1),
where m and v are the mean and the variance of a Beta(a,b) distribution).
2. Enter this mixture prior model into the R code for mixtures and plot it.
3. Calculate the posterior mixture model based on 15 success out of 20 trials and
plot it.
4. What is the chance that the drug is a winner?
Dr. Pablo E. Verde 96
Lecture 3: Prior distributions
Solution
# 1)
# to workout the a and b for the winner drug prior use the realtionship
# of the mean and variance of the beta distribution and its parameters
> m <- 0.8
> v <- 0.1^2
> a <- m*(m*(1-m)/v -1)
> a
[1] 12
> b <- (1-m)*(m*(1-m)/v -1)
> b
[1] 3
# 2) plot for the mixture prior
curve(dbeta(x, 12, 3)*0.05 + dbeta(x, 9.2, 13.8)*0.95, from = 0, to =1,
main = "prior", xlab = "rate")
Dr. Pablo E. Verde 97
Lecture 3: Prior distributions
# 3) plot for the mixture posterior with 15 success out of 20 trials
# mixture posterior of betas
mixbeta <- function(x,r,a1,b1,a2,b2,q,n)
{
qstar <- q*betabin(r,a1,b1,n)/(q*betabin(r,a1,b1,n)+
(1-q)*betabin(r,a2,b2,n))
p1 <- dbeta(x,a1+r,n-r+b1)
p2 <- dbeta(x,a2+r,n-r+b2)
posterior <- qstar*p1 + (1-qstar)*p2
}
# then the mixture beta has parameters:
# a = 12, b = 3 with probability 5%
# a = 9.2, b = 13.8 with probabilty 95%
# r = 15, n = 20
curve(mixbeta(x, r = 15, a1 = 12, b1 = 3, a2 = 9.2, b2 = 13.8,
q = 0.05, n = 20), from = 0, to = 1, main = "posterior", xlab = "rate")
Dr. Pablo E. Verde 98
Lecture 3: Prior distributions
# 4) The probabilty that the drug is a winner is given by the posterior
# weight to belong to the winner class
qstar <- 0.05*betabin(15, 12 , 3 ,20 )/(0.05*betabin(15, 12 ,3 ,20)
+(1-0.05)*betabin(15, 9.2, 13.8, 20 ))
> qstar
[1] 0.3951545
Dr. Pablo E. Verde 99
Lecture 4 - Multiparameters with R
Lecture 4:
Introduction to Multiparameter Models
Dr. Pablo E. Verde 100
Lecture 4 - Multiparameters with R
Summary
Introduction to multiparameter inference
Normal model with unknown mean and variance: standard non-informative priors
Using R for predictive model checking
Multinomial Model conjugate analysis and its application in R
Comparing classical and Bayesian multiparameter models
Dr. Pablo E. Verde 101
Lecture 4 - Multiparameters with R
Introduction
The basic ideas is similar to one parameter models:
We have a model for the observed data which deﬁnes a likelihood p(y|θ) on the
vector parameter θ
We specify a joint prior distribution p(θ) for the possible values of θ
We use Bayes’ rule to obtain the posterior of θ,
p(θ|y) ∝ p(θ) × p(y|θ)
Problems:
In many applications is diﬃcut to specify priors on multivariate θ
Computations could be very diﬃcult, they involve multivariate integration
Dr. Pablo E. Verde 102
Lecture 4 - Multiparameters with R
Example: Normal with unknown mean µ and variance σ2
In this example we use marathontimes in the R package LearnBayes
The obervations are the mean times (in minutes) for men running Marathon with
age classes between 20 to 29 years old:
> library(LearnBayes)
> data(marathontimes)
> marathontimes$time
[1] 182 201 221 234 237 251 261 266 267 273 286 291 292 ...
>
Dr. Pablo E. Verde 103
Lecture 4 - Multiparameters with R
We assume a Normal model for the data given the mean µ and the variance σ2:
y1, . . . , y20|µ, σ2
∼ N(µ, σ2
)
We use a non-informative Jeﬀreys’ prior assuming independence between location
and scale:
p(µ, σ) ∝
1
σ
,
Then posterior density for the mean and the variance p(µ, σ|y) is called the
Normal-χ2 distribution
This posterior delivers same results as classical analysis: E(µ|y) = ¯y, E(σ2
|y) = S2
The posterior p(µ|y) is proportional to a t-distribution with n − 1 df, and so on
Direct simulation from these posteriors is straghtforwared
Dr. Pablo E. Verde 104
Lecture 4 - Multiparameters with R
The Normal-χ2 posterior is implemented in the function normchis2post in
LearnBayes, which computes the logarithm of the joint posterior density of
(µ, σ2)
We can visualize the α% levels contourns with the function mycontour. The
arguments include the log-density to plot, the rectangular area (xlo, xhi , ylo, yhi )
and the data:
> mycontour(normchi2post, c(220,330,500,9000), time,
xlab = "mean", ylab = "var")
We can simulate directly form the marginal posteriors of µ and τ:
> SS <- sum((time - mean(time))^2)
> n <- length(time)
> sigma2 <- SS/rchisq(1000, n - 1)
> mu <- rnorm(1000, mean = mean(time), sd = sqrt(sigma2)/sqrt(n))
> points(mu, sigma2, col="blue")
> quantile(mu, c(0.025, 0.975)) # 95% posterior interval for mu
2.5% 97.5%
253.1121 301.1133Dr. Pablo E. Verde 105
Lecture 4 - Multiparameters with R
mean
variance
−6.9
−4.6
−2.3
220 240 260 280 300 320
2000400060008000
q
q
q
q
q
q
q
q
q
qq
q
q qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q q
q
q
q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
qq
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
Dr. Pablo E. Verde 106
Lecture 4 - Multiparameters with R
Predictive model checking
Is the statistical model consistent with the data ?
We answer this question by simulating predictive y∗ values from the model and
comparing some data features with the predictive data.
We start by looking at some histograms between the original data and the simulated
data with the same sample size.
y.star <- rnorm(1000, mean = mu, sd =sqrt(sigma2)) #predictive data
par(mfrow=c(3,4))
hist(time, breaks=10, xlim =c(150, 400), main="Original Data", col="green")
for(i in 1:11){
y.sim <- sample(y.star, 20)
hist(y.sim,breaks=10,xlim=c(150, 400),main="Simulated Data",col="blue")
}
Dr. Pablo E. Verde 107
Lecture 4 - Multiparameters with R
Original Data
time
Frequency
150 200 250 300 350 400
0123456
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
012345
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
012345
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
01234
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
01234
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
01234
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
012345
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
012345
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
012345
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
01234
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
0123456
Simulated Data
y.sim
Frequency
150 200 250 300 350 400
01234
Dr. Pablo E. Verde 108
Lecture 4 - Multiparameters with R
Predictive model checking
We deﬁne the following features between the observed data y and the predictive data
y∗:
T∗
1 = min(y∗
), T∗
2 = max(y∗
), T3∗
= q75(y∗
) − q25(y∗
)
and for asymmetry
T∗
4 = |y∗
(18) − µ∗
| − |y∗
(2) − µ∗
|
where the 18th and 2sd order statistics approximate the 90% and 10% respectively.
These measures are compared with the corresponding values based on the observed
data:
T1 = min(y), T2 = max(y), T3 = q75(y) − q25(y)
and
T4 = |y(18) − µ∗
| − |y(2) − µ∗
|.
Dr. Pablo E. Verde 109
Lecture 4 - Multiparameters with R
In R we have:
# Analysis of the minimum, maximum, variability and asymmetry
min.y <- max.y <- asy1 <- asy2 <- inter.q <- rep(0,1000)
time <- sort(time)
for (b in 1:1000){
y.sim <- sample(y.star, 20)
mu.star <- sample(mu, 20)
min.y[b] <- min(y.sim)
max.y[b] <- max(y.sim)
y.sim <- sort(y.sim)
asy1[b] <- abs(y.sim[18] - mean(mu.star)) - abs(y.sim[2] - mean(mu.star))
asy2[b] <- abs(time[18] - mean(mu.star)) - abs(time[2] - mean(mu.star))
inter.q[b] <- quantile(y.sim, prob=0.75) - quantile(y.sim, prob=0.25)
}
Dr. Pablo E. Verde 110
Lecture 4 - Multiparameters with R
To display this quantities
par(mfrow =c(2,2))
hist(min.y, breaks = 50, col="blue", main = "Minimum y*")
abline(v=min(time), lwd=3, lty=2)
hist(max.y, breaks = 50, col="red", main = "Maximum y*")
abline(v=max(time), lwd=3, lty=2)
hist(inter.q, breaks = 50, col="magenta", main = "Variability y*")
abline(v=quantile(time,prob=0.75)-quantile(time,prob=0.25), lwd=3,lty=2)
plot(asy1, asy2, main ="Asymmetry", xlab="Asymmetry predicted data",
ylab ="Asymmetry original data")
abline(a=0, b=1, lwd=2)
par(mfrow=c(1,1))
This analysis shows that the data and the model are compatibles and deviations can be
easily explained by sampling variation.
Dr. Pablo E. Verde 111
Lecture 4 - Multiparameters with R
Minimum y*
min.y
Frequency
100 150 200 250
020406080100
Maximum y*
max.y
Frequency
300 320 340 360 380 400 420
010203040
Variance y*
variance
Frequency
1000 2000 3000 4000 5000 6000 7000
010203040
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q q
qq
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
−150 −100 −50 0 50
−15−10−5051015
Asymmetry
Asymmetry predicted data
Asymmetryoriginaldata
Dr. Pablo E. Verde 112
Lecture 4 - Multiparameters with R
Multinomial distribution with unknown cell probabilities
In this model we observe n independent trials each of which has p possible outcomes
with associated probabilities
θ = (θ1, . . . , θp).
Letting y = (y1, . . . , yp) be the number of times each outcome is observe we have
y|θ ∼ Multinomial(n, θ),
with likelihood
p(y|θ) =
( yi )!
yi !
θyi
i ,
i
yi = n, θi = 1.
Dr. Pablo E. Verde 113
Lecture 4 - Multiparameters with R
The kernel of this likelihood is proportional to a Dirichlet distribution. If a-prior we
assume
θ ∼ Dirichlet(a1, . . . , ap), ai > 0,
then
p(θ) ∝
i
θai −1
and the posterior for θ is
p(θ|y) ∝
i
θyi +ai −1
so taking αi = ai + yi , then the posterior for θ is
p(θ|y) = Dirichlet(α1, . . . , αp).
Dr. Pablo E. Verde 114
Lecture 4 - Multiparameters with R
Some comments
The prior distribution is equivalent to a likelihood resulting from ai
observations, with ai observations in the ith category.
A uniform density is obtained by setting ai = 1 for all i. This distribution assigns
equal probability to each θi .
In general it is very diﬃcult to work with this distribution in practice. Proportions
are in general not independent and modeling prior information in this context it is
very diﬃcult.
We are going to use for contingency tables a surrogate Poisson model for the
multinomial distribution.
Moments and properties for the Dirichlet distribution are in Appendix A of
Gelman et.al.
Dr. Pablo E. Verde 115
Lecture 4 - Multiparameters with R
Example: A model for a two by two contingency table
Intervention
New Control
Death θ1,1 θ1,2
No death θ2,1 θ2,2
N
Data model:
p(y|θ) ∝
2
j=1
2
i=1
θ
yi,j
i,j
Prior model:
p(θ) ∝
2
j=1
2
i=1
θ
ai,j −1
i,j
Posterior model:
p(θ|y) ∝
2
j=1
2
i=1
θ
yi,j +ai,j −1
i,j
Dr. Pablo E. Verde 116
Lecture 4 - Multiparameters with R
We are interested in make inference for the Odds ratio:
Ψ =
θ1,1θ2,2
θ1,2θ2,1
.
We use direct simulation methods:
Simulate a large number of values for the vector θ from its posterior.
For each simulated value calculate Ψ∗.
Inference for Ψ is based on the histogram of Ψ∗.
Dr. Pablo E. Verde 117
Lecture 4 - Multiparameters with R
Example: GREAT trial, Spiegelhalter et.al. pag 69.
Intervention: Thrombolytic therapy after myocardial infarction, given at home by
general practitioners.
Aim of study: to compare a new drug treatment to be given at home as soon as
possible after a myocardial infarction and placebo.
Outcome measure: Thirty-day mortality rate under each treatment, with the beneﬁt
of the new treatment measured by the odds ratio, i.e., the ratio of the odds of death
following the new treatment to the odds of death on the conventional: OR < 1
therefore favors the new treatment.
Prospective Bayesian analysis: NO. It was carried out after the trial reported its
results.
Dr. Pablo E. Verde 118
Lecture 4 - Multiparameters with R
Example: GREAT trial continue
Intervention
New Control
Death 13 23 36
No death 150 125 275
163 148 311
We use a Dirichlet prior with parameters a1,1 = a1,2 = a2,1 = a2,2 = 1, which
corresponds to a uniform distribution for (θ1,1, θ1,2, θ2,1, θ2,2).
Dr. Pablo E. Verde 119
Lecture 4 - Multiparameters with R
We can simulate from a Dirichlet distributions with the function rdirichlet() from
the package LearnBayes:
> library(LearnBayes)
> draws <- rdirichlet(10000, c(13,23,150,125) )
> odds <-draws[,1]*draws[,4]/(draws[,2]*draws[,3])
> hist(odds, breaks = 100, xlab="Odds Ratio", freq = FALSE)
> lines(density(odds), lty =2, lwd=2, col ="blue")
> abline(v=quantile(odds, prob=c(0.025, 0.5, 0.975)), lty=3, col="red")
>
> quantile(odds, prob=c(0.025, 0.5, 0.975))
2.5% 50% 97.5%
0.2157422 0.4649921 0.9698272
Dr. Pablo E. Verde 120
Lecture 4 - Multiparameters with R
Histogram of odds
Odds Ratio
Density
0.5 1.0 1.5
0.00.51.01.52.02.5
prior odds
posterior odds
Dr. Pablo E. Verde 121
Lecture 4 - Multiparameters with R
Now suppose that we want to calculate the ”p-value” of
H0 : Ψ ≥ 1 vs. H1 : Ψ < 1
then,
> sum(odds > 1)/10000
[1] 0.0187
It is interesting to compare this results with the exact Fisher test:
> fisher.test(matrix(c(13, 23, 150, 125), nrow=2, byrow=TRUE),
alternative="less")
Fisher’s Exact Test for Count Data
p-value = 0.02817
Thus, the uniform prior on (θ1,1, θ1,2, θ2,1, θ2,2) does not corresponds to a standard
statistical analysis.
Dr. Pablo E. Verde 122
Lecture 4 - Multiparameters with R
Now, if we work with a prior with parameters a1,1 = 2, a1,2 = 1, a2,1 = 1 and a2,2 = 2,
with density
p(θ1,1, θ1,2, θ2,1, θ2,2) ∝ θ1,1θ2,2
we get the following results
> # Fisher exact test
> param <- c(2,1,1,2) # parameter of the dirichlet
> draws <- rdirichlet(10000, c(13,23,150,125)+param )
> odds <-draws[,1]*draws[,4]/(draws[,2]*draws[,3])
> sum(odds>1)/10000
[1] 0.0277
This shows (empirically) that the Fisher test is NOT based in a non-informative prior.
Some weakly information of association is implied in the test.
However, there is a diﬀerence in interpretation between the Bayesian and sampling
theory results.
Dr. Pablo E. Verde 123
Lecture 4 - Multiparameters with R
theta11
theta22
f(theta11,theta22)
Dirichlet with a11=2, a12=1, a21=1, a22=2
Dr. Pablo E. Verde 124
Lecture 4 - Multiparameters with R
Practical
Exercise: Normal data with unknown mean and variance
Review the Bayesian analysis presented in this lecture
Generate predictive posterior data and check model ﬁtness
Now, change the ﬁrst observation of the marathon times and generate an outlier
as follows:
> library(LearnBayes)
> data(marathontimes)
> marathontimes$time
[1] 182 201 221 234 237 251 261 266 267 273 286 291 292 ...
>
# generate an outlier in the first observation:
data1 <- marathontimes$time
data1[1] <- 82
Make the same analysis for the contaminated data
Which is the eﬀect of this outlier in the analysis?
Dr. Pablo E. Verde 125
Lecture 5 - Introduction to WinBUGS
Lecture 5:
Introduction to WinBUGS
Dr. Pablo E. Verde 126
Lecture 5 - Introduction to WinBUGS
Summary
Introduction to BUGS
The BUGS language
Some simple examples
Making predictions
Connecting WinBUGS with R
Dr. Pablo E. Verde 127
Lecture 5 - Introduction to WinBUGS
Introduction to BUGS
The BUGS project began at the Medical Research Council Biostatistics Unit in
Cambridge in 1989, before the classic Gibbs sampling paper by Gelfand and Smith in
1990. An excellent review and future directions of the BUGS project is given in Lunn
et al. (2009).
BUGS stands for Bayesian inference using Gibbs sampling, reﬂecting the basic
computational techniques originally adopted.
BUGS has been just one part of the tremendous growth in the application of Bayesian
ideas over the last 20 years.
At this time BUGS has approximately over 30,000 registered users worldwide, and an
active on-line community comprising over 8,000 members.
Typing ”WinBUGS” in Google generates over 100,000 hits; in Google scholars over
5,000; and searching in Statistics in Medicine on-line gives about 100 hits.
Dr. Pablo E. Verde 128
Lecture 5 - Introduction to WinBUGS
Think diﬀerent ...
The modelling philosophy of BUGS was strongly inﬂuenced by developments in
artiﬁcial intelligence in the 1980’s. In particular the focus was on Expert systems,
where a basic principle was to separate:
Knowledge base:
assumed model for the world
makes use of a declarative form of programming
structures described using a series of local relationships
Inference engine:
used to draw conclusions in speciﬁc circumstances
This approach forces to think ﬁrst about the model use to describe the problem at
hand. The declarative programming approach, sometimes, confuses statistical analysts
used to work with procedural or functional statistical languages (SAS, SPSS, R, etc).
Dr. Pablo E. Verde 129
Lecture 5 - Introduction to WinBUGS
The BUGS language:
Language for specifying complex Bayesian models.
Constructs object-oriented internal representation of the model graph by
identifying parents and children. This is done with a DAG (Directed Acyclic
Graph).
Builds up an arbitrary complex model through speciﬁcation of local structure.
Simulation from full conditionals using Gibbs sampling.
Current version is WinBUGS 1.4.3, it runs in Windows, and incorporates the
DoodleBUGS graphical model editor and a script language for running in batch
mode.
WinBUGS is freely available from http://www.mrc-bsu.cam.ac.uk/bugs
An open source version of BUGS called OpenBUGS is under development, and includes
versions of BUGS that run under LINUX (LinBUGS) and that can be run directly from
R (BRugs). See http://www.rni.helsinki.fi/openbugs
Dr. Pablo E. Verde 130
Lecture 5 - Introduction to WinBUGS
Example: Drug
In n = 20 patients we observed r = 15 positive responses.
y ∼ Bin(θ, n)
and we assume a conjugate prior for θ:
θ ∼ Beta(a, b)
Of course, we know the posterior distribution is
θ|y ∼ Beta(a + y, n − r + b)
and no simulation is necessary. But just to illustrate WinBUGS ...
Dr. Pablo E. Verde 131
Lecture 5 - Introduction to WinBUGS
Dr. Pablo E. Verde 132
Lecture 5 - Introduction to WinBUGS
Directed Acyclic Graph (DAG) representation:
Ovals nodes represent random variables
Rectangular nodes represent constants
Arrows parent child relationships
Dr. Pablo E. Verde 133
Lecture 5 - Introduction to WinBUGS
1. Write BUGS code to specify the model ...
model {
y ~ dbin(theta, n)
theta ~ dbeta(a, b)
}
2. Check the model syntax ...
3. Load the data ...
list(n=20, y = 15, a = 3, b =2)
... and compile.
4. Then load initial values ...
list(theta = 0.5)
5. Select the nodes (variables) to monitor (just one in this case)...
6. Set the trace for all selected nodes (*) to monitor the MCMC simulations ...
7. Using the Update tools, select 10,000 simulations ...
8. Results ...
Dr. Pablo E. Verde 134
Lecture 5 - Introduction to WinBUGS
Example: Drug continue ...
Now we assume a diﬀerent prior. A logit transform of
φ = log
θ
1 − θ
,
so that −∞ < φ < ∞
We assume a normal prior for φ:
φ ∼ Normal(µ, τ)
for suitable mean µ and precision τ = 1/σ2.
This is a non-conjugate prior with no simple form for the posterior.
Straightforward in WinBUGS!
Dr. Pablo E. Verde 135
Lecture 5 - Introduction to WinBUGS
Double arrows represent a logical node or a mathematical functional relationship.
Dr. Pablo E. Verde 136
Lecture 5 - Introduction to WinBUGS
The BUGS code for the model is ...
model
{
y ~ dbin(theta, n)
logit(theta) <- phi
phi ~ dnorm(0, 0.001)
}
Dr. Pablo E. Verde 137
Lecture 5 - Introduction to WinBUGS
Making predictions
Important to be able to predict unobserved quantities for
’ﬁlling-in” missing or censored data
model checking - are predictions ’similar’ to observed data?
making predictions!
Easy in MCMC/WinBUGS, just specify a stochastic node without a data value it
automatically predicted
Provides automatic imputation of missing data
Easiest case is where there is no data at all!! Just ’forwards sampling’ from prior
to make a Monte Carlo analysis.
Dr. Pablo E. Verde 138
Lecture 5 - Introduction to WinBUGS
Example: making predictions
The BUGS code for the model is ...
model
{
y ~ dbin(theta, n)
logit(theta) <- phi
phi ~ dnorm(mu, tau)
y.pred ~ dbin(theta, n) # defines a stochastic node for
# predictions
}
Dr. Pablo E. Verde 139
Lecture 5 - Introduction to WinBUGS
The prediction node y.pred is conditionally independent of the data node y given rate
θ.
Dr. Pablo E. Verde 140
Lecture 5 - Introduction to WinBUGS
Summary: Running WinBUGS
1. Open Speciﬁcation tool and Update from Model menu, and Samples from
Inference menu.
2. Highlight model by double-click. Click on Check model.
3. Highlight start of data. Click on Load data.
4. Click on Compile.
5. Highlight start of initial values. Click on Load inits.
6. Click on Gen Inits if model initials values are needed.
7. Click on Update to burn in.
8. Type nodes to be monitored into Sample Monitor, and click set after each.
9. Perform more updates.
10. Type * into Sample Monitor, and click stats, etc. to see results on all monitored
nodes.
Dr. Pablo E. Verde 141
Lecture 5 - Introduction to WinBUGS
The R2WinBUGS package in R
The R package R2WinBUGS oﬀers a versatile approach for making MCMC
computations within WinBUGS and returning them to R.
Let see a step by step example of linking R and WinBUGS with R2WinBUGS.
Example: non-conjugate priors inference for a binomial experiment
First save the following ﬁle with WinBUGS code in your working directory ...
# WinBUGS code: binary problem non-conjugate analysis
model
{
y ~ dbin(theta, n)
logit(theta) <- phi
phi ~ dnorm(0,0.001)
y.pred ~ dbin(theta,n) # making prediction
}
Dr. Pablo E. Verde 142
Lecture 5 - Introduction to WinBUGS
Then in R console ...
# load R2WinBUGS package
library(R2WinBUGS)
# setup WinBUGS directory and your working directory
bugsdir <- "C:/Programme/WinBUGS14"
workdir <- getwd()
# define you data nodes
n <- 20
y <- 15
data1 <- list ("n", "y")
Dr. Pablo E. Verde 143
Lecture 5 - Introduction to WinBUGS
# define parameters of interest
par1 <- c("theta", "y.pred")
# run the bugs() function:
m1 <- bugs(data1, inits=NULL, par1, "model1.txt",
n.chains = 1, n.iter = 2000, n.thin=1,
bugs.directory = bugsdir,
working.directory = getwd(),
debug=TRUE)
Dr. Pablo E. Verde 144
Lecture 5 - Introduction to WinBUGS
> print(m1, digits.summary = 3)
Inference for Bugs model at "model1.txt", fit using WinBUGS,
1 chains, each with 2000 iterations (first 1000 discarded)
n.sims = 1000 iterations saved
mean sd 2.5% 25% 50% 75% 97.5%
theta 0.749 0.096 0.543 0.690 0.760 0.819 0.914
y.pred 14.962 2.733 9.000 13.000 15.000 17.000 19.000
deviance 4.277 1.584 3.198 3.286 3.667 4.551 9.014
DIC info (using the rule, pD = Dbar-Dhat)
pD = 1.1 and DIC = 5.3
DIC is an estimate of expected predictive error (lower deviance is better).
Dr. Pablo E. Verde 145
Lecture 5 - Introduction to WinBUGS
The R object m1 generated in this analysis belong to the class bug and can be further
manipulated, transformed, etc.
> class(m1)
[1] "bugs"
> names(m1)
[1] "n.chains" "n.iter" "n.burnin"
[4] "n.thin" "n.keep" "n.sims"
[7] "sims.array" "sims.list" "sims.matrix"
[10] "summary" "mean" "sd"
[13] "median" "root.short" "long.short"
[16] "dimension.short" "indexes.short" "last.values"
[19] "isDIC" "DICbyR" "pD"
[22] "DIC" "model.file" "program"
Dr. Pablo E. Verde 146
Lecture 5 - Introduction to WinBUGS
The object m1 is a list in R, so we can extract elements of the list by using the ”$”
operator, for example:
> m1$pD
[1] 1.065
> m1$n.chains
[1] 1
> m1$sims.array[1:10, ,"theta"]
[1] 0.8389 0.9124 0.7971 0.8682 0.7025 0.7696 0.8417 0.6782 0.6146
[10] 0.8647
> m1$sims.array[1:10, ,"y.pred"]
[1] 15 19 19 20 15 20 18 17 10 17
> theta <- m1$sims.array[1:1000, ,"theta"]
> hist(theta, breaks = 50, prob = TRUE)
> lines(density(theta), col = "blue", lwd =2)
Dr. Pablo E. Verde 147
Lecture 5 - Introduction to WinBUGS
Histogram of theta
theta
Density
0.5 0.6 0.7 0.8 0.9 1.0
012345
Dr. Pablo E. Verde 148
Lecture 5 - Introduction to WinBUGS
Some aspects of the BUGS language
<- represents logical dependence, e.g. m <- a + b*x
~ represents stochastic dependence, e.g. r ~ dunif(a,b)
Can use arrays and loops
model{
....
for (i in 1:N){
r[i] ~ dbin(p[i], n[i])
}
...
}
Dr. Pablo E. Verde 149
Lecture 5 - Introduction to WinBUGS
A for loop is represented as a ”plate” in the DAG’s model
Dr. Pablo E. Verde 150
Lecture 5 - Introduction to WinBUGS
Some aspects of the BUGS language
Some functions can appear on the left-hand-side of an expression, e.g.
logit(p[i]) <- a + b*x[i]
log(m[i]) < - c + d*y[i]
mean(p[]) to take mean of whole array, mean(p[m : n]) to take mean of elements m
to n. Also for sum(p[])
dnorm(0, 1)I(0, ) means the random variable will be restricted to the range
(0, ∞).
Dr. Pablo E. Verde 151
Lecture 5 - Introduction to WinBUGS
Functions in the BUGS language
p <- step(0.05 - x) = 1 if x ≤ 0.05, 0 otherwise. Hence monitoring p and
recording its mean will give the probability that x ≤ 0.05. This is useful to
calculate Bayesian p-values.
p <- equals(x, 0.7) = 1 if x = 0.7, 0 otherwise.
tau <- 1/pow(s,2) sets τ = 1/s2.
s <- 1/sqrt(tau) sets s = 1/ (τ)
p[i,k] <- inprod(p[], Lambda[i,k]) sets pik = j πj Λij .
See ’Model Speciﬁcation/Logical nodes’ in the manual for full syntax.
Dr. Pablo E. Verde 152
Lecture 5 - Introduction to WinBUGS
Data transformations
Although transformations of data can always be carried out before using WinBUGS, it
is convenient to be able to try various transformations of dependent variables within a
model description.
For example, we may wish to try both y and sqrt(y) as dependent variables without
creating a separate variable z = sqrt(y) in the data ﬁle.
The BUGS language therefore permits the following type of structure to occur:
for (i in 1:N) {
z[i] <- sqrt(y[i])
z[i] ~ dnorm(mu, tau)
}
Strictly speaking, this goes against the declarative structure of the model speciﬁcation.
Dr. Pablo E. Verde 153
Lecture 5 - Introduction to WinBUGS
Some common distributions
Binomial: r ∼ dbin(p, n)
Normal: x ∼ dnorm(mu, tau)
Poisson: r ∼ dpois(lambda)
Uniform: x ∼ dunif(a, b)
Gamma: x ∼ dgamma(a, b)
Note: The normal distribution is parameterized in terms of its mean and
precision = 1/variance = 1/σ2.
Functions cannot be used as arguments in distributions. You need to create
new nodes.
Dr. Pablo E. Verde 154
Lecture 5 - Introduction to WinBUGS
The WinBUGS data formats
WinBUGS accepts data ﬁles in:
1. Rectangular formant
n[] r[]
50 2
....
20 4
END
2. R list format:
list(N =12, n = c(50,12,...), r = c(2, 1,...))
Dr. Pablo E. Verde 155
Lecture 5 - Introduction to WinBUGS
Double indexing: Specifying categorical explanatory variables
y[] x[]
12 1 #subject with x=level 1
34 2 #subject with x=level 2
...
for( i in 1:N) {
y[i] ~dnorm(mu[i], tau)
mu[i] <- alpha + beta[x[i]]
}
alpha ~ dunif(-100,100)
beta[1] <- 0 # alias first level of beta
beta[2] ~ dunif(-100, 100)
beta[3] ~ dunif(-100, 100)
tau ~ dgamma(0.1,0.1)
Dr. Pablo E. Verde 156
Lecture 5 - Introduction to WinBUGS
Practical
Exercise: Survival data
Aitchison & Dunsmore (1975) give data on survival times (in weeks) of 20 patients
after treatment for a particular carcinoma. The data set is
list(survtime=c(25, 45, 238, 194, 16, 23, 30, 16,
22, 123, 51, 412, 45, 162,
14, 72, 5, 43, 45, 91), N=20)
Dr. Pablo E. Verde 157
Lecture 5 - Introduction to WinBUGS
1. Write a WinBUGS script for the following model:
yi = log(survival[i])
yi |µ, τ ∼ N(µ, τ), τ = 1/σ2
with non-informative priors for µ ∼ N(0, 10−3) and τ ∼ Gamma(10−3, 10−3).
2. Use as initial values
list( mu = 0, tau = 1)
3. Draw a DAG for this model.
4. Run the model with 10,000 iterations and discard the ﬁrst 5,000. Analyze visually
convergence.
5. Estimate the probability that a new patient has survival time more than 150
weeks.
Dr. Pablo E. Verde 158
Lecture 5 - Introduction to WinBUGS
Solution
# WinBUGS script:
model{
for ( i in 1:N)
{
y[i] <- log(survtime[i]) #data transform
y[i] ~ dnorm(mu, tau) #sampling model
}
sigma <- 1/sqrt(tau) #standard deviation
mu ~ dnorm(0, 1.0E-3) #prior for mu
tau ~ dgamma(1.0E-3, 1.0E-3) #prior for tau
y.new ~ dnorm(mu, tau) #predicted data
y.dif <- step(y.new - log(150)) #pr(survival>150)
}
Dr. Pablo E. Verde 159
Lecture 5 - Introduction to WinBUGS
Solution
Dr. Pablo E. Verde 160
Lecture 6 - Multivariate Models with WinBUGS
Lecture 6:
Multivariate Models with WinBUGS
Dr. Pablo E. Verde 161
Lecture 6 - Multivariate Models with WinBUGS
Multivariate Normal Distribution
Multivariate Normal for p-dimensional vector y
y ∼ Normalp(µ, Σ)
The conjugate prior for µ is also a MVN. WinBUGS follows parametrization using the
precision matrix Ω = Σ−1.
In BUGS notation we have
y[1:p] ~ dmnorm(mu[], Omega[,])
mu[1:p] ~ dmnorm(mu.prior[], Omega.prior[])
Sigma[1:p, 1:p] <- inverse(Omega[,])
Dr. Pablo E. Verde 162
Lecture 6 - Multivariate Models with WinBUGS
Priors on precision matrix of multivariate normals
Conjugate prior is the Wishart distribution, which is analogous to Gamma or χ2.
Arises in classical statistics as the distribution of the sum-of-squares and products
matrix in multivariate normal sampling.
The Wishart distribution Wp(k, R) for a symmetric positive deﬁnite p × p matrix Ω
has density
p(Ω) ∝ |R|k/2
|Ω|(k−p−1)/2
exp −
1
2
tr(RΩ) ,
deﬁned for a real scalar k > p − 1 and a symmetric positive deﬁnite matrix R.
Dr. Pablo E. Verde 163
Lecture 6 - Multivariate Models with WinBUGS
Some characteristics
When p = 1
W1(k, R) ≡ Gamma(k/2, R/2) ≡ χ2
k/R.
The expectation of the Wp(k, R) is
E[Ω] = kR−1
.
Jeﬀreys prior is
p(Ω) ∝ |Σ|−(p+1)/2
,
equivalent to k → 0. This is not currently implemented in WinBUGS.
For ”weakly informative” we can set R/k to be a rough prior guess at the
unknown true covariance matrix and taking k = p indicates minimal ”eﬀective
prior sample size”.
Dr. Pablo E. Verde 164
Lecture 6 - Multivariate Models with WinBUGS
Known problems
Every element must have same precision
Incomprehensible! It is a good idea to make some prior predictions in practice to
understand what we are doing.
Gelman at al (2004) page 483 outline alternative strategies
If do not use Wishart priors for precision matrices in BUGS, you need to make
sure that the covariance matrix at each iteration is positive-deﬁnite, otherwise
may crash.
Dr. Pablo E. Verde 165
Lecture 6 - Multivariate Models with WinBUGS
Exercise 1: Correlation Analysis with missing data
The following is an artiﬁcial data from the book Tools for Statistical Inference (Tanner
pag 63.(1994)):
> Y
[,1] [,2]
[1,] 1 1
[2,] 1 -1
[3,] -1 1
[4,] -1 -1
[5,] 2 NA
[6,] 2 NA
[7,] -2 NA
[8,] -2 NA
...
Dr. Pablo E. Verde 166
Lecture 6 - Multivariate Models with WinBUGS
The following script in WinBUGS implement a Bayesian analysis for this problem.
model
{
for (i in 1 : 12){
Y[i, 1 : 2] ~ dmnorm(mu[], tau[ , ])
}
mu[1] <- 0
mu[2] <- 0
tau[1 : 2,1 : 2] ~ dwish(R[ , ], 2)
R[1, 1] <- 0.001
R[1, 2] <- 0
R[2, 1] <- 0
R[2, 2] <- 0.001
Sigma2[1 : 2,1 : 2] <- inverse(tau[ , ])
rho <- Sigma2[1, 2] / sqrt(Sigma2[1, 1] * Sigma2[2, 2]) }
Dr. Pablo E. Verde 167
Lecture 6 - Multivariate Models with WinBUGS
P(rho|data)
ρ
Frequency
−1.0 −0.5 0.0 0.5 1.0
020406080100120
Using predictive posterior values in each iteration WinBUGS impute the missing values.
Dr. Pablo E. Verde 168
Lecture 6 - Multivariate Models with WinBUGS
Multinomial model: non-conjugate analysis
In some applications the cell probabilities of contingency tables can be made function
of more basic parameters
Example: Population genetics example
Oﬀspring genotype
Maternal genotype AA AB BB
AA 427 95 AB
108 161 71
BB - 64 74
Dr. Pablo E. Verde 169
Lecture 6 - Multivariate Models with WinBUGS
The model equations are given by the following table:
Oﬀspring genotype
Maternal genotype AA AB BB
AA (1 − σ)p + σ (1 − σ)q AB
(1 − σ)p/2 + σ/4 1/2 (1 − σ)q/2 + σ/4
BB - (1 − σ)p (1 − σ)q + σ
where p is the frequency of A in outcross pollen, σ is the rate of self-fertilization and
q = 1 − p
Dr. Pablo E. Verde 170
Lecture 6 - Multivariate Models with WinBUGS
To implement this model in WinBUGS, we equate cell probabilites with requred
function:
model{
XAA[1]<- (1-sigma)*p + sigma; XAA[2]<- (1- sigma)*q; XAA[3]<- 0
XAB[1]<-(1-sigma)*p/2 +sigma/4; XAB[2]<-0.5; XAB[3]<-(1-sigma)*q/2 +sigma/4
XBB[1]<- 0; XBB[2]<- (1-sigma)*p; XBB[3]<- (1- sigma)*q + sigma
KAA <- sum(NAA[]); KAB <- sum(NAB[]); KBB <- sum(NBB[])
NAA[1:3] ~ dmulti(XAA[], KAA)
NAB[1:3] ~ dmulti(XAB[], KAB)
NBB[1:3] ~ dmulti(XBB[], KBB)
p ~ dunif(0, 1) # uniform prior for p
sigma ~ dunif(0, 1) # uniform prior for sigma
q <- 1 -p
}
list(NAA = c(427, 95, 0), NAB=c(108, 161, 71), NBB=c(0, 64, 74))
Dr. Pablo E. Verde 171
Lecture 6 - Multivariate Models with WinBUGS
> print(m.popgen, digits=3)
Inference for Bugs model at "popgen.bug", fit using WinBUGS,
2 chains, each with 10000 iterations (first 5000 discarded)
n.sims = 10000 iterations saved
mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
p 0.705 0.024 0.655 0.690 0.706 0.721 0.749 1.001 10000
sigma 0.371 0.042 0.288 0.343 0.371 0.400 0.451 1.001 10000
deviance 27.151 1.972 25.210 25.730 26.540 27.960 32.410 1.001 10000
For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
DIC info (using the rule, pD = Dbar-Dhat)
pD = 2.0 and DIC = 29.1
Dr. Pablo E. Verde 172
Lecture 6 - Multivariate Models with WinBUGS
x
Frequency
p
0.25 0.35 0.45
0.640.680.720.76
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq qq
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
qq
q q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
qqq
q
q
qq
q
q
q
q
q
q q
q
q
q
q q
q
q
q q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q qqq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q qq
q q
q
q
q q
q
q q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
qq
q
q
q
q
qqq q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q qq
0.64 0.68 0.72 0.76
0.250.350.45
q
q
q
q
q
q
q
q
qq
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q q
q
q q
q
q
q
q
q
q
q
q qq
q
qq
q
q
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
qq
q
q
q
qq
q
q
q
q
qq
q
q q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
qqq
q
qq
q
q
q
q
q
q q
q
q
q
q
q
q
q
qq
q
q
q q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q q
qq
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q qq
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
x
Frequency
sigma
Dr. Pablo E. Verde 173
Lecture 7 - MCMC-Short-version
Lecture 7:
Bayesian Computations with MCMC Methods
Dr. Pablo E. Verde 174
Lecture 7 - MCMC-Short-version
Summary
Introduction to discrete Markov chains.
Construction of Markov chain Monte Carlo algorithms.
Gibbs sampling methods.
Metropolis and Metropolis-Hastings method.
Issues in applications of MCMC (convergence checking, proposal distributions,
etc.)
Dr. Pablo E. Verde 175
Lecture 7 - MCMC-Short-version
Why is computation important ?
Bayesian inference centers around the posterior distribution
p(θ|y) ∝ p(y|θ) × p(θ)
where θ may be a vector of thousand components!
p(y|θ) and p(θ) are usually available in closed form, but p(θ|y) is usually not
analytical tractable. Also, we may want to obtain
marginal posteriors p(θ1|y) = p(θ|y)dθ(−1)
calculate properties of p(θ1|y), such as mean E(θ1|y) = θ1p(θ|y)dθ(−1), tail areas,
etc.
We see that numerical integration becomes crucial in Bayesian inference!
Dr. Pablo E. Verde 176
Lecture 7 - MCMC-Short-version
General Monte Carlo integration
If we have algorithms for sampling from arbitrary (typically high-dimensional) posterior
distributions, we could use ’Monte Carlo’ methods for Bayesian inference:
Suppose we can draw samples from the joint posterior distribution for θ, i.e.
θ(1)
, θ(2)
, . . . , θ(N)
∼ p(θ|y)
Then Monte Carlo integration
θ(1)
, θ(2)
, . . . θ(N)
∼ p(θ|y)
E(g(θ)) = g(θ)p(θ|y)dθ ≈ 1
N
N
i=1 g(θi
)
Theorems exist which prove convergence even if the sample is dependent,i.e.
1
N
N
i=1
g(θ(i)
) → E(g(θ)) as n → ∞
Dr. Pablo E. Verde 177
Lecture 7 - MCMC-Short-version
Markov Chain Monte Carlo (MCMC)
Independent sampling from p(θ|y) may be very diﬃcult in high dimensions
Alternative strategy based on dependent sampling:
We know p(θ|y) up to a normalizing constant
Then, we design a Markov chain which has p(θ|y) as its stationary distribution
A sequence of random variables θ(0)
, θ(1)
, θ(3)
, . . . forms a Markov chain if
θ(i+1)
∼ p(θ|θ(i)
)
i.e. conditional on the value of θ(i)
, θ(i+1)
is independent of θ(i−1)
, . . . , θ(0)
.
Dr. Pablo E. Verde 178
Lecture 7 - MCMC-Short-version
Run the chain until it appears to have settled down to equilibrium, say
θ(k)
, θ(k+1)
, . . . , θ(K)
∼ p(θ|y)
Use these sampling values to empirically estimate the posterior of θ, say,
ˆp(θ|y)
Problem: Design a Markov chain with p(θ|y) as its unique stationary distribution?
Dr. Pablo E. Verde 179
Lecture 7 - MCMC-Short-version
Answer: This is surprisingly easy and several standard recipes are available
Metropolis et al. (1953) showed how to do it
Hastings (1970) generalized Metropolis algorithm
Geman and Geman (1984) introduced the Gibbs Sampling algorithm
Gelfand and Smith (1990) popularized Gibbs sampling in statistics
See Gilks, Richardson and Spiegelhalter (1996) for a gentle introduction and many
worked examples.
Robert and Casella (2004) for more detailed theoretical reference
Dr. Pablo E. Verde 180
Lecture 7 - MCMC-Short-version
The Gibbs sampler
Let our vector of unknowns θ consist of k sub-components θ = (θ1, . . . , θk)
1. Choose starting values θ0
1, . . . , θ0
k
2. Sample from
θ
(1)
1 ∼ p(θ1|θ
(0)
2 , θ0
3, . . . , θ
(0)
k , y)
θ
(1)
2 ∼ p(θ2|θ
(1)
1 , θ0
3, . . . , θ
(0)
k , y)
. . .
θ
(1)
k ∼ p(θk|θ
(1)
1 , θ1
2, . . . , θ
(1)
k−1, y)
3. Repeat step 2 many 1000s of times. Eventually we obtain samples from p(θ|y)
Dr. Pablo E. Verde 181
Lecture 7 - MCMC-Short-version
Example: Normal distribution with unknown mean and variance
Suppose that the observations y1, . . . , yn ∼ N(µ, τ−1)
We assume that yi are conditionally independent given θ and precision τ, and θ
and τ are themselves independent.
We put conjugate priors on µ and τ:
µ ∼ N(θ0, φ−1
0 ), τ ∼ Gamma(a, b)
Then the full conditionals for µ and τ are:
p(µ|τ, y) = N
µ0φ0 + n¯yτ
φ0 + nτ
,
1
φ0 + nτ
p(τ|µ, y) = Gamma a +
n
2
, b +
1
2
(yi − µ)2
Dr. Pablo E. Verde 182
Lecture 7 - MCMC-Short-version
Programming the Gibbs sampler in R
Example: Normal distribution with unknown mean and variance
gibbsNormal <- function(y, mu1, tau1, N, mu0, phi0, a, b) {
mu <- numeric(N+1) # N = number of iterations
mu[1] <- mu1 # the initial value for mu
tau <- numeric(N+1)
tau[1] <- tau1 # the initial value for tau
n <- length(y) ; ybar <- mean(y)
for(i in 2:(N+1)) { # generate samples from full conditional
mu[i] <- rnorm(1, (mu0*phi0 + n*ybar*tau[i-1])/(phi0 + n*tau[i-1]),
1/sqrt(phi0 + n*tau[i-1]))
tau[i] <- rgamma(1, (n+2*a)/2, (sum((y-mu[i])^2) + 2*b)/2)
}
output <- cbind(mu, tau) }
Dr. Pablo E. Verde 183
Lecture 7 - MCMC-Short-version
149.0 149.5 150.0 150.5 151.0
0.0000.0010.0020.0030.004
mu
tau
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
149.0 149.5 150.0 150.5 151.0
0.0000.0010.0020.0030.004
mu
tau
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
Four independent sequences of the Gibbs sampler for a normal distribution with
unknown mean and variance.
Left panel: ﬁrst 10 steps. Right panel: last 500 iterations in each chain.
Dr. Pablo E. Verde 184
Lecture 7 - MCMC-Short-version
0 100 200 300 400 500
149.0149.5150.0150.5151.0
iteration
mu
Posterior of mu
mu
Frequency
149.0 149.5 150.0 150.5 151.0
010203040
0 100 200 300 400 500
0.00050.00150.0025
iteration
tau
Posterior of tau
tau
Frequency
0.0000 0.0010 0.0020 0.0030
010203040
Traces of one sequence of Gibbs simulations. Upper panels: 500 iterations from the
posterior of µ. Lower panels: 500 iterations from the posterior of τ.
Dr. Pablo E. Verde 185
Lecture 7 - MCMC-Short-version
General properties of a MCMC algorithm
Reversibility:
The key property is reversibility or detailed balance, i.e., a balance in the ﬂow of
transition probabilities between the states of the Markov chain
A Markov chain which is reversible has stationary distribution
Thus we build a kernel or transition matrix P such that it ensures reversibility
p(x)P(y|x) = p(y)P(x|y)
where P(y|x) represents the ﬂow of probability x → y
Dr. Pablo E. Verde 186
Lecture 7 - MCMC-Short-version
Irreducibility:
An irreducible chain is one in which for any point θ(k)
in the parameter space, it is
possible to move from θ(k)
to other point θ(l)
in a ﬁnite number of steps.
This guarantees the chain can visit all possible values of θ irrespective of the starting
value θ(0)
.
Aperiodicity:
A chain is aperiodic if does not exhibits periodic behavior.
If R1, R2, . . . , Rk are disjoint regions in parameter space the chain does not cycle
around them.
To sample form p(θ|y) we construct a Markov chain with transition matrix P which
satisﬁes reversibility, irreducibility and aperiodicity.
Dr. Pablo E. Verde 187
Lecture 7 - MCMC-Short-version
Metropolis algorithm
The algorithm proceeds as follows:
Start with a preliminary guess, say θ0.
At iteration t sample a proposal θt ∼ P(θt|θt−1).
The jumping distribution must be symmetric, satisfying the condition
P(θa|θb) = P(θb|θa) for all θa and θb.
If p(θt|y) > p(θt−1|y) then accept θt
If not, ﬂip a coin with probability r = p(θt |y)
p(θt−1|y)
, if it comes up heads, accept θt.
If the coin toss comes up tails, stay at θt−1
The algorithm continues, until we sample from p(θ|y). The coin tosses allow it to go
to less plausible θts, and keep it from getting stuck in local maxima.
Dr. Pablo E. Verde 188
Lecture 7 - MCMC-Short-version
Some samplers for the Metropolis algorithm
The following are commonly implemented samplers for the proposal distribution
P(θt|θt−1)
Random walk sampler, observations are generated by θt = θt−1 + zt with zt ∼ f .
There are many common choices for f , including the uniform in the unit disc, a
multivariate normal or a t - distribution
The independent sampler, the candidate observation is sample independently from
the current state of the chain
The Gibbs sampler. The Gibbs transition can be regarded as a special case of a
Metropolis transition
The STAN software implements samplers based on Hamiltonian dynamics
Dr. Pablo E. Verde 189
Lecture 7 - MCMC-Short-version
Metropolis in R
The package MCMCpack implements Metropolis for a large number of statistical
models.
The current implementation uses a random walk Metropolis with proposal density
multivariate Normals.
The proposal density is centered at the current θ(t) with variance-covariance
matrix given by the user or automatically calculated proportional to the Hessian of
the posterior evaluated at its mode.
Dr. Pablo E. Verde 190
Lecture 7 - MCMC-Short-version
Example: an interesting bivariate distribution
It is well know that the pair of marginal distributions doest not uniquely
determine a bivariate distribution. For example, a bivariate distribution with
normal marginal distributions need not be jointly normal (Feller 1966, p. 69)
In contrast, the conditional distribution functions uniquely determine a joint
density function (Arnold and Press 1989).
A natural question arises: Must a bivariate distribution with normal
conditionals be jointly normal?
Dr. Pablo E. Verde 191
Lecture 7 - MCMC-Short-version
The answer is NO!
Gelman and Meng (1991) introduced an interesting distribution, where the join
distribution is non-normal:
f (x, y) ∝ exp −1/2 Ax2
y2
+ x2
+ y2
− 2Bxy − 2C1 − 2C2y .
But it has conditional distributions which are normals:
x|y ∼ N(
By + C1
Ay2 + 1
,
1
Ay2 + 1
), y|x ∼ N(
Bx + C1
Ax2 + 1
,
1
Ax2 + 1
).
The question is how to sample form f (x, y)?
Dr. Pablo E. Verde 192
Lecture 7 - MCMC-Short-version
Here we show an example where the parameters are A = 1, B = 0 C1 = 3 and C2 = 3.
x
y
z
Gelman−Meng distribtuion A=1, B=0, C1=3 and C2=3
Dr. Pablo E. Verde 193
Lecture 7 - MCMC-Short-version
−1 0 1 2 3 4 5 6
−10123456
x
y
3
1
−1
−1
−3
−3
−3
−3
−5
−5
−5
−5
−7
−7
−7
−9
−9
−9
−11
−11
−11
−13
−13
−13
−15
−15
−17
−19
−21
−23
−25
−27
−29
−31
−33
−35
Gelman−Meng distribtuion A=1, B=0, C1=3 and C2=3
Dr. Pablo E. Verde 194
Lecture 7 - MCMC-Short-version
We implement a Metropolis sampling algorithm based on a random walk and bivariate
normals as proposals.
## Gelman and Meng (1991) kernel function:
f.g.m <- function(xy , A = 1, B = 0, C1 = 3, C2 = 3)
{ x <- xy[1]; y <- xy[2]
r <- -.5 * (A * x^2 * y^2 + x^2 + y^2
- 2 * B * x * y - 2 * C1 * x - 2 * C2 * y)
as.vector(r) }
## Metropolis sampling with MCMCpack
res <- MCMCmetrop1R(f.g.m, theta.init=c(0, 1), mcmc=10000, burnin=5000,
logfun=FALSE)
>
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The Metropolis acceptance rate was 0.47580
Dr. Pablo E. Verde 195
Lecture 7 - MCMC-Short-version
q
q
q
qq
qqqqq
qqqq
qq
q
qq
q
q
q
q
qqq
qqqqq
qq
q
qq
q
qqqqqqq
qqq
q
q
qqq
q
qqqqqqq
qqqq
q
q
qqqqq
qqqqqqqqqqq
qq
qqqqqq
qqqqqq
q
qqq
qqqqqqqqqq
qq
qq
qqqqqqq
q
qqqqqqq
qq
qqq
qqq
qqqqqqqqqqqq
qqq
q
q
q
qqqqq
qq
qqqq
qqqqq
qqqqqqqqqqqq
qqq
qq
q
q q
qq
q
q
q
q
qqqqqqqq
qqqq
q
qqq
qq
q
qq
q
qqqq
qq
qq
qqqqq
qqq
q
qqqqqqq
q
q
qqqqqq
qqq
q
qqqq
qqqqqq
q
qqq
q
q
qqqq
qqq
qqqqqqq
qqq qqqqq
q
qqqq q
qqqqq
qqq
q
qqqqqqqqq
qqq q
qqqqq
q
qqqqqqq
q
q
qqqqq
q
qq
q
q
qq
qq
qq
qqqq
qq
q
qqqqqqqq
qqqqq
q
qqq
q
q
q
qqq
q
q
q
qqqqqq q
q
q
qq
qq
q
qqqqq
q
q
q
qq
qqqqqqqq
qq
qqq
q
qq q
qqqq
qq
qq
qqqqqqqqqq
qq
q
qqqqqqqqq
q
qq
q
qqq
qq
q
q
qq
q
qqq
qqqqqqqqq
qqq
q
q
qq
qqq
qq
qqqq
qqqq
qqqqqq
qqqqqqqqqqqqqqqq
q
q
qqqqqq
qqq
qq
qqqqqq
qqqqqqqq qq
qqqqq
qqqqqqqq
q
qqq
qqq
qqqq
qq
qqqqq
qq
q
qqqqqq
qqqqqqq qqqqqqqqqqq
qqqqqqqqqqq qq
qqqqqqqqqq
qqqqq
qq
qqqqq
q qq
qq
qqqqqq
qqq
qqqq
q
qq
q
qqqqqqq
qqqqqqqqq
qq
qqqq
qqq
qq
q
qq
qqqqqq
q
qq
qqqq
qqqqq
qqqqqqq
qq
qq
qqqqq
q
q
q
q
qqqqqqqqqqq
q
qqq
qqq
q
q
qq
q qqqq
qqqq
qqqq
qq
qq
qqq
qq
q
q
qqq
qq
q
qqqqq
qqqq
qqq
qqqqqq
q
q
q
q
qqqq
q
q
qq
q
qqqq
qq
q
qq qqqqqqqq
qq
q
q
qqq
qq
q
qq
qqqqqqqqqqq
qqqqq
qq
qq
qqq
qqq
qqq
q
q
qq
q
qqq
qqq
qqq
qqqq
q
q
qqq
qqqq
qqq
q
qq
qq
q
qqq
qq
q
qqq
q
q
q qqqqqqq
qq
qqqqq
qq
q
qqq
q
qqqqqqqqq
qqq
qq
q
qq
q
qqqqq
q
qq
qqq
qq
qq
qq
qqqqqqq
q
qq
qqq
qqqq
qqq
−1 0 1 2 3 4
012345
res[,1]
res[,2]
0
0
−5
−5
−5
−5
−10
−10
−10
−15
−15 −20
−25 −30
−35
q
qqq
q
q
qqqqq
qq
q
qqqqqqqqqqqqqq
qq
qqqqqq
q
q
qq
q
qq
qqqqqq
qqq
qq
qqq
qq
q
qqqqqqqqqqqqqqq
qqqqq
qq
qqqqqqqqqqqqqqqqqqq
q
qq
qq
q
q
qq
qqq
q
qqq
qqqqq
qqq
q
q
q
q
q
qqqqqqq
q
qqqq
qqqq
q
qq
qqq
qqq
q
qq qqq
qqqq
qq
qqqq
qqqq
qqqq
qqq
qq
q
q
qqqq
q
qqqqqq
q
qq
qqqq
qq
q
q
q
qqq
qqqqqqqqq
qqq
qqqqq
qqqqqqq
qq
q
q
q
q
q
qq
q
q
qqqqq
q
qqqqqqqq
q
q
qq
qqqqq
qqq
qqq
qq
qq
qq
q
q
qqqqqq
qqq
q
q
qqqqqq
qqqq
q
qq
qqqq
qqqqqqqqqqq
qqq
qqq
q
qqqqqqqqqqq
qqq
qq
qqqqqqq
q
qqq
qq
qqqq qqqq
q
qqqqqqqqq
qqqqq
qqqq
qqq
qqq
q
qqqqq
qqqqq qqqq
q
qqqq
q
qqqqqqq
qqqqq qqqq
qq
q
qq
qq
q
q
qqq
q
q
qqq
qqqqq
qqqq
qq
q
qq
qqqq
qqq
qq
qq
qqqqqqq
q
qqqqqq
qq
q
qqqqq
qqqqqq
qqqq
qq
qqqq
q
q
qq
q
qq
qq
q
qqqqq
qq
qqqqqq
q
qqq
qq
q qqqqqqqqqqq
qq
q
q
q
q
qq
qq
q
qq
q
qqq
qqq
qqqqqq
qqqqqqq
q
q
qqq
q
q
qqq
qq
q q
q
q
qqqqq
qqqqqqqq
qqq
qqqqqqqqqq
qqq
qqqq
qq
q
q
qqq
qqq
q
q
q
qqqqqqqq
qqqqq
qq
qq
qqqqqqqq
q
qqqqqq
q
qqqq
qqqqq
qqqq
qqq
q
qq
q
qqqqqq
q
qq
q
qq
q
q
qq
q qq
qqqqq
q qq
qq
q
qqq
qqqq
qq
q
qqq
q
qqq
qqqqqqqqqqqqqq
qq
qq
qqq
q
qqqqqqqqqq
qq
q
q
q
qqq
qq qqq
q
qqqq
q
qq
qq
q
qqq
qqq
q q
q
qq
q
qq
qqqq
q q
q
q
q
q
qqq
qq
qqqqqqqqqq
qqqqqqqq
qqqqq
q
qqqqqqqq
q
qqqqqqq
q
q
qq
q qqq
qqqqq
q
qq
q
qq
qq
qqq
q qq
qqqq
qqq
qqqqqq
qq
q
qqqqqq
qqqqq
qqqqqq
qqqq
qqq
q
q
qqqqqqqqq
qqqqqqqqqqq
qq
q
qq
qqqqq
q
qq
qqq
q
qq
q
qq
q
qqq
qq
q
q
qq
q
qqq
qq
q
qq
q
3
1
−1
−1
−3
−3
−3
−3
−5
−5
−5
−5
−7
−7
−7
−9
−9
−9
−11
−11
−13
−13
−15
−17
−19
−21
−23
−25
−27
−29
−31
−33
−35
q
q
qq
qqq
q qq q
q
q
q
qqqq
qqq
q
qq
q qqqqq
qqq
q
qq
qqq
qqqqqqqqqqqqq
q
q qqqqqqqq
qqq
q
qq
q
q
qqqqqq
qqq
q
q
qqq
q q
q
q
qqq qqqq
qqq
q
q
qqq
q
qqqqq
qq
q
qqq
qq
qqqq
q
qqqqq
qq
qqq
q
qq
qq
qqqqq
q
qq
qqqq
qqq
qq
qq
qqq
q
q
qq
qq
q
qqq
qqqq
qqqq
qq
qq
qq
q
q
q
q q
q
q
qqqqqqq
q
qqq
q
qqqq
qqqq
qqq
qq
qqqq
qqqq
q
qq
q
qqqqq
qqq
qq
qqqqqq
q
qq
qq
qqqq
q
q
qqqq
q
qqqq
qqqqqq
qqq
qqqq
q
q
q
qqq
qqqq
q
q
qqq
q
q
qqq
q
q
qq
q
q
q
qq
qqqqqq
qq
qqq
qqqqqq
q
qqq
qqqqqqqqqqq
q
q
qqq
q
qqq
qq
q
qq
qq
qqq
qqqq
q
q
q
qq
qqqqqq
qqq
qqq
q
q
qq qqq
q
q
q
qqq
qq
qq
qq
qqq
qqq
qqq
q
qq
q
qqq
qq
qqq
qq
qq
qq
qqq
qqq
qqq
qqqq
q
qqq
q
q
qqq
qq
qqqqq
qq
q qq
q qqqq
qq
q q
qqqqqqqqq
qqqqqqqq
qqq
q
qq
qq
qqq
qq
qq
qqqqqqqq
qqq
q
qqqqqqqq
qqqqqqq
q
qq
qq
qq
qq
qqqq
qqq
q
qq
qqqqqqqqqq
q
qqq
qqq
q
qq
qqqqq
q
qqqqqq
qq
q
q
qq
q
qqqqq
qqqqqqqqqqqq
qq
qqq
qqq
q
qq
qqqqqqq
qqqqqqqq
q
qq
qqqqqq
qq
qqq
q
q
qqqq
q
q
q q
qqqqqqqqqqq
qqq
qqqqqqqqqqqqq
q
qqqqqqq
qqq
q
qqqqqq
q
qqq
qqqqqqqqqqq
q
qqqqqqqqq
qqqq
qqqq
q
q
qqq
qqqqq
q
qq
qqq
q
qqqq
qq
qq
qq
qqqqqq
qqqq
qqq
q
q
qq
q
q
qqq
qq
q
q
qqqqqq
qq
qq
q q
qqq
qq
qqqq
qq
qq
qqq
qqq
qqq
qqq
q
qq
qq
qqq
qqq
qq
qqq
q
q
q
q
qq
q
qqqq
qqq
qqq
qqq
qqq
q
qq
q
qq
q
qqqq
q qqqqqqqq
q qqqqqqqqqqqqq
q
qqqq
qqqqqqq
q
q
qqqqqqq
qqqqqqq
q
qqq
qqqqqqqqqqqqqqq
q
q
qqq
q
qqq
q
qqqqqq
qq
qq
qq
qqqq
qq
qqqqq
qq
qqqqqqqqqqqqqqqqq
qqq
qqqqq
q
qq
Dr. Pablo E. Verde 196
Lecture 7 - MCMC-Short-version
0 200 400 600 800 1000
−101234
iteration
x
0 200 400 600 800 1000
01234
iteration
y
Dr. Pablo E. Verde 197
Lecture 7 - MCMC-Short-version
Performance of MCMC methods
There are three main issues to consider
Convergence: How quickly does the distribution of θ(t) approach p(θ|x) ?
Eﬃciency: How well are functionals of p(θ|x) estimated from θ(t) ?
Simplicity: How convenient is the method to use?
In general computer eﬀort should be measured in seconds, not iterations!
Dr. Pablo E. Verde 198
Lecture 7 - MCMC-Short-version
Checking convergence
Convergence is to target distribution and not to a single value
Once convergence is reached, samples should look like a random scatter about a
stable mean value.
One approach is to run many long chains with widely diﬀering starting values.
Plot traces of the simulated chain
Plot ACF, etc.
Dr. Pablo E. Verde 199
Lecture 8 - Regression Models
Lecture 8:
Bayesian Regression Models
Dr. Pablo E. Verde 200
Lecture 8 - Regression Models
Modeling Examples
Multiple linear regression: model diagnostics and variable selection
ANOVA models: an alternative to multiple testing
Logistic regression: combining multiple information in Risk analysis
Dr. Pablo E. Verde 201
Lecture 8 - Regression Models
Introduction
Standard (and non standard) regression models can be easily formulated within a
Bayesian framework.
Specify probability distribution for the data
Specify form of relationship between response and explanatory variables
Specify prior distribution for regression coeﬃcients and any other unknown
(nuisance) parameters
Dr. Pablo E. Verde 202
Lecture 8 - Regression Models
Some advantages of a Bayesian formulation in regression modeling include:
Easy to include parameter restrictions and other relevant prior knowledge
It is simple to extended to non-linear regression
We can easily robustiﬁed a model
Easy to make inference about functions of regression parameters and/or
predictions
We can handle missing data and covariate measurement error
Transparent variable selection by prior modeling regression coeﬃcients
Dr. Pablo E. Verde 203
Lecture 8 - Regression Models
Linear regression
Example: Stack loss data, WinBUGS Volume 1
In this example, we illustrate a more complete regression analysis including outlier
checking, model adequacy and variable selection methods.
This is a very often analyzed data of Brownlee (1965, p. 454).
21 daily responses of stack loss, yi the amount of ammonia escaping from
industrial chimneys
Covariates: air ﬂow x1, temperature x2 and acid concentration x3
Transformed covariates: zk,i = (xk,i − ¯xk)/sd(xk) for k = 1, 2, 3.
Dr. Pablo E. Verde 204
Lecture 8 - Regression Models
Y
20 30 40 50 60 70 80 90
q
q q
q
q q
q
q
q
qq
q
q
q
q
q
q q
q
q q
q
q q
q
q q
q
q
q
qq
q
q
q
q
q
q q
q
q q
20 40 60 80
10152025303540
q
q q
q
q q
q
q
q
qq
q
q
q
q
q
q q
q
q q
20406080
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
X1
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
X2
20406080
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
10 15 20 25 30 35 40
20406080
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
20 40 60 80
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
X3
Dr. Pablo E. Verde 205
Lecture 8 - Regression Models
Model 1 speciﬁcation:
yi ∼ Normal(µi , τ) i = 1, . . . , 21
τ = 1/σ2
µi = β0 + β1z1,i + β2z2,i + β3z3,i
σ ∼ Uniform(0.01, 100)
βk ∼ Normal(0, 0.001), k = 0, . . . , 3.
Dr. Pablo E. Verde 206
Lecture 8 - Regression Models
In these model we also calculate some diagnostic quantities:
By analogy to the classical regression analysis we calculate the standardized residuals as
di =
yi − E(yi |z)
Var(yi |z)
.
In a Bayesian setting these residuals have posterior distributions as well. They can be
used similarly to residual analysis in classical statistics.
Dr. Pablo E. Verde 207
Lecture 8 - Regression Models
We specify the model in WinBUGS as:
model{
# Model 1
for (i in 1 : N) {
y[i] ~ dnorm(mu[i], tau)
mu[i] <- beta0 + beta[1] * z[i, 1] + beta[2] * z[i, 2] +
beta[3] * z[i, 3]
stres[i] <- (y[i] - mu[i]) / sigma
# Priors
beta0 ~ dnorm(0, 0.001)
for (j in 1 : p) {
beta[j] ~ dnorm(0, 0.001) } # coefficients independent
tau <- 1/(sigma*sigma)
sigma ~ dunif(0.01, 100) # Gelman’s prior
}
Dr. Pablo E. Verde 208
Lecture 8 - Regression Models
We run the model in WinBUGS, we look at the possible outliers.
Dr. Pablo E. Verde 209
Lecture 8 - Regression Models
As an alternative model we replace the Normal likelihood by a t-distribution with ν = 4
degrees of freedom. The idea is to have a more robust model against outliers in the y’s.
Model 2:
yi ∼ t(µi , τ, ν) i = 1, . . . , 21
τ = 1/σ2
µi = β0 + β1z1,i + β2z2,i + β3z3,i
σ ∼ Uniform(0.01, 100)
βk ∼ Normal(0, 0.001), k = 0, . . . , 3.
Dr. Pablo E. Verde 210
Lecture 8 - Regression Models
In the WinBUGS code we modify the model section by commenting out the normal
model and adding one line with the model based on the t-distribution and we run the
MCMC again.
#y[i] ~ dnorm(mu[i], tau)
y[i] ~ dt(mu[i], tau, 4)
#DIC Normal
Dbar Dhat pD DIC
Y 110.537 105.800 4.736 115.273
total 110.537 105.800 4.736 115.273
#DIC t
Dbar Dhat pD DIC
Y 109.043 104.103 4.940 113.983
total 109.043 104.103 4.940 113.983
A very modest diﬀerence between the two models.
Dr. Pablo E. Verde 211
Lecture 8 - Regression Models
Variable Selection
We modify the structure of the distribution of the regression coeﬃcients by adding a
common distribution with unknown variance. This is called a ridge regression
model, where the βs are assumed exchangeable.
Model 3 speciﬁcation:
yi ∼ Normal(µi , τ), τ = 1/σ2
, i = 1, . . . , 21
µi = β0 + β1z1,i + β2z2,i + β3z3,i
σ ∼ Uniform(0.01, 100)
βk ∼ Normal(0, φ), φ = 1/σ2
β k = 0, . . . , 3
σβ ∼ Uniform(0.01, 100).
Dr. Pablo E. Verde 212
Lecture 8 - Regression Models
In WinBUGS
for (j in 1 : p) {
# beta[j] ~ dnorm(0, 0.001) # coefs independent
beta[j] ~ dnorm(0, phi) # coefs exchangeable (ridge regression)
}
phi <- 1/(sigmaB*sigmaB)
sigmaB ~ dunif(0.01, 100) #Gelman’s prior
Dr. Pablo E. Verde 213
Lecture 8 - Regression Models
Dr. Pablo E. Verde 214
Lecture 8 - Regression Models
We see that β3 is not a relevant variable in the model. Now, we try another variable
selection procedure, by modifying the distribution of the regression coeﬃcients.
Model 5 speciﬁcation:
yi ∼ Normal(µi , σ2
) i = 1, . . . , 21
µi = β0 + β1 π1 z1,i + β2 π2 z2,i + β3 π3 z3,i
σ ∼ Uniform(0.01, 100)
βk ∼ Normal(0, 100), k = 0, . . . , 3,
π1 ∼ Bernoulli(0.5)
π2 ∼ Bernoulli(0.5)
π3 ∼ Bernoulli(0.5).
Dr. Pablo E. Verde 215
Lecture 8 - Regression Models
In WinBUGS we change the equation of µi :
#mean equation model
mu[i] <- beta0 + beta[1] * pi[1]* z[i, 1] +
beta[2]* pi[2] * z[i, 2] + beta[3] * pi[3]* z[i, 3]
#priors
for (j in 1 : p) {
beta[j] ~ dnorm(0, 0.001)
pi[j] ~ dbern(0.5)
}
We run the model for 40,000 iterations and we discard the ﬁrst 20,000. Results are:
pi[1] 0.9987 0.0367 <- X1 very important
pi[2] 0.8361 0.3702 <- X2 non meaningful
pi[3] 0.0425 0.2018 <- X1 definitively no important
Dr. Pablo E. Verde 216
Lecture 8 - Regression Models
Dr. Pablo E. Verde 217
Lecture 8 - Regression Models
ANOVA and Experimental Design
Example: Speed of light measurements by Michelson
This is a classical data set available in R and corresponds to Measurements of the
speed of light in air, made between 5th June and 2nd July, 1879. The data consists of
ﬁve experiments, each consisting of 20 consecutive runs. The response is the speed of
light in km/s, less 299000. The currently accepted value, on this scale of
measurement, is 734.5.
We start by comparing the 5 experiments with boxplots:
> library(MASS)
> data(michelson)
> attach(michelson)
> plot(Expt, Speed, main="Speed of Light Data",
xlab="Experiment No.",ylab="Speed")
Dr. Pablo E. Verde 218
Lecture 8 - Regression Models
q
qq
q
q
q
1 2 3 4 5
7008009001000
Speed of Light Data
Experiment No.
Speed
Dr. Pablo E. Verde 219
Lecture 8 - Regression Models
ANOVA model: Bayesian analysis in WinBUGS
The Bayesian approach to the ANOVA model is similar to the regression model, with
mean equation according to the mean groups parametrization. So if we have nj
observations in J groups with n = J
j=1 nj .
Model 1
yi,j ∼ Normal(µj , τ), τ = 1/σ2
i = 1, . . . , nj ,
µj = µ + αj , j = 1, . . . , J,
αj ∼ Normal(0, 0.001), j = 2, . . . , J,
µ ∼ Normal(0, 0.001),
σ ∼ Uniform(0.01, 100).
Dr. Pablo E. Verde 220
Lecture 8 - Regression Models
The WinBUGS implementation of Model 1 is:
model{
for(i in 1:n){
y[i] ~ dnorm( mu[i], tau)
mu[i] <- mu0 + alpha[ a[i] ]
}
# Corner constrains parametrization
# alpha[1] <- 0
# Sum to zero parametrization
alpha[1] <- -sum( alpha[2:J] )
# group means:
for( i in 1:J){
m.g[i] <- mu0 + alpha[i]}
Dr. Pablo E. Verde 221
Lecture 8 - Regression Models
# mean differences:
d[1] <- (m.g[2] - m.g[1] )
d[2]<- (m.g[3] - m.g[1] )
...
# Priors
mu0 ~ dnorm(0, 0.0001)
for( j in 2:J){
alpha[j] ~ dnorm(0, 0.0001)
}
tau <- 1/(sigma*sigma)
sigma ~ dunif(0.01, 100) #Gelman’s prior
}
Dr. Pablo E. Verde 222
Lecture 8 - Regression Models
One serious problem of ANOVA modeling is the lack of variance homogeneity between
groups. We can extend our Bayesian set up to include this data feature as follows:
Model 2
yi,j ∼ Normal(µj , τj ), τj = 1/σ2
j i = 1, . . . , nj ,
µj = µ + αj , j = 1, . . . , J,
αj ∼ Normal(0, 0.001), j = 2, . . . , J,
µ ∼ Normal(0, 0.001),
σj ∼ Uniform(0.01, 100), j = 1, . . . , J, .
This model includes a structural dispersion sub-model for each treatment group.
Dr. Pablo E. Verde 223
Lecture 8 - Regression Models
The WinBUGS implementation of Model 2 change to:
{
for(i in 1:n){
#y[i] ~ dnorm( mu[i], tau)
y[i] ~ dnorm( mu[i], tau[i])
mu[i] <- mu0 + alpha[ a[i] ]
tau[i] <- gamma[ a[i] ]
}
# Priors for groups dispersion model
for( j in 1:J){
gamma[j] ~ dgamma(0.001, 0.001)
sigma2[j] <- 1/ gamma[j]
sigma[j] <- pow(sigma2[j], 0.5)
}
Dr. Pablo E. Verde 224
Lecture 8 - Regression Models
For models with more than 3 groups the estimation based on simple means can be
improved by adding exchangeability structure to the αi ’s (Stein, 1956, Lindley 1962,
Casella and Berger, pag.574). This assumed a sub-model for αi in the following way:
Model 3
yi,j ∼ Normal(µj , τj ), τj = 1/σ2
j i = 1, . . . , nj ,
µj = µ + αj , j = 1, . . . , J,
αj ∼ Normal(0, φ), φ = 1/σα j = 2, . . . , J,
µ ∼ Normal(0, 0.001),
φ ∼ Uniform(0.01, 100), j = 1, . . . , J,
σj ∼ Uniform(0.01, 100), j = 1, . . . , J, .
Dr. Pablo E. Verde 225
Lecture 8 - Regression Models
The WinBUGS implementation of Model 3 change to:
# Priors
mu0 ~ dnorm(0, 0.0001)
for( j in 2:J){
alpha[j] ~ dnorm(0, xi)
}
xi <- 1/(sigma.alpha*sigma.alpha)
sigma.alpha ~ dunif(0.01, 100) #Gelman’s prior
Dr. Pablo E. Verde 226
Lecture 8 - Regression Models
One alternative to exchangeability between αi ’s is to use a mixture model to
investigate the structure of the data, for example:
Model 4
yi,j ∼ Normal(µj , τj ), τj = 1/σ2
j i = 1, . . . , nj ,
µj = µ + αj πj , j = 1, . . . , J,
αj ∼ Normal(0, 0.001), j = 2, . . . , J,
µ ∼ Normal(0, 0.001),
πj ∼ Bernoulli(0.5), j = 2, . . . , J,
σj ∼ Uniform(0.01, 100), j = 1, . . . , J, .
Dr. Pablo E. Verde 227
Lecture 8 - Regression Models
The WinBUGS implementation of Model 3 change to:
# group means mixture model:
for( i in 1:J){
m.g[i] <- mu0 + alpha[i]*pi[i]
pi[i] ~ dbern(0.5)
}
The next slides present some graphical results of the main features of each model. The
full WinBUGS script for this analysis is anova.odc, this includes some other
calculations for residual analysis as well.
Dr. Pablo E. Verde 228
Lecture 8 - Regression Models
ANOVA: classical model
Posterior distributions for mean groups µj , Model 1.
Dr. Pablo E. Verde 229
Lecture 8 - Regression Models
ANOVA: classical model
Posterior distributions for mean diﬀerences between all groups µk − µl (k = l), Model 1
Dr. Pablo E. Verde 230
Lecture 8 - Regression Models
ANOVA: variance heterogeneity model
Posterior distributions for σj , Model 2.
Dr. Pablo E. Verde 231
Lecture 8 - Regression Models
ANOVA: variance heterogeneity model
Posterior distributions for mean diﬀerences between all groups µk − µl (k = l), Model 2
Dr. Pablo E. Verde 232
Lecture 8 - Regression Models
ANOVA: exchangeability model
Posterior distributions for mean diﬀerences between all groups with exchangeability,
Model 3.
Dr. Pablo E. Verde 233
Lecture 8 - Regression Models
ANOVA: Mixture models
Posterior distributions for the group means with mixture distributions, Model 4.
Dr. Pablo E. Verde 234
Lecture 8 - Regression Models
Logistic regression: combining multiple information in Risk analysis
Example: The Challenger O-Ring Data: A Bayesian Risk Analysis
Historical background
The Space Shuttle Challenger’s ﬁnal mission was on January 28th 1986, on an
unusually cold morning (31F/-0.5C)
It disintegrated 73 seconds into its ﬂight after an O-ring seal in its right solid
rocket booster failed
The uncontrolled ﬂame caused structural failure of the external tank, and the
resulting aerodynamic forces broke up the orbiter
All seven members of the crew were killed
Dr. Pablo E. Verde 235
Lecture 8 - Regression Models
Technical background
Dr. Pablo E. Verde 236
Lecture 8 - Regression Models
The Challenger’s O-Ring Thermal-Distress Data
Example: Regression modelling with change point
On the night of January 27, 1986, the night before the space shuttle Challenger
accident, there was a tree-hour teleconference among people at Marton Thiokol
(manufacturer of the solid rocket motor), Marshal Space Center and Kennedy
Space Center.
The discussion focused on the forecast of a 31 F temperature for lunch time the
next morning, and the eﬀect of low temperature on O-ring performance.
The available data of previous shuttle ﬂights consisted on occurrence of thermal
distress (yes, no) and temperature at launch time. Let’s take a look at these data
and ﬁt a logistic regression in R:
Dr. Pablo E. Verde 237
Lecture 8 - Regression Models
> #distress yes=1 , no =0
> y = c(1,1,1, 1,0,0,0, 0,0,0,0, 0,1,1,0,
0,0,1,0, 0,0,0,0)
> # temperature at launch time
> x = c(53,57,58, 63,66, 67,67, 67,68, 69,70,
70,70, 70,72, 73,75, 75,76, 76,78, 79,81)
> # number of previous flights
> N = 23
> summary(f.b1 <- glm(y ~ x, family=binomial))
...
AIC: 24.315
> summary(f.b2 <- update(f.b1, family=binomial(link=cloglog)))
...
AIC: 23.531
Dr. Pablo E. Verde 238
Lecture 8 - Regression Models
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
55 60 65 70 75 80
0.00.20.40.60.81.0
Temperature in F
Probofatleastoneo−ringincident
Dr. Pablo E. Verde 239
Lecture 8 - Regression Models
These data have been analyzed several times in the literature. I found that ﬁtting
a change point model makes an improvement on previous published analysis. The
model is:
Pr(yi = 1) = logit−1
(β0 + β1 zi )
where
zi =
1 for xi ≥ K,
0 otherwise.
In this model K is unknown and represents the temperature from which the
probability of distress is the lowest.
The parameter β1 represents the reduction of risk under temperatures greater
than K.
Dr. Pablo E. Verde 240
Lecture 8 - Regression Models
This model is implemented in WinBUGS as following:
model
{
for(i in 1 : N) {
y[i] ~ dbern(p[i])
logit(p[i]) <- beta0 + beta1 * z[i] }
for(i in 1:N){z[i] <- step(x[i] - K) } # step = 0 until K
#priors
K ~dunif(53, 81)
beta0 ~ dnorm(0.0, 0.01)
beta1 ~ dnorm(0.0, 0.01)
}
Dr. Pablo E. Verde 241
Lecture 8 - Regression Models
We run the model with R2WinBUGS:
> data.b <- c("x", "y", "N")
> par.b <- c("beta0", "beta1", "K")
>
> chall.1 <- bugs(data.b, inits=NULL, par.b, "challenger-1.txt",
n.chains=1, n.iter=20000, n.thin=1, bugs.directory = bugsdir,
working.directory = getwd(), clearWD=TRUE, debug=TRUE)
> print(chall.1)
mean sd 2.5% 25% 50% 75% 97.5%
beta0 6.2 3.9 -0.2 3.3 5.8 8.4 15.3
beta1 -8.0 3.9 -17.2 -10.2 -7.6 -5.1 -2.1
K 63.8 2.9 58.2 63.1 64.1 65.1 67.0
deviance 19.1 2.8 16.6 16.9 18.0 20.4 26.9
pD = 2.4 and DIC = 21.5
Dr. Pablo E. Verde 242
Lecture 8 - Regression Models
Histogram of K
K
Frequency
55 60 65 70 75 80
0100200300400500
q
q
q
q
q
q
q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q q
55 60 65 70 75 80
0.00.20.40.60.81.0
Temperature in F
Probofatleastoneo−ringincident
Dr. Pablo E. Verde 243
Lecture 8 - Regression Models
Example: Probabilistic Risk Assessment
Dr. Pablo E. Verde 244
Lecture 8 - Regression Models
A catastrophic failure of a ﬁeld joint is expected when all four events occur
during the operation of the solid rocket boosters:
pﬁeld = pa × pb × pc × pd
Since there are 6 ﬁeld joints per launch, assuming the all 6 joint failures are
independent, the probability of at least one failure is
Risk = 1 − (1 − pﬁeld )6
pa is calculated from the logistic regression model as:
pa = Pr(O-Ring Thermal-Distress at 31F)
The available data for the other events was very sparse:
Event b: 2 in 7 ﬂights
Event c: 1 in 2 ﬂights
Event d: 0 in 1 ﬂight
We model the probabilities of events b, c and d with a beta-binomial model with
uninformative priors Beta(0.5, 0.5).
Dr. Pablo E. Verde 245
Lecture 8 - Regression Models
In WinBUGS we modify the model ﬁle ”challenger-1.txt” by adding these risk
calculations. We add, also, the calculation of risk at temperature K or greater.
To run the model in R by adding the date of events b, c, and d:
# Risk analysis
y.b <- 2; y.c <- 1; y.d <- 0
data.r <- c("x", "y", "N", "y.b", "y.c", "y.d")
par.r <- c("beta0", "beta1", "K", "p.cat", "p.catK")
chall.2<- bugs(data.r, inits=NULL, par.r, "challenger-1.txt", n.chains = 1,
...
> print(chall.2, 3)
mean sd 2.5% 25% 50% 75% 97.5%
K 63.726 2.653 58.190 63.090 64.120 65.120 66.010
p.cat 0.173 0.204 0.000 0.019 0.088 0.258 0.736
p.catK 0.038 0.061 0.000 0.003 0.014 0.047 0.213
Dr. Pablo E. Verde 246
Lecture 8 - Regression Models
Summary results of our analysis:
The risk of a catastrophic failure with launching temperature of 31 F is 17%
The risk of a catastrophic failure with temperatures greater than 63 F is 3.8%
The odds ratio of the risk between these temperatures is 9.8
One interesting historical note:
The Roger Commission, appointed by president Reagan to investigate the causes of
the accident, pointed out:
... a mistake in the analysis of the thermal distress data was that the ﬂights with zero
incidents were left oﬀ because it was felt that these ﬂights did not contribute any
information about the temperature eﬀect. (p. 145)
Dr. Pablo E. Verde 247
Lecture 8 - Regression Models
Practical
Exercise: a nonconjugate nonlinear model
Volume 2 in WinBUGS help: Dugongs
Originally, Carlin and Gelfand (1991) consider data on length yi and age xi
measurements for 27 dugongs (sea cows) and use the following nonlinear growth curve
with no inﬂection point and an asymptote as xi tends to inﬁnity:
yi ∼ Normal(µi , σ2
)
µi = α − βγxi
,
where α, β > 0 and γ ∈ (0, 1).
Dr. Pablo E. Verde 248
Lecture 8 - Regression Models
Practical
Open the ﬁle in WinBUGS and run the model
Change the last observation 2.57 to 2.17 and run the model. Did you see diﬀerent
results?
Modify the WinBUGS code as:
# y[i] ~ dnorm(mu[i], tau)
y[i] ~ dt(mu[i], tau, df) # fit robust t distribution
...
df <- 5
Run the model with this change and compare results
Dr. Pablo E. Verde 249
Lecture 9: Introduction to Hierarchical Models
Lecture 9:
Introduction to Hierarchical Models
Dr. Pablo E. Verde 250
Lecture 9: Introduction to Hierarchical Models
Learning from the information of the others ...
During the previous lectures we have seen several examples, where we performed
a single statistical analysis.
Now, suppose that instead of having a single analysis, you have several ones.
Each one giving independent results from each other.
One might ask:
Should we combine these independent results in a single global analysis?
Dr. Pablo E. Verde 251
Lecture 9: Introduction to Hierarchical Models
The general answer is:
YES!!
Hierarchical models give us the statistical framework to combine multiple sources
of information in a single analysis.
Informally, by combining several individual results, each individual analysis is
improve by the information we have from the others. This information is
summarized in an Empirical Prior distribution.
From the classical point of view, Charles Stein demonstrated in the 50s a
fundamental theoretical result, which shows the beneﬁt of combining individual
results in a single analysis.
Charles Stein (22nd of March 1920) ... and he still working everyday!
Dr. Pablo E. Verde 252
Lecture 9: Introduction to Hierarchical Models
Hierarchical Models
It becomes common practice to construct statistical models which reﬂects the
underline complexity of the problem under study. Such us diﬀerent patterns of
heterogeneity, dependence, miss-measurements, missing data, etc.
Statistical models which reﬂect complexity of the data involve multiple parameters.
Examples are:
”Study eﬀect” in meta-analysis
”Subject eﬀect” in growth curves models
”Identiﬁcation of hidden process” in sequence of observations
”Relative risks of disease outcome in diﬀerent areas/time periods
”Frailty eﬀects” in correlated survival data
...
Dr. Pablo E. Verde 253
Lecture 9: Introduction to Hierarchical Models
Statistical Inference for Multiple Parameters
How to make inference on multiple parameters {θ1, . . . , θN} measured in N units
(person, centers, areas, ...) which are related or connected by the structure of the
problem?
We can identify three diﬀerent scenarios
1. Identical parameters: All the θ’s are identical, i.e. all the data can be pooled
and the individual units ignored.
2. Independent parameters: All the θ’s are entirely unrelated, i.e. the results from
each unit can be analyzed independently.
3. Exchangeable parameters: The θ’s are assumed to be ’similar’ in the sense that
the ”labels” convey no information.
Dr. Pablo E. Verde 254
Lecture 9: Introduction to Hierarchical Models
Exchangeability
The assumption of exchangeable units is mathematically equivalent to assuming
that the θ’s are drawn at random from some population distribution.
”Exchangeability” express formally the idea that we ﬁnd no systematic reason to
distinguish the individual random variables
X1, . . . , Xn.
We assume that X1, . . . , Xn are exchangeable if the probability that we assign to
any set of potential outcomes,
p(x1, . . . , xn),
is unaﬀected by permutations of the labels attached to the variables.
Dr. Pablo E. Verde 255
Lecture 9: Introduction to Hierarchical Models
Example:
Suppose that X1, X2, X3 are the outcomes of the three patients enrolled in a clinical
trial. Were Xi = 1 indicates positive reaction to a treatment and Xi = 0 no reaction.
We may judge
p(X1 = 1, X2 = 0, X3 = 1) = p(X2 = 1, X1 = 0, X3 = 1) = p(X1 = 1, X3 = 0, X2 = 1)
i.e. the probability of getting 2 positive outcomes is NOT aﬀected by the
particular patient on which the positive outcome comes.
Note that this is a very strong assumption. In reality we may expect that patients
may behave in diﬀerent way. For example they may fail to comply a treatment.
Note that ”exchangeability” does not mean we believe that X1, X2, . . . Xn are
independent!
Dr. Pablo E. Verde 256
Lecture 9: Introduction to Hierarchical Models
Exchangeability and Hierarchical Models
Suppose yij is outcome for individual j, unit i, with unit-speciﬁc parameter θi
Assumption of partial exchangeability of individuals within units is represented by
yij ∼ p(yij |θi )
θi ∼ p(θi |φ)
Assumption of exchangeability of units can be represented by
θi ∼ p(θi |φ)
φ ∼ p(φ)
where, p(φ) can be considered as a common prior for all units, but with
unknown parameters.
Dr. Pablo E. Verde 257
Lecture 9: Introduction to Hierarchical Models
Assuming that θ1, . . . , θN are drawn from some common prior distribution whose
parameters are unknown is known as a hierarchical or multilevel model.
Bayesian statistical inference is based on:
p(θ1, . . . , θN, φ|y) ∝ p(φ) p(θ1, . . . , θN|φ) p(y|θ1, . . . , θN, φ)
The dimension of (θ1, . . . , θN) could be very large in practice.
Empirical Bayes techniques omit p(φ). They are useful in practice, but they may
biased variability estimates be reusing y to estimate φ.
Dr. Pablo E. Verde 258
Lecture 9: Introduction to Hierarchical Models
Exchangeability some further comments
Note that there does not need to be any actual sampling - perhaps these N units
are the only ones that exists - this is very common in meta-analysis.
The probability structure is a consequence of the belief in exchangeability rather
than a physical randomization mechanism.
We emphasis that an assumption of exchangeability is a judgement based on our
knowledge of the context.
Dr. Pablo E. Verde 259
Lecture 9: Introduction to Hierarchical Models
Hierarchical models and shrinkage
Suppose in each unit we observe a response yi assumed to have a Normal likelihood
yi ∼ N(θi , s2
i )
Unit means θi are assumed to be exchangeable, and to have a Normal distribution
θi ∼ N(µ, σ2
)
where µ and σ2 are ”hyper-parameters” for the moment assumed known.
After observing yi , Bayes theorem gives
θi |yi ∼ N(wi µ + (1 − wi ) yi , (1 − wi ) s2
i )
where wi = s2
i /(s2
i + σ2) is the weight given to the prior mean.
Dr. Pablo E. Verde 260
Lecture 9: Introduction to Hierarchical Models
Hierarchical models and shrinkage
An exchangeable model therefore leads to the inferences for each unit having
narrowed intervals that if they are assumed independent, but shrunk towards the
prior mean response.
wi controls the ”shrinkage” of the estimate towards µ, and the reduction in the
width of the interval for θi
Shrinkage (wi ) depends on precision of the individual unit i relative to the
variability between units.
Dr. Pablo E. Verde 261
Lecture 9: Introduction to Hierarchical Models
Example: Toxoplasmosis data
These data present the relation between rainfall and the proportions of people
with toxoplasmosis for 34 cities in El Salvador (Efron 1986)
The question is how to combine results of 34 diﬀerent cities in a single model?
The data have been used to illustrate non-linear relationship between the amount
of rain and toxoplasmosis prevalence
We illustrate a series of hierarchical models
...we take some preliminary visualization
Dr. Pablo E. Verde 262
Lecture 9: Introduction to Hierarchical Models
Toxoplasmosis Rates: El Salvador 34 cities
0.0 0.2 0.4 0.6 0.8 1.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Dr. Pablo E. Verde 263
Lecture 9: Introduction to Hierarchical Models
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
1600 1800 2000 2200
0.00.20.40.60.81.0
Amount of Rain (mm)
ToxoplasmosisRate
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 20 40 60 80
0.00.20.40.60.81.0
Sample Size
ToxoplasmosisRate
Toxoplasmosis Rates: El Salvador 34 cities
Dr. Pablo E. Verde 264
Lecture 9: Introduction to Hierarchical Models
Pooled approach:
To start we ignore possible (complex) relation between rainfall and proportion and we
ﬁt same beta-binomial model to all the cities
34
i=1
p(ri |π, ni ) =
34
i=1
Binomial(ni , π)
p(π) = Beta(a, b)
p(π|r, n) ∝
34
i=1
p(ri |π, ni )p(π)
= Beta a +
i
ri , b +
i
(ni − ri )
Dr. Pablo E. Verde 265
Lecture 9: Introduction to Hierarchical Models
Some comments:
Now, is it reasonable to assume common probability π of Toxoplasmosis for
each city?
The beta-binomial model assumes that each outcome is independent and
identically distributed according to the binomial probability distribution with
parameter π
Does this model adequately describe the random variation in outcomes for each
city?
Are the cities rates more variable that our model assumes ?
Question: How to model the excess of variation in the data?
Dr. Pablo E. Verde 266
Lecture 9: Introduction to Hierarchical Models
Modeling the excess of variation
Model 1:
Combining information with a Generalized Linear Mixed Model (GLMM)
ri ∼ Binomial(ni , πi )
logit(πi ) ∼ N(µ, σ2
)
µ ∼ N(0, 100)
σ ∼ Uniform(0.01, 10)
This model combines information between cities in a single model.
Dr. Pablo E. Verde 267
Lecture 9: Introduction to Hierarchical Models
Model 2:
Modeling each city rate independently
ri ∼ Binomial(ni , πi )
logit(πi ) ∼ N(0, 100).
This model does NOT use information between cities. Each rate πi is estimated
independently.
Dr. Pablo E. Verde 268
Lecture 9: Introduction to Hierarchical Models
In WinBUGS we run both models simultaneously as follows:
model{
for( i in 1 : I ) {
r[i] ~ dbin(p[i,1],n[i])
logit(p[i,1]) <- a[i]
a[i] ~ dnorm(alpha1, tau1) # Exchangeable
r2[i] ~ dbin(p[i,2],n[i]) # Copy of r[i]
logit(p[i,2]) <- b[i]
b[i] ~ dnorm(0, 0.01) # Independent
}
tau1 <- 1/(sigma1*sigma1) # Priors
sigma1 ~ dunif(0.01, 5)
alpha1 ~ dnorm(0,0.01)
}
Dr. Pablo E. Verde 269
Lecture 9: Introduction to Hierarchical Models
#data
list(I=34,
r=c(2,3,1,3,2,3,2,7,3,8,7,0,15,4 ,0,6 ,0,33,4,5 ,2 ,0,8,
41,24,7, 46,9,23,53,8,3,1,23),
r2=c(2,3,1,3,2,3,2,7,3,8,7,0,15,4 ,0,6 ,0,33,4,5 ,2 ,0,8,41,24,7, 46,
9,23,53,8,3,1,23),
n=c(4,20,5,10,2,5,8,19,6,10,24,1,30,22,1,11,1,54,9,18,12,1,11,77,51,
16,82,13,43,75,13,10,6,37)
)
Dr. Pablo E. Verde 270
Lecture 9: Introduction to Hierarchical Models
Toxoplasmosis Rates: El Salvador 34 cities
0.0 0.2 0.4 0.6 0.8 1.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Dr. Pablo E. Verde 271
Lecture 9: Introduction to Hierarchical Models
ML Bayes
0.00.20.40.60.81.0
Schrikage effect
Dr. Pablo E. Verde 272
Lecture 9: Introduction to Hierarchical Models
Results:
DIC for Model 1 (GLMM) = 145.670
DIC for Model 2 (independent) =176.504
Clearly Model 1 which combines information between cities is the winer.
The use of Model 2 has the eﬀect of reduce the variability between cities
The use of Model 2 gives better results at the level of the city.
Clear conclusion: We should take advantage of modeling simultaneously multiple
results!
Dr. Pablo E. Verde 273
Lecture 9: Introduction to Hierarchical Models
Example: Toxoplasmosis data continue ...
Two important questions are:
Which is the inﬂuence of the amount of rain in toxoplasmosis prevalence?
Which is the inﬂuence of the number of participants in each town?
Model 3:
ri ∼ Binomial(ni , πi )
logit(πi ) = ai + β1 × (xi,1 − ¯x1) + β2 × (xi,1 − ¯x1)
ai ∼ N(α, σ2
)
σ ∼ Uniform(0.01, 10)
α, β1, β2 ∼ N(0, 100),
Dr. Pablo E. Verde 274
Lecture 9: Introduction to Hierarchical Models
In WinBUGS:
model
{
for(i in 1:N)
{log.n[i] <- log(n[i])}
for( i in 1 : N ) {
r[i] ~ dbin(p[i], n[i])
a[i] ~ dnorm(alpha,tau)
logit(p[i]) <- a[i] + beta1 * (rain[i]-mean(rain[])) +
beta2 * (log.n[i] - mean(log.n[]))
}
# Priors
tau <- 1/(sigma*sigma)
sigma ~ dunif(0.01, 5)
alpha ~ dnorm(0,0.01)
beta1 ~ dnorm(0, 0.01)
beta2 ~ dnorm(0, 0.01)}
Dr. Pablo E. Verde 275
Lecture 9: Introduction to Hierarchical Models
Posterior for Beta_1
beta1
Density
−0.003 0.000 0.002
0100200300400500
Posterior for Beta_2
beta2
Density
−0.2 0.2 0.6
0.00.51.01.52.02.53.0
Posterior of β1 clearly indicates no linear inﬂuence of the amount of rain. Posterior of
β2 shows a clear trend.
Dr. Pablo E. Verde 276
Lecture 9: Introduction to Hierarchical Models
Example: Ranking of the eighteen baseball players (Efron and Morris, 1977)
How can we rank the players ability ? Who was the best player of the season 1970?
We use the ranking function in WinBUGS to answer this question.
Name hit/AB Observed Avg (ML) ”TRUTH” James-Stein
Clemente 18/45 .400 .346 .290
Robinson 17/45 .378 .298 .286
...
...
...
...
...
Total Squared error .077 0.022
Dr. Pablo E. Verde 277
Lecture 9: Introduction to Hierarchical Models
ML Bayes
0.200.250.300.350.400.45
BattingAverage
Dr. Pablo E. Verde 278
Lecture 9: Introduction to Hierarchical Models
Posteriors: Prob(Hits)
0.1 0.2 0.3 0.4 0.5 0.6
Roberto Clemente
Frank Robinson
Frank Howard
Jay Johnstone
Ken Berry
Jim Spencer
Don Kessinger
Luis Alvarado
Ron Santo
Ron Swoboda
Del Unser
Billy Williams
George Scott
Rico Petrocelli
Ellie Rodriguez
Bert Campaneris
Thurman Munson
Max Alvis
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Dr. Pablo E. Verde 279
Lecture 9: Introduction to Hierarchical Models
model {
for( i in 1 : N ) {
r[i] ~ dbin(p[i],n[i])
logit(p[i]) <- b[i]
b[i] ~ dnorm(mu, tau)
p.rank[i] <- rank(p[], i)
}
# hyper-priors
mu ~ dnorm(0.0,1.0E-6)
sigma~ dunif(0,100)
tau<-1/(sigma*sigma)
}
# Data
list(r = c(145, 144, 160, 76, 128, 140, 167, 41, 148, 57, 83, 79, 142,
152, 52, 168, 137, 21),
n= c(412, 471, 566, 320, 463, 511, 631, 183, 555, 245, 322, 315, 480, 583,
231, 603, 453, 115), N=18 )
Dr. Pablo E. Verde 280
Lecture 9: Introduction to Hierarchical Models
Rank distributions for Robinson, Howard and Clemente.
Dr. Pablo E. Verde 281
Lecture 9: Introduction to Hierarchical Models
Left panel: posterior distributions of ranks. Right panel: posterior distributions of
performances.
Dr. Pablo E. Verde 282
Lecture 9: Introduction to Hierarchical Models
Summary on hierarchical models
Hierarchical models allows to ”borrow strength” across units
Posteriors distribution of θi for each unit borrows information from the likelihood
contributions for all other units. Estimation is more eﬃcient.
MCMC allows considerable ﬂexibility over choice of random eﬀects distribution
(not restricted to normal random errors)
MCMC allows to make inference on diﬃcult questions, e.g. ranking estimation of
random eﬀects
Easy to extend to more complicated models (e.g. non-linear repeated
measurements, etc.)
Dr. Pablo E. Verde 283
Lecture 9: Introduction to Hierarchical Models
Summary of the course
Bayesian statistics:
A new interpretation of probability as a subjective mental construct
Relationship with classical statistics
Uses of prior distribution as: sensitivity analysis, prior predictions, model building,
etc.
Making inference using MCMC
Thinking diﬀerent about regression models
Diﬀerent modeling tools: DAGs, DIC and WinBUGS
Introduction to hierarchical modeling
Dr. Pablo E. Verde 284
Lecture 9: Introduction to Hierarchical Models
Finally
Nice to see that most people attended and survived the course!!
Hope that everybody will be now enthusiastic modern Bayesians ;-)
Thank you very much ... Muchas gracias !!!
Dr. Pablo E. Verde 285