Introduction to Bayesian Data Analysis using R and WinBUGS Dr. Pablo E. Verde Department of Mathematics and Statistics Masaryk University Czech Republic April 2013 pabloemilio.verde@uni-duesseldorf.de Dr. Pablo E. Verde 1 Overview of the course Day 1 Lecture 1: Introduction to Bayesian Inference Lecture 2: Bayesian analysis for single parameter models Lecture 3: Prior distributions: univariate Dr. Pablo E. Verde 2 Day 2 Lecture 4: Bayesian analysis for multiple parameter models with R Lecture 5: An introduction to WinBUGS Lecture 6: Multivariate models with WinBUGS Day 3 Lecture 7: An introduction to MCMC computations Lecture 8: Bayesian regression with WinBUGS Lecture 9: Introduction to Hierarchical Statistical modeling Dr. Pablo E. Verde 3 Learning objectives Understanding of the potential role of Bayesian methods for making inference about real-world problems Insight into modern computations techniques used for Bayesian analysis Learning Bayesian statistical analysis with R and WinBUGS An interest in using Bayesian methods in your own field of work Dr. Pablo E. Verde 4 Style and emphasis Immediately applicable methods rather than latest theory Attention to real problems: case studies Implementation in R and WinBUGS (although not a full tutorial) Focus on statistical modeling rather than running code, checking convergence etc. Make more emphasis to the complementary aspects of Bayesian Statistics to Classical Statistics rather than one vs. the other Dr. Pablo E. Verde 5 Recommended bibliography The BUGS Book: A Practical Introduction to Bayesian Analysis. David Lunn; Chris Jackson; Nicky Best; Andrew Thomas; David Spiegelhalter. CRC Press, October 3, 2012. Bayesian Data Analysis (Second edition). Andrew Gelman, John Carlin, Hal Stern and Donald Rubin. 2004 Chapman & Hall/CRC. Bayesian Computation with R (Second edition). Jim Albert. 2009. Springer Verlag. An introduction of Bayesian data analysis with R and BUGS: a simple worked example. Verde, P.E. Estadistica (2010), 62, pp. 21-44 Dr. Pablo E. Verde 6 Lecture 1: Introduction to Modern Bayesian Inference Lecture 1: Introduction to Modern Bayesian Inference ”I shall not assume the truth of Bayes’ axiom (...) theorems which are useless for scientific purposes.” -Ronald A. Fisher (1935) The Design of Experiments, page 6. Dr. Pablo E. Verde 7 Lecture 1: Introduction to Modern Bayesian Inference How did it all start? In 1763, Reverend Thomas Bayes wrote: PROBLEM. Given the number of times in which an unknown event has happened and faild: Requiered the chance that the probability of its happening in a single trial lies somewhere between any two degrees of probability that can be named. In modern language, given r ∼ Binomial(θ, n), what is Pr(θ1 < θ < θ2|r, n) ? Dr. Pablo E. Verde 8 Lecture 1: Introduction to Modern Bayesian Inference Example: surgical procedure Suppose a hospital is considering a new high-risk operation Experience in other hospitals indicates that the risk θ for each patient is around 10 % It would be surprising to be less than 3% or more than 20% We can directly express uncertainty about the patient risk θ with a probability distribution Dr. Pablo E. Verde 9 Lecture 1: Introduction to Modern Bayesian Inference 0 10 20 30 40 50 02468 Mortality rate (in %) Density(mortalityrate) Probability that mortality risk is greater than 15% is Pr(θ > 0.15) = 0.17 Dr. Pablo E. Verde 10 Lecture 1: Introduction to Modern Bayesian Inference Why a direct probability distribution? Tells us what we want: what are plausible values for the parameter of interest? No P-values: just calculate relevant tail areas No confidence intervals: just report central area that contains 95% of distribution Easy to make predictions (see later) Fits naturally into decision analysis / risk analysis / cost-effectiveness analysis There is a procedure for adapting the distribution in the light of additional evidence: i.e. Bayes theorem allows us to learn from experience Dr. Pablo E. Verde 11 Lecture 1: Introduction to Modern Bayesian Inference What about disadvantages? Requires the specification of what we thought before new evidence is taken into account: the prior distribution Explicit allowance for quantitative subjective judgment in the analysis Analysis may be more complex than a traditional approach Computation may be more difficult Currently no established standards for Bayesian reporting Dr. Pablo E. Verde 12 Lecture 1: Introduction to Modern Bayesian Inference Why does Bayesian statistics is popular today? Bayesian methods optimally combine multiple sources of information in a common model The computational revolution produced by the rediscovery of Markov chain Monte Carlo (MCMC) techniques in statistics Free available software implementation of MCMC (e.g. WinBUGS, JAGS, STAN, large number of packages in R, etc.) As a result, we can routinely construct sophisticated statistical models that may reflect the complexity for phenomena of interest Dr. Pablo E. Verde 13 Lecture 1: Introduction to Modern Bayesian Inference Bayes theorem for observable quantities Let A and B be events; then it is provable from axioms of probability theory that P(A|B) = P(B|A)P(A) P(B) P(A) is the marginal probability of A, i.e., prior to taking account of the information in B P(A|B) is the conditional probability of A given B, i.e., posterior to taking account of the value of B P(B|A) is the conditional probability of B given A P(B) is the marginal probability of B Sometimes is useful to work with P(B) = P(A)P(B|A) + P(¯A)P(B|¯A), which is curiously called ”extending the conversation” (Lindley 2006, pag. 68) Dr. Pablo E. Verde 14 Lecture 1: Introduction to Modern Bayesian Inference Example: Problems with statistical significance (Simon, 1994) Suppose that a priory only 10% of clinical trials are truly effective treatments Assume each trial is carried out with a design with enough sample size such that α = 5% and power 1 − β = 80% Question: What is the chance that the treatment is true effective given a significant test results? p(H1|”significant results”)? Dr. Pablo E. Verde 15 Lecture 1: Introduction to Modern Bayesian Inference Let A be the event that H1 is true, then p(H1) = 0.1 Let B be the event that the statistical test is significant We want p(A|B) = p(H1|”significant results”) We have: p(B|A) = p(”significant results”|H1) = 1 − β = 0.8 We have: p(B|A) = p(”significant results”|H0) = α = 0.05 Now, Bayes theorem says p(H1|”significant results”) = (1 − β) × 0.1 (1 − β) × 0.1 + α × 0.9 = 0.8 × 0.1 0.8 × 0.1 + 0.05 × 0.9 = 0.64 Answer: This says that if truly effective treatments are relatively rare, then a ”statistically significant” results stands a good chance of being a false positive. Dr. Pablo E. Verde 16 Lecture 1: Introduction to Modern Bayesian Inference Bayesian inference for unknown quantities Makes fundamental distinction between: Observable quantities y, i.e., data. Unknown quantities θ, that can be statistical parameters, missing data, predicted values, mismeasured data, indicators of variable selected, etc. Technically, in the Bayesian framework parameters are treated as values of random variables. Differences with classical statistical inference: In Bayesian inference, we make probability statements about model parameters In the classical framework, parameters are fixed non-random quantities and the probability statements concern the data Dr. Pablo E. Verde 17 Lecture 1: Introduction to Modern Bayesian Inference Bayesian Inference Suppose that we have observed some data y We want to make inference about unknown quantities θ: model parameters, missing data, predicted values, mismeasured data, etc. The Bayesian analysis starts like a classical statistical analysis by specifying the sampling model: p(y|θ) this is the likelihood function. From a Bayesian point of view, θ is unknown so should have a probability distribution reflecting our uncertainty about it before seeing the data. We need to specify a prior distribution p(θ) Together they define a full probability model: p(y, θ) = p(y|θ)p(θ) Dr. Pablo E. Verde 18 Lecture 1: Introduction to Modern Bayesian Inference Then we use the Bayes theorem to obtain the conditional probability distribution for unobserved quantities of interest given the data: p(θ|y) = p(θ)p(y|θ) p(θ)p(y|θ)dθ ∝ p(θ)p(y|θ) This is the posterior distribution for θ, posterior ∝ likelihood × prior. Dr. Pablo E. Verde 19 Lecture 1: Introduction to Modern Bayesian Inference Inference with binary data Example: Inference on proportions using a discrete prior Suppose I have 3 coins in my pocket: 1. A biased coin: p(heads) = 0.25 2. A fair coin: p(heads) = 0.5 3. A biased coin: p(heads) = 0.75 I randomly select one coin, I flip it once and it comes a head. What is the probability that I have chosen coin number 3 ? Dr. Pablo E. Verde 20 Lecture 1: Introduction to Modern Bayesian Inference Inference with binary data 1. Let y = 1 the event that I observed a head 2. Let θ denote the probability of a head: θ ∈ (0.25, 0.5, 0.75) 3. Prior: p(θ = 0.25) = p(θ = 0.5) = p(θ = 0.75) = 1/3 4. Sampling distribution: y|θ ∼ Binomial(θ, 1), with likelihood p(y|θ) = θy (1 − θ)(1−y) . Dr. Pablo E. Verde 21 Lecture 1: Introduction to Modern Bayesian Inference If we observe a single positive response (y = 1), how is our belief revised? Coin θ Prior Likelihood Likelihood × prior Posterior p(θ) p(y = 1|θ) p(y = 1|θ)p(θ) p(θ|y = 1) 1 0.25 0.33 0.25 0.0825 0.167 2 0.50 0.33 0.50 0.1650 0.333 3 0.75 0.33 0.75 0.2475 0.500 1.0 1.50 0.495 1.0 So, observing a head on a single flip of the coin means that there is now a 50% probability that the chance of heads is 0.75 and only a 16.7% that the chance of heads is 0.25. Note that if we normalize the likelihood p(y = 1|θ) we have exactly the same results. Dr. Pablo E. Verde 22 Lecture 1: Introduction to Modern Bayesian Inference Example: Three coins by simulation We can approximate the results of the three coins example with the following computer simulation algorithm: 1. Generate a random index i∗ from (1, 2, 3) with prior probability 1/3 for each component 2. Evaluate the probability of y = 1 at θ∗ = θ(i∗) 3. Generate a Binomial y∗ with probability θ∗ 4. Repeat 1 to 3 a large number of times 5. Average and normalize the outcomes of y∗ Dr. Pablo E. Verde 23 Lecture 1: Introduction to Modern Bayesian Inference Example: Three coins by simulation in R > coin.pri <- rep(1/3, 3) # prior for coins 1, 2, 3. > theta <- c(0.25, 0.5, 0.75) # pr. heads under 1, 2, 3. > sim <- 100000 # number of simulations > y <- rep(0, 3*sim) > dim(y) <- c(sim, 3) > for( i in 1:sim) { ind <- sample(c(1,2,3), size = 1, prob = coin.pri, replace = TRUE) y.new <- rbinom(1, 1, prob = theta[ind]) y[i, ind] <- y.new } > # average each class and normalize > apply(y, 2, mean) / sum( apply(y, 2, mean)) [1] 0.1661 0.3306 0.5034 Dr. Pablo E. Verde 24 Lecture 1: Introduction to Modern Bayesian Inference Monte Carlo error: How accurate are the empirical estimation by simulation? Suppose that we obtain empirical mean ˆE(g(X)) and variance ˆV = ˆVar(g(X)) based on T simulations. Then since ˆE is a sample mean based on T independent samples, it has true sample variance V = Var(g(X))/T which may be estimated by ˆV /T Hence ˆE has estimated standard error ˆV /T, which is known as Monte Carlo error. > th.post <- apply(y, 2, mean)/sum( apply(y, 2, mean)) > # monte carlo error > V <- th.post * (1- th.post)/sim > sqrt(V) [1] 0.001177 0.001488 0.001581 > Dr. Pablo E. Verde 25 Lecture 1: Introduction to Modern Bayesian Inference Posterior predictive distributions The predictive posterior distribution for a new observation ynew is given by p(ynew |y) = p(ynew |y, θ)p(θ|y)dθ. Assuming that past and future observations are conditionally independent given θ, this simplify to p(ynew |y) = p(ynew |θ)p(θ|y)dθ. For the discrete case of θ, integrals are replaced by sums: p(ynew |y) = θi p(ynew |θi )p(θi |y) where the p(θi |y) can be thought of as ”posterior weights”. Dr. Pablo E. Verde 26 Lecture 1: Introduction to Modern Bayesian Inference Example: Three coins continue ... Suppose we want to predict the probability that in the next toss is a head. We have: p(ynew = 1|y = 1) = θi θi , p(θi |y = 1) = (0.25 × 0.167) + (0.50 × 0.333) + (0.75 × 0.5) = 0.5833 By using the previous simulation results in R we have: > sum(theta * apply(y, 2, mean)/sum( apply(y, 2, mean))) [1] 0.5843 > Dr. Pablo E. Verde 27 Lecture 1: Introduction to Modern Bayesian Inference Sequential learning Suppose we obtain data y1 and form the posterior p(θ|y1) and then we obtain further data y2. The posterior based on y1, y2 is given by: p(θ|y1, y2) ∝ p(y2|θ) × p(θ|y1). Today’s posterior is tomorrow’s prior! The resultant posterior is the same as if we have obtained the data y1, y2 together: p(θ|y1, y2) ∝ p(y1, y2|θ) × p(θ). Dr. Pablo E. Verde 28 Lecture 1: Introduction to Modern Bayesian Inference Example: Three coins continue ... Now suppose that after observing y1 = 1 we observe y2 = 1, how is our belief revised? Coin θ Prior Likelihood Likelihood × prior Posterior p(θ) p(y = 1|θ) p(y = 1|θ)p(θ) p(θ|y = 1) 1 0.25 0.167 0.25 0.042 0.071 2 0.50 0.333 0.50 0.167 0.286 3 0.75 0.500 0.75 0.375 0.644 1.0 1.50 0.583 1.0 After observing a second head, there is now a 64.4% probability that the chance of heads is 0.75 and only a 7.1% that the chance of heads is 0.25. Dr. Pablo E. Verde 29 Lecture 1: Introduction to Modern Bayesian Inference A bit of philosophy: Probability ”for” and probability ”of” The prior distribution p(θ), expresses our uncertainty about θ before seeing the data, that could be objective or subjective The Bayesian inference allows the combination of different types of probabilities Subjective probability implied a mental construct where probabilities are used to express our uncertainty. This is why we use a pedantic: ”probability for an event...” In the classical setting probabilities are defined in terms of long run frequencies and are interpreted as physical properties of systems. In this way we use: ”probability of ...” In general we follow Bayesian Statistical Modeling, which is dynamic view of data analysis, which includes model building and model checking as well as statistical inference Dr. Pablo E. Verde 30 Lecture 1: Introduction to Modern Bayesian Inference Summary Bayesian statistics: Formal combination of external information with data model by the Bayesian rule Uses of subjective probability to make inductive inference Straightforward inference by manipulating probabilities No need of repeated sampling ideas Specification of prior or external evidence may be difficult or controversial Prediction and sequential learning is straightforward. Dr. Pablo E. Verde 31 Lecture 2: Bayesian inference for single parameter Lecture 2: Bayesian Inference for Single Parameter Models Dr. Pablo E. Verde 32 Lecture 2: Bayesian inference for single parameter Summary 1. Conjugate Analysis for: Binomial model Normal model known variance Normal model known mean Poisson model 2. Using R for: Graphical visualization of posteriors Direct Monte Carlo Simulation Methods Calculations of posteriors for functional parameters Dr. Pablo E. Verde 33 Lecture 2: Bayesian inference for single parameter Inference of proportions using a continuous prior Suppose we observe r positive responses out of n patients. Assuming patients are independent, with common unknown response rate θ, leads to a binomial likelihood p(r|n, θ) = n r θr (1 − θ)n−r ∝ θr (1 − θ)n−r We consider the response rate θ to be a continuous parameter, i.e., we need to give a continuous prior distribution. Suppose that before the data is observed we believe all values for θ are equally likely (very unrealistic!), then we give θ ∼ Unif(0, 1), i.e., p(θ) = 1. Posterior is then p(θ|r, n) ∝ θr (1 − θ)n−r × 1 this has a form of the kernel of a Beta(r + 1, n − r + 1). Dr. Pablo E. Verde 34 Lecture 2: Bayesian inference for single parameter To represent external evidence that some response rates are more plausible than others, it is mathematical convenient to use a Beta(a, b) prior distribution for θ p(θ) ∝ θa−1 (1 − θ)b−1 Combining this with the binomial likelihood gives a posterior distribution p(θ|r, n) ∝ p(r|θ, n)p(θ) ∝ θr (1 − θ)n−r = θr+a−1 (1 − θ)n−r+b−1 ∝ Beta(r + a, n − r + b). Dr. Pablo E. Verde 35 Lecture 2: Bayesian inference for single parameter When the prior and posterior come from the same family of distributions the prior is said to be conjugate to the likelihood A Beta(a, b) distribution has E(θ) = a/(a + b) var(θ) = ab/[(a + b)2 (a + b + 1)] Hence the posterior mean is E(θ|r, n) = (r + a)/(n + a + b) a and b are equivalent to observing a prior a − 1 successes in a + b − 2 trials, then it can be elicited. With fixed a and b, as r and n increase, E(θ|r, n) → r/n (the MLE). A Beta(1,1) is equivalent to Uniform(0,1). Dr. Pablo E. Verde 36 Lecture 2: Bayesian inference for single parameter Example: Shape of the Beta density function The Beta(a, b) prior is a flexible distribution. We can explore it shape by using the following lines in R: # Beta flexibility par(mfrow=c(2,2)) curve(dbeta(x, 1, 1), from =0, to=1, lty = 1, lwd=2, main="Beta(1, 1)", xlab=expression(theta)) curve(dbeta(x, 1/2, 1/2), from = 0, to = 1, lty = 2, lwd=2, col="red", main="Beta(1/2, 1/2)", xlab=expression(theta)) curve(dbeta(x, 2, 5), from = 0, to = 1, lty = 2, lwd=2, col="blue", main="Beta(2, 5)", xlab=expression(theta)) curve(dbeta(x, 2, 2), from = 0, to = 1, lty = 2, lwd=2, col="green", main="Beta(2, 2)", xlab=expression(theta)) par(mfrow=c(1,1)) Dr. Pablo E. Verde 37 Lecture 2: Bayesian inference for single parameter 0.0 0.2 0.4 0.6 0.8 1.0 0.60.81.01.21.4 Beta(1, 1) θ dbeta(x,1,1) 0.0 0.2 0.4 0.6 0.8 1.0 1.01.52.02.53.0 Beta(1/2, 1/2) θ dbeta(x,1/2,1/2) 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.52.02.5 Beta(2, 5) θ dbeta(x,2,5) 0.0 0.2 0.4 0.6 0.8 1.0 0.00.51.01.5 Beta(2, 2) θ dbeta(x,2,2) Dr. Pablo E. Verde 38 Lecture 2: Bayesian inference for single parameter Example: drug investigation Consider an early investigation of a new drug Experience with similar compounds has suggested that response rates between 0.2 and 0.6 could be feasible Interpret this as a distribution with mean=0.4 and standard deviation 0.1 A Beta(9.2, 13.8) distribution has these properties Suppose we treat n = 20 volunteers with the compound and observe r = 15 positive responses. Then we update the prior and the posterior is Beta(15 + 9.2, 5 + 13.8) Dr. Pablo E. Verde 39 Lecture 2: Bayesian inference for single parameter R script to perform the analysis: > par(mfrow=c(3,1)) # graphical output: 3 by 1 panels # draw a curve for a beta density > curve(dbeta(x,9.2, 13.8),from=0, to =1, xlab= "prob of success",main="Prior") # draw a curve for a binomial density > curve(dbinom(15, 20,x),from=0, to =1, col="blue", xlab="prob of sucess", main="Likelihood") # draw the posterior > curve(dbeta(x, 24.2, 18.8),from=0, to =1, col="red", xlab="prob of sucess", main="Posterior") > par(mfrow=c(1,1)) Dr. Pablo E. Verde 40 Lecture 2: Bayesian inference for single parameter 0.0 0.2 0.4 0.6 0.8 1.0 01234 Prior prob of success dbeta(x,9.2,13.8) 0.0 0.2 0.4 0.6 0.8 1.0 0.000.100.20 Likelihood prob of sucess dbinom(15,20,x) 0.0 0.2 0.4 0.6 0.8 1.0 024 Posterior prob of sucess dbeta(x,24.2,18.8) Dr. Pablo E. Verde 41 Lecture 2: Bayesian inference for single parameter Monte Carlo Simulation If we are able to sample values θ∗ from the posterior p(θ|r, n), then we can extend the inferential scope. For example, one important application of this simulation process is when we need to estimate the posterior of a functional parameter, e.g., the odds ratio: φ = f (θ) = θ 1 − θ For simple models we can directly simulate from the posterior density and use this values to empirically approximate the posterior quantiles. Dr. Pablo E. Verde 42 Lecture 2: Bayesian inference for single parameter Example: Posterior for the odds ratio The posterior of φ = θ 1 − θ is calculated in R as follows: > theta.star <- rbeta(20000, 24.2, 18.8) > odds.star <- theta.star/(1-theta.star) > quantile(odds.star, prob = c(0.05, 0.5, 0.75, 0.95)) 5% 50% 75% 95% 0.7730276 1.2852558 1.5784318 2.1592678 > hist(odds.star, breaks=100, xlab="odds ratio", freq=FALSE, xlim=c(0, 4)) > lines(density(odds.star), lwd =2, lty = 1, col ="red") Dr. Pablo E. Verde 43 Lecture 2: Bayesian inference for single parameter Histogram of odds.star odds ratio Density 0 1 2 3 4 0.00.20.40.60.81.0 Dr. Pablo E. Verde 44 Lecture 2: Bayesian inference for single parameter Making predictions for binary data Suppose that we want to make predictions from the model The predictive posterior distribution for rnew (the number of successes in m trials) follows a Beta-Binomial distribution with density: p(rnew ) = Γ(a + b) Γ(a)Γ(b) m rnew Γ(a + rnew )Γ(b + m − rnew ) Γ(a + b + m) In R: # Beta-binomial density betabin <- function(r,a,b,m) { gamma(a+b)/(gamma(a)*gamma(b)) * choose(m,r) * gamma(a+r)*gamma(b+m-r)/gamma(a+b+m) } Dr. Pablo E. Verde 45 Lecture 2: Bayesian inference for single parameter Example: drug investigation continue ... Suppose that we are interested in the predictive posterior of the number of success in the next 40 trials: Using the betabin function in R we have: # Beta-binomial distribution of the number of successes x in the next # 40 trials with mean 22.5 and standard deviation 4.3 x <- 0:40; px <- betabin(0:40, a=24.2,b=18.8,m=40) plot(x, px, type="h", xlab="number of successes out of 40", main="Predictive Posterior") Dr. Pablo E. Verde 46 Lecture 2: Bayesian inference for single parameter 0 10 20 30 40 0.000.020.040.060.08 Predictive Posterior number of successes out of 40 px Dr. Pablo E. Verde 47 Lecture 2: Bayesian inference for single parameter Example: drug investigation continue ... Suppose that we would consider continuing a development program if the drug managed to achieve at least a further 25 successes out of these 40 future trials In R we can calculate: # probability of at least 25 successes out of 40 further trials > sum(betabin(25:40,24.2,18.8,m=40)) [1] 0.3290134 > Dr. Pablo E. Verde 48 Lecture 2: Bayesian inference for single parameter Simulations for predictive data Instead of using the analytical approximation based on the Beta-Binomial distribution, we can simulate predicted observations in the following way: We simulate θ∗ 1, . . . , θ∗ B from the posterior Beta(24.2, 18.8) Then we simulate y∗ from a binomial distribution with rate θ∗ and n = 40. We tabulate and normalize predictive frequencies. Dr. Pablo E. Verde 49 Lecture 2: Bayesian inference for single parameter Simulations for predictive data In R notation > # Simulation of predictive data with a Beta-Binomial model ... > theta.star <- rbeta(10000, 24.2, 18.8) > y <- rbinom(10000, 40, theta.star) > freq.y <- table(y) > ys <- as.integer(names(freq.y)) > predprob <- freq.y/sum(freq.y) > plot(ys, predprob, type = "h", xlab = "y", + ylab = "Predictive Probability") > > # Probability of at least 25 out of future 40 trials. > sum(predprob[ys>24]) [1] 0.3244 Dr. Pablo E. Verde 50 Lecture 2: Bayesian inference for single parameter 10 15 20 25 30 35 0.000.020.040.060.08 y PredictiveProbability Dr. Pablo E. Verde 51 Lecture 2: Bayesian inference for single parameter Bayesian analysis for Normal data Known variance, unknown mean Suppose we have a sample of Normal data yi ∼ N(µ, σ2 ), i = 1, . . . , n where σ2 is known and µ is unknown. The conjugate prior of µ is µ ∼ N(µ0, τ2 ). It is convenient to write τ2 as σ2/n0, where n0 represents the ”effective number of observations” in the prior distribution. Dr. Pablo E. Verde 52 Lecture 2: Bayesian inference for single parameter Then the posterior distribution for µ is given by p(µ|y) = N n0µ0 + n¯y n0 + n , σ2 n0 + n Prior variance is based on an ”implicit” sample size n0 As n0 tends to 0, the distribution becomes ”flatter” Posterior mean is a weighted average of the prior mean µ0 and parameter estimate ¯x , weighted by their precisions, i.e., relative sample sizes. Posterior variance is based on an implicit sample size n0 and the data sample size n. Dr. Pablo E. Verde 53 Lecture 2: Bayesian inference for single parameter Alternative expressions for the posterior mean µn are : µn = wµ0 + (1 − w)¯y where w = n0 n + n0 , µn = µ0 + (¯y − µ0) n n + n0 , µn = ¯y − (¯y − µ0) n0 n + n0 . That shows ”shrinkage” towards prior mean. Dr. Pablo E. Verde 54 Lecture 2: Bayesian inference for single parameter Prediction Denoting the posterior mean and variance as µn and σ2 n = σ2/(n0 + n), the predictive posterior distribution for a new observation y∗ is p(y∗ |y) = p(y∗ |y, µ)p(µ|y)dµ which is generally equal to p(y∗ |y) = p(y∗ |µ)p(µ|y)dµ which can be shown to give p(y∗ |y) ∼ N(µn, σ2 n + σ2 ) The predictive posterior distribution is centered around the posterior mean µn with variance equal to sum of the posterior variance of µ plus the data variance. Dr. Pablo E. Verde 55 Lecture 2: Bayesian inference for single parameter Bayesian inference for Normal data Unknown variance, know mean Suppose we have sample of Normal data yi ∼ N(µ, σ2 ), i = 1, . . . , n, where µ is known and σ2 is unknown. It is convenient to change parameterization to the precision w = 1/σ2. The conjugate prior for w is then w ∼ Gamma(α, β), where p(w) ∝ wα−1 exp(−βw). Then σ2 is then said to have an inverse-gamma distribution. Dr. Pablo E. Verde 56 Lecture 2: Bayesian inference for single parameter The posterior distribution for w takes the form p(w|µ, y) ∝ wα−1 exp(−βw) × w n 2 exp − w 2 n i=1 (yi − µ)2 . Collecting terms gives p(w|µ, y) = Gamma α + n 2 , β + 1 2 n i=1 (yi − µ)2 . Dr. Pablo E. Verde 57 Lecture 2: Bayesian inference for single parameter Clearly we can think of α = n0/2, where n0 is the ”effective prior sample size”. Since n i=1(yi − µ)2/n estimate σ2 = 1/w, then we interpret 2β as representing n0 × prior estimate ofσ2 0. Alternative, we can write our conjugate prior as w ∼ Gamma n0 2 , n0σ2 0 2 , which can be seen as a scale χ2 n0 distribution. This is useful when assessing prior distributions for sample variances. Dr. Pablo E. Verde 58 Lecture 2: Bayesian inference for single parameter Bayesian inference with count data Suppose we have an independent sample of counts y1 . . . , yn which can be assumed to follow a Poisson distribution with unknown mean µti , where µ is the rate per unit t : p(y|µ) = i (µti )yi exp(−µti ) yi ! The kernel of the Poisson likelihood as a function of µ has the same form as a Gamma(a, b) prior for µ: p(µ) ∝ µa−1 exp(−bµ). Dr. Pablo E. Verde 59 Lecture 2: Bayesian inference for single parameter This implies the following posterior p(µ|y) ∝ p(µ)p(y|µ) ∝ µ(a−1) e−bµ n i=1 e−µti µy i ∝ µa+Yn−1 e−(b+Tn)µ = Gamma(a + Yn, b + Tn). where Yn = n i=1 yi and Tn = n i=1 ti . Dr. Pablo E. Verde 60 Lecture 2: Bayesian inference for single parameter The posterior mean is: E(µ|y) = a + Yn b + Tn = Yn Tn n n + b + a b 1 − n n + b . The posterior mean is a compromise between the prior mean a/b and the MLE Yn Tn . Thus b can be interpreted as an effective exposure and a/b as a prior estimate of the Poisson mean. Dr. Pablo E. Verde 61 Lecture 2: Bayesian inference for single parameter Example: Annual number of cases of haemolytic uraemic syndrome Henderson and Matthews, (1993) Annual numbers of cases were available from two a specialist center at Birmingham, from 1970 to 1988. For analysis we consider observed values from the first decade. year 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 cases, xi 1 5 3 2 1 0 0 2 1 1 Because we are interested in counts of disease over time, a simple model is a Poisson process. yi |µ ∼ Poisson(µ) i = 1, . . . , 10. Dr. Pablo E. Verde 62 Lecture 2: Bayesian inference for single parameter Similar data is collected by another centre, given a mean rate of 2.3 with s.d. 2.79. with prior mean a/b = 2.3 prior sd √ a/b = 2.79 Solving for a and b, this information can be translated to a prior Gamma(0.679, 0.295) distribution Then p(µ|y) = Gamma( i xi + 0.679, 10 + 0.295) = Gamma(16.679, 10.295) E(µ|y) = 16.679 10.295 = 1.620; sd(µ|y) = √ 16.679 10.295 = 0.396 Dr. Pablo E. Verde 63 Lecture 2: Bayesian inference for single parameter 0 1 2 3 4 5 0.00.20.40.60.81.0 µµ dgamma(x,0.679,0.295) prior posterior Dr. Pablo E. Verde 64 Lecture 2: Bayesian inference for single parameter The predictive posterior distribution for a new count y∗ is y∗ |y ∼ Negative-Binomial(a + n¯y, b + n) If we ask what is the probability to observe 7 or more cases in the next year we have > sum(dnbinom(6:17, prob=10.295/(1+10.295), size=16.679)) [1] 0.0096 > Note: the observed values for the following years were 7,11,4,7,10,... Indicating a possible structural break. We can visualize the density by plot(dnbinom(0:10, prob=p, size=16.679, type="h", lwd=2, col="red", xlab= "counts") Dr. Pablo E. Verde 65 Lecture 2: Bayesian inference for single parameter 2 4 6 8 10 0.000.050.100.150.200.250.30 counts dnbinom(0:10,prob=p,size=16.679) Dr. Pablo E. Verde 66 Lecture 2: Bayesian inference for single parameter Some comments For all these examples, we see that the posterior mean is a compromise between the prior mean and the MLE the posterior s.d. is less that each of the prior s.d. and the s.e. (MLE) As n → ∞, the posterior mean → the MLE the posterior s.d. → the s.e. (MLE) the posterior does not depend on the prior. These observations are generally true, when the MLE exists and is unique. Dr. Pablo E. Verde 67 Lecture 2: Bayesian inference for single parameter Priors When the posterior is in the same family as the prior then we have what is known as conjugacy. Conjugate models have the advantage that: Prior parameters can usually be interpreted as implied prior sample size: n0 They involve simple mathematical models, examples include: Distribution of y Parameter conjugate prior Binomial Prob. of success Beta Poisson Mean Gamma Exponential Reciprocal of mean Gamma Normal Mean (variance known) Normal Normal Variance (mean known) Inverse Gamma Unfortunately conjugate priors only exists for small catalog of likelihoods. Dr. Pablo E. Verde 68 Lecture 2: Bayesian inference for single parameter Practical Exercise: Conjugate inference for a binomial experiment Drug investigation example from this Lecture We treat n = 20 volunteers with a new compound and observe r = 15 positive responses. We use as prior θ ∼ Beta(9.2, 13.8): 1. What is the posterior mean and median for the response rate? 2. What are the 2.5th and 97.5th percentiles of the posterior? 3. What is the probability that the true response rate is greater than 0.6? 4. How is this value affected if a uniform prior is adopted? 5. Using the original Beta(9.2, 13.8) prior, suppose 40 more patients were entered into the study. What is the chance that at least 25 of them respond positively? Dr. Pablo E. Verde 69 Lecture 2: Bayesian inference for single parameter Solution # 1) Posteriors mean and median # The posterior mean is: (r+a) / (n + a+b) > (15 + 9.2) /( 20 + 9.2 + 13.8) [1] 0.5627907 # the posterior median is > qbeta(0.5, 24.2, 18.8 ) [1] 0.5637731 > # 2) Posterior percentiles: > qbeta(0.025, 24.2, 18.8 ) # 2.5% [1] 0.4142266 > qbeta(0.975, 24.2, 18.8 ) # 97.5% [1] 0.7058181 > Dr. Pablo E. Verde 70 Lecture 2: Bayesian inference for single parameter # 3) Posterior prob that the rate is greter than 0.6 > 1 - pbeta(0.6, 24.2, 18.8 ) [1] 0.3156323 # 4) Uniform prior is beta(1,1) then the posterior is beta( r+1, n-r+1) > 1 - pbeta(0.6, 15+1 ,20-15+1) [1] 0.9042598 # 5) Posterior probability of at least 25 successes out of 40 further trials > sum(betabin(25:40,24.2,18.8,m=40)) [1] 0.3290134 Dr. Pablo E. Verde 71 Lecture 3: Prior distributions Lecture 3: Priors Distributions Dr. Pablo E. Verde 72 Lecture 3: Prior distributions Summary Misunderstandings about prior distributions Non-informative Priors and Jeffreys invariance priors Sensitivity analysis and making priors predictions Adjustment of priors based on historical data and judgement Mixture of priors Dr. Pablo E. Verde 73 Lecture 3: Prior distributions Misunderstandings about prior distributions It is worth pointing out some misunderstanding regarding prior distributions: The name prior suggests a temporal relationship, however, this is misleading. The prior distribution models the uncertainty given by the external evidence. Cox (1999) The prior is not necessarily unique! In a recent article Lambert et. al. (2005) analyze the use of 13 different priors for the between study variance parameter in random-effects meta-analysis. There is no such thing as the‘correct’ prior. Bayesian analysis is regarded as transforming prior into posterior opinion, rather than producing ’the’ posterior distribution. The prior may not be completely specified. In Empirical Bayes inference priors have unknown parameters that are estimated from the data. Dr. Pablo E. Verde 74 Lecture 3: Prior distributions Priors can be overparametrized. Sometimes we intentionally overparametrized the priors in order to accelerate convergence of simulation methods, see Gelman, Carlin, Stern and Rubin (2004) Chapter 6. Inference may rely only on priors. There are situations where no further data are available to combine with our priors or there is no intention to update the priors. This is the typical case of risk analysis, sample size determination in experiments, simulation of complex process , etc. In these analytical scenarios priors are usually used to simulate hypothetical data and we refer to that prior predictive analysis. Finally, priors are not necessarily important! In many scientific applications, as the amount of data increases, the prior is overwhelmed by the likelihood and the influence of the prior disappears, see Box and Tiao (1973) (pag. 20-25). Dr. Pablo E. Verde 75 Lecture 3: Prior distributions Non-informative Priors: Classical Bayesian Perspective We may be interested to introduce an initial state of ”ignorance” in our Bayesian analysis. But representing ignorance raises formidable difficulties! There has been a long and complex search for various ”non-informative”, ”reference” or ”objective” priors during the last century. A sort of ”off-the-shelf objective prior” that can be applied in all circumstances. The synthesis of this search is that those magic priors do not exists, although useful guidance exists (Berger, 2006). Dr. Pablo E. Verde 76 Lecture 3: Prior distributions Problems with uniform priors for continuous parameters Example: uniform prior on proportions Let θ be the chance that a bias coin comes down heads, we assume θ ∼ Uniform(0, 1). Let φ = θ2 the chance that it coming down heads in both of the next 2 throws. Now, the density of φ is p(φ) = 1 2 √ φ , which corresponds to a Beta(0.5,1) distribution and is certainly not uniform! Dr. Pablo E. Verde 77 Lecture 3: Prior distributions Jeffreys’ invariance priors Consider a 1-to-1 transformation of θ : φ = g(θ) Transformation of variables: prior p(θ) is equivalent to prior on φ of p(φ) = p(θ) | dθ dφ | Jeffreys proposed defining a non-informative prior for θ as p(θ) ∝ I(θ)1/2 where I(θ) is Fisher information for θ I(θ) = −Ex|θ ∂2 log p(X|θ) ∂θ2 = Ex|θ ∂ log p(X|θ) ∂θ 2 . Dr. Pablo E. Verde 78 Lecture 3: Prior distributions Non-informative priors for proportions Data model is Binomial: we consider r successes from n trials r|θ ∼ Binomial(θ, n) we have log p(x|θ) = r log(θ) + (n − r) log(1 − θ) + C then I(θ) = n θ(1 − θ) . So the Jeffreys’ prior is p(θ) ∝ (θ(1 − θ))−1/2 , which is a Beta(1/2, 1/2). Dr. Pablo E. Verde 79 Lecture 3: Prior distributions Non-informative priors for location parameters A location parameters θ is such that p(θ|y) is a function of (y − θ) and so the distribution of (y − θ) is independent of θ. Example: data model is Normal with unknown mean θ and known variance v x1, x2, . . . , xn|θ ∼ Normal(θ, v) then we have log p(x|θ) = − (xi − θ)2 2v + C with I(θ) = n v . So the Jeffreys’ prior is p(θ) ∝ 1 which is the Uniform distribution for θ. Dr. Pablo E. Verde 80 Lecture 3: Prior distributions Non-informative priors for scale parameters A scale parameters θ is such that p(y|θ) is a function of 1/θf (y/θ) and so the distribution of (y/θ) is independent of θ. Example: data model is Normal with known mean m and unknown variance θ x1, x2, . . . , xn|θ ∼ Normal(m, θ) then we have log p(x|θ) = −n/2 log(θ) − s 2θ , where s = (xi − m)2, then I(θ) = n 2θ2 . So the Jeffreys’ prior on variance is p(θ) ∝ 1 θ This Jeffreys’ improper prior is approximated by a Gamma( , ) with → 0. Dr. Pablo E. Verde 81 Lecture 3: Prior distributions Priors for counts and rates Data model is Poisson: x|θ ∼ Poisson(θ), then we have log p(x|θ) = −θ + x log θ + C with I(θ) = 1/θ. So the Jeffreys’ prior is p(θ) ∝ θ−1/2 . This improper prior is approximated by a Gamma distribution with α = 1/2 and β → 0. Dr. Pablo E. Verde 82 Lecture 3: Prior distributions Comments Jeffreys’ rule They are invariant, whatever the scale we choose to measure the unknown parameter, the same prior results when the scale is transformed to any particular scale Some inconsistencies associated with Jeffreys’ priors have been discussed: Applying this rule to the normal case with both mean and variance parameters unknown does not lead to the same prior as applying separately the rule for the mean and the variance and assuming a priori independence between these parameters. Although Jeffreys’ rule is suggestive, it cannot be applied blindly. It should be thought of as guideline to consider particularly if there is no other obvious way of finding a prior distribution. Dr. Pablo E. Verde 83 Lecture 3: Prior distributions Predictive Prior Analysis In practice it is not necessary to adopt a full Bayesian approach. Sometimes is very useful to use Bayesian methods for some analyzes and classical approaches for others. Example: predictive distribution of the power in sample size calculations a randomize trial is planned with n patients in each of two arms. the response within each treatment arm is assumed to have between-patient standard deviation σ. the treatment estimate ˆθ = ¯y1 − ¯y2 is approximately distributed as Normal(θ, 2σ2/n). a trial designed to have two-sided Type I error α and Type II error β in detecting a true difference of θ in mean response between the groups will require a sample size per group of n = 2σ2 θ2 (z1−β − z1−α/2)2 , Dr. Pablo E. Verde 84 Lecture 3: Prior distributions Alternatively, for fixed n, the power of this experiment is Power = Φ nθ2 2σ2 − z1−α/2 . If we assume θ/σ = 0.5, α = 0.05 and β = 0.2, we have z1−β = 0.84, z1−α/2 = 1.96, then the power of the trial is 80% and n = 63 in each trial arm. However, we accept uncertainty about θ and σ and we wish to include this feature into the sample size and power calculations. We assume from previous studies that it is reasonable that θ ∼ N(0.5, 1) and σ ∼ N(1, 0.32 )I(0, ∞). Dr. Pablo E. Verde 85 Lecture 3: Prior distributions Then 1. Simulate values θ∗ ∼ N(0.5, 1) and σ∗ ∼ N(1, 0.32) (subject to the constrain of σ being positive). 2. Substitute them in the formulae and generate n∗ and Power∗ . 3. Use the histogram of n∗ and Power∗ as their corresponding predictive distribution. 4. In R we have ## predictive prior distribution for sample size and power set.seed(123) theta <- rnorm(10000, 0.5, 0.1) sigma <- rnorm(10000, 1, 0.3) sigma <- ifelse(sigma <0, -1*sigma, sigma) n <- 2*sigma^2 /(theta^2)*(0.84 + 1.96)^2 pow <- pnorm( sqrt( 63/2) * theta /sigma - 1.96) par(mfrow=c(1,2)) hist(n, xlim = c(0, 400), breaks=50) hist(pow) par(mfrow=c(1,1)) Dr. Pablo E. Verde 86 Lecture 3: Prior distributions Histogram of n n Frequency 0 100 200 300 400 0500100015002000 Histogram of pow pow Frequency 0.2 0.4 0.6 0.8 1.0 05001000150020002500 Dr. Pablo E. Verde 87 Lecture 3: Prior distributions > round(quantile(n),2) 0% 25% 50% 75% 100% 0.00 37.09 62.03 99.45 1334.07 > round(quantile(pow),2) 0% 25% 50% 75% 100% 0.09 0.61 0.81 0.95 1.00 > sum(pow<0.7)/10000 [1] 0.3645 > It is clear that there is huge uncertainty to the appropriate sample size. For n=63 the median power is 81% and a trial of 63 patients per group could be seriously underpowered. There is a 37% chance that the power is less than 70%. Dr. Pablo E. Verde 88 Lecture 3: Prior distributions Mixture of priors for proportions We may want to express a more complex prior opinion that can not be encapsulated by a beta distribution A prior which is a mixture of beta distributions p(θ) = qp1(θ) + (1 − q)p2(θ) where pi = Beta(ai , bi ). Now if we observe r successes out of n trials,the posterior is p(θ|r, n) = q∗ p1(θ|r, n) + (1 − q∗ )p2(θ|r, n) where pi (θ|r, n) ∝ pi (θ)p(r|θ, n) q∗ = qp1(r|n) qp1(r|n) + (1 − q)p2(r|n) Dr. Pablo E. Verde 89 Lecture 3: Prior distributions pi (r|n) is a beta-binomial predictive probability or r successes in n trials assuming θ has distribution Beta(ai , bi ). The posterior is a mixture of beta posteriors, with mixture weights adapted to support prior that provides best prediction for the observed data. In R: # mixture of betas mixbeta <- function(x,r,a1,b1,a2,b2,q,n) { qstar <- q*betabin(r,a1,b1,n)/ (q*betabin(r,a1,b1,n)+(1-q)*betabin(r,a2,b2,n)) p1 <- dbeta(x,a1+r,n-r+b1) p2 <- dbeta(x,a2+r,n-r+b2) posterior <- qstar*p1 + (1-qstar)*p2 } Dr. Pablo E. Verde 90 Lecture 3: Prior distributions Example: drug investigation continue ... We want to combine: an informative Beta(9.2, 13.8) a non-informative Beta(1, 1) we give 80 % of prior weight to the informative prior Suppose we treat n = 20 volunteers with the compound and observe r = 15 positive responses. Dr. Pablo E. Verde 91 Lecture 3: Prior distributions We can visualize this analysis in R as follows: par(mfrow=c(2,1)) # informative beta prior curve(dbeta(x,9.2,13.8),from=0, to=1,col="red", xlab="probability of response",main="priors") # mixture beta prior with 80% informative and 20% flat curve(0.8*dbeta(x, 9.2, 13.8)+0.2*dbeta(x, 1, 1),from=0, to=1,col="blue",add=T) # beta posterior curve(dbeta(x,24.2,18.8),from=0,to=1,col="red", xlab="probability of response", main="posteriors") # posterior from a mixture prior curve(mixbeta(x, r=15, a1=9.2, b1=13.8, a2=1, b2=1, q=.8, 20), from=0,to=1, col="blue",add=T) par(mfrow=c(1,1)) Dr. Pablo E. Verde 92 Lecture 3: Prior distributions 0.0 0.2 0.4 0.6 0.8 1.0 01234 priors probability of response dbeta(x,9.2,13.8) 0.0 0.2 0.4 0.6 0.8 1.0 012345 posteriors probability of response dbeta(x,24.2,18.8) Dr. Pablo E. Verde 93 Lecture 3: Prior distributions Further topics on priors: Adjustment of priors based on historical data and judgement 1. Power priors (Ibrahim and Shen, 2000) 2. Bias modelling Hirearchical priors for large dimensional problems (e.g regression with more predictors than observations) Summary: The need for priors distributions should not be an embarrassment It is reasonamble that the prior should influence the analysis, as long as the influence is recognized and justified Importance of transparency and sensitivity analysis Dr. Pablo E. Verde 94 Lecture 3: Prior distributions Practical Exercise: Conjugate mixture priors inference for a binomial experiment Suppose that most drugs (95%) are assumed to come from the stated Beta(9.2, 13.8) prior, but there is a small chance that the drug might be a ”winner”. Winners are assumed to have a prior distribution with mean 0.8 and standard deviation 0.1. Dr. Pablo E. Verde 95 Lecture 3: Prior distributions 1. What Beta distribution might represent the ”winners” prior? ( Hint: Use the formulas of the mean and the variance for a Beta(a, b) distribution from lecture 2. This can be re-arranged to a = m(m(1 − m)/v − 1) and b = (1 − m)(m(1 − m)/v − 1), where m and v are the mean and the variance of a Beta(a,b) distribution). 2. Enter this mixture prior model into the R code for mixtures and plot it. 3. Calculate the posterior mixture model based on 15 success out of 20 trials and plot it. 4. What is the chance that the drug is a winner? Dr. Pablo E. Verde 96 Lecture 3: Prior distributions Solution # 1) # to workout the a and b for the winner drug prior use the realtionship # of the mean and variance of the beta distribution and its parameters > m <- 0.8 > v <- 0.1^2 > a <- m*(m*(1-m)/v -1) > a [1] 12 > b <- (1-m)*(m*(1-m)/v -1) > b [1] 3 # 2) plot for the mixture prior curve(dbeta(x, 12, 3)*0.05 + dbeta(x, 9.2, 13.8)*0.95, from = 0, to =1, main = "prior", xlab = "rate") Dr. Pablo E. Verde 97 Lecture 3: Prior distributions # 3) plot for the mixture posterior with 15 success out of 20 trials # mixture posterior of betas mixbeta <- function(x,r,a1,b1,a2,b2,q,n) { qstar <- q*betabin(r,a1,b1,n)/(q*betabin(r,a1,b1,n)+ (1-q)*betabin(r,a2,b2,n)) p1 <- dbeta(x,a1+r,n-r+b1) p2 <- dbeta(x,a2+r,n-r+b2) posterior <- qstar*p1 + (1-qstar)*p2 } # then the mixture beta has parameters: # a = 12, b = 3 with probability 5% # a = 9.2, b = 13.8 with probabilty 95% # r = 15, n = 20 curve(mixbeta(x, r = 15, a1 = 12, b1 = 3, a2 = 9.2, b2 = 13.8, q = 0.05, n = 20), from = 0, to = 1, main = "posterior", xlab = "rate") Dr. Pablo E. Verde 98 Lecture 3: Prior distributions # 4) The probabilty that the drug is a winner is given by the posterior # weight to belong to the winner class qstar <- 0.05*betabin(15, 12 , 3 ,20 )/(0.05*betabin(15, 12 ,3 ,20) +(1-0.05)*betabin(15, 9.2, 13.8, 20 )) > qstar [1] 0.3951545 Dr. Pablo E. Verde 99 Lecture 4 - Multiparameters with R Lecture 4: Introduction to Multiparameter Models Dr. Pablo E. Verde 100 Lecture 4 - Multiparameters with R Summary Introduction to multiparameter inference Normal model with unknown mean and variance: standard non-informative priors Using R for predictive model checking Multinomial Model conjugate analysis and its application in R Comparing classical and Bayesian multiparameter models Dr. Pablo E. Verde 101 Lecture 4 - Multiparameters with R Introduction The basic ideas is similar to one parameter models: We have a model for the observed data which defines a likelihood p(y|θ) on the vector parameter θ We specify a joint prior distribution p(θ) for the possible values of θ We use Bayes’ rule to obtain the posterior of θ, p(θ|y) ∝ p(θ) × p(y|θ) Problems: In many applications is difficut to specify priors on multivariate θ Computations could be very difficult, they involve multivariate integration Dr. Pablo E. Verde 102 Lecture 4 - Multiparameters with R Example: Normal with unknown mean µ and variance σ2 In this example we use marathontimes in the R package LearnBayes The obervations are the mean times (in minutes) for men running Marathon with age classes between 20 to 29 years old: > library(LearnBayes) > data(marathontimes) > marathontimes$time [1] 182 201 221 234 237 251 261 266 267 273 286 291 292 ... > Dr. Pablo E. Verde 103 Lecture 4 - Multiparameters with R We assume a Normal model for the data given the mean µ and the variance σ2: y1, . . . , y20|µ, σ2 ∼ N(µ, σ2 ) We use a non-informative Jeffreys’ prior assuming independence between location and scale: p(µ, σ) ∝ 1 σ , Then posterior density for the mean and the variance p(µ, σ|y) is called the Normal-χ2 distribution This posterior delivers same results as classical analysis: E(µ|y) = ¯y, E(σ2 |y) = S2 The posterior p(µ|y) is proportional to a t-distribution with n − 1 df, and so on Direct simulation from these posteriors is straghtforwared Dr. Pablo E. Verde 104 Lecture 4 - Multiparameters with R The Normal-χ2 posterior is implemented in the function normchis2post in LearnBayes, which computes the logarithm of the joint posterior density of (µ, σ2) We can visualize the α% levels contourns with the function mycontour. The arguments include the log-density to plot, the rectangular area (xlo, xhi , ylo, yhi ) and the data: > mycontour(normchi2post, c(220,330,500,9000), time, xlab = "mean", ylab = "var") We can simulate directly form the marginal posteriors of µ and τ: > SS <- sum((time - mean(time))^2) > n <- length(time) > sigma2 <- SS/rchisq(1000, n - 1) > mu <- rnorm(1000, mean = mean(time), sd = sqrt(sigma2)/sqrt(n)) > points(mu, sigma2, col="blue") > quantile(mu, c(0.025, 0.975)) # 95% posterior interval for mu 2.5% 97.5% 253.1121 301.1133Dr. Pablo E. Verde 105 Lecture 4 - Multiparameters with R mean variance −6.9 −4.6 −2.3 220 240 260 280 300 320 2000400060008000 q q q q q q q q q qq q q qq q q q q q q q q q q q q q q qqq q q q q q q q q q q qq q q qq q q q q q qq q q q q q q q qq q q q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q qq q q q q q q q q q q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q Dr. Pablo E. Verde 106 Lecture 4 - Multiparameters with R Predictive model checking Is the statistical model consistent with the data ? We answer this question by simulating predictive y∗ values from the model and comparing some data features with the predictive data. We start by looking at some histograms between the original data and the simulated data with the same sample size. y.star <- rnorm(1000, mean = mu, sd =sqrt(sigma2)) #predictive data par(mfrow=c(3,4)) hist(time, breaks=10, xlim =c(150, 400), main="Original Data", col="green") for(i in 1:11){ y.sim <- sample(y.star, 20) hist(y.sim,breaks=10,xlim=c(150, 400),main="Simulated Data",col="blue") } Dr. Pablo E. Verde 107 Lecture 4 - Multiparameters with R Original Data time Frequency 150 200 250 300 350 400 0123456 Simulated Data y.sim Frequency 150 200 250 300 350 400 012345 Simulated Data y.sim Frequency 150 200 250 300 350 400 012345 Simulated Data y.sim Frequency 150 200 250 300 350 400 01234 Simulated Data y.sim Frequency 150 200 250 300 350 400 01234 Simulated Data y.sim Frequency 150 200 250 300 350 400 01234 Simulated Data y.sim Frequency 150 200 250 300 350 400 012345 Simulated Data y.sim Frequency 150 200 250 300 350 400 012345 Simulated Data y.sim Frequency 150 200 250 300 350 400 012345 Simulated Data y.sim Frequency 150 200 250 300 350 400 01234 Simulated Data y.sim Frequency 150 200 250 300 350 400 0123456 Simulated Data y.sim Frequency 150 200 250 300 350 400 01234 Dr. Pablo E. Verde 108 Lecture 4 - Multiparameters with R Predictive model checking We define the following features between the observed data y and the predictive data y∗: T∗ 1 = min(y∗ ), T∗ 2 = max(y∗ ), T3∗ = q75(y∗ ) − q25(y∗ ) and for asymmetry T∗ 4 = |y∗ (18) − µ∗ | − |y∗ (2) − µ∗ | where the 18th and 2sd order statistics approximate the 90% and 10% respectively. These measures are compared with the corresponding values based on the observed data: T1 = min(y), T2 = max(y), T3 = q75(y) − q25(y) and T4 = |y(18) − µ∗ | − |y(2) − µ∗ |. Dr. Pablo E. Verde 109 Lecture 4 - Multiparameters with R In R we have: # Analysis of the minimum, maximum, variability and asymmetry min.y <- max.y <- asy1 <- asy2 <- inter.q <- rep(0,1000) time <- sort(time) for (b in 1:1000){ y.sim <- sample(y.star, 20) mu.star <- sample(mu, 20) min.y[b] <- min(y.sim) max.y[b] <- max(y.sim) y.sim <- sort(y.sim) asy1[b] <- abs(y.sim[18] - mean(mu.star)) - abs(y.sim[2] - mean(mu.star)) asy2[b] <- abs(time[18] - mean(mu.star)) - abs(time[2] - mean(mu.star)) inter.q[b] <- quantile(y.sim, prob=0.75) - quantile(y.sim, prob=0.25) } Dr. Pablo E. Verde 110 Lecture 4 - Multiparameters with R To display this quantities par(mfrow =c(2,2)) hist(min.y, breaks = 50, col="blue", main = "Minimum y*") abline(v=min(time), lwd=3, lty=2) hist(max.y, breaks = 50, col="red", main = "Maximum y*") abline(v=max(time), lwd=3, lty=2) hist(inter.q, breaks = 50, col="magenta", main = "Variability y*") abline(v=quantile(time,prob=0.75)-quantile(time,prob=0.25), lwd=3,lty=2) plot(asy1, asy2, main ="Asymmetry", xlab="Asymmetry predicted data", ylab ="Asymmetry original data") abline(a=0, b=1, lwd=2) par(mfrow=c(1,1)) This analysis shows that the data and the model are compatibles and deviations can be easily explained by sampling variation. Dr. Pablo E. Verde 111 Lecture 4 - Multiparameters with R Minimum y* min.y Frequency 100 150 200 250 020406080100 Maximum y* max.y Frequency 300 320 340 360 380 400 420 010203040 Variance y* variance Frequency 1000 2000 3000 4000 5000 6000 7000 010203040 q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q −150 −100 −50 0 50 −15−10−5051015 Asymmetry Asymmetry predicted data Asymmetryoriginaldata Dr. Pablo E. Verde 112 Lecture 4 - Multiparameters with R Multinomial distribution with unknown cell probabilities In this model we observe n independent trials each of which has p possible outcomes with associated probabilities θ = (θ1, . . . , θp). Letting y = (y1, . . . , yp) be the number of times each outcome is observe we have y|θ ∼ Multinomial(n, θ), with likelihood p(y|θ) = ( yi )! yi ! θyi i , i yi = n, θi = 1. Dr. Pablo E. Verde 113 Lecture 4 - Multiparameters with R The kernel of this likelihood is proportional to a Dirichlet distribution. If a-prior we assume θ ∼ Dirichlet(a1, . . . , ap), ai > 0, then p(θ) ∝ i θai −1 and the posterior for θ is p(θ|y) ∝ i θyi +ai −1 so taking αi = ai + yi , then the posterior for θ is p(θ|y) = Dirichlet(α1, . . . , αp). Dr. Pablo E. Verde 114 Lecture 4 - Multiparameters with R Some comments The prior distribution is equivalent to a likelihood resulting from ai observations, with ai observations in the ith category. A uniform density is obtained by setting ai = 1 for all i. This distribution assigns equal probability to each θi . In general it is very difficult to work with this distribution in practice. Proportions are in general not independent and modeling prior information in this context it is very difficult. We are going to use for contingency tables a surrogate Poisson model for the multinomial distribution. Moments and properties for the Dirichlet distribution are in Appendix A of Gelman et.al. Dr. Pablo E. Verde 115 Lecture 4 - Multiparameters with R Example: A model for a two by two contingency table Intervention New Control Death θ1,1 θ1,2 No death θ2,1 θ2,2 N Data model: p(y|θ) ∝ 2 j=1 2 i=1 θ yi,j i,j Prior model: p(θ) ∝ 2 j=1 2 i=1 θ ai,j −1 i,j Posterior model: p(θ|y) ∝ 2 j=1 2 i=1 θ yi,j +ai,j −1 i,j Dr. Pablo E. Verde 116 Lecture 4 - Multiparameters with R We are interested in make inference for the Odds ratio: Ψ = θ1,1θ2,2 θ1,2θ2,1 . We use direct simulation methods: Simulate a large number of values for the vector θ from its posterior. For each simulated value calculate Ψ∗. Inference for Ψ is based on the histogram of Ψ∗. Dr. Pablo E. Verde 117 Lecture 4 - Multiparameters with R Example: GREAT trial, Spiegelhalter et.al. pag 69. Intervention: Thrombolytic therapy after myocardial infarction, given at home by general practitioners. Aim of study: to compare a new drug treatment to be given at home as soon as possible after a myocardial infarction and placebo. Outcome measure: Thirty-day mortality rate under each treatment, with the benefit of the new treatment measured by the odds ratio, i.e., the ratio of the odds of death following the new treatment to the odds of death on the conventional: OR < 1 therefore favors the new treatment. Prospective Bayesian analysis: NO. It was carried out after the trial reported its results. Dr. Pablo E. Verde 118 Lecture 4 - Multiparameters with R Example: GREAT trial continue Intervention New Control Death 13 23 36 No death 150 125 275 163 148 311 We use a Dirichlet prior with parameters a1,1 = a1,2 = a2,1 = a2,2 = 1, which corresponds to a uniform distribution for (θ1,1, θ1,2, θ2,1, θ2,2). Dr. Pablo E. Verde 119 Lecture 4 - Multiparameters with R We can simulate from a Dirichlet distributions with the function rdirichlet() from the package LearnBayes: > library(LearnBayes) > draws <- rdirichlet(10000, c(13,23,150,125) ) > odds <-draws[,1]*draws[,4]/(draws[,2]*draws[,3]) > hist(odds, breaks = 100, xlab="Odds Ratio", freq = FALSE) > lines(density(odds), lty =2, lwd=2, col ="blue") > abline(v=quantile(odds, prob=c(0.025, 0.5, 0.975)), lty=3, col="red") > > quantile(odds, prob=c(0.025, 0.5, 0.975)) 2.5% 50% 97.5% 0.2157422 0.4649921 0.9698272 Dr. Pablo E. Verde 120 Lecture 4 - Multiparameters with R Histogram of odds Odds Ratio Density 0.5 1.0 1.5 0.00.51.01.52.02.5 prior odds posterior odds Dr. Pablo E. Verde 121 Lecture 4 - Multiparameters with R Now suppose that we want to calculate the ”p-value” of H0 : Ψ ≥ 1 vs. H1 : Ψ < 1 then, > sum(odds > 1)/10000 [1] 0.0187 It is interesting to compare this results with the exact Fisher test: > fisher.test(matrix(c(13, 23, 150, 125), nrow=2, byrow=TRUE), alternative="less") Fisher’s Exact Test for Count Data p-value = 0.02817 Thus, the uniform prior on (θ1,1, θ1,2, θ2,1, θ2,2) does not corresponds to a standard statistical analysis. Dr. Pablo E. Verde 122 Lecture 4 - Multiparameters with R Now, if we work with a prior with parameters a1,1 = 2, a1,2 = 1, a2,1 = 1 and a2,2 = 2, with density p(θ1,1, θ1,2, θ2,1, θ2,2) ∝ θ1,1θ2,2 we get the following results > # Fisher exact test > param <- c(2,1,1,2) # parameter of the dirichlet > draws <- rdirichlet(10000, c(13,23,150,125)+param ) > odds <-draws[,1]*draws[,4]/(draws[,2]*draws[,3]) > sum(odds>1)/10000 [1] 0.0277 This shows (empirically) that the Fisher test is NOT based in a non-informative prior. Some weakly information of association is implied in the test. However, there is a difference in interpretation between the Bayesian and sampling theory results. Dr. Pablo E. Verde 123 Lecture 4 - Multiparameters with R theta11 theta22 f(theta11,theta22) Dirichlet with a11=2, a12=1, a21=1, a22=2 Dr. Pablo E. Verde 124 Lecture 4 - Multiparameters with R Practical Exercise: Normal data with unknown mean and variance Review the Bayesian analysis presented in this lecture Generate predictive posterior data and check model fitness Now, change the first observation of the marathon times and generate an outlier as follows: > library(LearnBayes) > data(marathontimes) > marathontimes$time [1] 182 201 221 234 237 251 261 266 267 273 286 291 292 ... > # generate an outlier in the first observation: data1 <- marathontimes$time data1[1] <- 82 Make the same analysis for the contaminated data Which is the effect of this outlier in the analysis? Dr. Pablo E. Verde 125 Lecture 5 - Introduction to WinBUGS Lecture 5: Introduction to WinBUGS Dr. Pablo E. Verde 126 Lecture 5 - Introduction to WinBUGS Summary Introduction to BUGS The BUGS language Some simple examples Making predictions Connecting WinBUGS with R Dr. Pablo E. Verde 127 Lecture 5 - Introduction to WinBUGS Introduction to BUGS The BUGS project began at the Medical Research Council Biostatistics Unit in Cambridge in 1989, before the classic Gibbs sampling paper by Gelfand and Smith in 1990. An excellent review and future directions of the BUGS project is given in Lunn et al. (2009). BUGS stands for Bayesian inference using Gibbs sampling, reflecting the basic computational techniques originally adopted. BUGS has been just one part of the tremendous growth in the application of Bayesian ideas over the last 20 years. At this time BUGS has approximately over 30,000 registered users worldwide, and an active on-line community comprising over 8,000 members. Typing ”WinBUGS” in Google generates over 100,000 hits; in Google scholars over 5,000; and searching in Statistics in Medicine on-line gives about 100 hits. Dr. Pablo E. Verde 128 Lecture 5 - Introduction to WinBUGS Think different ... The modelling philosophy of BUGS was strongly influenced by developments in artificial intelligence in the 1980’s. In particular the focus was on Expert systems, where a basic principle was to separate: Knowledge base: assumed model for the world makes use of a declarative form of programming structures described using a series of local relationships Inference engine: used to draw conclusions in specific circumstances This approach forces to think first about the model use to describe the problem at hand. The declarative programming approach, sometimes, confuses statistical analysts used to work with procedural or functional statistical languages (SAS, SPSS, R, etc). Dr. Pablo E. Verde 129 Lecture 5 - Introduction to WinBUGS The BUGS language: Language for specifying complex Bayesian models. Constructs object-oriented internal representation of the model graph by identifying parents and children. This is done with a DAG (Directed Acyclic Graph). Builds up an arbitrary complex model through specification of local structure. Simulation from full conditionals using Gibbs sampling. Current version is WinBUGS 1.4.3, it runs in Windows, and incorporates the DoodleBUGS graphical model editor and a script language for running in batch mode. WinBUGS is freely available from http://www.mrc-bsu.cam.ac.uk/bugs An open source version of BUGS called OpenBUGS is under development, and includes versions of BUGS that run under LINUX (LinBUGS) and that can be run directly from R (BRugs). See http://www.rni.helsinki.fi/openbugs Dr. Pablo E. Verde 130 Lecture 5 - Introduction to WinBUGS Example: Drug In n = 20 patients we observed r = 15 positive responses. y ∼ Bin(θ, n) and we assume a conjugate prior for θ: θ ∼ Beta(a, b) Of course, we know the posterior distribution is θ|y ∼ Beta(a + y, n − r + b) and no simulation is necessary. But just to illustrate WinBUGS ... Dr. Pablo E. Verde 131 Lecture 5 - Introduction to WinBUGS Dr. Pablo E. Verde 132 Lecture 5 - Introduction to WinBUGS Directed Acyclic Graph (DAG) representation: Ovals nodes represent random variables Rectangular nodes represent constants Arrows parent child relationships Dr. Pablo E. Verde 133 Lecture 5 - Introduction to WinBUGS 1. Write BUGS code to specify the model ... model { y ~ dbin(theta, n) theta ~ dbeta(a, b) } 2. Check the model syntax ... 3. Load the data ... list(n=20, y = 15, a = 3, b =2) ... and compile. 4. Then load initial values ... list(theta = 0.5) 5. Select the nodes (variables) to monitor (just one in this case)... 6. Set the trace for all selected nodes (*) to monitor the MCMC simulations ... 7. Using the Update tools, select 10,000 simulations ... 8. Results ... Dr. Pablo E. Verde 134 Lecture 5 - Introduction to WinBUGS Example: Drug continue ... Now we assume a different prior. A logit transform of φ = log θ 1 − θ , so that −∞ < φ < ∞ We assume a normal prior for φ: φ ∼ Normal(µ, τ) for suitable mean µ and precision τ = 1/σ2. This is a non-conjugate prior with no simple form for the posterior. Straightforward in WinBUGS! Dr. Pablo E. Verde 135 Lecture 5 - Introduction to WinBUGS Double arrows represent a logical node or a mathematical functional relationship. Dr. Pablo E. Verde 136 Lecture 5 - Introduction to WinBUGS The BUGS code for the model is ... model { y ~ dbin(theta, n) logit(theta) <- phi phi ~ dnorm(0, 0.001) } Dr. Pablo E. Verde 137 Lecture 5 - Introduction to WinBUGS Making predictions Important to be able to predict unobserved quantities for ’filling-in” missing or censored data model checking - are predictions ’similar’ to observed data? making predictions! Easy in MCMC/WinBUGS, just specify a stochastic node without a data value it automatically predicted Provides automatic imputation of missing data Easiest case is where there is no data at all!! Just ’forwards sampling’ from prior to make a Monte Carlo analysis. Dr. Pablo E. Verde 138 Lecture 5 - Introduction to WinBUGS Example: making predictions The BUGS code for the model is ... model { y ~ dbin(theta, n) logit(theta) <- phi phi ~ dnorm(mu, tau) y.pred ~ dbin(theta, n) # defines a stochastic node for # predictions } Dr. Pablo E. Verde 139 Lecture 5 - Introduction to WinBUGS The prediction node y.pred is conditionally independent of the data node y given rate θ. Dr. Pablo E. Verde 140 Lecture 5 - Introduction to WinBUGS Summary: Running WinBUGS 1. Open Specification tool and Update from Model menu, and Samples from Inference menu. 2. Highlight model by double-click. Click on Check model. 3. Highlight start of data. Click on Load data. 4. Click on Compile. 5. Highlight start of initial values. Click on Load inits. 6. Click on Gen Inits if model initials values are needed. 7. Click on Update to burn in. 8. Type nodes to be monitored into Sample Monitor, and click set after each. 9. Perform more updates. 10. Type * into Sample Monitor, and click stats, etc. to see results on all monitored nodes. Dr. Pablo E. Verde 141 Lecture 5 - Introduction to WinBUGS The R2WinBUGS package in R The R package R2WinBUGS offers a versatile approach for making MCMC computations within WinBUGS and returning them to R. Let see a step by step example of linking R and WinBUGS with R2WinBUGS. Example: non-conjugate priors inference for a binomial experiment First save the following file with WinBUGS code in your working directory ... # WinBUGS code: binary problem non-conjugate analysis model { y ~ dbin(theta, n) logit(theta) <- phi phi ~ dnorm(0,0.001) y.pred ~ dbin(theta,n) # making prediction } Dr. Pablo E. Verde 142 Lecture 5 - Introduction to WinBUGS Then in R console ... # load R2WinBUGS package library(R2WinBUGS) # setup WinBUGS directory and your working directory bugsdir <- "C:/Programme/WinBUGS14" workdir <- getwd() # define you data nodes n <- 20 y <- 15 data1 <- list ("n", "y") Dr. Pablo E. Verde 143 Lecture 5 - Introduction to WinBUGS # define parameters of interest par1 <- c("theta", "y.pred") # run the bugs() function: m1 <- bugs(data1, inits=NULL, par1, "model1.txt", n.chains = 1, n.iter = 2000, n.thin=1, bugs.directory = bugsdir, working.directory = getwd(), debug=TRUE) Dr. Pablo E. Verde 144 Lecture 5 - Introduction to WinBUGS > print(m1, digits.summary = 3) Inference for Bugs model at "model1.txt", fit using WinBUGS, 1 chains, each with 2000 iterations (first 1000 discarded) n.sims = 1000 iterations saved mean sd 2.5% 25% 50% 75% 97.5% theta 0.749 0.096 0.543 0.690 0.760 0.819 0.914 y.pred 14.962 2.733 9.000 13.000 15.000 17.000 19.000 deviance 4.277 1.584 3.198 3.286 3.667 4.551 9.014 DIC info (using the rule, pD = Dbar-Dhat) pD = 1.1 and DIC = 5.3 DIC is an estimate of expected predictive error (lower deviance is better). Dr. Pablo E. Verde 145 Lecture 5 - Introduction to WinBUGS The R object m1 generated in this analysis belong to the class bug and can be further manipulated, transformed, etc. > class(m1) [1] "bugs" > names(m1) [1] "n.chains" "n.iter" "n.burnin" [4] "n.thin" "n.keep" "n.sims" [7] "sims.array" "sims.list" "sims.matrix" [10] "summary" "mean" "sd" [13] "median" "root.short" "long.short" [16] "dimension.short" "indexes.short" "last.values" [19] "isDIC" "DICbyR" "pD" [22] "DIC" "model.file" "program" Dr. Pablo E. Verde 146 Lecture 5 - Introduction to WinBUGS The object m1 is a list in R, so we can extract elements of the list by using the ”$” operator, for example: > m1$pD [1] 1.065 > m1$n.chains [1] 1 > m1$sims.array[1:10, ,"theta"] [1] 0.8389 0.9124 0.7971 0.8682 0.7025 0.7696 0.8417 0.6782 0.6146 [10] 0.8647 > m1$sims.array[1:10, ,"y.pred"] [1] 15 19 19 20 15 20 18 17 10 17 > theta <- m1$sims.array[1:1000, ,"theta"] > hist(theta, breaks = 50, prob = TRUE) > lines(density(theta), col = "blue", lwd =2) Dr. Pablo E. Verde 147 Lecture 5 - Introduction to WinBUGS Histogram of theta theta Density 0.5 0.6 0.7 0.8 0.9 1.0 012345 Dr. Pablo E. Verde 148 Lecture 5 - Introduction to WinBUGS Some aspects of the BUGS language <- represents logical dependence, e.g. m <- a + b*x ~ represents stochastic dependence, e.g. r ~ dunif(a,b) Can use arrays and loops model{ .... for (i in 1:N){ r[i] ~ dbin(p[i], n[i]) } ... } Dr. Pablo E. Verde 149 Lecture 5 - Introduction to WinBUGS A for loop is represented as a ”plate” in the DAG’s model Dr. Pablo E. Verde 150 Lecture 5 - Introduction to WinBUGS Some aspects of the BUGS language Some functions can appear on the left-hand-side of an expression, e.g. logit(p[i]) <- a + b*x[i] log(m[i]) < - c + d*y[i] mean(p[]) to take mean of whole array, mean(p[m : n]) to take mean of elements m to n. Also for sum(p[]) dnorm(0, 1)I(0, ) means the random variable will be restricted to the range (0, ∞). Dr. Pablo E. Verde 151 Lecture 5 - Introduction to WinBUGS Functions in the BUGS language p <- step(0.05 - x) = 1 if x ≤ 0.05, 0 otherwise. Hence monitoring p and recording its mean will give the probability that x ≤ 0.05. This is useful to calculate Bayesian p-values. p <- equals(x, 0.7) = 1 if x = 0.7, 0 otherwise. tau <- 1/pow(s,2) sets τ = 1/s2. s <- 1/sqrt(tau) sets s = 1/ (τ) p[i,k] <- inprod(p[], Lambda[i,k]) sets pik = j πj Λij . See ’Model Specification/Logical nodes’ in the manual for full syntax. Dr. Pablo E. Verde 152 Lecture 5 - Introduction to WinBUGS Data transformations Although transformations of data can always be carried out before using WinBUGS, it is convenient to be able to try various transformations of dependent variables within a model description. For example, we may wish to try both y and sqrt(y) as dependent variables without creating a separate variable z = sqrt(y) in the data file. The BUGS language therefore permits the following type of structure to occur: for (i in 1:N) { z[i] <- sqrt(y[i]) z[i] ~ dnorm(mu, tau) } Strictly speaking, this goes against the declarative structure of the model specification. Dr. Pablo E. Verde 153 Lecture 5 - Introduction to WinBUGS Some common distributions Binomial: r ∼ dbin(p, n) Normal: x ∼ dnorm(mu, tau) Poisson: r ∼ dpois(lambda) Uniform: x ∼ dunif(a, b) Gamma: x ∼ dgamma(a, b) Note: The normal distribution is parameterized in terms of its mean and precision = 1/variance = 1/σ2. Functions cannot be used as arguments in distributions. You need to create new nodes. Dr. Pablo E. Verde 154 Lecture 5 - Introduction to WinBUGS The WinBUGS data formats WinBUGS accepts data files in: 1. Rectangular formant n[] r[] 50 2 .... 20 4 END 2. R list format: list(N =12, n = c(50,12,...), r = c(2, 1,...)) Dr. Pablo E. Verde 155 Lecture 5 - Introduction to WinBUGS Double indexing: Specifying categorical explanatory variables y[] x[] 12 1 #subject with x=level 1 34 2 #subject with x=level 2 ... for( i in 1:N) { y[i] ~dnorm(mu[i], tau) mu[i] <- alpha + beta[x[i]] } alpha ~ dunif(-100,100) beta[1] <- 0 # alias first level of beta beta[2] ~ dunif(-100, 100) beta[3] ~ dunif(-100, 100) tau ~ dgamma(0.1,0.1) Dr. Pablo E. Verde 156 Lecture 5 - Introduction to WinBUGS Practical Exercise: Survival data Aitchison & Dunsmore (1975) give data on survival times (in weeks) of 20 patients after treatment for a particular carcinoma. The data set is list(survtime=c(25, 45, 238, 194, 16, 23, 30, 16, 22, 123, 51, 412, 45, 162, 14, 72, 5, 43, 45, 91), N=20) Dr. Pablo E. Verde 157 Lecture 5 - Introduction to WinBUGS 1. Write a WinBUGS script for the following model: yi = log(survival[i]) yi |µ, τ ∼ N(µ, τ), τ = 1/σ2 with non-informative priors for µ ∼ N(0, 10−3) and τ ∼ Gamma(10−3, 10−3). 2. Use as initial values list( mu = 0, tau = 1) 3. Draw a DAG for this model. 4. Run the model with 10,000 iterations and discard the first 5,000. Analyze visually convergence. 5. Estimate the probability that a new patient has survival time more than 150 weeks. Dr. Pablo E. Verde 158 Lecture 5 - Introduction to WinBUGS Solution # WinBUGS script: model{ for ( i in 1:N) { y[i] <- log(survtime[i]) #data transform y[i] ~ dnorm(mu, tau) #sampling model } sigma <- 1/sqrt(tau) #standard deviation mu ~ dnorm(0, 1.0E-3) #prior for mu tau ~ dgamma(1.0E-3, 1.0E-3) #prior for tau y.new ~ dnorm(mu, tau) #predicted data y.dif <- step(y.new - log(150)) #pr(survival>150) } Dr. Pablo E. Verde 159 Lecture 5 - Introduction to WinBUGS Solution Dr. Pablo E. Verde 160 Lecture 6 - Multivariate Models with WinBUGS Lecture 6: Multivariate Models with WinBUGS Dr. Pablo E. Verde 161 Lecture 6 - Multivariate Models with WinBUGS Multivariate Normal Distribution Multivariate Normal for p-dimensional vector y y ∼ Normalp(µ, Σ) The conjugate prior for µ is also a MVN. WinBUGS follows parametrization using the precision matrix Ω = Σ−1. In BUGS notation we have y[1:p] ~ dmnorm(mu[], Omega[,]) mu[1:p] ~ dmnorm(mu.prior[], Omega.prior[]) Sigma[1:p, 1:p] <- inverse(Omega[,]) Dr. Pablo E. Verde 162 Lecture 6 - Multivariate Models with WinBUGS Priors on precision matrix of multivariate normals Conjugate prior is the Wishart distribution, which is analogous to Gamma or χ2. Arises in classical statistics as the distribution of the sum-of-squares and products matrix in multivariate normal sampling. The Wishart distribution Wp(k, R) for a symmetric positive definite p × p matrix Ω has density p(Ω) ∝ |R|k/2 |Ω|(k−p−1)/2 exp − 1 2 tr(RΩ) , defined for a real scalar k > p − 1 and a symmetric positive definite matrix R. Dr. Pablo E. Verde 163 Lecture 6 - Multivariate Models with WinBUGS Some characteristics When p = 1 W1(k, R) ≡ Gamma(k/2, R/2) ≡ χ2 k/R. The expectation of the Wp(k, R) is E[Ω] = kR−1 . Jeffreys prior is p(Ω) ∝ |Σ|−(p+1)/2 , equivalent to k → 0. This is not currently implemented in WinBUGS. For ”weakly informative” we can set R/k to be a rough prior guess at the unknown true covariance matrix and taking k = p indicates minimal ”effective prior sample size”. Dr. Pablo E. Verde 164 Lecture 6 - Multivariate Models with WinBUGS Known problems Every element must have same precision Incomprehensible! It is a good idea to make some prior predictions in practice to understand what we are doing. Gelman at al (2004) page 483 outline alternative strategies If do not use Wishart priors for precision matrices in BUGS, you need to make sure that the covariance matrix at each iteration is positive-definite, otherwise may crash. Dr. Pablo E. Verde 165 Lecture 6 - Multivariate Models with WinBUGS Exercise 1: Correlation Analysis with missing data The following is an artificial data from the book Tools for Statistical Inference (Tanner pag 63.(1994)): > Y [,1] [,2] [1,] 1 1 [2,] 1 -1 [3,] -1 1 [4,] -1 -1 [5,] 2 NA [6,] 2 NA [7,] -2 NA [8,] -2 NA ... Dr. Pablo E. Verde 166 Lecture 6 - Multivariate Models with WinBUGS The following script in WinBUGS implement a Bayesian analysis for this problem. model { for (i in 1 : 12){ Y[i, 1 : 2] ~ dmnorm(mu[], tau[ , ]) } mu[1] <- 0 mu[2] <- 0 tau[1 : 2,1 : 2] ~ dwish(R[ , ], 2) R[1, 1] <- 0.001 R[1, 2] <- 0 R[2, 1] <- 0 R[2, 2] <- 0.001 Sigma2[1 : 2,1 : 2] <- inverse(tau[ , ]) rho <- Sigma2[1, 2] / sqrt(Sigma2[1, 1] * Sigma2[2, 2]) } Dr. Pablo E. Verde 167 Lecture 6 - Multivariate Models with WinBUGS P(rho|data) ρ Frequency −1.0 −0.5 0.0 0.5 1.0 020406080100120 Using predictive posterior values in each iteration WinBUGS impute the missing values. Dr. Pablo E. Verde 168 Lecture 6 - Multivariate Models with WinBUGS Multinomial model: non-conjugate analysis In some applications the cell probabilities of contingency tables can be made function of more basic parameters Example: Population genetics example Offspring genotype Maternal genotype AA AB BB AA 427 95 AB 108 161 71 BB - 64 74 Dr. Pablo E. Verde 169 Lecture 6 - Multivariate Models with WinBUGS The model equations are given by the following table: Offspring genotype Maternal genotype AA AB BB AA (1 − σ)p + σ (1 − σ)q AB (1 − σ)p/2 + σ/4 1/2 (1 − σ)q/2 + σ/4 BB - (1 − σ)p (1 − σ)q + σ where p is the frequency of A in outcross pollen, σ is the rate of self-fertilization and q = 1 − p Dr. Pablo E. Verde 170 Lecture 6 - Multivariate Models with WinBUGS To implement this model in WinBUGS, we equate cell probabilites with requred function: model{ XAA[1]<- (1-sigma)*p + sigma; XAA[2]<- (1- sigma)*q; XAA[3]<- 0 XAB[1]<-(1-sigma)*p/2 +sigma/4; XAB[2]<-0.5; XAB[3]<-(1-sigma)*q/2 +sigma/4 XBB[1]<- 0; XBB[2]<- (1-sigma)*p; XBB[3]<- (1- sigma)*q + sigma KAA <- sum(NAA[]); KAB <- sum(NAB[]); KBB <- sum(NBB[]) NAA[1:3] ~ dmulti(XAA[], KAA) NAB[1:3] ~ dmulti(XAB[], KAB) NBB[1:3] ~ dmulti(XBB[], KBB) p ~ dunif(0, 1) # uniform prior for p sigma ~ dunif(0, 1) # uniform prior for sigma q <- 1 -p } list(NAA = c(427, 95, 0), NAB=c(108, 161, 71), NBB=c(0, 64, 74)) Dr. Pablo E. Verde 171 Lecture 6 - Multivariate Models with WinBUGS > print(m.popgen, digits=3) Inference for Bugs model at "popgen.bug", fit using WinBUGS, 2 chains, each with 10000 iterations (first 5000 discarded) n.sims = 10000 iterations saved mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff p 0.705 0.024 0.655 0.690 0.706 0.721 0.749 1.001 10000 sigma 0.371 0.042 0.288 0.343 0.371 0.400 0.451 1.001 10000 deviance 27.151 1.972 25.210 25.730 26.540 27.960 32.410 1.001 10000 For each parameter, n.eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor (at convergence, Rhat=1). DIC info (using the rule, pD = Dbar-Dhat) pD = 2.0 and DIC = 29.1 Dr. Pablo E. Verde 172 Lecture 6 - Multivariate Models with WinBUGS x Frequency p 0.25 0.35 0.45 0.640.680.720.76 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q qq q q q q q q q q q q q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q qq q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q qq q qq qq q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qqq q q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q q q qq q q q q qq 0.64 0.68 0.72 0.76 0.250.350.45 q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q qq q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q qq q q q q q q q qq q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q qq q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q qq q q q q qq q q q q q q q q qq q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q qq q q qq q q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q q q qq q q q q q q q q q q q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q qq q q q q q q q q q q q q x Frequency sigma Dr. Pablo E. Verde 173 Lecture 7 - MCMC-Short-version Lecture 7: Bayesian Computations with MCMC Methods Dr. Pablo E. Verde 174 Lecture 7 - MCMC-Short-version Summary Introduction to discrete Markov chains. Construction of Markov chain Monte Carlo algorithms. Gibbs sampling methods. Metropolis and Metropolis-Hastings method. Issues in applications of MCMC (convergence checking, proposal distributions, etc.) Dr. Pablo E. Verde 175 Lecture 7 - MCMC-Short-version Why is computation important ? Bayesian inference centers around the posterior distribution p(θ|y) ∝ p(y|θ) × p(θ) where θ may be a vector of thousand components! p(y|θ) and p(θ) are usually available in closed form, but p(θ|y) is usually not analytical tractable. Also, we may want to obtain marginal posteriors p(θ1|y) = p(θ|y)dθ(−1) calculate properties of p(θ1|y), such as mean E(θ1|y) = θ1p(θ|y)dθ(−1), tail areas, etc. We see that numerical integration becomes crucial in Bayesian inference! Dr. Pablo E. Verde 176 Lecture 7 - MCMC-Short-version General Monte Carlo integration If we have algorithms for sampling from arbitrary (typically high-dimensional) posterior distributions, we could use ’Monte Carlo’ methods for Bayesian inference: Suppose we can draw samples from the joint posterior distribution for θ, i.e. θ(1) , θ(2) , . . . , θ(N) ∼ p(θ|y) Then Monte Carlo integration θ(1) , θ(2) , . . . θ(N) ∼ p(θ|y) E(g(θ)) = g(θ)p(θ|y)dθ ≈ 1 N N i=1 g(θi ) Theorems exist which prove convergence even if the sample is dependent,i.e. 1 N N i=1 g(θ(i) ) → E(g(θ)) as n → ∞ Dr. Pablo E. Verde 177 Lecture 7 - MCMC-Short-version Markov Chain Monte Carlo (MCMC) Independent sampling from p(θ|y) may be very difficult in high dimensions Alternative strategy based on dependent sampling: We know p(θ|y) up to a normalizing constant Then, we design a Markov chain which has p(θ|y) as its stationary distribution A sequence of random variables θ(0) , θ(1) , θ(3) , . . . forms a Markov chain if θ(i+1) ∼ p(θ|θ(i) ) i.e. conditional on the value of θ(i) , θ(i+1) is independent of θ(i−1) , . . . , θ(0) . Dr. Pablo E. Verde 178 Lecture 7 - MCMC-Short-version Run the chain until it appears to have settled down to equilibrium, say θ(k) , θ(k+1) , . . . , θ(K) ∼ p(θ|y) Use these sampling values to empirically estimate the posterior of θ, say, ˆp(θ|y) Problem: Design a Markov chain with p(θ|y) as its unique stationary distribution? Dr. Pablo E. Verde 179 Lecture 7 - MCMC-Short-version Answer: This is surprisingly easy and several standard recipes are available Metropolis et al. (1953) showed how to do it Hastings (1970) generalized Metropolis algorithm Geman and Geman (1984) introduced the Gibbs Sampling algorithm Gelfand and Smith (1990) popularized Gibbs sampling in statistics See Gilks, Richardson and Spiegelhalter (1996) for a gentle introduction and many worked examples. Robert and Casella (2004) for more detailed theoretical reference Dr. Pablo E. Verde 180 Lecture 7 - MCMC-Short-version The Gibbs sampler Let our vector of unknowns θ consist of k sub-components θ = (θ1, . . . , θk) 1. Choose starting values θ0 1, . . . , θ0 k 2. Sample from θ (1) 1 ∼ p(θ1|θ (0) 2 , θ0 3, . . . , θ (0) k , y) θ (1) 2 ∼ p(θ2|θ (1) 1 , θ0 3, . . . , θ (0) k , y) . . . θ (1) k ∼ p(θk|θ (1) 1 , θ1 2, . . . , θ (1) k−1, y) 3. Repeat step 2 many 1000s of times. Eventually we obtain samples from p(θ|y) Dr. Pablo E. Verde 181 Lecture 7 - MCMC-Short-version Example: Normal distribution with unknown mean and variance Suppose that the observations y1, . . . , yn ∼ N(µ, τ−1) We assume that yi are conditionally independent given θ and precision τ, and θ and τ are themselves independent. We put conjugate priors on µ and τ: µ ∼ N(θ0, φ−1 0 ), τ ∼ Gamma(a, b) Then the full conditionals for µ and τ are: p(µ|τ, y) = N µ0φ0 + n¯yτ φ0 + nτ , 1 φ0 + nτ p(τ|µ, y) = Gamma a + n 2 , b + 1 2 (yi − µ)2 Dr. Pablo E. Verde 182 Lecture 7 - MCMC-Short-version Programming the Gibbs sampler in R Example: Normal distribution with unknown mean and variance gibbsNormal <- function(y, mu1, tau1, N, mu0, phi0, a, b) { mu <- numeric(N+1) # N = number of iterations mu[1] <- mu1 # the initial value for mu tau <- numeric(N+1) tau[1] <- tau1 # the initial value for tau n <- length(y) ; ybar <- mean(y) for(i in 2:(N+1)) { # generate samples from full conditional mu[i] <- rnorm(1, (mu0*phi0 + n*ybar*tau[i-1])/(phi0 + n*tau[i-1]), 1/sqrt(phi0 + n*tau[i-1])) tau[i] <- rgamma(1, (n+2*a)/2, (sum((y-mu[i])^2) + 2*b)/2) } output <- cbind(mu, tau) } Dr. Pablo E. Verde 183 Lecture 7 - MCMC-Short-version 149.0 149.5 150.0 150.5 151.0 0.0000.0010.0020.0030.004 mu tau q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 149.0 149.5 150.0 150.5 151.0 0.0000.0010.0020.0030.004 mu tau q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qqq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q Four independent sequences of the Gibbs sampler for a normal distribution with unknown mean and variance. Left panel: first 10 steps. Right panel: last 500 iterations in each chain. Dr. Pablo E. Verde 184 Lecture 7 - MCMC-Short-version 0 100 200 300 400 500 149.0149.5150.0150.5151.0 iteration mu Posterior of mu mu Frequency 149.0 149.5 150.0 150.5 151.0 010203040 0 100 200 300 400 500 0.00050.00150.0025 iteration tau Posterior of tau tau Frequency 0.0000 0.0010 0.0020 0.0030 010203040 Traces of one sequence of Gibbs simulations. Upper panels: 500 iterations from the posterior of µ. Lower panels: 500 iterations from the posterior of τ. Dr. Pablo E. Verde 185 Lecture 7 - MCMC-Short-version General properties of a MCMC algorithm Reversibility: The key property is reversibility or detailed balance, i.e., a balance in the flow of transition probabilities between the states of the Markov chain A Markov chain which is reversible has stationary distribution Thus we build a kernel or transition matrix P such that it ensures reversibility p(x)P(y|x) = p(y)P(x|y) where P(y|x) represents the flow of probability x → y Dr. Pablo E. Verde 186 Lecture 7 - MCMC-Short-version Irreducibility: An irreducible chain is one in which for any point θ(k) in the parameter space, it is possible to move from θ(k) to other point θ(l) in a finite number of steps. This guarantees the chain can visit all possible values of θ irrespective of the starting value θ(0) . Aperiodicity: A chain is aperiodic if does not exhibits periodic behavior. If R1, R2, . . . , Rk are disjoint regions in parameter space the chain does not cycle around them. To sample form p(θ|y) we construct a Markov chain with transition matrix P which satisfies reversibility, irreducibility and aperiodicity. Dr. Pablo E. Verde 187 Lecture 7 - MCMC-Short-version Metropolis algorithm The algorithm proceeds as follows: Start with a preliminary guess, say θ0. At iteration t sample a proposal θt ∼ P(θt|θt−1). The jumping distribution must be symmetric, satisfying the condition P(θa|θb) = P(θb|θa) for all θa and θb. If p(θt|y) > p(θt−1|y) then accept θt If not, flip a coin with probability r = p(θt |y) p(θt−1|y) , if it comes up heads, accept θt. If the coin toss comes up tails, stay at θt−1 The algorithm continues, until we sample from p(θ|y). The coin tosses allow it to go to less plausible θts, and keep it from getting stuck in local maxima. Dr. Pablo E. Verde 188 Lecture 7 - MCMC-Short-version Some samplers for the Metropolis algorithm The following are commonly implemented samplers for the proposal distribution P(θt|θt−1) Random walk sampler, observations are generated by θt = θt−1 + zt with zt ∼ f . There are many common choices for f , including the uniform in the unit disc, a multivariate normal or a t - distribution The independent sampler, the candidate observation is sample independently from the current state of the chain The Gibbs sampler. The Gibbs transition can be regarded as a special case of a Metropolis transition The STAN software implements samplers based on Hamiltonian dynamics Dr. Pablo E. Verde 189 Lecture 7 - MCMC-Short-version Metropolis in R The package MCMCpack implements Metropolis for a large number of statistical models. The current implementation uses a random walk Metropolis with proposal density multivariate Normals. The proposal density is centered at the current θ(t) with variance-covariance matrix given by the user or automatically calculated proportional to the Hessian of the posterior evaluated at its mode. Dr. Pablo E. Verde 190 Lecture 7 - MCMC-Short-version Example: an interesting bivariate distribution It is well know that the pair of marginal distributions doest not uniquely determine a bivariate distribution. For example, a bivariate distribution with normal marginal distributions need not be jointly normal (Feller 1966, p. 69) In contrast, the conditional distribution functions uniquely determine a joint density function (Arnold and Press 1989). A natural question arises: Must a bivariate distribution with normal conditionals be jointly normal? Dr. Pablo E. Verde 191 Lecture 7 - MCMC-Short-version The answer is NO! Gelman and Meng (1991) introduced an interesting distribution, where the join distribution is non-normal: f (x, y) ∝ exp −1/2 Ax2 y2 + x2 + y2 − 2Bxy − 2C1 − 2C2y . But it has conditional distributions which are normals: x|y ∼ N( By + C1 Ay2 + 1 , 1 Ay2 + 1 ), y|x ∼ N( Bx + C1 Ax2 + 1 , 1 Ax2 + 1 ). The question is how to sample form f (x, y)? Dr. Pablo E. Verde 192 Lecture 7 - MCMC-Short-version Here we show an example where the parameters are A = 1, B = 0 C1 = 3 and C2 = 3. x y z Gelman−Meng distribtuion A=1, B=0, C1=3 and C2=3 Dr. Pablo E. Verde 193 Lecture 7 - MCMC-Short-version −1 0 1 2 3 4 5 6 −10123456 x y 3 1 −1 −1 −3 −3 −3 −3 −5 −5 −5 −5 −7 −7 −7 −9 −9 −9 −11 −11 −11 −13 −13 −13 −15 −15 −17 −19 −21 −23 −25 −27 −29 −31 −33 −35 Gelman−Meng distribtuion A=1, B=0, C1=3 and C2=3 Dr. Pablo E. Verde 194 Lecture 7 - MCMC-Short-version We implement a Metropolis sampling algorithm based on a random walk and bivariate normals as proposals. ## Gelman and Meng (1991) kernel function: f.g.m <- function(xy , A = 1, B = 0, C1 = 3, C2 = 3) { x <- xy[1]; y <- xy[2] r <- -.5 * (A * x^2 * y^2 + x^2 + y^2 - 2 * B * x * y - 2 * C1 * x - 2 * C2 * y) as.vector(r) } ## Metropolis sampling with MCMCpack res <- MCMCmetrop1R(f.g.m, theta.init=c(0, 1), mcmc=10000, burnin=5000, logfun=FALSE) > @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The Metropolis acceptance rate was 0.47580 Dr. Pablo E. Verde 195 Lecture 7 - MCMC-Short-version q q q qq qqqqq qqqq qq q qq q q q q qqq qqqqq qq q qq q qqqqqqq qqq q q qqq q qqqqqqq qqqq q q qqqqq qqqqqqqqqqq qq qqqqqq qqqqqq q qqq qqqqqqqqqq qq qq qqqqqqq q qqqqqqq qq qqq qqq qqqqqqqqqqqq qqq q q q qqqqq qq qqqq qqqqq qqqqqqqqqqqq qqq qq q q q qq q q q q qqqqqqqq qqqq q qqq qq q qq q qqqq qq qq qqqqq qqq q qqqqqqq q q qqqqqq qqq q qqqq qqqqqq q qqq q q qqqq qqq qqqqqqq qqq qqqqq q qqqq q qqqqq qqq q qqqqqqqqq qqq q qqqqq q qqqqqqq q q qqqqq q qq q q qq qq qq qqqq qq q qqqqqqqq qqqqq q qqq q q q qqq q q q qqqqqq q q q qq qq q qqqqq q q q qq qqqqqqqq qq qqq q qq q qqqq qq qq qqqqqqqqqq qq q qqqqqqqqq q qq q qqq qq q q qq q qqq qqqqqqqqq qqq q q qq qqq qq qqqq qqqq qqqqqq qqqqqqqqqqqqqqqq q q qqqqqq qqq qq qqqqqq qqqqqqqq qq qqqqq qqqqqqqq q qqq qqq qqqq qq qqqqq qq q qqqqqq qqqqqqq qqqqqqqqqqq qqqqqqqqqqq qq qqqqqqqqqq qqqqq qq qqqqq q qq qq qqqqqq qqq qqqq q qq q qqqqqqq qqqqqqqqq qq qqqq qqq qq q qq qqqqqq q qq qqqq qqqqq qqqqqqq qq qq qqqqq q q q q qqqqqqqqqqq q qqq qqq q q qq q qqqq qqqq qqqq qq qq qqq qq q q qqq qq q qqqqq qqqq qqq qqqqqq q q q q qqqq q q qq q qqqq qq q qq qqqqqqqq qq q q qqq qq q qq qqqqqqqqqqq qqqqq qq qq qqq qqq qqq q q qq q qqq qqq qqq qqqq q q qqq qqqq qqq q qq qq q qqq qq q qqq q q q qqqqqqq qq qqqqq qq q qqq q qqqqqqqqq qqq qq q qq q qqqqq q qq qqq qq qq qq qqqqqqq q qq qqq qqqq qqq −1 0 1 2 3 4 012345 res[,1] res[,2] 0 0 −5 −5 −5 −5 −10 −10 −10 −15 −15 −20 −25 −30 −35 q qqq q q qqqqq qq q qqqqqqqqqqqqqq qq qqqqqq q q qq q qq qqqqqq qqq qq qqq qq q qqqqqqqqqqqqqqq qqqqq qq qqqqqqqqqqqqqqqqqqq q qq qq q q qq qqq q qqq qqqqq qqq q q q q q qqqqqqq q qqqq qqqq q qq qqq qqq q qq qqq qqqq qq qqqq qqqq qqqq qqq qq q q qqqq q qqqqqq q qq qqqq qq q q q qqq qqqqqqqqq qqq qqqqq qqqqqqq qq q q q q q qq q q qqqqq q qqqqqqqq q q qq qqqqq qqq qqq qq qq qq q q qqqqqq qqq q q qqqqqq qqqq q qq qqqq qqqqqqqqqqq qqq qqq q qqqqqqqqqqq qqq qq qqqqqqq q qqq qq qqqq qqqq q qqqqqqqqq qqqqq qqqq qqq qqq q qqqqq qqqqq qqqq q qqqq q qqqqqqq qqqqq qqqq qq q qq qq q q qqq q q qqq qqqqq qqqq qq q qq qqqq qqq qq qq qqqqqqq q qqqqqq qq q qqqqq qqqqqq qqqq qq qqqq q q qq q qq qq q qqqqq qq qqqqqq q qqq qq q qqqqqqqqqqq qq q q q q qq qq q qq q qqq qqq qqqqqq qqqqqqq q q qqq q q qqq qq q q q q qqqqq qqqqqqqq qqq qqqqqqqqqq qqq qqqq qq q q qqq qqq q q q qqqqqqqq qqqqq qq qq qqqqqqqq q qqqqqq q qqqq qqqqq qqqq qqq q qq q qqqqqq q qq q qq q q qq q qq qqqqq q qq qq q qqq qqqq qq q qqq q qqq qqqqqqqqqqqqqq qq qq qqq q qqqqqqqqqq qq q q q qqq qq qqq q qqqq q qq qq q qqq qqq q q q qq q qq qqqq q q q q q q qqq qq qqqqqqqqqq qqqqqqqq qqqqq q qqqqqqqq q qqqqqqq q q qq q qqq qqqqq q qq q qq qq qqq q qq qqqq qqq qqqqqq qq q qqqqqq qqqqq qqqqqq qqqq qqq q q qqqqqqqqq qqqqqqqqqqq qq q qq qqqqq q qq qqq q qq q qq q qqq qq q q qq q qqq qq q qq q 3 1 −1 −1 −3 −3 −3 −3 −5 −5 −5 −5 −7 −7 −7 −9 −9 −9 −11 −11 −13 −13 −15 −17 −19 −21 −23 −25 −27 −29 −31 −33 −35 q q qq qqq q qq q q q q qqqq qqq q qq q qqqqq qqq q qq qqq qqqqqqqqqqqqq q q qqqqqqqq qqq q qq q q qqqqqq qqq q q qqq q q q q qqq qqqq qqq q q qqq q qqqqq qq q qqq qq qqqq q qqqqq qq qqq q qq qq qqqqq q qq qqqq qqq qq qq qqq q q qq qq q qqq qqqq qqqq qq qq qq q q q q q q q qqqqqqq q qqq q qqqq qqqq qqq qq qqqq qqqq q qq q qqqqq qqq qq qqqqqq q qq qq qqqq q q qqqq q qqqq qqqqqq qqq qqqq q q q qqq qqqq q q qqq q q qqq q q qq q q q qq qqqqqq qq qqq qqqqqq q qqq qqqqqqqqqqq q q qqq q qqq qq q qq qq qqq qqqq q q q qq qqqqqq qqq qqq q q qq qqq q q q qqq qq qq qq qqq qqq qqq q qq q qqq qq qqq qq qq qq qqq qqq qqq qqqq q qqq q q qqq qq qqqqq qq q qq q qqqq qq q q qqqqqqqqq qqqqqqqq qqq q qq qq qqq qq qq qqqqqqqq qqq q qqqqqqqq qqqqqqq q qq qq qq qq qqqq qqq q qq qqqqqqqqqq q qqq qqq q qq qqqqq q qqqqqq qq q q qq q qqqqq qqqqqqqqqqqq qq qqq qqq q qq qqqqqqq qqqqqqqq q qq qqqqqq qq qqq q q qqqq q q q q qqqqqqqqqqq qqq qqqqqqqqqqqqq q qqqqqqq qqq q qqqqqq q qqq qqqqqqqqqqq q qqqqqqqqq qqqq qqqq q q qqq qqqqq q qq qqq q qqqq qq qq qq qqqqqq qqqq qqq q q qq q q qqq qq q q qqqqqq qq qq q q qqq qq qqqq qq qq qqq qqq qqq qqq q qq qq qqq qqq qq qqq q q q q qq q qqqq qqq qqq qqq qqq q qq q qq q qqqq q qqqqqqqq q qqqqqqqqqqqqq q qqqq qqqqqqq q q qqqqqqq qqqqqqq q qqq qqqqqqqqqqqqqqq q q qqq q qqq q qqqqqq qq qq qq qqqq qq qqqqq qq qqqqqqqqqqqqqqqqq qqq qqqqq q qq Dr. Pablo E. Verde 196 Lecture 7 - MCMC-Short-version 0 200 400 600 800 1000 −101234 iteration x 0 200 400 600 800 1000 01234 iteration y Dr. Pablo E. Verde 197 Lecture 7 - MCMC-Short-version Performance of MCMC methods There are three main issues to consider Convergence: How quickly does the distribution of θ(t) approach p(θ|x) ? Efficiency: How well are functionals of p(θ|x) estimated from θ(t) ? Simplicity: How convenient is the method to use? In general computer effort should be measured in seconds, not iterations! Dr. Pablo E. Verde 198 Lecture 7 - MCMC-Short-version Checking convergence Convergence is to target distribution and not to a single value Once convergence is reached, samples should look like a random scatter about a stable mean value. One approach is to run many long chains with widely differing starting values. Plot traces of the simulated chain Plot ACF, etc. Dr. Pablo E. Verde 199 Lecture 8 - Regression Models Lecture 8: Bayesian Regression Models Dr. Pablo E. Verde 200 Lecture 8 - Regression Models Modeling Examples Multiple linear regression: model diagnostics and variable selection ANOVA models: an alternative to multiple testing Logistic regression: combining multiple information in Risk analysis Dr. Pablo E. Verde 201 Lecture 8 - Regression Models Introduction Standard (and non standard) regression models can be easily formulated within a Bayesian framework. Specify probability distribution for the data Specify form of relationship between response and explanatory variables Specify prior distribution for regression coefficients and any other unknown (nuisance) parameters Dr. Pablo E. Verde 202 Lecture 8 - Regression Models Some advantages of a Bayesian formulation in regression modeling include: Easy to include parameter restrictions and other relevant prior knowledge It is simple to extended to non-linear regression We can easily robustified a model Easy to make inference about functions of regression parameters and/or predictions We can handle missing data and covariate measurement error Transparent variable selection by prior modeling regression coefficients Dr. Pablo E. Verde 203 Lecture 8 - Regression Models Linear regression Example: Stack loss data, WinBUGS Volume 1 In this example, we illustrate a more complete regression analysis including outlier checking, model adequacy and variable selection methods. This is a very often analyzed data of Brownlee (1965, p. 454). 21 daily responses of stack loss, yi the amount of ammonia escaping from industrial chimneys Covariates: air flow x1, temperature x2 and acid concentration x3 Transformed covariates: zk,i = (xk,i − ¯xk)/sd(xk) for k = 1, 2, 3. Dr. Pablo E. Verde 204 Lecture 8 - Regression Models Y 20 30 40 50 60 70 80 90 q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q 20 40 60 80 10152025303540 q q q q q q q q q qq q q q q q q q q q q 20406080 q q q q q q q q q q q q q q q q q q q q q X1 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q X2 20406080 q q q q q q q q q q q q q q q q q q q q q 10 15 20 25 30 35 40 20406080 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 20 40 60 80 q q q q q q q q q q q q q q q q q q q q q X3 Dr. Pablo E. Verde 205 Lecture 8 - Regression Models Model 1 specification: yi ∼ Normal(µi , τ) i = 1, . . . , 21 τ = 1/σ2 µi = β0 + β1z1,i + β2z2,i + β3z3,i σ ∼ Uniform(0.01, 100) βk ∼ Normal(0, 0.001), k = 0, . . . , 3. Dr. Pablo E. Verde 206 Lecture 8 - Regression Models In these model we also calculate some diagnostic quantities: By analogy to the classical regression analysis we calculate the standardized residuals as di = yi − E(yi |z) Var(yi |z) . In a Bayesian setting these residuals have posterior distributions as well. They can be used similarly to residual analysis in classical statistics. Dr. Pablo E. Verde 207 Lecture 8 - Regression Models We specify the model in WinBUGS as: model{ # Model 1 for (i in 1 : N) { y[i] ~ dnorm(mu[i], tau) mu[i] <- beta0 + beta[1] * z[i, 1] + beta[2] * z[i, 2] + beta[3] * z[i, 3] stres[i] <- (y[i] - mu[i]) / sigma # Priors beta0 ~ dnorm(0, 0.001) for (j in 1 : p) { beta[j] ~ dnorm(0, 0.001) } # coefficients independent tau <- 1/(sigma*sigma) sigma ~ dunif(0.01, 100) # Gelman’s prior } Dr. Pablo E. Verde 208 Lecture 8 - Regression Models We run the model in WinBUGS, we look at the possible outliers. Dr. Pablo E. Verde 209 Lecture 8 - Regression Models As an alternative model we replace the Normal likelihood by a t-distribution with ν = 4 degrees of freedom. The idea is to have a more robust model against outliers in the y’s. Model 2: yi ∼ t(µi , τ, ν) i = 1, . . . , 21 τ = 1/σ2 µi = β0 + β1z1,i + β2z2,i + β3z3,i σ ∼ Uniform(0.01, 100) βk ∼ Normal(0, 0.001), k = 0, . . . , 3. Dr. Pablo E. Verde 210 Lecture 8 - Regression Models In the WinBUGS code we modify the model section by commenting out the normal model and adding one line with the model based on the t-distribution and we run the MCMC again. #y[i] ~ dnorm(mu[i], tau) y[i] ~ dt(mu[i], tau, 4) #DIC Normal Dbar Dhat pD DIC Y 110.537 105.800 4.736 115.273 total 110.537 105.800 4.736 115.273 #DIC t Dbar Dhat pD DIC Y 109.043 104.103 4.940 113.983 total 109.043 104.103 4.940 113.983 A very modest difference between the two models. Dr. Pablo E. Verde 211 Lecture 8 - Regression Models Variable Selection We modify the structure of the distribution of the regression coefficients by adding a common distribution with unknown variance. This is called a ridge regression model, where the βs are assumed exchangeable. Model 3 specification: yi ∼ Normal(µi , τ), τ = 1/σ2 , i = 1, . . . , 21 µi = β0 + β1z1,i + β2z2,i + β3z3,i σ ∼ Uniform(0.01, 100) βk ∼ Normal(0, φ), φ = 1/σ2 β k = 0, . . . , 3 σβ ∼ Uniform(0.01, 100). Dr. Pablo E. Verde 212 Lecture 8 - Regression Models In WinBUGS for (j in 1 : p) { # beta[j] ~ dnorm(0, 0.001) # coefs independent beta[j] ~ dnorm(0, phi) # coefs exchangeable (ridge regression) } phi <- 1/(sigmaB*sigmaB) sigmaB ~ dunif(0.01, 100) #Gelman’s prior Dr. Pablo E. Verde 213 Lecture 8 - Regression Models Dr. Pablo E. Verde 214 Lecture 8 - Regression Models We see that β3 is not a relevant variable in the model. Now, we try another variable selection procedure, by modifying the distribution of the regression coefficients. Model 5 specification: yi ∼ Normal(µi , σ2 ) i = 1, . . . , 21 µi = β0 + β1 π1 z1,i + β2 π2 z2,i + β3 π3 z3,i σ ∼ Uniform(0.01, 100) βk ∼ Normal(0, 100), k = 0, . . . , 3, π1 ∼ Bernoulli(0.5) π2 ∼ Bernoulli(0.5) π3 ∼ Bernoulli(0.5). Dr. Pablo E. Verde 215 Lecture 8 - Regression Models In WinBUGS we change the equation of µi : #mean equation model mu[i] <- beta0 + beta[1] * pi[1]* z[i, 1] + beta[2]* pi[2] * z[i, 2] + beta[3] * pi[3]* z[i, 3] #priors for (j in 1 : p) { beta[j] ~ dnorm(0, 0.001) pi[j] ~ dbern(0.5) } We run the model for 40,000 iterations and we discard the first 20,000. Results are: pi[1] 0.9987 0.0367 <- X1 very important pi[2] 0.8361 0.3702 <- X2 non meaningful pi[3] 0.0425 0.2018 <- X1 definitively no important Dr. Pablo E. Verde 216 Lecture 8 - Regression Models Dr. Pablo E. Verde 217 Lecture 8 - Regression Models ANOVA and Experimental Design Example: Speed of light measurements by Michelson This is a classical data set available in R and corresponds to Measurements of the speed of light in air, made between 5th June and 2nd July, 1879. The data consists of five experiments, each consisting of 20 consecutive runs. The response is the speed of light in km/s, less 299000. The currently accepted value, on this scale of measurement, is 734.5. We start by comparing the 5 experiments with boxplots: > library(MASS) > data(michelson) > attach(michelson) > plot(Expt, Speed, main="Speed of Light Data", xlab="Experiment No.",ylab="Speed") Dr. Pablo E. Verde 218 Lecture 8 - Regression Models q qq q q q 1 2 3 4 5 7008009001000 Speed of Light Data Experiment No. Speed Dr. Pablo E. Verde 219 Lecture 8 - Regression Models ANOVA model: Bayesian analysis in WinBUGS The Bayesian approach to the ANOVA model is similar to the regression model, with mean equation according to the mean groups parametrization. So if we have nj observations in J groups with n = J j=1 nj . Model 1 yi,j ∼ Normal(µj , τ), τ = 1/σ2 i = 1, . . . , nj , µj = µ + αj , j = 1, . . . , J, αj ∼ Normal(0, 0.001), j = 2, . . . , J, µ ∼ Normal(0, 0.001), σ ∼ Uniform(0.01, 100). Dr. Pablo E. Verde 220 Lecture 8 - Regression Models The WinBUGS implementation of Model 1 is: model{ for(i in 1:n){ y[i] ~ dnorm( mu[i], tau) mu[i] <- mu0 + alpha[ a[i] ] } # Corner constrains parametrization # alpha[1] <- 0 # Sum to zero parametrization alpha[1] <- -sum( alpha[2:J] ) # group means: for( i in 1:J){ m.g[i] <- mu0 + alpha[i]} Dr. Pablo E. Verde 221 Lecture 8 - Regression Models # mean differences: d[1] <- (m.g[2] - m.g[1] ) d[2]<- (m.g[3] - m.g[1] ) ... # Priors mu0 ~ dnorm(0, 0.0001) for( j in 2:J){ alpha[j] ~ dnorm(0, 0.0001) } tau <- 1/(sigma*sigma) sigma ~ dunif(0.01, 100) #Gelman’s prior } Dr. Pablo E. Verde 222 Lecture 8 - Regression Models One serious problem of ANOVA modeling is the lack of variance homogeneity between groups. We can extend our Bayesian set up to include this data feature as follows: Model 2 yi,j ∼ Normal(µj , τj ), τj = 1/σ2 j i = 1, . . . , nj , µj = µ + αj , j = 1, . . . , J, αj ∼ Normal(0, 0.001), j = 2, . . . , J, µ ∼ Normal(0, 0.001), σj ∼ Uniform(0.01, 100), j = 1, . . . , J, . This model includes a structural dispersion sub-model for each treatment group. Dr. Pablo E. Verde 223 Lecture 8 - Regression Models The WinBUGS implementation of Model 2 change to: { for(i in 1:n){ #y[i] ~ dnorm( mu[i], tau) y[i] ~ dnorm( mu[i], tau[i]) mu[i] <- mu0 + alpha[ a[i] ] tau[i] <- gamma[ a[i] ] } # Priors for groups dispersion model for( j in 1:J){ gamma[j] ~ dgamma(0.001, 0.001) sigma2[j] <- 1/ gamma[j] sigma[j] <- pow(sigma2[j], 0.5) } Dr. Pablo E. Verde 224 Lecture 8 - Regression Models For models with more than 3 groups the estimation based on simple means can be improved by adding exchangeability structure to the αi ’s (Stein, 1956, Lindley 1962, Casella and Berger, pag.574). This assumed a sub-model for αi in the following way: Model 3 yi,j ∼ Normal(µj , τj ), τj = 1/σ2 j i = 1, . . . , nj , µj = µ + αj , j = 1, . . . , J, αj ∼ Normal(0, φ), φ = 1/σα j = 2, . . . , J, µ ∼ Normal(0, 0.001), φ ∼ Uniform(0.01, 100), j = 1, . . . , J, σj ∼ Uniform(0.01, 100), j = 1, . . . , J, . Dr. Pablo E. Verde 225 Lecture 8 - Regression Models The WinBUGS implementation of Model 3 change to: # Priors mu0 ~ dnorm(0, 0.0001) for( j in 2:J){ alpha[j] ~ dnorm(0, xi) } xi <- 1/(sigma.alpha*sigma.alpha) sigma.alpha ~ dunif(0.01, 100) #Gelman’s prior Dr. Pablo E. Verde 226 Lecture 8 - Regression Models One alternative to exchangeability between αi ’s is to use a mixture model to investigate the structure of the data, for example: Model 4 yi,j ∼ Normal(µj , τj ), τj = 1/σ2 j i = 1, . . . , nj , µj = µ + αj πj , j = 1, . . . , J, αj ∼ Normal(0, 0.001), j = 2, . . . , J, µ ∼ Normal(0, 0.001), πj ∼ Bernoulli(0.5), j = 2, . . . , J, σj ∼ Uniform(0.01, 100), j = 1, . . . , J, . Dr. Pablo E. Verde 227 Lecture 8 - Regression Models The WinBUGS implementation of Model 3 change to: # group means mixture model: for( i in 1:J){ m.g[i] <- mu0 + alpha[i]*pi[i] pi[i] ~ dbern(0.5) } The next slides present some graphical results of the main features of each model. The full WinBUGS script for this analysis is anova.odc, this includes some other calculations for residual analysis as well. Dr. Pablo E. Verde 228 Lecture 8 - Regression Models ANOVA: classical model Posterior distributions for mean groups µj , Model 1. Dr. Pablo E. Verde 229 Lecture 8 - Regression Models ANOVA: classical model Posterior distributions for mean differences between all groups µk − µl (k = l), Model 1 Dr. Pablo E. Verde 230 Lecture 8 - Regression Models ANOVA: variance heterogeneity model Posterior distributions for σj , Model 2. Dr. Pablo E. Verde 231 Lecture 8 - Regression Models ANOVA: variance heterogeneity model Posterior distributions for mean differences between all groups µk − µl (k = l), Model 2 Dr. Pablo E. Verde 232 Lecture 8 - Regression Models ANOVA: exchangeability model Posterior distributions for mean differences between all groups with exchangeability, Model 3. Dr. Pablo E. Verde 233 Lecture 8 - Regression Models ANOVA: Mixture models Posterior distributions for the group means with mixture distributions, Model 4. Dr. Pablo E. Verde 234 Lecture 8 - Regression Models Logistic regression: combining multiple information in Risk analysis Example: The Challenger O-Ring Data: A Bayesian Risk Analysis Historical background The Space Shuttle Challenger’s final mission was on January 28th 1986, on an unusually cold morning (31F/-0.5C) It disintegrated 73 seconds into its flight after an O-ring seal in its right solid rocket booster failed The uncontrolled flame caused structural failure of the external tank, and the resulting aerodynamic forces broke up the orbiter All seven members of the crew were killed Dr. Pablo E. Verde 235 Lecture 8 - Regression Models Technical background Dr. Pablo E. Verde 236 Lecture 8 - Regression Models The Challenger’s O-Ring Thermal-Distress Data Example: Regression modelling with change point On the night of January 27, 1986, the night before the space shuttle Challenger accident, there was a tree-hour teleconference among people at Marton Thiokol (manufacturer of the solid rocket motor), Marshal Space Center and Kennedy Space Center. The discussion focused on the forecast of a 31 F temperature for lunch time the next morning, and the effect of low temperature on O-ring performance. The available data of previous shuttle flights consisted on occurrence of thermal distress (yes, no) and temperature at launch time. Let’s take a look at these data and fit a logistic regression in R: Dr. Pablo E. Verde 237 Lecture 8 - Regression Models > #distress yes=1 , no =0 > y = c(1,1,1, 1,0,0,0, 0,0,0,0, 0,1,1,0, 0,0,1,0, 0,0,0,0) > # temperature at launch time > x = c(53,57,58, 63,66, 67,67, 67,68, 69,70, 70,70, 70,72, 73,75, 75,76, 76,78, 79,81) > # number of previous flights > N = 23 > summary(f.b1 <- glm(y ~ x, family=binomial)) ... AIC: 24.315 > summary(f.b2 <- update(f.b1, family=binomial(link=cloglog))) ... AIC: 23.531 Dr. Pablo E. Verde 238 Lecture 8 - Regression Models q q q q q q q q q q q q q q q q q q q q q q q 55 60 65 70 75 80 0.00.20.40.60.81.0 Temperature in F Probofatleastoneo−ringincident Dr. Pablo E. Verde 239 Lecture 8 - Regression Models These data have been analyzed several times in the literature. I found that fitting a change point model makes an improvement on previous published analysis. The model is: Pr(yi = 1) = logit−1 (β0 + β1 zi ) where zi = 1 for xi ≥ K, 0 otherwise. In this model K is unknown and represents the temperature from which the probability of distress is the lowest. The parameter β1 represents the reduction of risk under temperatures greater than K. Dr. Pablo E. Verde 240 Lecture 8 - Regression Models This model is implemented in WinBUGS as following: model { for(i in 1 : N) { y[i] ~ dbern(p[i]) logit(p[i]) <- beta0 + beta1 * z[i] } for(i in 1:N){z[i] <- step(x[i] - K) } # step = 0 until K #priors K ~dunif(53, 81) beta0 ~ dnorm(0.0, 0.01) beta1 ~ dnorm(0.0, 0.01) } Dr. Pablo E. Verde 241 Lecture 8 - Regression Models We run the model with R2WinBUGS: > data.b <- c("x", "y", "N") > par.b <- c("beta0", "beta1", "K") > > chall.1 <- bugs(data.b, inits=NULL, par.b, "challenger-1.txt", n.chains=1, n.iter=20000, n.thin=1, bugs.directory = bugsdir, working.directory = getwd(), clearWD=TRUE, debug=TRUE) > print(chall.1) mean sd 2.5% 25% 50% 75% 97.5% beta0 6.2 3.9 -0.2 3.3 5.8 8.4 15.3 beta1 -8.0 3.9 -17.2 -10.2 -7.6 -5.1 -2.1 K 63.8 2.9 58.2 63.1 64.1 65.1 67.0 deviance 19.1 2.8 16.6 16.9 18.0 20.4 26.9 pD = 2.4 and DIC = 21.5 Dr. Pablo E. Verde 242 Lecture 8 - Regression Models Histogram of K K Frequency 55 60 65 70 75 80 0100200300400500 q q q q q q q q q q q q q q q q q q q q q q q 55 60 65 70 75 80 0.00.20.40.60.81.0 Temperature in F Probofatleastoneo−ringincident Dr. Pablo E. Verde 243 Lecture 8 - Regression Models Example: Probabilistic Risk Assessment Dr. Pablo E. Verde 244 Lecture 8 - Regression Models A catastrophic failure of a field joint is expected when all four events occur during the operation of the solid rocket boosters: pfield = pa × pb × pc × pd Since there are 6 field joints per launch, assuming the all 6 joint failures are independent, the probability of at least one failure is Risk = 1 − (1 − pfield )6 pa is calculated from the logistic regression model as: pa = Pr(O-Ring Thermal-Distress at 31F) The available data for the other events was very sparse: Event b: 2 in 7 flights Event c: 1 in 2 flights Event d: 0 in 1 flight We model the probabilities of events b, c and d with a beta-binomial model with uninformative priors Beta(0.5, 0.5). Dr. Pablo E. Verde 245 Lecture 8 - Regression Models In WinBUGS we modify the model file ”challenger-1.txt” by adding these risk calculations. We add, also, the calculation of risk at temperature K or greater. To run the model in R by adding the date of events b, c, and d: # Risk analysis y.b <- 2; y.c <- 1; y.d <- 0 data.r <- c("x", "y", "N", "y.b", "y.c", "y.d") par.r <- c("beta0", "beta1", "K", "p.cat", "p.catK") chall.2<- bugs(data.r, inits=NULL, par.r, "challenger-1.txt", n.chains = 1, ... > print(chall.2, 3) mean sd 2.5% 25% 50% 75% 97.5% K 63.726 2.653 58.190 63.090 64.120 65.120 66.010 p.cat 0.173 0.204 0.000 0.019 0.088 0.258 0.736 p.catK 0.038 0.061 0.000 0.003 0.014 0.047 0.213 Dr. Pablo E. Verde 246 Lecture 8 - Regression Models Summary results of our analysis: The risk of a catastrophic failure with launching temperature of 31 F is 17% The risk of a catastrophic failure with temperatures greater than 63 F is 3.8% The odds ratio of the risk between these temperatures is 9.8 One interesting historical note: The Roger Commission, appointed by president Reagan to investigate the causes of the accident, pointed out: ... a mistake in the analysis of the thermal distress data was that the flights with zero incidents were left off because it was felt that these flights did not contribute any information about the temperature effect. (p. 145) Dr. Pablo E. Verde 247 Lecture 8 - Regression Models Practical Exercise: a nonconjugate nonlinear model Volume 2 in WinBUGS help: Dugongs Originally, Carlin and Gelfand (1991) consider data on length yi and age xi measurements for 27 dugongs (sea cows) and use the following nonlinear growth curve with no inflection point and an asymptote as xi tends to infinity: yi ∼ Normal(µi , σ2 ) µi = α − βγxi , where α, β > 0 and γ ∈ (0, 1). Dr. Pablo E. Verde 248 Lecture 8 - Regression Models Practical Open the file in WinBUGS and run the model Change the last observation 2.57 to 2.17 and run the model. Did you see different results? Modify the WinBUGS code as: # y[i] ~ dnorm(mu[i], tau) y[i] ~ dt(mu[i], tau, df) # fit robust t distribution ... df <- 5 Run the model with this change and compare results Dr. Pablo E. Verde 249 Lecture 9: Introduction to Hierarchical Models Lecture 9: Introduction to Hierarchical Models Dr. Pablo E. Verde 250 Lecture 9: Introduction to Hierarchical Models Learning from the information of the others ... During the previous lectures we have seen several examples, where we performed a single statistical analysis. Now, suppose that instead of having a single analysis, you have several ones. Each one giving independent results from each other. One might ask: Should we combine these independent results in a single global analysis? Dr. Pablo E. Verde 251 Lecture 9: Introduction to Hierarchical Models The general answer is: YES!! Hierarchical models give us the statistical framework to combine multiple sources of information in a single analysis. Informally, by combining several individual results, each individual analysis is improve by the information we have from the others. This information is summarized in an Empirical Prior distribution. From the classical point of view, Charles Stein demonstrated in the 50s a fundamental theoretical result, which shows the benefit of combining individual results in a single analysis. Charles Stein (22nd of March 1920) ... and he still working everyday! Dr. Pablo E. Verde 252 Lecture 9: Introduction to Hierarchical Models Hierarchical Models It becomes common practice to construct statistical models which reflects the underline complexity of the problem under study. Such us different patterns of heterogeneity, dependence, miss-measurements, missing data, etc. Statistical models which reflect complexity of the data involve multiple parameters. Examples are: ”Study effect” in meta-analysis ”Subject effect” in growth curves models ”Identification of hidden process” in sequence of observations ”Relative risks of disease outcome in different areas/time periods ”Frailty effects” in correlated survival data ... Dr. Pablo E. Verde 253 Lecture 9: Introduction to Hierarchical Models Statistical Inference for Multiple Parameters How to make inference on multiple parameters {θ1, . . . , θN} measured in N units (person, centers, areas, ...) which are related or connected by the structure of the problem? We can identify three different scenarios 1. Identical parameters: All the θ’s are identical, i.e. all the data can be pooled and the individual units ignored. 2. Independent parameters: All the θ’s are entirely unrelated, i.e. the results from each unit can be analyzed independently. 3. Exchangeable parameters: The θ’s are assumed to be ’similar’ in the sense that the ”labels” convey no information. Dr. Pablo E. Verde 254 Lecture 9: Introduction to Hierarchical Models Exchangeability The assumption of exchangeable units is mathematically equivalent to assuming that the θ’s are drawn at random from some population distribution. ”Exchangeability” express formally the idea that we find no systematic reason to distinguish the individual random variables X1, . . . , Xn. We assume that X1, . . . , Xn are exchangeable if the probability that we assign to any set of potential outcomes, p(x1, . . . , xn), is unaffected by permutations of the labels attached to the variables. Dr. Pablo E. Verde 255 Lecture 9: Introduction to Hierarchical Models Example: Suppose that X1, X2, X3 are the outcomes of the three patients enrolled in a clinical trial. Were Xi = 1 indicates positive reaction to a treatment and Xi = 0 no reaction. We may judge p(X1 = 1, X2 = 0, X3 = 1) = p(X2 = 1, X1 = 0, X3 = 1) = p(X1 = 1, X3 = 0, X2 = 1) i.e. the probability of getting 2 positive outcomes is NOT affected by the particular patient on which the positive outcome comes. Note that this is a very strong assumption. In reality we may expect that patients may behave in different way. For example they may fail to comply a treatment. Note that ”exchangeability” does not mean we believe that X1, X2, . . . Xn are independent! Dr. Pablo E. Verde 256 Lecture 9: Introduction to Hierarchical Models Exchangeability and Hierarchical Models Suppose yij is outcome for individual j, unit i, with unit-specific parameter θi Assumption of partial exchangeability of individuals within units is represented by yij ∼ p(yij |θi ) θi ∼ p(θi |φ) Assumption of exchangeability of units can be represented by θi ∼ p(θi |φ) φ ∼ p(φ) where, p(φ) can be considered as a common prior for all units, but with unknown parameters. Dr. Pablo E. Verde 257 Lecture 9: Introduction to Hierarchical Models Assuming that θ1, . . . , θN are drawn from some common prior distribution whose parameters are unknown is known as a hierarchical or multilevel model. Bayesian statistical inference is based on: p(θ1, . . . , θN, φ|y) ∝ p(φ) p(θ1, . . . , θN|φ) p(y|θ1, . . . , θN, φ) The dimension of (θ1, . . . , θN) could be very large in practice. Empirical Bayes techniques omit p(φ). They are useful in practice, but they may biased variability estimates be reusing y to estimate φ. Dr. Pablo E. Verde 258 Lecture 9: Introduction to Hierarchical Models Exchangeability some further comments Note that there does not need to be any actual sampling - perhaps these N units are the only ones that exists - this is very common in meta-analysis. The probability structure is a consequence of the belief in exchangeability rather than a physical randomization mechanism. We emphasis that an assumption of exchangeability is a judgement based on our knowledge of the context. Dr. Pablo E. Verde 259 Lecture 9: Introduction to Hierarchical Models Hierarchical models and shrinkage Suppose in each unit we observe a response yi assumed to have a Normal likelihood yi ∼ N(θi , s2 i ) Unit means θi are assumed to be exchangeable, and to have a Normal distribution θi ∼ N(µ, σ2 ) where µ and σ2 are ”hyper-parameters” for the moment assumed known. After observing yi , Bayes theorem gives θi |yi ∼ N(wi µ + (1 − wi ) yi , (1 − wi ) s2 i ) where wi = s2 i /(s2 i + σ2) is the weight given to the prior mean. Dr. Pablo E. Verde 260 Lecture 9: Introduction to Hierarchical Models Hierarchical models and shrinkage An exchangeable model therefore leads to the inferences for each unit having narrowed intervals that if they are assumed independent, but shrunk towards the prior mean response. wi controls the ”shrinkage” of the estimate towards µ, and the reduction in the width of the interval for θi Shrinkage (wi ) depends on precision of the individual unit i relative to the variability between units. Dr. Pablo E. Verde 261 Lecture 9: Introduction to Hierarchical Models Example: Toxoplasmosis data These data present the relation between rainfall and the proportions of people with toxoplasmosis for 34 cities in El Salvador (Efron 1986) The question is how to combine results of 34 different cities in a single model? The data have been used to illustrate non-linear relationship between the amount of rain and toxoplasmosis prevalence We illustrate a series of hierarchical models ...we take some preliminary visualization Dr. Pablo E. Verde 262 Lecture 9: Introduction to Hierarchical Models Toxoplasmosis Rates: El Salvador 34 cities 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Dr. Pablo E. Verde 263 Lecture 9: Introduction to Hierarchical Models q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1600 1800 2000 2200 0.00.20.40.60.81.0 Amount of Rain (mm) ToxoplasmosisRate q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 20 40 60 80 0.00.20.40.60.81.0 Sample Size ToxoplasmosisRate Toxoplasmosis Rates: El Salvador 34 cities Dr. Pablo E. Verde 264 Lecture 9: Introduction to Hierarchical Models Pooled approach: To start we ignore possible (complex) relation between rainfall and proportion and we fit same beta-binomial model to all the cities 34 i=1 p(ri |π, ni ) = 34 i=1 Binomial(ni , π) p(π) = Beta(a, b) p(π|r, n) ∝ 34 i=1 p(ri |π, ni )p(π) = Beta a + i ri , b + i (ni − ri ) Dr. Pablo E. Verde 265 Lecture 9: Introduction to Hierarchical Models Some comments: Now, is it reasonable to assume common probability π of Toxoplasmosis for each city? The beta-binomial model assumes that each outcome is independent and identically distributed according to the binomial probability distribution with parameter π Does this model adequately describe the random variation in outcomes for each city? Are the cities rates more variable that our model assumes ? Question: How to model the excess of variation in the data? Dr. Pablo E. Verde 266 Lecture 9: Introduction to Hierarchical Models Modeling the excess of variation Model 1: Combining information with a Generalized Linear Mixed Model (GLMM) ri ∼ Binomial(ni , πi ) logit(πi ) ∼ N(µ, σ2 ) µ ∼ N(0, 100) σ ∼ Uniform(0.01, 10) This model combines information between cities in a single model. Dr. Pablo E. Verde 267 Lecture 9: Introduction to Hierarchical Models Model 2: Modeling each city rate independently ri ∼ Binomial(ni , πi ) logit(πi ) ∼ N(0, 100). This model does NOT use information between cities. Each rate πi is estimated independently. Dr. Pablo E. Verde 268 Lecture 9: Introduction to Hierarchical Models In WinBUGS we run both models simultaneously as follows: model{ for( i in 1 : I ) { r[i] ~ dbin(p[i,1],n[i]) logit(p[i,1]) <- a[i] a[i] ~ dnorm(alpha1, tau1) # Exchangeable r2[i] ~ dbin(p[i,2],n[i]) # Copy of r[i] logit(p[i,2]) <- b[i] b[i] ~ dnorm(0, 0.01) # Independent } tau1 <- 1/(sigma1*sigma1) # Priors sigma1 ~ dunif(0.01, 5) alpha1 ~ dnorm(0,0.01) } Dr. Pablo E. Verde 269 Lecture 9: Introduction to Hierarchical Models #data list(I=34, r=c(2,3,1,3,2,3,2,7,3,8,7,0,15,4 ,0,6 ,0,33,4,5 ,2 ,0,8, 41,24,7, 46,9,23,53,8,3,1,23), r2=c(2,3,1,3,2,3,2,7,3,8,7,0,15,4 ,0,6 ,0,33,4,5 ,2 ,0,8,41,24,7, 46, 9,23,53,8,3,1,23), n=c(4,20,5,10,2,5,8,19,6,10,24,1,30,22,1,11,1,54,9,18,12,1,11,77,51, 16,82,13,43,75,13,10,6,37) ) Dr. Pablo E. Verde 270 Lecture 9: Introduction to Hierarchical Models Toxoplasmosis Rates: El Salvador 34 cities 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Dr. Pablo E. Verde 271 Lecture 9: Introduction to Hierarchical Models ML Bayes 0.00.20.40.60.81.0 Schrikage effect Dr. Pablo E. Verde 272 Lecture 9: Introduction to Hierarchical Models Results: DIC for Model 1 (GLMM) = 145.670 DIC for Model 2 (independent) =176.504 Clearly Model 1 which combines information between cities is the winer. The use of Model 2 has the effect of reduce the variability between cities The use of Model 2 gives better results at the level of the city. Clear conclusion: We should take advantage of modeling simultaneously multiple results! Dr. Pablo E. Verde 273 Lecture 9: Introduction to Hierarchical Models Example: Toxoplasmosis data continue ... Two important questions are: Which is the influence of the amount of rain in toxoplasmosis prevalence? Which is the influence of the number of participants in each town? Model 3: ri ∼ Binomial(ni , πi ) logit(πi ) = ai + β1 × (xi,1 − ¯x1) + β2 × (xi,1 − ¯x1) ai ∼ N(α, σ2 ) σ ∼ Uniform(0.01, 10) α, β1, β2 ∼ N(0, 100), Dr. Pablo E. Verde 274 Lecture 9: Introduction to Hierarchical Models In WinBUGS: model { for(i in 1:N) {log.n[i] <- log(n[i])} for( i in 1 : N ) { r[i] ~ dbin(p[i], n[i]) a[i] ~ dnorm(alpha,tau) logit(p[i]) <- a[i] + beta1 * (rain[i]-mean(rain[])) + beta2 * (log.n[i] - mean(log.n[])) } # Priors tau <- 1/(sigma*sigma) sigma ~ dunif(0.01, 5) alpha ~ dnorm(0,0.01) beta1 ~ dnorm(0, 0.01) beta2 ~ dnorm(0, 0.01)} Dr. Pablo E. Verde 275 Lecture 9: Introduction to Hierarchical Models Posterior for Beta_1 beta1 Density −0.003 0.000 0.002 0100200300400500 Posterior for Beta_2 beta2 Density −0.2 0.2 0.6 0.00.51.01.52.02.53.0 Posterior of β1 clearly indicates no linear influence of the amount of rain. Posterior of β2 shows a clear trend. Dr. Pablo E. Verde 276 Lecture 9: Introduction to Hierarchical Models Example: Ranking of the eighteen baseball players (Efron and Morris, 1977) How can we rank the players ability ? Who was the best player of the season 1970? We use the ranking function in WinBUGS to answer this question. Name hit/AB Observed Avg (ML) ”TRUTH” James-Stein Clemente 18/45 .400 .346 .290 Robinson 17/45 .378 .298 .286 ... ... ... ... ... Total Squared error .077 0.022 Dr. Pablo E. Verde 277 Lecture 9: Introduction to Hierarchical Models ML Bayes 0.200.250.300.350.400.45 BattingAverage Dr. Pablo E. Verde 278 Lecture 9: Introduction to Hierarchical Models Posteriors: Prob(Hits) 0.1 0.2 0.3 0.4 0.5 0.6 Roberto Clemente Frank Robinson Frank Howard Jay Johnstone Ken Berry Jim Spencer Don Kessinger Luis Alvarado Ron Santo Ron Swoboda Del Unser Billy Williams George Scott Rico Petrocelli Ellie Rodriguez Bert Campaneris Thurman Munson Max Alvis q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Dr. Pablo E. Verde 279 Lecture 9: Introduction to Hierarchical Models model { for( i in 1 : N ) { r[i] ~ dbin(p[i],n[i]) logit(p[i]) <- b[i] b[i] ~ dnorm(mu, tau) p.rank[i] <- rank(p[], i) } # hyper-priors mu ~ dnorm(0.0,1.0E-6) sigma~ dunif(0,100) tau<-1/(sigma*sigma) } # Data list(r = c(145, 144, 160, 76, 128, 140, 167, 41, 148, 57, 83, 79, 142, 152, 52, 168, 137, 21), n= c(412, 471, 566, 320, 463, 511, 631, 183, 555, 245, 322, 315, 480, 583, 231, 603, 453, 115), N=18 ) Dr. Pablo E. Verde 280 Lecture 9: Introduction to Hierarchical Models Rank distributions for Robinson, Howard and Clemente. Dr. Pablo E. Verde 281 Lecture 9: Introduction to Hierarchical Models Left panel: posterior distributions of ranks. Right panel: posterior distributions of performances. Dr. Pablo E. Verde 282 Lecture 9: Introduction to Hierarchical Models Summary on hierarchical models Hierarchical models allows to ”borrow strength” across units Posteriors distribution of θi for each unit borrows information from the likelihood contributions for all other units. Estimation is more efficient. MCMC allows considerable flexibility over choice of random effects distribution (not restricted to normal random errors) MCMC allows to make inference on difficult questions, e.g. ranking estimation of random effects Easy to extend to more complicated models (e.g. non-linear repeated measurements, etc.) Dr. Pablo E. Verde 283 Lecture 9: Introduction to Hierarchical Models Summary of the course Bayesian statistics: A new interpretation of probability as a subjective mental construct Relationship with classical statistics Uses of prior distribution as: sensitivity analysis, prior predictions, model building, etc. Making inference using MCMC Thinking different about regression models Different modeling tools: DAGs, DIC and WinBUGS Introduction to hierarchical modeling Dr. Pablo E. Verde 284 Lecture 9: Introduction to Hierarchical Models Finally Nice to see that most people attended and survived the course!! Hope that everybody will be now enthusiastic modern Bayesians ;-) Thank you very much ... Muchas gracias !!! Dr. Pablo E. Verde 285