5 Bayesian Statistics
5.1 Overview
Today we will:
- Introduce Bayesian concepts in the context of the binomial proportion
Important concepts from Bayesian statistics are:
- The parameter is considered random.
- The prior is used to incorporate prior information into the result.
- The posterior is used to interpret the results after observing the data.
- Interpretation of Bayesian results are about where the parameter is likely to be.
5.2 Beginning of Bayesian Inference
- Probabilities in Frequentist and evidentialist phisosophies represent what would happen upon repeated samplings of the data
- In Bayesian statistics the probability represents a subject belief about the unknown world
- In all philosophies the state of the world is expressed in the form of parameters.
- Bayesian statistics is about updating our belief of the world using Bayes theorem
- \(H\) is my current hypothesis about the state of the world, expressed in the form of parameters, \(X\) is the observed data, then \[ \mathbb{P}(H \mid X) = \frac{\mathbb{P}(X \mid H) \mathbb{P}(H)}{\mathbb{P}(X)} \]
- In Bayesian statistics, the state of the world is treated as a random variable, thus we have \(P(H)\)
- \(P(H)\) is subjective
- \(\mathbb{P}(X) = \sum_{h} \mathbb{P}(X \mid H=h) \mathbb{P}(H=h)\), in practice it is usually an integral.
Definitions
- Prior \(\mathbb{P}(H)\)
- Likelihood \(\mathbb{P}(X \mid H)\)
- Posterior \(\mathbb{P}(H \mid X)\)
5.2.1 COVID-19 dog example
- Scent dog detection example Jendrny et al, 2020
- These are the estimates of accuracy for the dogs identifying nCov presence
Positive Test | Negative Test | |
---|---|---|
nCov Absent | 4% | 96% |
nCov Present | 83% | 17% |
- Let’s say a dog barks at an airport visitor. Bayesian statistics tries to understand the probability that the person is actually nCov positive using Bayes theorem \[ \mathbb{P}(\text{nCov} | + ) = \frac{\mathbb{P}(+ \mid \text{nCov} ) \mathbb{P}(\text{nCov}) }{\mathbb{P}(+ \mid \text{nCov} ) \mathbb{P}(\text{nCov}) + \mathbb{P}(+ \mid \text{no nCov} ) \mathbb{P}(\text{no nCov})} \]
- The person testing positive for COVID is the data point. The person having COVID for real is the parameter.
- The probability someone comes into the airport with nCov is unknown.
- We can specify this probability from prior data.
- The number of current cases is 9,165,744 Worldometer.
- The US population is (approximately) 333,000,000 US population clock
- Thus we can approximate \(\mathbb{P}(\text{nCov}) = 9,165,744/333,000,000 = 0.0275\).
- We can then compute the probability that someone is sick, given that the the dog barked \[ \mathbb{P}(\text{nCov} | + ) = \frac{0.83 \times 0.0275 }{0.83 \times 0.0275 + 0.04 \times (1-0.275)} \approx 0.37 \]
- There’s less than 50% chance the person actually has nCov given the test is positive.
- This is because the false positive rate is high enough and the COVID is rare enough that the majority of people who test positive are false positives.
- Obviously, a decision still needs to be made about what to do with the person.
Fundamental difference in interpretation:
- Parameter is “random.”
- Probability represents our belief about the unknown value.
5.2.2 Bayesian analysis overview
- Decide a reasonable model for the data.
- Choose a reasonable prior the parameters in the model.
- Use Bayes theorem to compute the posterior, \(P(H \mid X)\).
- Do stuff with \(P(H \mid X)\):
- Create intervals.
- Compute Bayes Factors.
- Compute point estimates (most likely values for unknown parameters).
5.3 Binomial proportion
5.3.1 Examples in the NSDUH data set
- Let’s consider binomial data from the NSDUH data set again.
psilcy
- “yes”/“no” for having ever taken psilocybin, a psychedelic (and recreational) drug recently used in clinical trials for psychiatric treatment.psyanyflag
- “yes”/“no” for having ever taken a psychiatric medication.
- What is a reasonable model for these? Bernoulli
- What is a reasonable prior for the parameter? Flat? Skewed left? Skewed right? Centered? – Will depend on the drug type.
5.3.2 Flat prior for binomial proportions
- Binomial proportion with flat prior \[ \begin{align*} \mathbb{P}(H) & = \mathbb{P}(\pi) = 1, \text{ for $\pi \in [0,1]$} \\ \mathbb{P}(X \mid H) & = \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} \\ \mathbb{P}(X) & = \int_0^1 \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} d\pi = B\left(\sum_{i=1} x_i+1, n - \sum_{i=1} x_i+1\right) \end{align*} \]
We can then compute the distribution of the parameter conditional on the data \[ \mathbb{P}(\pi \mid X) = \frac{\mathbb{P}(X \mid \pi) \times \mathbb{P}(\pi)}{\mathbb{P}(X)} = \frac{\pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i}}{B\left(\sum_{i=1} x_i+1, n - \sum_{i=1} x_i+1\right)} \]
This is a Beta distribution \(\pi \mid x \sim \text{Beta}(\sum_{i=1} x_i+1, n-\sum_{i=1} x_i+1)\)
Evaluated at \(n=0\), we get the flat prior… If we collect no data, we just get our prior hypothesis back.
This prior is called a flat prior.
It implies that we do not have a prior preference for any particular value of the parameter. (Every value of the parameter is equally probable).
Evaluation over 1 sample that’s one and zero.
Code
pi = ppoints(100)
n = 0
S = 0
prob = dbeta(pi, S+1, n -S + 1)
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,3), xlim=c(0,1), col=cols[1] )
n = 1
S = 1
prob = dbeta(pi, S+1, n -S + 1)
points(pi, prob, type='l', col=cols[2] )
n = 1
S = 0
prob = dbeta(pi, S+1, n -S + 1)
points(pi, prob, type='l', col=cols[3] )
legend('topright', legend=c('n=0, S=0', 'n=1, S=1', 'n=1, S=0'), col=cols[1:3], lty=1, bty='n')
- Plot for a few random samples for the question about psychiatric drug use.
Code
set.seed(1234)
puf = readRDS('../datasets/nsduh/puf.rds')
ns = c(2, 5, 10, 50, 100)
pi = ppoints(1000)
puf$psyanyflag = ifelse(puf$psyanyflag=='yes', 1, 0)
S=0
for(n in 1:max(ns)){
S = S + sample(puf$psyanyflag, size = 1)
prob = dbeta(pi, S+1, n -S + 1)
if(n==ns[1]){
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Any psychiatric medicine' )
} else if(n %in% ns) {
points(pi, prob, type='l', col=cols[which(ns %in% n)] )
}
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')
- Same plots for psilocybin drug use
Code
puf$psilcy = ifelse(puf$psilcy=='yes', 1, 0)
S = 0
for(n in 1:max(ns)){
S = S + sample(puf$psilcy, size = 1)
prob = dbeta(pi, S+1, n -S + 1)
if(n==ns[1]){
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Psilocybin' )
} else if(n %in% ns) {
points(pi, prob, type='l', col=cols[which(ns %in% n)] )
}
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')
- What are some things you notice about the plots?
- What happens as the sample size gets larger?
5.3.3 Triangle shaped priors
- How about if we think the parameter \(\pi\) lies toward the lower end of the interval (prior to looking at the data)?
- We could use a prior \(2(1-\pi)\).
- Going through the math again gives us the posterior \[ \pi \mid x \sim \text{Beta}(\sum_{i=1} x_i+1, n-\sum_{i=1} x_i+2) \]
Code
set.seed(1234)
S=0
for(n in 1:max(ns)){
S = S + sample(puf$psyanyflag, size = 1)
prob = dbeta(pi, S+1, n -S + 2)
if(n==ns[1]){
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Any psychiatric medicine' )
} else if(n %in% ns) {
points(pi, prob, type='l', col=cols[which(ns %in% n)] )
}
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')
For the psilocybin data:
Code
S = 0
for(n in 1:max(ns)){
S = S + sample(puf$psilcy, size = 1)
prob = dbeta(pi, S+1, n -S + 2)
if(n==ns[1]){
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Psilocybin' )
} else if(n %in% ns) {
points(pi, prob, type='l', col=cols[which(ns %in% n)] )
}
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')
5.3.4 Posterior inference using Bayesian statistics
- Because we have the full posterior distribution of \(\pi\), we can do whatever we want with it, e.g. reporting mean and variance.
5.3.4.1 The posterior mode
- A point estimate for the parameters can be obtained from posterior mode of the parameter (sort of like maximum likelihood)
- The posterior mode is helpful to compare to the frequentist maximum likelihood estimator.
- Mode of a \(\text{Beta}(a, b)\) distribution is the value where the maximum occurs, it is (from wikipedia) \[ \frac{a-1}{a+b-2}, \] For the posterior with the uniform prior, the mode is \[ \frac{1 +\sum_{i=1}x_i - 1}{2 + n - 2 }. \]
- When \(\alpha,\beta=1\) the prior is uniform on \((0,1)\), \[ p(\pi \mid x) \propto p(x \mid \pi) p(\pi) = \mathbb{P}(X=x \mid \pi) \times 1 = L(\pi \mid x) \] the mode in this case is the MLE!
5.3.4.2 Bayesian credible intervals
- Bayesians generally frown upon just reporting the mean and report quantiles instead.
- Posterior interval \[ \mathbb{P}(\ell(x) < \pi < u(x) \mid x) = \int_\ell^u f(\pi \mid x) d\pi = 1-\alpha \]
- Bayesian intervals are called Credible intervals.
- They show where the parameter is likely to be given the data.
- Distinction: \(\pi\) is random!!!!
- Since it follows a beta distribution, we can get quantiles using
R
.
Code
set.seed(111)
# S - the numnber of events/ Yes answers
# n - the sample size
flatPriorBinom = function(S, n, probs=c(0.025, 0.975)){
qs = qbeta(probs, S+1, n -S + 1)
names(qs) = paste('quantile', probs, sep='_')
c(mode=S/n, mean = (S+1)/(n + 2), qs)
}
n = 100
pi = ppoints(1000)
X = sample(puf$psilcy, size = n)
res = flatPriorBinom(sum(X), n)
S = sum(X)
prob = dbeta(pi, S+1, n -S + 1)
poly = data.frame(x=c(pi,rev(pi)), y=c(prob, rep(0, length(pi))))
poly = poly[ poly$x>= res['quantile_0.025'] & poly$x<=res['quantile_0.975'], ]
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,0.5), col=cols[1] )
polygon(poly$x, poly$y, border = NA, col=cols[2])
## mode mean quantile_0.025 quantile_0.975
## 0.090 0.098 0.049 0.162
## [1] 0.07784242
- Amazing part of Bayes is the interpretation!
- Parameter lies in the interval with 95% probability.
- Represents subjective probability.
- Dependent on the prior.
5.3.5 Introducing other notation
- Think of \(H\) as a parameter value or vector
- Prior depends on other parameters that are unknown (We need to set them) \[ \begin{align*} \text{likelihood } p(x \mid \theta) \\ \text{prior } p(\theta \mid \gamma) \\ \text{posterior } p(\theta \mid x, \gamma) \end{align*} \]
5.3.6 Conjugate prior
\[ \begin{align*} p(\pi \mid \alpha, \beta) & = \frac{ \pi^{\alpha-1 }(1-\pi)^{\beta-1}}{B(\alpha, \beta)} \\ p(x \mid \pi) & = \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} \\ p(x) & = \text{ not going to worry about this here.} \end{align*} \]
\[ p(\pi \mid x) \propto \frac{ \pi^{\alpha-1 }(1-\pi)^{\beta-1}}{B(\alpha, \beta)} \times \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} \]
\[ \pi \mid x \sim \text{Beta}(\alpha +\sum_{i=1}x_i, \beta + n-\sum_{i=1} x_i ) \]
- Conjugate priors are priors whose posterior is the same distributional family as the prior (but with different parameters that depend on the data; this was a conjugate prior)
- Uniform distribution is technically also a conjugate prior with \(\alpha = \beta = 1\).
- Jeffreys prior for binomial likelihood is the one you used for the Homework in Module 4, \(\alpha = \beta = 1/2\). Also considered an uninformative prior.
Code
- Prior that adds one observation to each category
5.3.7 Interpretation of these priors
- A point estimate for the parameters can be obtained from posterior mode of the parameter (sort of like maximum likelihood)
- Mode of a \(\text{Beta}(a, b)\) distribution is the value where the maximum occurs, it is (from wikipedia) \[ \frac{a-1}{a+b-2}, \] For the posterior the mode it is \[ \frac{\alpha +\sum_{i=1}x_i - 1}{\beta +\alpha + n - 2 }. \]
- When \(\alpha,\beta=1\) the prior is uniform on \((0,1)\), \[ p(\pi \mid x) \propto p(x \mid \pi) p(\pi) = \mathbb{P}(X=x \mid \pi) \times 1 = L(\pi \mid x) \] the mode in this case is the MLE!
- When \(\alpha,\beta = 1/2\) it is equivalent to subtracting 1/2 from the count for psilocybin users and psilocybin users.
5.3.8 Plot the posteriors in the results
- Compare Jeffreys and Uniform priors
- Before looking at the posteriors associated with each prior distribution, which posterior do you think will be closer to 1/2?
Code
set.seed(1234)
n = 50
psil = sample(puf$psilcy, size = n)
S = sum(psil)
pi = ppoints(1000)
alpha = beta = 1
plot(pi, dbeta(pi, shape1=alpha+S, shape2=beta+n-S), type='l', col=cols[1], ylab='P(pi|X)', main='Posterior distribution')
alpha = beta = 1/2
points(pi, dbeta(pi, shape1=alpha+S, shape2=beta+n-S), type='l', col=cols[2])
legend('topright', legend=paste0('alpha,beta=', c(1,1/2)), bty='n', lty=1, col=cols[1:2])
5.3.9 Informative priors
- Incorporate subjectivity into our analysis
- What type of data are each of these priors appropriate for?
- A disease prevalence?
- A coin flip?
- Which priors are more “informative” (they are more certain about where the parameter should be)?
- Which priors do you think will be more responsive to the data (i.e. the posterior will be more influenced by the data)?
- Formula reminder for posterior mode: \[ \frac{\alpha +\sum_{i=1}x_i - 1}{\beta +\alpha + n - 2 }. \]
Code
params = data.frame(alpha=c(1,5,10,5,50), beta=c(1,5,5,20,50))
alpha = beta = 1
pi = ppoints(1000)
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)
plot(pi, pOfpi, col=cols[1], ylab='P(pi|X)', type='n', ylim=c(0,20), main='Beta Priors')
for(ind in 1:nrow(params)){
alpha = params$alpha[ind]
beta = params$beta[ind]
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)
points(pi, pOfpi, col=cols[ind], type='l')
}
legend('topright', bty='n', legend=paste0('(alpha, beta)=', '(', params$alpha, ',', params$beta, ')'), lty=1, col=cols[1:nrow(params)] )
5.3.10 Posteriors with small and large sample sizes
5.3.10.1 For the psychiatric medicine data
Code
params = data.frame(alpha=c(1,5,10,5,50), beta=c(1,5,5,20,50))
alpha = beta = 1
pi = ppoints(1000)
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)
ns = c(20, 200, 2000)
for(n in ns){
psych = sample(puf$psyanyflag, size = n)
S = sum(psych)
if(n<=200) ymax=30 else ymax=70
plot(pi, pOfpi, col=cols[1], ylab='P(pi|X)', type='n', ylim=c(0,ymax), main=paste0('Beta Posterior, n=',n))
for(ind in 1:nrow(params)){
alpha = params$alpha[ind]
beta = params$beta[ind]
pOfpi = dbeta(pi, shape1=alpha+S, shape2=beta+n-S)
points(pi, pOfpi, col=cols[ind], type='l')
}
legend('topright', bty='n', legend=paste0('(alpha, beta)=', '(', params$alpha, ',', params$beta, ')'), lty=1, col=cols[1:nrow(params)] )
}
5.3.10.2 For the psilocybin data
Code
params = data.frame(alpha=c(1,5,10,5,50), beta=c(1,5,5,20,50))
alpha = beta = 1
pi = ppoints(1000)
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)
ns = c(20, 200, 2000)
for(n in ns){
psil = sample(puf$psilcy, size = n)
S = sum(psil)
if(n<=200) ymax=30 else ymax=70
plot(pi, pOfpi, col=cols[1], ylab='P(pi|X)', type='n', ylim=c(0,ymax), main=paste0('Beta Posterior, n=',n))
for(ind in 1:nrow(params)){
alpha = params$alpha[ind]
beta = params$beta[ind]
pOfpi = dbeta(pi, shape1=alpha+S, shape2=beta+n-S)
points(pi, pOfpi, col=cols[ind], type='l')
}
legend('topright', bty='n', legend=paste0('(alpha, beta)=', '(', params$alpha, ',', params$beta, ')'), lty=1, col=cols[1:nrow(params)] )
}
Take home points:
- Bayesian approaches are not a magic bullet for small samples
- In large samples posterior doesn’t matter, as \(n\to \infty\) the parameters \(\alpha\) and \(\beta\) are small relative to the contribution of the data. \[ \text{Beta}(\alpha +\sum_{i=1}x_i, \beta + n-\sum_{i=1} x_i ) \]
5.4 Assessing frequentist probabilities for Bayesian methods
Binomial proportion
After a Bayesian analysis we can get a PI \[ p(\ell(x, \alpha) < \pi < u(x, \alpha) \mid x ) = 1-\alpha \]
What is the interpretation of this interval?
Although the interpretation is conditional on the observed data, we still have an interval that we can evaluate across data sets in the frequentist sense.
Code
# number of simulations
nsim = 1000
# sample sizes
ns = round(seq(10, 5000, length.out=10))
# parameter
pis = c(0.001,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,0.999)
# Desired coverage probability (1-alpha)
alpha = 0.05
# Function to compute Bayesian Posterior Intervals for pi
# S - sum of the X_i's number successes or proportion LSD users
# alpha and beta are prior parameter values
bayesBinomPI = function(S, n, alpha=1/2, beta=1/2, probs=c(0.025, 0.975)){
qs = t(apply(cbind(alpha, beta), 1, function(row) qbeta(probs, S+row[1], n -S + row[2]) ))
colnames(qs) = paste('quantile', probs, sep='_')
data.frame(alpha=alpha, beta=beta, mode=(S+alpha-1)/(n + alpha + beta - 2), qs)
}
# enter some more priors
priors = data.frame('alpha' = c(1/2, 1, 50, 10), 'beta' = c(1/2, 1, 50, 50))
# creating blank output table
results = expand.grid(pi=pis , n=ns)
priornames = paste(paste0(names(priors)[1], priors[,1]), paste0(names(priors)[2], priors[,2]), sep='_')
colNames = paste(rep(priornames, each=2),
rep(c('Coverage', 'Width'), nrow(priors)))
results[, colNames ] = NA
for(paramSetting in 1:nrow(results)){
pi = results[paramSetting,'pi']
n = results[paramSetting, 'n']
#matrix(rlnorm(n * nsim, meanlog = mu, sdlog=sqrt(sigmaSq)), nrow=n)
simres = matrix(NA, ncol=ncol(results)-2, nrow=nsim, dimnames=list(NULL, names(results)[-c(1,2)]) )
for(sim in 1:nsim){
# Computes the number of events/ "yes"
S = sum(rbinom(n, 1, pi))
# Computes the posterior interval for all parameter values
PIs = bayesBinomPI(S, n, alpha=priors[,'alpha'], beta=priors[,'beta'])
# Get whether the interval covered the parameter
simres[sim, paste(priornames, 'Coverage')] = (PIs[,4]<=pi & PIs[,5]>=pi)
# Get the width of the interval
simres[sim, paste(priornames, 'Width')] = PIs[,5] - PIs[,4]
}
results[paramSetting, -c(1,2)] = colMeans(simres)
}
### Plot the results
layout(matrix(1:length(ns), nrow=2, byrow = TRUE) )
for(n in ns){
subres = results[ results$n==n, ]
# create the plots
plot(subres$pi, subres[,paste(priornames[1], 'Coverage')], xlab='pi', ylab='Coverage', type='n', ylim=c(0,1), main=paste0('n=', n))
columnNames = paste(priornames, 'Coverage')
trash = sapply(1:length(columnNames), function(ind) points(subres$pi, subres[,columnNames[ind]], type='b', col=cols[ind] ) )
abline(h=0.95, lty=2)
}
legend('bottomright', bg='white', fill=cols[1:length(columnNames)], legend=columnNames)
5.5 Extra stuff
5.5.1 Bayesian continuous mean analysis
- Preliminary info: Writing one-sample and two-sample models as linear models.
5.5.2 One sample mean
- I am using the
brms
package, which relies on Stan to do MCMC - The package needs to compile stan code for each analysis, which takes about 30 seconds
- The MCMC part takes fractions of a second (it’s super fast)
- My approach here for the “noninformative” prior is by choosing a very large variance
Code
set.seed(122)
library(brms)
n = 100
adni = readRDS('../..//datasets/adni/hippocampus.rds')
adni = adni[sample(nrow(adni), n),]
adni$diff = adni$LEFTHIPPO - adni$RIGHTHIPPO
# Prior at zero
# not very informative
noninform = set_prior("normal(0,1000)", class = "Intercept")
# more informative
inform = set_prior("normal(0,2)", class = "Intercept")
# very informative
veryInform = set_prior("normal(0,0.25)", class = "Intercept")
# defaults to flat prior - should be close to the t-test
bayesNonInformResults = brm(diff ~ 1, data=tp, prior = noninform)
bayesInformResults = brm(diff ~ 1, data=tp, prior = inform)
bayesVeryInformResults = brm(diff ~ 1, data=tp, prior = veryInform)
ttestResults = t.test(tp$diff)
5.5.3 Bayesian two sample normal means analysis
- \(\mathbb{E}Y_i \mid \text{F} = \mu_0 = \beta_0\)
- \(\mathbb{E}Y_i \mid \text{M} = \mu_1 = \beta_0 + \beta_1\)
- Implies \(\beta_1 = \mu_1 - \mu_0\).
Code
library(brms)
# removing spaces from names
names(mf) = gsub(' ', '_', names(mf))
# Prior at zero
# not very informative
noninform = c(set_prior("normal(0,1000)", class = "Intercept"), set_prior("normal(0,1000)", class='b') )
# more informative
inform = c(set_prior("normal(0,2)", class='b'))
# very informative
veryInform = c(set_prior("normal(0,0.25)", class='b'))
bayesModels = list(
# defaults to flat prior - should be close to the t-test
bayesNoPriorResults = brm(C4_CSA ~ Sex, data=mf),
# our own noninformative prior
bayesNonInformResults = brm(C4_CSA ~ Sex, data=mf, prior = noninform),
bayesInformResults = brm(C4_CSA ~ Sex, data=mf, prior = inform),
bayesVeryInformResults = brm(C4_CSA ~ Sex, data=mf, prior = veryInform)
)
lapply(bayesModels, prior_summary)
ttestResults = t.test(mf$C4_CSA~mf$Sex)
lapply(bayesModels, function(x) summary(x)$fixed)
Today we will:
- Evaluate the Bayesian binomial proportion using simulations
- Explore Bayesian concepts in the context of a single mean
- Distinction from binomial proportion is that we have normal distribution and two parameters (mean and variance)
5.5.4 Normal mean with known variance
- For the normal mean with known variance the normal is also the conjugate prior for the mean
\[ \begin{align*} p(\mu \mid \nu, \tau^2) & = (2\pi\tau^2)^{-1/2}\exp\left\{-\frac{1}{2\tau^2}(\mu - \nu)^2\right\} \\ p(x \mid \mu) & = (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\}\\ p(x) & = \text{ don't need to worry about this.} \end{align*} \]
- You’ll derive the posterior for your homework and explore how the prior affects the results
- I’ll do it here too.
- The posterior is \[ \mu \mid x, \nu, \tau^2 \sim N\left\{\frac{\tau^2}{\sigma^2/n + \tau^2} \bar x + \frac{\sigma^2/n}{\sigma^2/n + \tau^2} \mu, \left( \frac{n}{\sigma^2} + \frac{1}{\tau^2}\right)^{-1} \right\} \]
5.5.5 Unknown mean, unknown variance uninformative priors
5.5.5.1 With noninformative priors
- Flat prior is proportional to 1 (an improper prior)
- Normal as the variance goes to infinity
5.5.5.2 Marginalizing out variance to get t-distribution with uninformative priors
\[ \begin{align*} p(x\mid \mu, \sigma^2) & = (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\}\\ p(\mu) & \propto 1 \\ p(\sigma^2) & \propto 1/\sigma^2 \end{align*} \]
Then the posterior is \[ \begin{align*} \int_{\sigma^2} p(\mu, \sigma^2 \mid x) & \propto p(x\mid \mu,\sigma^2)p(\mu)p(\sigma^2) \\ & = \int_{\sigma^2}(2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\right\} \times 1/\sigma^2 \\ & \propto \int_{\sigma^2}(\sigma^2)^{-(n+2)/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\right\} \\ & = \Gamma((n-1)/2)\times \left(\frac{1}{2}\sum_{i=1}^n(x_i - \mu)^2\right)^{-n/2} \\ & \propto \left(\tilde\sigma^2\left[1+ \frac{n(\bar x - \mu)^2}{\tilde\sigma^2} \right] \right)^{-(n-1 + 1)/2} \end{align*} \]
- \(\sigma^2\) is a nuisance parameter, we can integrate it out (called marginalizing with respect to \(\sigma^2\)).
- This will give us the marginal distribution of \(\mu\).
Take home points of this
- We get the t-distribution.
- We can interpret the t- confidence interval in a Bayesian framework, assuming a flat prior on \(\mu\) and a prior proportional to \(1/\sigma\) for \(\sigma^2\).
5.5.6 Informative priors for the normal likelihood
5.5.6.2 Preliminaries
Inverse gamma distribution: If \(\sigma^2 \sim \text{IG}(\alpha, \beta)\) it has PDF \[ f(s) = \frac{\beta^\alpha}{\Gamma(\alpha)}s^{-\alpha-1}e^{-\beta/s} \] \(\Gamma(\alpha)\) is the gamma function (it’s just a normalizing constant so that the integral is equal to 1).
For the normal with known mean and unknown variance, the inverse gamma is the conjugate prior (for the variance) \[ \begin{align*} p(\sigma^2 \mid \alpha, \beta) & = \frac{\beta^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1} e^{-\beta/\sigma^2} \\ p(x\mid \mu, \sigma^2) (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\} \\ p(x \mid \alpha, \beta) = \text{ don't need to worry about this.} \end{align*} \]
\[ \begin{align*} p(\sigma^2\mid x) & \propto p(x\mid \sigma^2) p(\sigma^2 \mid \alpha, \beta) \\ & = (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\} \times \frac{\beta^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1} e^{-\beta/\sigma^2} \\ & \propto (\sigma^2)^{-\alpha-n/2-1}\exp\left\{ -\frac{\frac{1}{2}\sum_{i=1}^n(X_i - \mu)^2 + \beta}{\sigma^2}\right\} \\ &\propto IG\left(\alpha+n/2, \beta + 1/2\sum_{i=1}^n(X_i - \mu)^2 \right) \end{align*} \]
5.5.6.3 Unknown mean, unknown variance informative priors
- Review about definition of dependent/independent random variables
- Conditional on \(x\), \(\sigma^2\) and \(\mu\) are dependent
- More advanced Bayesian approaches are needed for this (Markov chain Monte Carlo)
5.5.6.4 Bayesian inference in practice
- Performing inference in R.
Aside:
- Using a linear model perform a one sample t-test
5.5.6.5 Comparing two means using Bayesian statistics in R
Code
library(brms)
load('~/Box Sync/teaching/PMB/datasets/smith_cervical_sc/tpmf.rdata')
tp$diff = tp$`C4 CSA_tp2` - tp$`C4 CSA_tp1`
# Prior at zero
# not very informative
noninform = set_prior("normal(0,500)", class = "Intercept")
# more informative
inform = set_prior("normal(0,2)", class = "Intercept")
# very informative
veryInform = set_prior("normal(0,0.25)", class = "Intercept")
# defaults to flat prior - basically the same as a t-test
bayesNonInformResults = brm(diff ~ 1, data=tp, prior = noninform )
bayesInformResults = brm(diff ~ 1, data=tp, prior = inform)
bayesVeryInformResults = brm(diff ~ 1, data=tp, prior = veryInform)
ttestResults = t.test(tp$diff)
5.5.7 Software
- There is good software to do Bayesian inference for complicated models
- Stan, Inla, Bugs – I think all can be called from R and Python.