5 Bayesian Statistics

5.1 Overview

Today we will:

  • Introduce Bayesian concepts in the context of the binomial proportion

Important concepts from Bayesian statistics are:

  • The parameter is considered random.
  • The prior is used to incorporate prior information into the result.
  • The posterior is used to interpret the results after observing the data.
  • Interpretation of Bayesian results are about where the parameter is likely to be.

5.2 Beginning of Bayesian Inference

  • Probabilities in Frequentist and evidentialist phisosophies represent what would happen upon repeated samplings of the data
  • In Bayesian statistics the probability represents a subject belief about the unknown world
  • In all philosophies the state of the world is expressed in the form of parameters.
  • Bayesian statistics is about updating our belief of the world using Bayes theorem
  • \(H\) is my current hypothesis about the state of the world, expressed in the form of parameters, \(X\) is the observed data, then \[ \mathbb{P}(H \mid X) = \frac{\mathbb{P}(X \mid H) \mathbb{P}(H)}{\mathbb{P}(X)} \]
  • In Bayesian statistics, the state of the world is treated as a random variable, thus we have \(P(H)\)
  • \(P(H)\) is subjective
  • \(\mathbb{P}(X) = \sum_{h} \mathbb{P}(X \mid H=h) \mathbb{P}(H=h)\), in practice it is usually an integral.

Definitions

  • Prior \(\mathbb{P}(H)\)
  • Likelihood \(\mathbb{P}(X \mid H)\)
  • Posterior \(\mathbb{P}(H \mid X)\)

5.2.1 COVID-19 dog example

  • Scent dog detection example Jendrny et al, 2020
  • These are the estimates of accuracy for the dogs identifying nCov presence
Positive Test Negative Test
nCov Absent 4% 96%
nCov Present 83% 17%
  • Let’s say a dog barks at an airport visitor. Bayesian statistics tries to understand the probability that the person is actually nCov positive using Bayes theorem \[ \mathbb{P}(\text{nCov} | + ) = \frac{\mathbb{P}(+ \mid \text{nCov} ) \mathbb{P}(\text{nCov}) }{\mathbb{P}(+ \mid \text{nCov} ) \mathbb{P}(\text{nCov}) + \mathbb{P}(+ \mid \text{no nCov} ) \mathbb{P}(\text{no nCov})} \]
  • The person testing positive for COVID is the data point. The person having COVID for real is the parameter.
  • The probability someone comes into the airport with nCov is unknown.
  • We can specify this probability from prior data.
    • The number of current cases is 9,165,744 Worldometer.
    • The US population is (approximately) 333,000,000 US population clock
    • Thus we can approximate \(\mathbb{P}(\text{nCov}) = 9,165,744/333,000,000 = 0.0275\).
  • We can then compute the probability that someone is sick, given that the the dog barked \[ \mathbb{P}(\text{nCov} | + ) = \frac{0.83 \times 0.0275 }{0.83 \times 0.0275 + 0.04 \times (1-0.275)} \approx 0.37 \]
  • There’s less than 50% chance the person actually has nCov given the test is positive.
  • This is because the false positive rate is high enough and the COVID is rare enough that the majority of people who test positive are false positives.
  • Obviously, a decision still needs to be made about what to do with the person.

Fundamental difference in interpretation:

  • Parameter is “random.”
  • Probability represents our belief about the unknown value.

5.2.2 Bayesian analysis overview

  1. Decide a reasonable model for the data.
  2. Choose a reasonable prior the parameters in the model.
  3. Use Bayes theorem to compute the posterior, \(P(H \mid X)\).
  4. Do stuff with \(P(H \mid X)\):
    • Create intervals.
    • Compute Bayes Factors.
    • Compute point estimates (most likely values for unknown parameters).

5.3 Binomial proportion

5.3.1 Examples in the NSDUH data set

  • Let’s consider binomial data from the NSDUH data set again.
    • psilcy - “yes”/“no” for having ever taken psilocybin, a psychedelic (and recreational) drug recently used in clinical trials for psychiatric treatment.
    • psyanyflag - “yes”/“no” for having ever taken a psychiatric medication.
  • What is a reasonable model for these? Bernoulli
  • What is a reasonable prior for the parameter? Flat? Skewed left? Skewed right? Centered? – Will depend on the drug type.

5.3.2 Flat prior for binomial proportions

  • Binomial proportion with flat prior \[ \begin{align*} \mathbb{P}(H) & = \mathbb{P}(\pi) = 1, \text{ for $\pi \in [0,1]$} \\ \mathbb{P}(X \mid H) & = \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} \\ \mathbb{P}(X) & = \int_0^1 \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} d\pi = B\left(\sum_{i=1} x_i+1, n - \sum_{i=1} x_i+1\right) \end{align*} \]

We can then compute the distribution of the parameter conditional on the data \[ \mathbb{P}(\pi \mid X) = \frac{\mathbb{P}(X \mid \pi) \times \mathbb{P}(\pi)}{\mathbb{P}(X)} = \frac{\pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i}}{B\left(\sum_{i=1} x_i+1, n - \sum_{i=1} x_i+1\right)} \]

  • This is a Beta distribution \(\pi \mid x \sim \text{Beta}(\sum_{i=1} x_i+1, n-\sum_{i=1} x_i+1)\)

  • Evaluated at \(n=0\), we get the flat prior… If we collect no data, we just get our prior hypothesis back.

  • This prior is called a flat prior.

  • It implies that we do not have a prior preference for any particular value of the parameter. (Every value of the parameter is equally probable).

  • Evaluation over 1 sample that’s one and zero.

Code
pi = ppoints(100)
n = 0
S = 0
prob = dbeta(pi, S+1, n -S + 1)
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,3), xlim=c(0,1), col=cols[1] )

n = 1
S = 1
prob = dbeta(pi, S+1, n -S + 1)
points(pi, prob, type='l', col=cols[2] )

n = 1
S = 0
prob = dbeta(pi, S+1, n -S + 1)
points(pi, prob, type='l', col=cols[3] )
legend('topright', legend=c('n=0, S=0', 'n=1, S=1', 'n=1, S=0'), col=cols[1:3], lty=1, bty='n')

  • Plot for a few random samples for the question about psychiatric drug use.
Code
set.seed(1234)
puf = readRDS('../datasets/nsduh/puf.rds')
ns = c(2, 5, 10, 50, 100)
pi = ppoints(1000)
puf$psyanyflag = ifelse(puf$psyanyflag=='yes', 1, 0)
S=0
for(n in 1:max(ns)){
  S = S + sample(puf$psyanyflag, size = 1)
  prob = dbeta(pi, S+1, n -S + 1)
  if(n==ns[1]){
    plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Any psychiatric medicine' )
  } else if(n %in% ns) {
    points(pi, prob, type='l', col=cols[which(ns %in% n)] )
  }
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')

  • Same plots for psilocybin drug use
Code
puf$psilcy = ifelse(puf$psilcy=='yes', 1, 0)
S = 0
for(n in 1:max(ns)){
  S = S + sample(puf$psilcy, size = 1)
  prob = dbeta(pi, S+1, n -S + 1)
  if(n==ns[1]){
    plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Psilocybin' )
  } else if(n %in% ns) {
    points(pi, prob, type='l', col=cols[which(ns %in% n)] )
  }
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')

  • What are some things you notice about the plots?
  • What happens as the sample size gets larger?

5.3.3 Triangle shaped priors

  • How about if we think the parameter \(\pi\) lies toward the lower end of the interval (prior to looking at the data)?
  • We could use a prior \(2(1-\pi)\).
  • Going through the math again gives us the posterior \[ \pi \mid x \sim \text{Beta}(\sum_{i=1} x_i+1, n-\sum_{i=1} x_i+2) \]
Code
set.seed(1234)
S=0
for(n in 1:max(ns)){
  S = S + sample(puf$psyanyflag, size = 1)
  prob = dbeta(pi, S+1, n -S + 2)
  if(n==ns[1]){
    plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Any psychiatric medicine' )
  } else if(n %in% ns) {
    points(pi, prob, type='l', col=cols[which(ns %in% n)] )
  }
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')

For the psilocybin data:

Code
S = 0
for(n in 1:max(ns)){
  S = S + sample(puf$psilcy, size = 1)
  prob = dbeta(pi, S+1, n -S + 2)
  if(n==ns[1]){
    plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Psilocybin' )
  } else if(n %in% ns) {
    points(pi, prob, type='l', col=cols[which(ns %in% n)] )
  }
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')

5.3.4 Posterior inference using Bayesian statistics

  • Because we have the full posterior distribution of \(\pi\), we can do whatever we want with it, e.g. reporting mean and variance.

5.3.4.1 The posterior mode

  • A point estimate for the parameters can be obtained from posterior mode of the parameter (sort of like maximum likelihood)
  • The posterior mode is helpful to compare to the frequentist maximum likelihood estimator.
  • Mode of a \(\text{Beta}(a, b)\) distribution is the value where the maximum occurs, it is (from wikipedia) \[ \frac{a-1}{a+b-2}, \] For the posterior with the uniform prior, the mode is \[ \frac{1 +\sum_{i=1}x_i - 1}{2 + n - 2 }. \]
  • When \(\alpha,\beta=1\) the prior is uniform on \((0,1)\), \[ p(\pi \mid x) \propto p(x \mid \pi) p(\pi) = \mathbb{P}(X=x \mid \pi) \times 1 = L(\pi \mid x) \] the mode in this case is the MLE!

5.3.4.2 Bayesian credible intervals

  • Bayesians generally frown upon just reporting the mean and report quantiles instead.
  • Posterior interval \[ \mathbb{P}(\ell(x) < \pi < u(x) \mid x) = \int_\ell^u f(\pi \mid x) d\pi = 1-\alpha \]
  • Bayesian intervals are called Credible intervals.
  • They show where the parameter is likely to be given the data.
  • Distinction: \(\pi\) is random!!!!
  • Since it follows a beta distribution, we can get quantiles using R.
Code
set.seed(111)
# S - the numnber of events/ Yes answers
# n - the sample size
flatPriorBinom = function(S, n, probs=c(0.025, 0.975)){
  qs =  qbeta(probs, S+1, n -S + 1)
  names(qs) = paste('quantile', probs, sep='_')
  c(mode=S/n, mean = (S+1)/(n + 2), qs)
}



n = 100
pi = ppoints(1000)
X = sample(puf$psilcy, size = n)
res = flatPriorBinom(sum(X), n)
S = sum(X)
prob = dbeta(pi, S+1, n -S + 1)
poly = data.frame(x=c(pi,rev(pi)), y=c(prob, rep(0, length(pi))))
poly = poly[ poly$x>= res['quantile_0.025'] & poly$x<=res['quantile_0.975'], ]
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,0.5), col=cols[1] )
polygon(poly$x, poly$y, border = NA, col=cols[2])

Code
round(flatPriorBinom(sum(X), n), 3)
##           mode           mean quantile_0.025 quantile_0.975 
##          0.090          0.098          0.049          0.162
Code
# True value of the parameter in the "population"
mean(puf$psilcy)
## [1] 0.07784242
  • Amazing part of Bayes is the interpretation!
  • Parameter lies in the interval with 95% probability.
  • Represents subjective probability.
  • Dependent on the prior.

5.3.5 Introducing other notation

  • Think of \(H\) as a parameter value or vector
  • Prior depends on other parameters that are unknown (We need to set them) \[ \begin{align*} \text{likelihood } p(x \mid \theta) \\ \text{prior } p(\theta \mid \gamma) \\ \text{posterior } p(\theta \mid x, \gamma) \end{align*} \]

5.3.6 Conjugate prior

\[ \begin{align*} p(\pi \mid \alpha, \beta) & = \frac{ \pi^{\alpha-1 }(1-\pi)^{\beta-1}}{B(\alpha, \beta)} \\ p(x \mid \pi) & = \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} \\ p(x) & = \text{ not going to worry about this here.} \end{align*} \]

\[ p(\pi \mid x) \propto \frac{ \pi^{\alpha-1 }(1-\pi)^{\beta-1}}{B(\alpha, \beta)} \times \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} \]

\[ \pi \mid x \sim \text{Beta}(\alpha +\sum_{i=1}x_i, \beta + n-\sum_{i=1} x_i ) \]

  • Conjugate priors are priors whose posterior is the same distributional family as the prior (but with different parameters that depend on the data; this was a conjugate prior)
  • Uniform distribution is technically also a conjugate prior with \(\alpha = \beta = 1\).
  • Jeffreys prior for binomial likelihood is the one you used for the Homework in Module 4, \(\alpha = \beta = 1/2\). Also considered an uninformative prior.
Code
pi = ppoints(100)
prob = dbeta(pi, 1/2, 1/2)
plot(pi, prob, ylab='P(pi)', type='l', ylim=c(0,5), xlim=c(0,1), col=cols[1], main='Jeffreys prior' )

  • Prior that adds one observation to each category
Code
pi = ppoints(100)
prob = dbeta(pi, 1+1/2, 1+1/2)
plot(pi, prob, ylab='P(pi)', type='l', ylim=c(0,5), xlim=c(0,1), col=cols[1], main='Add-one-half prior' )

5.3.7 Interpretation of these priors

  • A point estimate for the parameters can be obtained from posterior mode of the parameter (sort of like maximum likelihood)
  • Mode of a \(\text{Beta}(a, b)\) distribution is the value where the maximum occurs, it is (from wikipedia) \[ \frac{a-1}{a+b-2}, \] For the posterior the mode it is \[ \frac{\alpha +\sum_{i=1}x_i - 1}{\beta +\alpha + n - 2 }. \]
  • When \(\alpha,\beta=1\) the prior is uniform on \((0,1)\), \[ p(\pi \mid x) \propto p(x \mid \pi) p(\pi) = \mathbb{P}(X=x \mid \pi) \times 1 = L(\pi \mid x) \] the mode in this case is the MLE!
  • When \(\alpha,\beta = 1/2\) it is equivalent to subtracting 1/2 from the count for psilocybin users and psilocybin users.

5.3.8 Plot the posteriors in the results

  • Compare Jeffreys and Uniform priors
  • Before looking at the posteriors associated with each prior distribution, which posterior do you think will be closer to 1/2?
Code
set.seed(1234)
n = 50
psil = sample(puf$psilcy, size = n)
S = sum(psil)
pi = ppoints(1000)
alpha = beta = 1
plot(pi, dbeta(pi, shape1=alpha+S, shape2=beta+n-S), type='l', col=cols[1], ylab='P(pi|X)', main='Posterior distribution')
alpha = beta = 1/2
points(pi, dbeta(pi, shape1=alpha+S, shape2=beta+n-S), type='l', col=cols[2])
legend('topright', legend=paste0('alpha,beta=', c(1,1/2)), bty='n', lty=1, col=cols[1:2])

5.3.9 Informative priors

  • Incorporate subjectivity into our analysis
  • What type of data are each of these priors appropriate for?
    • A disease prevalence?
    • A coin flip?
  • Which priors are more “informative” (they are more certain about where the parameter should be)?
  • Which priors do you think will be more responsive to the data (i.e. the posterior will be more influenced by the data)?
  • Formula reminder for posterior mode: \[ \frac{\alpha +\sum_{i=1}x_i - 1}{\beta +\alpha + n - 2 }. \]
Code
params = data.frame(alpha=c(1,5,10,5,50), beta=c(1,5,5,20,50))
alpha = beta = 1
pi = ppoints(1000)
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)
plot(pi, pOfpi, col=cols[1], ylab='P(pi|X)', type='n', ylim=c(0,20), main='Beta Priors')
for(ind in 1:nrow(params)){
  alpha = params$alpha[ind]
  beta  = params$beta[ind]
  pOfpi = dbeta(pi, shape1=alpha, shape2=beta)
  points(pi, pOfpi, col=cols[ind], type='l')
}
legend('topright', bty='n', legend=paste0('(alpha, beta)=', '(', params$alpha, ',', params$beta, ')'), lty=1, col=cols[1:nrow(params)] )

5.3.10 Posteriors with small and large sample sizes

5.3.10.1 For the psychiatric medicine data

Code
params = data.frame(alpha=c(1,5,10,5,50), beta=c(1,5,5,20,50))
alpha = beta = 1
pi = ppoints(1000)
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)

ns = c(20, 200, 2000)
for(n in ns){
  psych = sample(puf$psyanyflag, size = n)
  S = sum(psych)
  if(n<=200) ymax=30 else ymax=70
  plot(pi, pOfpi, col=cols[1], ylab='P(pi|X)', type='n', ylim=c(0,ymax), main=paste0('Beta Posterior, n=',n))
  for(ind in 1:nrow(params)){
    alpha = params$alpha[ind]
    beta  = params$beta[ind]
    pOfpi = dbeta(pi, shape1=alpha+S, shape2=beta+n-S)
    points(pi, pOfpi, col=cols[ind], type='l')
  }
  legend('topright', bty='n', legend=paste0('(alpha, beta)=', '(', params$alpha, ',', params$beta, ')'), lty=1, col=cols[1:nrow(params)] )
}

5.3.10.2 For the psilocybin data

Code
params = data.frame(alpha=c(1,5,10,5,50), beta=c(1,5,5,20,50))
alpha = beta = 1
pi = ppoints(1000)
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)

ns = c(20, 200, 2000)
for(n in ns){
  psil = sample(puf$psilcy, size = n)
  S = sum(psil)
  if(n<=200) ymax=30 else ymax=70
  plot(pi, pOfpi, col=cols[1], ylab='P(pi|X)', type='n', ylim=c(0,ymax), main=paste0('Beta Posterior, n=',n))
  for(ind in 1:nrow(params)){
    alpha = params$alpha[ind]
    beta  = params$beta[ind]
    pOfpi = dbeta(pi, shape1=alpha+S, shape2=beta+n-S)
    points(pi, pOfpi, col=cols[ind], type='l')
  }
  legend('topright', bty='n', legend=paste0('(alpha, beta)=', '(', params$alpha, ',', params$beta, ')'), lty=1, col=cols[1:nrow(params)] )
}

Take home points:

  • Bayesian approaches are not a magic bullet for small samples
  • In large samples posterior doesn’t matter, as \(n\to \infty\) the parameters \(\alpha\) and \(\beta\) are small relative to the contribution of the data. \[ \text{Beta}(\alpha +\sum_{i=1}x_i, \beta + n-\sum_{i=1} x_i ) \]

5.4 Assessing frequentist probabilities for Bayesian methods

  • Binomial proportion

  • After a Bayesian analysis we can get a PI \[ p(\ell(x, \alpha) < \pi < u(x, \alpha) \mid x ) = 1-\alpha \]

  • What is the interpretation of this interval?

  • Although the interpretation is conditional on the observed data, we still have an interval that we can evaluate across data sets in the frequentist sense.

Code
# number of simulations
nsim = 1000
# sample sizes
ns = round(seq(10, 5000, length.out=10))
# parameter
pis = c(0.001,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,0.999)

# Desired coverage probability (1-alpha)
alpha = 0.05

# Function to compute Bayesian Posterior Intervals for pi
# S - sum of the X_i's number successes or proportion LSD users
# alpha and beta are prior parameter values 
bayesBinomPI = function(S, n, alpha=1/2, beta=1/2, probs=c(0.025, 0.975)){
  qs =  t(apply(cbind(alpha, beta), 1, function(row) qbeta(probs, S+row[1], n -S + row[2]) ))
  colnames(qs) = paste('quantile', probs, sep='_')
  data.frame(alpha=alpha, beta=beta, mode=(S+alpha-1)/(n + alpha + beta - 2), qs)
}

# enter some more priors
priors = data.frame('alpha' = c(1/2, 1, 50, 10), 'beta' = c(1/2, 1, 50, 50))

# creating blank output table
results = expand.grid(pi=pis , n=ns)
priornames = paste(paste0(names(priors)[1], priors[,1]), paste0(names(priors)[2], priors[,2]), sep='_')
colNames = paste(rep(priornames, each=2), 
                 rep(c('Coverage', 'Width'), nrow(priors)))
results[, colNames ] = NA



for(paramSetting in 1:nrow(results)){
  pi = results[paramSetting,'pi']
  n = results[paramSetting, 'n']
  
  #matrix(rlnorm(n * nsim, meanlog = mu, sdlog=sqrt(sigmaSq)), nrow=n)
  simres = matrix(NA, ncol=ncol(results)-2, nrow=nsim, dimnames=list(NULL, names(results)[-c(1,2)]) )
  for(sim in 1:nsim){
    # Computes the number of events/ "yes"
    S = sum(rbinom(n, 1, pi))
    # Computes the posterior interval for all parameter values
    PIs = bayesBinomPI(S, n, alpha=priors[,'alpha'], beta=priors[,'beta'])
    
    # Get whether the interval covered the parameter
    simres[sim, paste(priornames, 'Coverage')] = (PIs[,4]<=pi & PIs[,5]>=pi)
    # Get the width of the interval
    simres[sim, paste(priornames, 'Width')] = PIs[,5] - PIs[,4]
  }
  results[paramSetting, -c(1,2)] = colMeans(simres)
}


### Plot the results
layout(matrix(1:length(ns), nrow=2, byrow = TRUE) )
for(n in ns){
  subres = results[ results$n==n, ]
  # create the plots
  plot(subres$pi, subres[,paste(priornames[1], 'Coverage')], xlab='pi', ylab='Coverage', type='n', ylim=c(0,1), main=paste0('n=', n))
  columnNames = paste(priornames, 'Coverage')
  trash = sapply(1:length(columnNames), function(ind) points(subres$pi, subres[,columnNames[ind]], type='b', col=cols[ind] ) )
  abline(h=0.95, lty=2)
}
legend('bottomright', bg='white', fill=cols[1:length(columnNames)], legend=columnNames)

5.5 Extra stuff

5.5.1 Bayesian continuous mean analysis

  • Preliminary info: Writing one-sample and two-sample models as linear models.

5.5.2 One sample mean

  • I am using the brms package, which relies on Stan to do MCMC
  • The package needs to compile stan code for each analysis, which takes about 30 seconds
    • The MCMC part takes fractions of a second (it’s super fast)
  • My approach here for the “noninformative” prior is by choosing a very large variance
Code
set.seed(122)
library(brms)
n = 100
adni = readRDS('../..//datasets/adni/hippocampus.rds')
adni = adni[sample(nrow(adni), n),]
adni$diff = adni$LEFTHIPPO - adni$RIGHTHIPPO
# Prior at zero
# not very informative
noninform = set_prior("normal(0,1000)", class = "Intercept")
# more informative 
inform = set_prior("normal(0,2)", class = "Intercept")
# very informative 
veryInform = set_prior("normal(0,0.25)", class = "Intercept")

# defaults to flat prior - should be close to the t-test
bayesNonInformResults = brm(diff ~ 1, data=tp, prior = noninform)
bayesInformResults = brm(diff ~ 1, data=tp, prior = inform)
bayesVeryInformResults = brm(diff ~ 1, data=tp, prior = veryInform)

ttestResults = t.test(tp$diff)

5.5.3 Bayesian two sample normal means analysis

  • \(\mathbb{E}Y_i \mid \text{F} = \mu_0 = \beta_0\)
  • \(\mathbb{E}Y_i \mid \text{M} = \mu_1 = \beta_0 + \beta_1\)
  • Implies \(\beta_1 = \mu_1 - \mu_0\).
Code
library(brms)

# removing spaces from names
names(mf) = gsub(' ', '_', names(mf))
# Prior at zero
# not very informative
noninform = c(set_prior("normal(0,1000)", class = "Intercept"), set_prior("normal(0,1000)", class='b') )
# more informative 
inform = c(set_prior("normal(0,2)", class='b'))
# very informative 
veryInform = c(set_prior("normal(0,0.25)", class='b'))


bayesModels = list(
  # defaults to flat prior - should be close to the t-test
  bayesNoPriorResults = brm(C4_CSA ~ Sex, data=mf),
  # our own noninformative prior
  bayesNonInformResults = brm(C4_CSA ~ Sex, data=mf, prior = noninform),
  bayesInformResults = brm(C4_CSA ~ Sex, data=mf, prior = inform),
  bayesVeryInformResults = brm(C4_CSA ~ Sex, data=mf, prior = veryInform)
)

lapply(bayesModels, prior_summary)


ttestResults = t.test(mf$C4_CSA~mf$Sex)
lapply(bayesModels, function(x) summary(x)$fixed)

Today we will:

  • Evaluate the Bayesian binomial proportion using simulations
  • Explore Bayesian concepts in the context of a single mean
  • Distinction from binomial proportion is that we have normal distribution and two parameters (mean and variance)

5.5.4 Normal mean with known variance

  • For the normal mean with known variance the normal is also the conjugate prior for the mean

\[ \begin{align*} p(\mu \mid \nu, \tau^2) & = (2\pi\tau^2)^{-1/2}\exp\left\{-\frac{1}{2\tau^2}(\mu - \nu)^2\right\} \\ p(x \mid \mu) & = (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\}\\ p(x) & = \text{ don't need to worry about this.} \end{align*} \]

  • You’ll derive the posterior for your homework and explore how the prior affects the results
  • I’ll do it here too.
  • The posterior is \[ \mu \mid x, \nu, \tau^2 \sim N\left\{\frac{\tau^2}{\sigma^2/n + \tau^2} \bar x + \frac{\sigma^2/n}{\sigma^2/n + \tau^2} \mu, \left( \frac{n}{\sigma^2} + \frac{1}{\tau^2}\right)^{-1} \right\} \]

5.5.5 Unknown mean, unknown variance uninformative priors

5.5.5.1 With noninformative priors

  • Flat prior is proportional to 1 (an improper prior)
  • Normal as the variance goes to infinity

5.5.5.2 Marginalizing out variance to get t-distribution with uninformative priors

\[ \begin{align*} p(x\mid \mu, \sigma^2) & = (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\}\\ p(\mu) & \propto 1 \\ p(\sigma^2) & \propto 1/\sigma^2 \end{align*} \]

Then the posterior is \[ \begin{align*} \int_{\sigma^2} p(\mu, \sigma^2 \mid x) & \propto p(x\mid \mu,\sigma^2)p(\mu)p(\sigma^2) \\ & = \int_{\sigma^2}(2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\right\} \times 1/\sigma^2 \\ & \propto \int_{\sigma^2}(\sigma^2)^{-(n+2)/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\right\} \\ & = \Gamma((n-1)/2)\times \left(\frac{1}{2}\sum_{i=1}^n(x_i - \mu)^2\right)^{-n/2} \\ & \propto \left(\tilde\sigma^2\left[1+ \frac{n(\bar x - \mu)^2}{\tilde\sigma^2} \right] \right)^{-(n-1 + 1)/2} \end{align*} \]

  • \(\sigma^2\) is a nuisance parameter, we can integrate it out (called marginalizing with respect to \(\sigma^2\)).
  • This will give us the marginal distribution of \(\mu\).

Take home points of this

  • We get the t-distribution.
  • We can interpret the t- confidence interval in a Bayesian framework, assuming a flat prior on \(\mu\) and a prior proportional to \(1/\sigma\) for \(\sigma^2\).

5.5.6 Informative priors for the normal likelihood

5.5.6.1 Known mean, unknown variance

5.5.6.2 Preliminaries

  • Inverse gamma distribution: If \(\sigma^2 \sim \text{IG}(\alpha, \beta)\) it has PDF \[ f(s) = \frac{\beta^\alpha}{\Gamma(\alpha)}s^{-\alpha-1}e^{-\beta/s} \] \(\Gamma(\alpha)\) is the gamma function (it’s just a normalizing constant so that the integral is equal to 1).

  • For the normal with known mean and unknown variance, the inverse gamma is the conjugate prior (for the variance) \[ \begin{align*} p(\sigma^2 \mid \alpha, \beta) & = \frac{\beta^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1} e^{-\beta/\sigma^2} \\ p(x\mid \mu, \sigma^2) (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\} \\ p(x \mid \alpha, \beta) = \text{ don't need to worry about this.} \end{align*} \]

\[ \begin{align*} p(\sigma^2\mid x) & \propto p(x\mid \sigma^2) p(\sigma^2 \mid \alpha, \beta) \\ & = (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\} \times \frac{\beta^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1} e^{-\beta/\sigma^2} \\ & \propto (\sigma^2)^{-\alpha-n/2-1}\exp\left\{ -\frac{\frac{1}{2}\sum_{i=1}^n(X_i - \mu)^2 + \beta}{\sigma^2}\right\} \\ &\propto IG\left(\alpha+n/2, \beta + 1/2\sum_{i=1}^n(X_i - \mu)^2 \right) \end{align*} \]

5.5.6.3 Unknown mean, unknown variance informative priors

  • Review about definition of dependent/independent random variables
  • Conditional on \(x\), \(\sigma^2\) and \(\mu\) are dependent
  • More advanced Bayesian approaches are needed for this (Markov chain Monte Carlo)

5.5.6.4 Bayesian inference in practice

  • Performing inference in R.

Aside:

  • Using a linear model perform a one sample t-test

5.5.6.5 Comparing two means using Bayesian statistics in R

Code
library(brms)
load('~/Box Sync/teaching/PMB/datasets/smith_cervical_sc/tpmf.rdata')
tp$diff = tp$`C4 CSA_tp2` - tp$`C4 CSA_tp1`


# Prior at zero
# not very informative
noninform = set_prior("normal(0,500)", class = "Intercept")
# more informative 
inform = set_prior("normal(0,2)", class = "Intercept")
# very informative 
veryInform = set_prior("normal(0,0.25)", class = "Intercept")

# defaults to flat prior - basically the same as a t-test
bayesNonInformResults = brm(diff ~ 1, data=tp, prior = noninform )
bayesInformResults = brm(diff ~ 1, data=tp, prior = inform)
bayesVeryInformResults = brm(diff ~ 1, data=tp, prior = veryInform)

ttestResults = t.test(tp$diff)

5.5.7 Software

  • There is good software to do Bayesian inference for complicated models
  • Stan, Inla, Bugs – I think all can be called from R and Python.

5.5.8 Flat priors in general

  • When can we use them?
  • Edwards et al. describe 3 assumptions.
  • Distinction between \(\mathbb{P}(H \mid X)\) and \(\mathbb{P}(X \mid H)\) as a function of \(H\). (The second is not necessarily a PDF/PMF. Does not have the same form for \(X\) and \(H\).
  • Improper prior

5.5.9 Informative priors

  • For penalization, e.g. regularized regression
  • Horseshoe priors, Lasso priors
  • Random effects models