5 Bayesian Statistics

5.1 Overview

Today we will:

Introduce Bayesian concepts in the context of the binomial proportion

Important concepts from Bayesian statistics are:

The parameter is considered random.
The prior is used to incorporate prior information into the result.
The posterior is used to interpret the results after observing the data.
Interpretation of Bayesian results are about where the parameter is likely to be.

5.2 Beginning of Bayesian Inference

Probabilities in Frequentist and evidentialist phisosophies represent what would happen upon repeated samplings of the data
In Bayesian statistics the probability represents a subject belief about the unknown world
In all philosophies the state of the world is expressed in the form of parameters.
Bayesian statistics is about updating our belief of the world using Bayes theorem
$H$ is my current hypothesis about the state of the world, expressed in the form of parameters, $X$ is the observed data, then \[ \mathbb{P}(H \mid X) = \frac{\mathbb{P}(X \mid H) \mathbb{P}(H)}{\mathbb{P}(X)} \]
In Bayesian statistics, the state of the world is treated as a random variable, thus we have $P(H)$
$P(H)$ is subjective
$\mathbb{P}(X) = \sum_{h} \mathbb{P}(X \mid H=h) \mathbb{P}(H=h)$, in practice it is usually an integral.

Definitions

Prior $\mathbb{P}(H)$
Likelihood $\mathbb{P}(X \mid H)$
Posterior $\mathbb{P}(H \mid X)$

5.2.1 COVID-19 dog example

Scent dog detection example Jendrny et al, 2020
These are the estimates of accuracy for the dogs identifying nCov presence

	Positive Test	Negative Test
nCov Absent	4%	96%
nCov Present	83%	17%

Let’s say a dog barks at an airport visitor. Bayesian statistics tries to understand the probability that the person is actually nCov positive using Bayes theorem \[ \mathbb{P}(\text{nCov} | + ) = \frac{\mathbb{P}(+ \mid \text{nCov} ) \mathbb{P}(\text{nCov}) }{\mathbb{P}(+ \mid \text{nCov} ) \mathbb{P}(\text{nCov}) + \mathbb{P}(+ \mid \text{no nCov} ) \mathbb{P}(\text{no nCov})} \]
The person testing positive for COVID is the data point. The person having COVID for real is the parameter.
The probability someone comes into the airport with nCov is unknown.
We can specify this probability from prior data.
- The number of current cases is 9,165,744 Worldometer.
- The US population is (approximately) 333,000,000 US population clock
- Thus we can approximate $\mathbb{P}(\text{nCov}) = 9,165,744/333,000,000 = 0.0275$.
We can then compute the probability that someone is sick, given that the the dog barked \[ \mathbb{P}(\text{nCov} | + ) = \frac{0.83 \times 0.0275 }{0.83 \times 0.0275 + 0.04 \times (1-0.275)} \approx 0.37 \]
There’s less than 50% chance the person actually has nCov given the test is positive.
This is because the false positive rate is high enough and the COVID is rare enough that the majority of people who test positive are false positives.
Obviously, a decision still needs to be made about what to do with the person.

Fundamental difference in interpretation:

Parameter is “random.”
Probability represents our belief about the unknown value.

5.2.2 Bayesian analysis overview

Decide a reasonable model for the data.
Choose a reasonable prior the parameters in the model.
Use Bayes theorem to compute the posterior, $P(H \mid X)$.
Do stuff with $P(H \mid X)$:
- Create intervals.
- Compute Bayes Factors.
- Compute point estimates (most likely values for unknown parameters).

5.3 Binomial proportion

5.3.1 Examples in the NSDUH data set

Let’s consider binomial data from the NSDUH data set again.
- psilcy - “yes”/“no” for having ever taken psilocybin, a psychedelic (and recreational) drug recently used in clinical trials for psychiatric treatment.
- psyanyflag - “yes”/“no” for having ever taken a psychiatric medication.
What is a reasonable model for these? Bernoulli
What is a reasonable prior for the parameter? Flat? Skewed left? Skewed right? Centered? – Will depend on the drug type.

5.3.2 Flat prior for binomial proportions

Binomial proportion with flat prior \[ \begin{align*} \mathbb{P}(H) & = \mathbb{P}(\pi) = 1, \text{ for $\pi \in [0,1]$} \\ \mathbb{P}(X \mid H) & = \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} \\ \mathbb{P}(X) & = \int_0^1 \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} d\pi = B\left(\sum_{i=1} x_i+1, n - \sum_{i=1} x_i+1\right) \end{align*} \]

We can then compute the distribution of the parameter conditional on the data \[ \mathbb{P}(\pi \mid X) = \frac{\mathbb{P}(X \mid \pi) \times \mathbb{P}(\pi)}{\mathbb{P}(X)} = \frac{\pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i}}{B\left(\sum_{i=1} x_i+1, n - \sum_{i=1} x_i+1\right)} \]

This is a Beta distribution $\pi \mid x \sim \text{Beta}(\sum_{i=1} x_i+1, n-\sum_{i=1} x_i+1)$
Evaluated at $n=0$, we get the flat prior… If we collect no data, we just get our prior hypothesis back.
This prior is called a flat prior.
It implies that we do not have a prior preference for any particular value of the parameter. (Every value of the parameter is equally probable).
Evaluation over 1 sample that’s one and zero.

Code

pi = ppoints(100)
n = 0
S = 0
prob = dbeta(pi, S+1, n -S + 1)
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,3), xlim=c(0,1), col=cols[1] )

n = 1
S = 1
prob = dbeta(pi, S+1, n -S + 1)
points(pi, prob, type='l', col=cols[2] )

n = 1
S = 0
prob = dbeta(pi, S+1, n -S + 1)
points(pi, prob, type='l', col=cols[3] )
legend('topright', legend=c('n=0, S=0', 'n=1, S=1', 'n=1, S=0'), col=cols[1:3], lty=1, bty='n')

Plot for a few random samples for the question about psychiatric drug use.

Code

set.seed(1234)
puf = readRDS('../datasets/nsduh/puf.rds')
ns = c(2, 5, 10, 50, 100)
pi = ppoints(1000)
puf$psyanyflag = ifelse(puf$psyanyflag=='yes', 1, 0)
S=0
for(n in 1:max(ns)){
  S = S + sample(puf$psyanyflag, size = 1)
  prob = dbeta(pi, S+1, n -S + 1)
  if(n==ns[1]){
    plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Any psychiatric medicine' )
  } else if(n %in% ns) {
    points(pi, prob, type='l', col=cols[which(ns %in% n)] )
  }
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')

Same plots for psilocybin drug use

Code

puf$psilcy = ifelse(puf$psilcy=='yes', 1, 0)
S = 0
for(n in 1:max(ns)){
  S = S + sample(puf$psilcy, size = 1)
  prob = dbeta(pi, S+1, n -S + 1)
  if(n==ns[1]){
    plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Psilocybin' )
  } else if(n %in% ns) {
    points(pi, prob, type='l', col=cols[which(ns %in% n)] )
  }
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')

What are some things you notice about the plots?
What happens as the sample size gets larger?

5.3.3 Triangle shaped priors

How about if we think the parameter $\pi$ lies toward the lower end of the interval (prior to looking at the data)?
We could use a prior $2(1-\pi)$.
Going through the math again gives us the posterior \[ \pi \mid x \sim \text{Beta}(\sum_{i=1} x_i+1, n-\sum_{i=1} x_i+2) \]

Code

set.seed(1234)
S=0
for(n in 1:max(ns)){
  S = S + sample(puf$psyanyflag, size = 1)
  prob = dbeta(pi, S+1, n -S + 2)
  if(n==ns[1]){
    plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Any psychiatric medicine' )
  } else if(n %in% ns) {
    points(pi, prob, type='l', col=cols[which(ns %in% n)] )
  }
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')

For the psilocybin data:

Code

S = 0
for(n in 1:max(ns)){
  S = S + sample(puf$psilcy, size = 1)
  prob = dbeta(pi, S+1, n -S + 2)
  if(n==ns[1]){
    plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,1), col=cols[1], main='Psilocybin' )
  } else if(n %in% ns) {
    points(pi, prob, type='l', col=cols[which(ns %in% n)] )
  }
}
legend('topright', legend=ns, col=cols[1:length(ns)], lty=1, bty='n')

5.3.4 Posterior inference using Bayesian statistics

Because we have the full posterior distribution of $\pi$, we can do whatever we want with it, e.g. reporting mean and variance.

5.3.4.1 The posterior mode

A point estimate for the parameters can be obtained from posterior mode of the parameter (sort of like maximum likelihood)
The posterior mode is helpful to compare to the frequentist maximum likelihood estimator.
Mode of a $\text{Beta}(a, b)$ distribution is the value where the maximum occurs, it is (from wikipedia) \[ \frac{a-1}{a+b-2}, \] For the posterior with the uniform prior, the mode is \[ \frac{1 +\sum_{i=1}x_i - 1}{2 + n - 2 }. \]
When $\alpha,\beta=1$ the prior is uniform on $(0,1)$, \[ p(\pi \mid x) \propto p(x \mid \pi) p(\pi) = \mathbb{P}(X=x \mid \pi) \times 1 = L(\pi \mid x) \] the mode in this case is the MLE!

5.3.4.2 Bayesian credible intervals

Bayesians generally frown upon just reporting the mean and report quantiles instead.
Posterior interval \[ \mathbb{P}(\ell(x) < \pi < u(x) \mid x) = \int_\ell^u f(\pi \mid x) d\pi = 1-\alpha \]
Bayesian intervals are called Credible intervals.
They show where the parameter is likely to be given the data.
Distinction: $\pi$ is random!!!!
Since it follows a beta distribution, we can get quantiles using R.

Code

set.seed(111)
# S - the numnber of events/ Yes answers
# n - the sample size
flatPriorBinom = function(S, n, probs=c(0.025, 0.975)){
  qs =  qbeta(probs, S+1, n -S + 1)
  names(qs) = paste('quantile', probs, sep='_')
  c(mode=S/n, mean = (S+1)/(n + 2), qs)
}



n = 100
pi = ppoints(1000)
X = sample(puf$psilcy, size = n)
res = flatPriorBinom(sum(X), n)
S = sum(X)
prob = dbeta(pi, S+1, n -S + 1)
poly = data.frame(x=c(pi,rev(pi)), y=c(prob, rep(0, length(pi))))
poly = poly[ poly$x>= res['quantile_0.025'] & poly$x<=res['quantile_0.975'], ]
plot(pi, prob, ylab='P(pi| x)', type='l', ylim=c(0,15), xlim=c(0,0.5), col=cols[1] )
polygon(poly$x, poly$y, border = NA, col=cols[2])

Code

round(flatPriorBinom(sum(X), n), 3)

##           mode           mean quantile_0.025 quantile_0.975 
##          0.090          0.098          0.049          0.162

Code

# True value of the parameter in the "population"
mean(puf$psilcy)

## [1] 0.07784242

Amazing part of Bayes is the interpretation!
Parameter lies in the interval with 95% probability.
Represents subjective probability.
Dependent on the prior.

5.3.5 Introducing other notation

Think of $H$ as a parameter value or vector
Prior depends on other parameters that are unknown (We need to set them) \[ \begin{align*} \text{likelihood } p(x \mid \theta) \\ \text{prior } p(\theta \mid \gamma) \\ \text{posterior } p(\theta \mid x, \gamma) \end{align*} \]

5.3.6 Conjugate prior

\[ \begin{align*} p(\pi \mid \alpha, \beta) & = \frac{ \pi^{\alpha-1 }(1-\pi)^{\beta-1}}{B(\alpha, \beta)} \\ p(x \mid \pi) & = \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} \\ p(x) & = \text{ not going to worry about this here.} \end{align*} \]

\[ p(\pi \mid x) \propto \frac{ \pi^{\alpha-1 }(1-\pi)^{\beta-1}}{B(\alpha, \beta)} \times \pi^{\sum_{i=1} x_i}(1-\pi)^{n-\sum_{i=1} x_i} \]

\[ \pi \mid x \sim \text{Beta}(\alpha +\sum_{i=1}x_i, \beta + n-\sum_{i=1} x_i ) \]

Conjugate priors are priors whose posterior is the same distributional family as the prior (but with different parameters that depend on the data; this was a conjugate prior)
Uniform distribution is technically also a conjugate prior with $\alpha = \beta = 1$.
Jeffreys prior for binomial likelihood is the one you used for the Homework in Module 4, $\alpha = \beta = 1/2$. Also considered an uninformative prior.

Code

pi = ppoints(100)
prob = dbeta(pi, 1/2, 1/2)
plot(pi, prob, ylab='P(pi)', type='l', ylim=c(0,5), xlim=c(0,1), col=cols[1], main='Jeffreys prior' )

Prior that adds one observation to each category

Code

pi = ppoints(100)
prob = dbeta(pi, 1+1/2, 1+1/2)
plot(pi, prob, ylab='P(pi)', type='l', ylim=c(0,5), xlim=c(0,1), col=cols[1], main='Add-one-half prior' )

5.3.7 Interpretation of these priors

A point estimate for the parameters can be obtained from posterior mode of the parameter (sort of like maximum likelihood)
Mode of a $\text{Beta}(a, b)$ distribution is the value where the maximum occurs, it is (from wikipedia) \[ \frac{a-1}{a+b-2}, \] For the posterior the mode it is \[ \frac{\alpha +\sum_{i=1}x_i - 1}{\beta +\alpha + n - 2 }. \]
When $\alpha,\beta=1$ the prior is uniform on $(0,1)$, \[ p(\pi \mid x) \propto p(x \mid \pi) p(\pi) = \mathbb{P}(X=x \mid \pi) \times 1 = L(\pi \mid x) \] the mode in this case is the MLE!
When $\alpha,\beta = 1/2$ it is equivalent to subtracting 1/2 from the count for psilocybin users and psilocybin users.

5.3.8 Plot the posteriors in the results

Compare Jeffreys and Uniform priors
Before looking at the posteriors associated with each prior distribution, which posterior do you think will be closer to 1/2?

Code

set.seed(1234)
n = 50
psil = sample(puf$psilcy, size = n)
S = sum(psil)
pi = ppoints(1000)
alpha = beta = 1
plot(pi, dbeta(pi, shape1=alpha+S, shape2=beta+n-S), type='l', col=cols[1], ylab='P(pi|X)', main='Posterior distribution')
alpha = beta = 1/2
points(pi, dbeta(pi, shape1=alpha+S, shape2=beta+n-S), type='l', col=cols[2])
legend('topright', legend=paste0('alpha,beta=', c(1,1/2)), bty='n', lty=1, col=cols[1:2])

5.3.9 Informative priors

Incorporate subjectivity into our analysis
What type of data are each of these priors appropriate for?
- A disease prevalence?
- A coin flip?
Which priors are more “informative” (they are more certain about where the parameter should be)?
Which priors do you think will be more responsive to the data (i.e. the posterior will be more influenced by the data)?
Formula reminder for posterior mode: \[ \frac{\alpha +\sum_{i=1}x_i - 1}{\beta +\alpha + n - 2 }. \]

Code

params = data.frame(alpha=c(1,5,10,5,50), beta=c(1,5,5,20,50))
alpha = beta = 1
pi = ppoints(1000)
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)
plot(pi, pOfpi, col=cols[1], ylab='P(pi|X)', type='n', ylim=c(0,20), main='Beta Priors')
for(ind in 1:nrow(params)){
  alpha = params$alpha[ind]
  beta  = params$beta[ind]
  pOfpi = dbeta(pi, shape1=alpha, shape2=beta)
  points(pi, pOfpi, col=cols[ind], type='l')
}
legend('topright', bty='n', legend=paste0('(alpha, beta)=', '(', params$alpha, ',', params$beta, ')'), lty=1, col=cols[1:nrow(params)] )

5.3.10 Posteriors with small and large sample sizes

5.3.10.1 For the psychiatric medicine data

Code

params = data.frame(alpha=c(1,5,10,5,50), beta=c(1,5,5,20,50))
alpha = beta = 1
pi = ppoints(1000)
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)

ns = c(20, 200, 2000)
for(n in ns){
  psych = sample(puf$psyanyflag, size = n)
  S = sum(psych)
  if(n<=200) ymax=30 else ymax=70
  plot(pi, pOfpi, col=cols[1], ylab='P(pi|X)', type='n', ylim=c(0,ymax), main=paste0('Beta Posterior, n=',n))
  for(ind in 1:nrow(params)){
    alpha = params$alpha[ind]
    beta  = params$beta[ind]
    pOfpi = dbeta(pi, shape1=alpha+S, shape2=beta+n-S)
    points(pi, pOfpi, col=cols[ind], type='l')
  }
  legend('topright', bty='n', legend=paste0('(alpha, beta)=', '(', params$alpha, ',', params$beta, ')'), lty=1, col=cols[1:nrow(params)] )
}

5.3.10.2 For the psilocybin data

Code

params = data.frame(alpha=c(1,5,10,5,50), beta=c(1,5,5,20,50))
alpha = beta = 1
pi = ppoints(1000)
pOfpi = dbeta(pi, shape1=alpha, shape2=beta)

ns = c(20, 200, 2000)
for(n in ns){
  psil = sample(puf$psilcy, size = n)
  S = sum(psil)
  if(n<=200) ymax=30 else ymax=70
  plot(pi, pOfpi, col=cols[1], ylab='P(pi|X)', type='n', ylim=c(0,ymax), main=paste0('Beta Posterior, n=',n))
  for(ind in 1:nrow(params)){
    alpha = params$alpha[ind]
    beta  = params$beta[ind]
    pOfpi = dbeta(pi, shape1=alpha+S, shape2=beta+n-S)
    points(pi, pOfpi, col=cols[ind], type='l')
  }
  legend('topright', bty='n', legend=paste0('(alpha, beta)=', '(', params$alpha, ',', params$beta, ')'), lty=1, col=cols[1:nrow(params)] )
}

Take home points:

Bayesian approaches are not a magic bullet for small samples
In large samples posterior doesn’t matter, as $n\to \infty$ the parameters $\alpha$ and $\beta$ are small relative to the contribution of the data. \[ \text{Beta}(\alpha +\sum_{i=1}x_i, \beta + n-\sum_{i=1} x_i ) \]

5.4 Assessing frequentist probabilities for Bayesian methods

Binomial proportion
After a Bayesian analysis we can get a PI \[ p(\ell(x, \alpha) < \pi < u(x, \alpha) \mid x ) = 1-\alpha \]
What is the interpretation of this interval?
Although the interpretation is conditional on the observed data, we still have an interval that we can evaluate across data sets in the frequentist sense.

Code

# number of simulations
nsim = 1000
# sample sizes
ns = round(seq(10, 5000, length.out=10))
# parameter
pis = c(0.001,0.01,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.99,0.999)

# Desired coverage probability (1-alpha)
alpha = 0.05

# Function to compute Bayesian Posterior Intervals for pi
# S - sum of the X_i's number successes or proportion LSD users
# alpha and beta are prior parameter values 
bayesBinomPI = function(S, n, alpha=1/2, beta=1/2, probs=c(0.025, 0.975)){
  qs =  t(apply(cbind(alpha, beta), 1, function(row) qbeta(probs, S+row[1], n -S + row[2]) ))
  colnames(qs) = paste('quantile', probs, sep='_')
  data.frame(alpha=alpha, beta=beta, mode=(S+alpha-1)/(n + alpha + beta - 2), qs)
}

# enter some more priors
priors = data.frame('alpha' = c(1/2, 1, 50, 10), 'beta' = c(1/2, 1, 50, 50))

# creating blank output table
results = expand.grid(pi=pis , n=ns)
priornames = paste(paste0(names(priors)[1], priors[,1]), paste0(names(priors)[2], priors[,2]), sep='_')
colNames = paste(rep(priornames, each=2), 
                 rep(c('Coverage', 'Width'), nrow(priors)))
results[, colNames ] = NA



for(paramSetting in 1:nrow(results)){
  pi = results[paramSetting,'pi']
  n = results[paramSetting, 'n']
  
  #matrix(rlnorm(n * nsim, meanlog = mu, sdlog=sqrt(sigmaSq)), nrow=n)
  simres = matrix(NA, ncol=ncol(results)-2, nrow=nsim, dimnames=list(NULL, names(results)[-c(1,2)]) )
  for(sim in 1:nsim){
    # Computes the number of events/ "yes"
    S = sum(rbinom(n, 1, pi))
    # Computes the posterior interval for all parameter values
    PIs = bayesBinomPI(S, n, alpha=priors[,'alpha'], beta=priors[,'beta'])
    
    # Get whether the interval covered the parameter
    simres[sim, paste(priornames, 'Coverage')] = (PIs[,4]<=pi & PIs[,5]>=pi)
    # Get the width of the interval
    simres[sim, paste(priornames, 'Width')] = PIs[,5] - PIs[,4]
  }
  results[paramSetting, -c(1,2)] = colMeans(simres)
}


### Plot the results
layout(matrix(1:length(ns), nrow=2, byrow = TRUE) )
for(n in ns){
  subres = results[ results$n==n, ]
  # create the plots
  plot(subres$pi, subres[,paste(priornames[1], 'Coverage')], xlab='pi', ylab='Coverage', type='n', ylim=c(0,1), main=paste0('n=', n))
  columnNames = paste(priornames, 'Coverage')
  trash = sapply(1:length(columnNames), function(ind) points(subres$pi, subres[,columnNames[ind]], type='b', col=cols[ind] ) )
  abline(h=0.95, lty=2)
}
legend('bottomright', bg='white', fill=cols[1:length(columnNames)], legend=columnNames)

5.5 Extra stuff

5.5.1 Bayesian continuous mean analysis

Preliminary info: Writing one-sample and two-sample models as linear models.

5.5.2 One sample mean

I am using the brms package, which relies on Stan to do MCMC
The package needs to compile stan code for each analysis, which takes about 30 seconds
- The MCMC part takes fractions of a second (it’s super fast)
My approach here for the “noninformative” prior is by choosing a very large variance

Code

set.seed(122)
library(brms)
n = 100
adni = readRDS('../..//datasets/adni/hippocampus.rds')
adni = adni[sample(nrow(adni), n),]
adni$diff = adni$LEFTHIPPO - adni$RIGHTHIPPO
# Prior at zero
# not very informative
noninform = set_prior("normal(0,1000)", class = "Intercept")
# more informative 
inform = set_prior("normal(0,2)", class = "Intercept")
# very informative 
veryInform = set_prior("normal(0,0.25)", class = "Intercept")

# defaults to flat prior - should be close to the t-test
bayesNonInformResults = brm(diff ~ 1, data=tp, prior = noninform)
bayesInformResults = brm(diff ~ 1, data=tp, prior = inform)
bayesVeryInformResults = brm(diff ~ 1, data=tp, prior = veryInform)

ttestResults = t.test(tp$diff)

5.5.3 Bayesian two sample normal means analysis

$\mathbb{E}Y_i \mid \text{F} = \mu_0 = \beta_0$
$\mathbb{E}Y_i \mid \text{M} = \mu_1 = \beta_0 + \beta_1$
Implies $\beta_1 = \mu_1 - \mu_0$.

Code

library(brms)

# removing spaces from names
names(mf) = gsub(' ', '_', names(mf))
# Prior at zero
# not very informative
noninform = c(set_prior("normal(0,1000)", class = "Intercept"), set_prior("normal(0,1000)", class='b') )
# more informative 
inform = c(set_prior("normal(0,2)", class='b'))
# very informative 
veryInform = c(set_prior("normal(0,0.25)", class='b'))


bayesModels = list(
  # defaults to flat prior - should be close to the t-test
  bayesNoPriorResults = brm(C4_CSA ~ Sex, data=mf),
  # our own noninformative prior
  bayesNonInformResults = brm(C4_CSA ~ Sex, data=mf, prior = noninform),
  bayesInformResults = brm(C4_CSA ~ Sex, data=mf, prior = inform),
  bayesVeryInformResults = brm(C4_CSA ~ Sex, data=mf, prior = veryInform)
)

lapply(bayesModels, prior_summary)


ttestResults = t.test(mf$C4_CSA~mf$Sex)
lapply(bayesModels, function(x) summary(x)$fixed)

Today we will:

Evaluate the Bayesian binomial proportion using simulations
Explore Bayesian concepts in the context of a single mean
Distinction from binomial proportion is that we have normal distribution and two parameters (mean and variance)

5.5.4 Normal mean with known variance

For the normal mean with known variance the normal is also the conjugate prior for the mean

\[ \begin{align*} p(\mu \mid \nu, \tau^2) & = (2\pi\tau^2)^{-1/2}\exp\left\{-\frac{1}{2\tau^2}(\mu - \nu)^2\right\} \\ p(x \mid \mu) & = (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\}\\ p(x) & = \text{ don't need to worry about this.} \end{align*} \]

You’ll derive the posterior for your homework and explore how the prior affects the results
I’ll do it here too.
The posterior is \[ \mu \mid x, \nu, \tau^2 \sim N\left\{\frac{\tau^2}{\sigma^2/n + \tau^2} \bar x + \frac{\sigma^2/n}{\sigma^2/n + \tau^2} \mu, \left( \frac{n}{\sigma^2} + \frac{1}{\tau^2}\right)^{-1} \right\} \]

5.5.5 Unknown mean, unknown variance uninformative priors

5.5.5.1 With noninformative priors

Flat prior is proportional to 1 (an improper prior)
Normal as the variance goes to infinity

5.5.5.2 Marginalizing out variance to get t-distribution with uninformative priors

\[ \begin{align*} p(x\mid \mu, \sigma^2) & = (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\}\\ p(\mu) & \propto 1 \\ p(\sigma^2) & \propto 1/\sigma^2 \end{align*} \]

Then the posterior is \[ \begin{align*} \int_{\sigma^2} p(\mu, \sigma^2 \mid x) & \propto p(x\mid \mu,\sigma^2)p(\mu)p(\sigma^2) \\ & = \int_{\sigma^2}(2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\right\} \times 1/\sigma^2 \\ & \propto \int_{\sigma^2}(\sigma^2)^{-(n+2)/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\right\} \\ & = \Gamma((n-1)/2)\times \left(\frac{1}{2}\sum_{i=1}^n(x_i - \mu)^2\right)^{-n/2} \\ & \propto \left(\tilde\sigma^2\left[1+ \frac{n(\bar x - \mu)^2}{\tilde\sigma^2} \right] \right)^{-(n-1 + 1)/2} \end{align*} \]

$\sigma^2$ is a nuisance parameter, we can integrate it out (called marginalizing with respect to $\sigma^2$).
This will give us the marginal distribution of $\mu$.

Take home points of this

We get the t-distribution.
We can interpret the t- confidence interval in a Bayesian framework, assuming a flat prior on $\mu$ and a prior proportional to $1/\sigma$ for $\sigma^2$.

5.5.6 Informative priors for the normal likelihood

5.5.6.1 Known mean, unknown variance

5.5.6.2 Preliminaries

Inverse gamma distribution: If $\sigma^2 \sim \text{IG}(\alpha, \beta)$ it has PDF \[ f(s) = \frac{\beta^\alpha}{\Gamma(\alpha)}s^{-\alpha-1}e^{-\beta/s} \] $\Gamma(\alpha)$ is the gamma function (it’s just a normalizing constant so that the integral is equal to 1).
For the normal with known mean and unknown variance, the inverse gamma is the conjugate prior (for the variance) \[ \begin{align*} p(\sigma^2 \mid \alpha, \beta) & = \frac{\beta^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1} e^{-\beta/\sigma^2} \\ p(x\mid \mu, \sigma^2) (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\} \\ p(x \mid \alpha, \beta) = \text{ don't need to worry about this.} \end{align*} \]

\[ \begin{align*} p(\sigma^2\mid x) & \propto p(x\mid \sigma^2) p(\sigma^2 \mid \alpha, \beta) \\ & = (2\pi\sigma^2)^{-n/2}\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n(X_i - \mu)^2\right\} \times \frac{\beta^\alpha}{\Gamma(\alpha)}(\sigma^2)^{-\alpha-1} e^{-\beta/\sigma^2} \\ & \propto (\sigma^2)^{-\alpha-n/2-1}\exp\left\{ -\frac{\frac{1}{2}\sum_{i=1}^n(X_i - \mu)^2 + \beta}{\sigma^2}\right\} \\ &\propto IG\left(\alpha+n/2, \beta + 1/2\sum_{i=1}^n(X_i - \mu)^2 \right) \end{align*} \]

5.5.6.3 Unknown mean, unknown variance informative priors

Review about definition of dependent/independent random variables
Conditional on $x$, $\sigma^2$ and $\mu$ are dependent
More advanced Bayesian approaches are needed for this (Markov chain Monte Carlo)

5.5.6.4 Bayesian inference in practice

Performing inference in R.

Aside:

Using a linear model perform a one sample t-test

5.5.6.5 Comparing two means using Bayesian statistics in R

Code

library(brms)
load('~/Box Sync/teaching/PMB/datasets/smith_cervical_sc/tpmf.rdata')
tp$diff = tp$`C4 CSA_tp2` - tp$`C4 CSA_tp1`


# Prior at zero
# not very informative
noninform = set_prior("normal(0,500)", class = "Intercept")
# more informative 
inform = set_prior("normal(0,2)", class = "Intercept")
# very informative 
veryInform = set_prior("normal(0,0.25)", class = "Intercept")

# defaults to flat prior - basically the same as a t-test
bayesNonInformResults = brm(diff ~ 1, data=tp, prior = noninform )
bayesInformResults = brm(diff ~ 1, data=tp, prior = inform)
bayesVeryInformResults = brm(diff ~ 1, data=tp, prior = veryInform)

ttestResults = t.test(tp$diff)

5.5.7 Software

There is good software to do Bayesian inference for complicated models
Stan, Inla, Bugs – I think all can be called from R and Python.

5.5.8 Flat priors in general

When can we use them?
Edwards et al. describe 3 assumptions.
Distinction between $\mathbb{P}(H \mid X)$ and $\mathbb{P}(X \mid H)$ as a function of $H$. (The second is not necessarily a PDF/PMF. Does not have the same form for $X$ and $H$.
Improper prior

5.5.9 Informative priors

For penalization, e.g. regularized regression
Horseshoe priors, Lasso priors
Random effects models