Class 5

Author

Simon Vandekar

Objectives:

  1. Learn to construct and interpret a confidence interval and test statistic
  2. Learn to compute and interpret p-values

RECAP!

Constructing a confidence interval

To construct a basic confidence interval, we need a few things

  1. The Central Limit Theorem
  2. The mean and variance of the parameter estimator
  3. Estimates of the parameter and variance of the parameter

Preliminary: The normal distribution

Things about the normal distribution:

  • Standard normal Z\sim N(0,1) often denoted with a Z.
  • PDF often denoted by \phi(z).
  • CDF often denoted by \Phi(z).
  • For Y \sim N(\mu, \sigma^2), (Y-\mu)/\sigma \sim N(0, 1) (often called Z-normalization).
  • \mathbb{P}(\lvert Z \rvert\le 1.96) = \Phi(1.96) - \Phi(-1.96) \approx 0.95.
  • \mathbb{P}( Z \le 1.64) = \Phi(1.64) \approx 0.95.

The Central Limit Theorem

The Central Limit Theorem (Durrett, pg. 124): Let X_1, X_2, \ldots be iid with \mathbb{E} X_i = \mu and \text{Var}(X_i) = \sigma^2 \in (0, \infty).

If \bar X_n = n^{-1} \sum_{i=1}^n X_i, then n^{1/2}(\bar X_n - \mu)/\sigma \to_D X, where X \sim N(0,1).

Comments:

  • We need the variance to be finite (stronger assumptions than LLN)
  • In words the central limit theorem means that no matter what the distribution of the data are, the mean will always me normally distributed in large samples.
  • More specifically, (\bar X - \mu)/\sigma will be standard normal

Conceptual overview

  • This is to illustrate the CLT numbers using some example data to answer the question, “What is the proportion of people in the United States who smoke cigarettes?”
  • We’re again pretending that the NSDUH dataset is the entire population of the US.
Code
# read in data
nsduh = readRDS('nsduh/puf.rds')
# ever tried cigarettes indicator
triedCigs = nsduh$cigmon
# make it a Bernoulli random variable
triedCigs = ifelse(triedCigs=='yes', 1, 0)

mu = mean(triedCigs)
sigma = sqrt(var(triedCigs))

ns = c(5, 10, 50, 100, 200, 300, 400, 500)

# for each sample size, we performing 100 studies
studies = 1:100

layout(matrix(1:8, nrow=2, byrow=TRUE))
for(n in ns){
  # Each person in the class is performing a study of smoking
  studies = lapply(studies, function(studies) sample(triedCigs, size=n))
  names(studies) = studies
  
  # get the mean for each person's study
  studyMeans = sapply(studies, mean)
  stdMeans = sqrt(n)*(studyMeans - mu)/sigma
  # histogram of the study means
  hist(stdMeans, xlim = c(-3,3), breaks=10, main=paste0('n=', n))
}

Constructing confidence intervals

  • We can use the normal distribution to compute confidence intervals

  • Confidence intervals are an interval obtained from a random sample that contains the true value of the parameter with a given probability. P\{L(X_1, \ldots, X_n) \le p < U(X_1, \ldots, X_n) \} = 1-\alpha, for a given value \alpha \in (0,1).

  • Note what things are random here (the end points). The parameter is fixed.

  • It comes from the fact that (\bar X - \mu)/\sigma \sim N(0,1) (approximately).

  • The interpretation is that the procedure that creates the CI captures the true parameter 1-\alpha% of the time.

Dataset-to-dataset variability

Code
mu = 1
sigma = 0.8
n = 50
nsim = 100
alpha = 0.05
CIs = data.frame(mean=rep(NA, nsim), lower=rep(NA, nsim), upper=rep(NA, nsim))
for(sim in 1:nsim){
  # draw a random sample
  X = rnorm(n, mean=mu, sd=sigma)
  stdev = sd(X)
  # construct the confidence interval
  CIs[sim, ] = c(mean(X), mean(X) + c(-1,1)*qnorm(1-alpha/2)*stdev/sqrt(n))
}
CIs$gotcha = ifelse(CIs$lower<=mu & CIs$upper>=mu, 1, 2)
# range(c(CIs$lower, CIs$upper))
plot(1:nsim, CIs$mean, pch=19, ylim = c(0,2), main=paste(nsim, 'Studies'), col=CIs$gotcha, ylab='Estimated means and CIs', xlab='Studies' )
segments(x0=1:nsim, y0=CIs$lower, y1=CIs$upper, col=CIs$gotcha)
abline(h=mu, lty=2 )

Computing a confidence interval in real data

  • We can compute confidence intervals more generally (not just for proportions or means)

Examples

Code
library(RESI)
Registered S3 method overwritten by 'clubSandwich':
  method    from    
  bread.mlm sandwich
Code
var = 'bmi'
mu = mean(insurance[,var])
CI =  mu + qnorm(c(0.025, 0.975)) * sd(insurance[,var])/sqrt(length(insurance[,var]))
hist(insurance[,var], main=var, xlab=var)
abline(v=mu)
abline(v=CI, lty=2)

Code
knitr::kable(round(data.frame(mean=mu, LCI=CI[1], UCI=CI[2]) ), 2)
mean LCI UCI
31 30 31
Code
library(RESI)
var = 'children'
mu = mean(insurance[,var])
CI = mu + qnorm(c(0.025, 0.975)) * sd(insurance[,var])/sqrt(length(insurance[,var]))
hist(insurance[,var], main=var, xlab=var)
abline(v=mu)
abline(v=CI, lty=2)

Code
knitr::kable(round(data.frame(mean=mu, LCI=CI[1], UCI=CI[2]), 2) )
mean LCI UCI
1.09 1.03 1.16

Reporting confidence intervals research

  • They are most often plotted, but can also be reported in tables/text, such as mean [L, U].
  • We’re talking about estimating means, but CIs can be used to study associations of multiple variables.

Difference is task activation

Confidence intervals for continuous variables.

Regression table studying reduction in pain following a treatment.

Hypothesis testing

  • Confidence intervals are useful for identifying where/what kinds of questions.
  • Hypothesis test are for asking yes/no types of questions like
    • “Do two or more treatments differ in their disease rate?”
    • “Is age associated with brain size?”

For example we might ask “Are insurance charges larger in smokers than nonsmokers?”

  • In this case, the parameter that we’re interested in is the difference in charges between smokers and nonsmokers.

  • \delta = \mu_S - \mu_N

  • We didn’t show this, but often, the CLT implies T = \frac{(\hat \delta - \delta)}{\mathrm{Var}(\hat \delta)} \sim N(0,1)

  • Hypothesis testing determines “null” and “alternative” hypotheses

    • Null is the uninteresting result – smokers and nonsmokers have the same mean charges
  • Null is often denote H_0: \delta = 0 (no difference in charges)

Code
library(RESI)

t.test(charges~smoker, data=insurance)

    Welch Two Sample t-test

data:  charges by smoker
t = -32.752, df = 311.85, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -25034.71 -22197.21
sample estimates:
mean in group 0 mean in group 1 
       8434.268       32050.232 

Some common statistical jargon so far

Types of variables

  • Categorical variable - a variable that takes discrete values, such as diagnosis, sex,
  • Continuous variable - a variable that can take any values within a range, such as age, blood pressure, brain volume.
  • Ordinal variable - a categorical variable that can be ordered, such as age category, Likert scales, Number of days, Number of occurrences.
  • Discrete random variable - takes integer values.
  • Continuous random variable - takes real values.

Distribution/Density features

  • Mode/ Non-modal/Unimodal/ Bimodal - a peak in a density.

    • Non-modal - no peak
    • Unimodal - one peak
    • Bimodal - two peaks
  • Quantile - the pth quantile, q_p divides the distribution such that p% of the data are below q_p.

    • Median - 50th quantile of a distribution
  • Skew/Skewness - the amount of asymmetry of a distribution

    • Symmetric distribution - a distribution with no skew
  • Kurtosis - the heavy tailed-ness of a distribution.

  • Probability Density/Mass Function - function, f_X, that represents f_X(x) = \mathbb{P}(X=x)

  • Cumulative Distribution Function - function, F_X that represents F_X(x) = \mathbb{P}(X\le x)

  • Parameter – Target unknown feature of a population (nonrandom)

  • Estimate – Value computed from observed data to approximate the parameter (nonrandom)

  • Estimator – A function of a random sample to approximate the parameter