1 Probability

1.1 Probability review

1.1.1 Objectives

In this meeting we will:

  • Introductions
  • Review some probability notation (e.g. PMF, Expectations)
  • Use Bernoulli random variables to study proportions
  • Define Estimators and learn to compute values assess estimators

1.1.2 Administrative stuff

  1. Introductions
    • How did you get interested in statistics?
    • What do you do for fun?
  2. Syllabus

1.1.3 Types of data

  • Continuous
  • Categorical
  • Ordinal

Examples:

  • Brain volume
  • Diagnosis
  • Symptom rating scale (1-7)
  • Coin flip
  • Proportion of heads in \(N\) flips
  • Proportion of time spent sleeping each week

1.1.4 Random variables

Random variables are the key player in statistics and are often used to describe the process by which data arise.

1.1.4.1 Probability notation

  • Blackboard P is probability, \(\mathbb{P}(X=x)\), means the probability the random variable \(X\) is equal to the nonrandom value \(x\).
  • \(\sim\) “is distributed as”
  • Probability mass function (PMF; loose definition is it assigns probabilities to values for a given random variable)
  • Continuous/Discrete random variables
  • A random variable does not have a value in itself… We don’t usually talk about \(X=0.5\), but \(\mathbb{P}(X=0.5)\).
1.1.4.1.1 Bernoulli example
  • Something that takes only two values can be described with a Bernoulli random variable.
    • Coin flip
    • Diagnosis
  • For \(X\sim \text{Be}(\pi)\): X is distributed as Bernoulli, with parameter \(\pi \in (0,1)\), e.g. \(\pi=0.25\).
  • \(\mathbb{P}(X = x) = \pi^x(1-\pi)^{(1-x)}\). That is the probability mass function (PMF) of a Bernoulli random variable
  • Common notation is \(f_X(x)\) is the PMF of the random variable \(X\). In this case \(f_X(x):=\pi^x(1-\pi)^{(1-x)}\).
  • \(:=\) notation means “is defined by.”

1.1.4.2 Probability axioms

We have an intuitive understanding of the basic axioms of probability.

  • An event \(E\) is something that can occur, e.g. head or tails \(\{0,1\}\).
  • Let \(\Omega = {0,1}\) be union of all possible events

The (Kolmogorov) axioms are

  1. \(\mathbb{P}(X\in E) \ge 0\) – probability is positive.
  2. \(\mathbb{P}(X\in \Omega) = 1\) – probabilities sum to 1.
  3. For disjoint sets \(E_1 \cap E_2 = \varnothing\), \(\mathbb{P}(E_1 \cup E_2) = \mathbb{P}(E_1) + \mathbb{P}(E_2)\).

1.1.4.3 Why use probabilities for data?

STOPPED HERE

Risk of dying
Risk of dying

1.1.4.4 Multinomial example

  • A die is an example of a multinomial distribution
Code
die = data.frame('X'=1:6, 'P(X)'=paste0('$\\pi_', 1:6, '$'), check.names = FALSE)
knitr::kable(die)
X P(X)
1 \(\pi_1\)
2 \(\pi_2\)
3 \(\pi_3\)
4 \(\pi_4\)
5 \(\pi_5\)
6 \(\pi_6\)
  • Can also think about statements like \(\mathbb{P}(X\le x)\)
  • This is a sum over probabilities corresponding to possible values of \(X\).
  • Missing probabilities are called parameters

e.g. x = 2 \[ \sum_{i=1}^2 \mathbb{P}(X=i) = 1/3 \]

1.1.4.5 Wordle Multinomial example

  • Multinomial can be used for distributions of other things
  • E.g. Wordle scores. The probability of getting the Wordle in a certain number of guesses.
  • Also 6 (or one more) possible values.
  • Same family of distribution as the die, except with different parameters
\(Y=\)Number of tries \(P(Y)\)
1 0
2 0.03
3 0.28
4 0.40
5 0.18
6 0.11
7 ??
Code
wordle = c(0, 2, 17, 25, 11, 7)
#wordle/sum(wordle)
#round(wordle/sum(wordle), 2)
Wordle results
Wordle results

1.1.4.6 Normal example

  • The normal density has two parameters \(\mu, \sigma\) is \[ \mathbb{P}(X\le x) = \int_{-\infty}^x (2\pi\sigma^2)^{-1/2} \exp\left\{-\frac{1}{2\sigma^2}(z-\mu)^2 \right\}dz \]

  • Probabilities of sets \(\mathbb{P}(X \in E)\), where \(E = (-\infty, -1.96) \cup (1.96, \infty)\).

  • Probability \(X=x\)?

Examples:

  1. Multinomial – after a rolling a die, it has to land on one of them. Normal, \(E=(-\infty,\infty)\).
  2. If not disjoint, then adding some probabilities twice. Multinomial \(\{1,2,3 \}, \{4,3\}\)

1.2 CDFs, PDFs, quantile functions

1.2.1 Objectives

  • Introduce CDFs, PDFs, and quantile functions
  • Introduce Expectations (means)

1.2.2 PDFs and CDFs (and quantile functions)

1.2.2.1 PDFs

The PDF (probability density function) is the derivative of the CDF and often denoted with a lower case letter \(f(x)\). For discrete random variables the PDF is call the PMF (probability mass function).

Corresponding density functions.

Figure 1.1: Corresponding density functions.

1.2.2.2 PDFs continued

The PDF can conceptually be thought of as \(\mathbb{P}(X=x)\), for PMFs, that’s exactly what it is. But, for continuous random variables the probability they take any value is equal to zero. These PDFs are functions with parameters: \[ \begin{align*} \text{Normal: }& f(x; \mu, \sigma^2) = (2\pi\sigma^2)^{-1/2} \exp\left\{-\frac{1}{2\sigma^2}(x-\mu)^2 \right\} \\ \text{Gamma: }& f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} \\ \text{Poisson: }& f(x; \lambda) = \frac{\lambda^k\ e^{-\lambda}}{k!} \\ \end{align*} \]

The CDF (below) and PDF are useful because we never know what values the random variable will take, so we can integrate the PDF or use the CDF to figure out where they are most likely to fall.

1.2.2.3 CDFs

The cumulative distribution function (CDF), \(F(x)\), for a random variable \(X\) is a function that satisfies \[ F(x) = \mathbb{P}(X \le x). \] They are usually functions of parameters. Here are some examples.

Some distribution functions.

Figure 1.2: Some distribution functions.

1.2.2.3.1 Human Connectome Project Brain volume example
A brain.
A brain.
  • Human Connectome Project data
  • Study designed to understand how regions of the brain are interconnected
Code
hcp = read.csv('../datasets/hcp/hcp.csv')
mu = mean(hcp$FS_TotCort_GM_Vol, na.rm=T)
sigma = sd(hcp$FS_TotCort_GM_Vol, na.rm=T)
mu
## [1] 510019.7
Code
sigma
## [1] 53940.92
Code
x = seq(mu - 3*sigma, mu + 3*sigma, length.out=1000)
fOfX = dnorm(x, mean = mu, sd=sigma)
plot(x, fOfX, type='l', xlab='Brain volume', ylab='f(Brain Volume)')

Code
hist(hcp$FS_TotCort_GM_Vol, xlab='Brain Volume', xlim=c(mu - 3*sigma, mu + 3*sigma))

For discrete random variables the derivative of the CDF does not exist because it is a step function, but the probability mass function is the amount the CDF jumps up at that location, heuristically we can define it as \[ f(x) = F(x+\Delta x) - F(x), \] for an infinitesimal value \(\Delta x\).

1.2.2.3.2 Wordle CDF example
\(Y=\)Number of tries \(P(Y|\text{Simon})\) \(P(Y|\text{Lillie})\)
1 0 0
2 0.03 0.05
3 0.28 0.24
4 0.40 0.44
5 0.18 0.19
6 0.11 0.08
Code
wordle1 = c(0, 2, 17, 25, 11, 7)
wordle2 = c(0, 2, 9, 16, 7, 3)
wordle1 = cumsum(wordle1/sum(wordle1))
wordle2 = cumsum(wordle2/sum(wordle2))
x = 0:7
plot(x, c(0,wordle1, 1), main='Wordle CDFs', type='s', ylab='P(X<=x)')
points(x, c(0,wordle2, 1), type='s', col='red')

Wordle results
1.2.2.3.3 Human Connectome Project CDF example (continued)
  • We can compute probabilities that people lie within a given region using the CDF.
Code
x = seq(mu - 3*sigma, mu + 3*sigma, length.out=1000)
fOfX = dnorm(x, mean = mu, sd=sigma)
plot(x, fOfX, type='l', xlab='Brain volume', ylab='f(Brain Volume)')

  • Probability someone is between \(450,000\mathrm{mm}^3\) and \(450,000\mathrm{mm}^3\)
  • Probability someone is less than \(400,000\mathrm{mm}^3\).
  • probability someone is greater than\(500,000\mathrm{mm}^3\)
Code
# center
pnorm(550000, mean=mu, sd=sigma) - pnorm(450000, mean=mu, sd=sigma)
## [1] 0.6377898
Code
# lower tail
pnorm(400000, mean=mu, sd=sigma)
## [1] 0.0206934
Code
# upper tail
1-pnorm(600000, mean=mu, sd=sigma)
## [1] 0.0476453

1.2.2.4 Quantile functions

The quantile function is the inverse of the CDF.

  • For a given probability \(p\), it spits out a value \(Q(p)\) such that that \(F(Q(p)) = p\).
  • The interpretation is that for a given probability \(p\), There’s a \(p\) percent probability that the random variable will be below \(Q(p)\).
  • Sometimes, it’s possible to find the quantile function explicitly, but many times it isn’t.
  • Also can call these “percentiles” if you multiply them by 100.
1.2.2.4.1 HCP example
  • 5% of people have a brain volume larger than what value?
  • What is the middle value (median; .5 quantile) of brain volumes?
  • What is interquartile range (difference between .75 quantile and .25 quantile)?
Code
# 95 percentile of brain volume
qnorm(1-0.05, mean=mu, sd=sigma)
## [1] 598744.6
Code
# Median
qnorm(0.5, mean=mu, sd=sigma)
## [1] 510019.7
Code
# IQR
qnorm(0.75, mean=mu, sd=sigma) - qnorm(0.25, mean=mu, sd=sigma)
## [1] 72765.19

1.2.3 Expectations

1.2.3.1 Overview

Many things we care about about a random variable are expectations. The expectation is an “operator” on a random variable, meaning it takes function of a random variable \(g(X)\) for a (somewhat) arbitrary function \(g\) and is defined by \[ \mathbb{E} g(X) = \int_{\mathcal{X}} g(x) p(x) dx, \] where \(p(x)\) is the density function of \(X\) and the \(\mathcal{X}\) subscript is to denote that we are integrating over the domain of values \(X\) can take.

1.2.3.2 Great expectations

The mean is the most common expectation \[ \mathbb{E} X = \int_{\mathcal{X}} x p(x) dx. \]

  • The integral notation is in the sense of “real analysis” type integrals that can refer to sums or integrals. This is to emphasize that the definition is the same with continuous or discrete random variables.

Another great expectation is the variance \[ \text{Var}(X) = \mathbb{E} (X - \mathbb{E}X)^2 = \mathbb{E}X^2 - (\mathbb{E}X)^2 \] Properties of the variance are homework questions.

In general, \[ \mathbb{E}X^k \] is call the \(k\)th moment of \(X\).

1.2.3.3 Properties of expectations

  • \(\mathbb{E}a X + b = a \mathbb{E}X + b\)
  • \(\mathbb{E}\{ \mathbb{E}X\} = \mathbb{E}X\) (\(\mathbb{E}X\) is a constant)

1.2.3.4 Expected value for Bernoulli

  • For the Bernoulli \(X\sim \text{Be}(p)\) the expectation is \[ \mathbb{E}X = \sum_{i=0}^1 x_i f_X(x_i) = 1 \times p + 0 \times (1-p) = p \]

1.2.3.5 Expected value for Multinomial

\(Y\) \(P(Y)\) \(P(Y)\) \(P(Y)\)
1 1/4 .1 .7
2 1/4 .2 .1
3 1/4 .3 .1
4 1/4 .4 .1

1.2.3.6 Expected value for Wordle score

\(Y=\)Number of tries \(P(Y)\)
1 0
2 0.03
3 0.28
4 0.40
5 0.18
6 0.11
Code
wordle = c(0, 2, 17, 25, 11, 7)
#wordle/sum(wordle)
#round(wordle/sum(wordle), 2)
Wordle results
Wordle results

1.2.3.7 Expected value for Poisson random variable

  • Poisson random variables are useful for modeling skewed counts.
  • E.g. number of drinks of alcohol in a week.
  • E.g. score on a depression inventory or other psychology/scale.

\[ f(x; \lambda) = \frac{\lambda^x\ e^{-\lambda}}{x!} \]

  • Expectation and Variance of Poisson distribution
Code
x = seq(0, 15)
plot(x, dpois(x, lambda=5), main='Poisson(5)/Poisson(0.5)', type='p', ylab='P(X=x)', ylim=c(0,1))
points(x, dpois(x, lambda=0.5), type='p', col='red')

1.2.3.8 Expected value for standard Normal distribution

Start with the standardized normal distribution \[\begin{align*} \mathbb{E}X & = \int_{-\infty}^\infty x \left(2\pi \right)^{-1/2} \exp\left\{-\frac{1}{2} x^2 \right\}\\ & = \left(2\pi \right)^{-1/2} \int_{0}^\infty \exp\{-u\} du - \left(2\pi \right)^{-1/2} \int_{0}^\infty \exp\{-u\} du \end{align*}\] where \(u = x^2/2\).

  • This equals zero.

1.2.3.9 Expected value for Normal distribution

Start with the standardized normal distribution and use change of variables.

  • For \(X\sim N(0,1)\), let \(Y = X\sigma + \mu\).
  • What is \(\mathbb{E}Y\)?
  • Then \[\begin{align*} \mathbb{P}(Y<z) = \mathbb{P}((X\sigma+\mu)<z) = \mathbb{P}(X< (z - \mu)/\sigma) \\ = \int_{-\infty}^{(z-\mu)/\sigma} \left(2\pi \right)^{-1/2} \exp\left\{-\frac{1}{2} x^2 \right\} dx\\ = \int_{-\infty}^{y} \left(2\pi \sigma^2\right)^{-1/2} \exp\left\{-\frac{1}{2\sigma^2} (w-\mu)^2 \right\} dw \end{align*}\]
  • This is the CDF a \(N(\mu, \sigma^2)\).

1.2.3.10 Properties of variance

  • \(\mathrm{Var}(aX +b) = a^2 \mathrm{Var}(X)\)

1.3 Multivariate random variables

1.3.0.1 Overview

Let \(W = (X, Y) \in \mathcal{B}\) be a multivariate random variable. \(X\) and \(Y\) can be continuous or discrete. For example \(X,Y\) can be multivariate normal.

1.3.0.2 Example

For \(X\in\{1,2\}\) and \(Y \in \{1,2,3\}\)

  • Joint probability table
  • Marginal probabilities
  • Conditional probabilities (intuitively)

Draw the 3 different tables.

Code
probs = data.frame( X=rep(1:3, each=2), Y = rep(c(1,2), 3))
probs$p = paste0('$p_{', probs$X, probs$Y, '}$')
tabout = t(table(probs$X, probs$Y))
tabout[,] = probs$p
tabout = t(tabout)
knitr::kable(tabout, row.names = TRUE, escape = FALSE )
1 2
1 \(p_{11}\) \(p_{12}\)
2 \(p_{21}\) \(p_{22}\)
3 \(p_{31}\) \(p_{32}\)

1.3.0.3 Example: Umbrella and raining

  • A common example is whether someone brings an umbrella with them
It’s not raining It’s raining
I don’t bring umbrella 0.05 0.17
I bring umbrella 0.35 0.43

Another example

It’s not precipitating It’s raining It’s snowing
I don’t bring umbrella 0.03 0.17 0.02
I bring umbrella 0.35 0.40 0.03

1.3.1 Distribution functions

The distribution function for multivariate random variables is often written in terms of subsets \(B \subset \mathcal{B}\) \[ F_W(B) = \mathbb{P}(W \in B). \]

If \(W\) has a density then we can also write for values \(x\) and \(y\) (in the appropriate domain) \[ \mathbb{P}(W \in B) = \int_{B} f_W(x, y)dx dy, \] where \(f_W\) is the PDF or PMF of \(W = (X,Y)\).

1.3.2 Marginal probability

  • The marginal probability, \(\mathbb{P}(Y=y)\) is the probability for \(Y\) ignoring whatever is going on with \(X\).
  • This can be found by integrating out the \(X\) variable. \[ f_Y(y) = \int_\mathcal{X} f_{X,Y}(x, y)dx \]
  • Joint multinomial above to demonstrate this.

1.3.3 Conditional distribution

Conditional distributions can be written in terms of the underlying probability space. For sets \(A, B \subset \mathcal{B}\) \[ \mathbb{P}(A \mid B ) = \frac{\mathbb{P}( A \cap B)}{\mathbb{P}(B)} \]
Venn diagram of sets
Venn diagram of sets

Conceptually, the denominator adjusts the numerator for the fact that we are only considering events that include B occurring. This makes the conditional probabilities sum to 1.

For multivariate random variables the definition is similar \[ \mathbb{P}(Y=y \mid X=x) = \frac{\mathbb{P}(X=x, Y=y)}{\mathbb{P}(X=x)} \]

  • Can think of the Venn diagram in terms of the joint random variable falling into the set.
  • Probability I bring and umbrella given that it’s raining
  • Probability it’s raining given it’s not precipitating (Conditioning on itself)

1.3.4 Examples

1.3.4.1 Insurance dataset children and region

  • The insurance dataset is in the RESI R package, which you can install with the following code
Code
install.packages('RESI')
  • There are some cool multivariate distributions in this data set.
  • Consider the probability of having a certain number of children living in a certain area.
Code
library(RESI) # contains insurance dataset
childByRegion = table(insurance$region, insurance$children)
childByRegion = childByRegion/sum(childByRegion)
knitr::kable(round(childByRegion, 2))
0 1 2 3 4 5
northeast 0.11 0.06 0.04 0.03 0.01 0.00
northwest 0.10 0.06 0.05 0.03 0.00 0.00
southeast 0.12 0.07 0.05 0.03 0.00 0.00
southwest 0.10 0.06 0.04 0.03 0.01 0.01
  • Let \(X\) be a random variable denoting number of children

    • Is it categorical/continuous/ordinal?
  • Let \(Y\) be a random variable denoting region

    • Is it categorical/continuous/ordinal?
  • joint distribution is a function of both variables \(\mathbb{P}(X=x, Y=y) = f(x,y)\).

  • Questions about regional differences and number of children can be determined from the table

  • What is the probability of having zero kids and living in the east?

  • What is the probability of having zero kids?

1.3.4.2 Insurance data multivariate distribution of age and charges

Code
library(RESI)
name='charges'
hist(insurance[,name], xlab=name, main=name, freq=FALSE)
lines(density(insurance[,name], adjust=1.5), lwd = 2) 

  • Relationship between age and charges in the insurance data
Code
plot(insurance[,'age'], insurance$charges, ylab='Charges', xlab='Age', main=paste('Age', 'and', 'Charges') )

  • The scatter plot can be thought of as a dataset version of a joint density \(f_{X,Y}(x,y)\)
  • We can get the conditional histograms for a few different age groups
Code
par(mar=c(8,2.8,1.8,.2), mgp=c(6,.7,0))
nbin = 15
insCat = with(insurance, data.frame(age=cut(age, breaks = seq(min(age), max(age), length.out=7), include.lowest=T), charges=cut(charges/1000, breaks = seq(min(charges/1000), max(charges/1000), length.out=nbin), include.lowest=T) ) )
condTab = do.call(cbind, by(insCat$charges, insCat$age, function(x){res=table(x); res/sum(res)}))
barplot(condTab, beside=TRUE, col=rep(cols[1:ncol(condTab)], each=nbin-1),
        names.arg = rep(rownames(condTab), ncol(condTab) ), las=2, xlab='Charges (in thousands)', main='Conditional frequencies')
legend('topright', fill=cols[1:ncol(condTab)], legend=colnames(condTab), bty='n')

Code
par(mar=c(8,2.8,1.8,.2), mgp=c(6,.7,0))
nbin = 15
jointTab = do.call(cbind, by(insCat$charges, insCat$age, function(x){res=table(x); res}))
jointTab = jointTab/sum(jointTab)
barplot(jointTab, beside=TRUE, col=rep(cols[1:ncol(jointTab)], each=nbin-1),
        names.arg = rep(rownames(jointTab), ncol(jointTab) ), las=2, xlab='Charges (in thousands)', main='Joint frequencies')
legend('topright', fill=cols[1:ncol(jointTab)], legend=colnames(jointTab), bty='n')

Code
par(mar=c(8,2.8,1.8,.2), mgp=c(6,.7,0))
nbin = 15
test = barplot(t(jointTab), beside=FALSE, col=cols[rep(1:ncol(jointTab), nbin-1)], las=2, xlab='Charges (in thousands)', main='Marginal frequencies', space=0)
legend('topright', fill=cols[1:ncol(jointTab)], legend=colnames(jointTab), bty='n')

1.3.5 Conditional means and variances

  • Once you have conditional probabilities, you can do all the same stuff with it that you would do with a regular random variable
  • For example, \(\mathbb{E}(Y \mid X) = \sum_{y} y f_{y\mid x}(y \mid x)\)

Conditional expectation is just an expectation with respect to a conditional distribution. \[ \begin{align*} \mathbb{E}(Y \mid X=x) & = \int_{\mathcal{Y}} yf_{Y\mid X=x}(y\mid x)dy \\ & = \int_{\mathcal{Y}} yf_{Y,X}(y, x)/f(x)dy \\ & = \int_{\mathcal{Y}} yf_{Y,X}(y, X)/f(X)dy \end{align*} \]

It can still be thought of as random with respect to the conditioning variable.

\(P(X=1)=.1\) \(P(X=2) = .7\) \(P(X=3)=.2\)
\(Y\) \(P(Y|X=1)\) \(P(Y| X=2)\) \(P(Y| X=3)\)
1 1/4 .1 .7
2 1/4 .2 .1
3 1/4 .3 .1
4 1/4 .4 .1

1.3.6 Properties of expectation for two random variables

Expectation is linear because it is an integral. For example, for constants \(a,b\) and random variables \(X,Y\) \[ \mathbb{E}(aX + bY) = a \mathbb{E} X + b\mathbb{E}Y \]

1.3.6.1 Example: discrete and continuous joint distribution

  • Diagnosis and hippocampus volume in Alzheimer’s disease.
  • \(P(DX)\) is multinomial, \(P(Hipp\mid DX)\) assumed to be normal.
Code
adni = readRDS('../datasets/adni/hippocampus.rds')
tab = round(do.call(rbind, by(adni$LEFTHIPPO, adni$DX, function(x) c(mean(x), var(x)) )), 0)
pX = table(adni$DX); pX = as.data.frame(round(pX/sum(pX),2)); names(pX) = c('X', 'P(X)')
colnames(tab) = c('Mean', 'Variance')

knitr::kable(pX, row.names = FALSE, escape = FALSE )
X P(X)
AD 0.26
HC 0.29
MCI 0.45
Code
knitr::kable(tab, row.names = TRUE, escape = FALSE )
Mean Variance
AD 1549 109794
HC 2122 89241
MCI 1803 131030

1.3.7 Independence

  • Two random variables are independent if they factor \(\mathbb{P}(X=x, Y=y) = \mathbb{P}(X=x)\mathbb{P}(Y=y)\) for all possible values of \(x\) and \(y\).
  • Intuitively, this means that fixing one variable doesn’t affect the distribution of the other.
  • Another way to express it is that \(\mathbb{P}(X=x\mid Y=y) = \mathbb{P}(X=x)\).
  • The table is pretty close to independent.
  • Conditional probabilities are \(\mathbb{P}(\text{N kids} | \text{Region})\).
  • This means that they are independent if their CDFs or PDFs factor!

If the table above were perfectly independent, it might look like this:

Code
rs = rowSums(childByRegion)
cs = colSums(childByRegion)
indepTab = outer(rs, cs)
knitr::kable(round(indepTab, 3))
0 1 2 3 4 5
northeast 0.104 0.059 0.043 0.028 0.005 0.003
northwest 0.104 0.059 0.044 0.029 0.005 0.003
southeast 0.117 0.066 0.049 0.032 0.005 0.004
southwest 0.104 0.059 0.044 0.029 0.005 0.003
Code
barplot(indepTab, main='Joint probabilities', ylab='Probability', xlab='Number of children', beside=T, col=cols[1:nrow(indepTab)])
legend('topright', fill=cols[1:nrow(indepTab)], legend=rownames(indepTab), bty='n' )

Code
condTab = sweep(indepTab, 1, rowSums(childByRegion), FUN = '/')
barplot(condTab, main='Conditional probabilities', ylab='Probability', xlab='Number of children', beside=T, col=cols[1:nrow(condTab)])
legend('topright', fill=cols[1:nrow(condTab)], legend=rownames(condTab), bty='n' )

  • Are charges and smoking independent?

1.3.8 Covariance and dependence

Expand this and derive the other formula \[ \text{Cov}(X, Y) = \mathbb{E}(X-\mathbb{E}X)(Y-\mathbb{E}Y) \]

Correlation is \[ \begin{align*} \text{Cor}(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \in [-1,1] \end{align*} \]

  • Covariance implies dependence, but dependence does not imply covariance (homework question).
  • Independence implies zero covariance, but dependence does not imply nonzero covariance.
  • The covariance can be zero between two random variables, but they can still be dependent.
  • Covariance summarizes the linear dependence between two random variables.

1.3.9 Pearson’s correlation

  • Sampling two measurements on each subject \(X_i\) \(Y_i\).

Correlation is a normalized variance \[ \begin{align*} \rho = \text{Cor}(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \end{align*} \]

  • It takes values between -1 and 1.
  • negative values means they two variables have a negative (approximately) linear relationship
  • positive means they have a positive (approximately) linear relationship
  • Correlation is not slope.
    • A slope of zero implies no correlation
correlation
correlation

Can estimate correlation with

\[ \begin{align*} \hat\rho = \frac{\widehat{\text{Cov}}(X,Y)}{\sqrt{\widehat{\text{Var}}(X)\widehat{\text{Var}}(Y)}} = \frac{\sum_{i=1}^n(X_i - \bar X)(Y_i - \bar Y)}{\sqrt{\sum_{i=1}^n(X_i - \bar X)^2\sum_{i=1}^n(Y_i - \bar Y)^2}} \end{align*} \]

1.3.9.1 Examples

Code
plot(insurance[,'age'], insurance$charges, ylab='Charges', xlab='Age', main=paste('Age', 'and', 'Charges') )

Code
cor(insurance[,'age'], insurance$charges)
## [1] 0.2990082

1.3.9.2 Examples

  • Dependent but uncorrelated
Code
par(mfrow=c(2,1))
X = rnorm(100)
Y = X^2
plot(X, Y, main='Uncorrelated dependent no noise')
abline(lm(Y ~ X))
cor(X,Y)
## [1] 0.1224623
Code
X = rnorm(100)
Y = X^2 + rnorm(100)
plot(X, Y, main='Uncorrelated dependent with noise')
abline(lm(Y ~ X))

Code
cor(X,Y)
## [1] -0.1366219

1.3.10 Properties of variances of sums

Derive the variance of this

  • \(\text{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\text{Cov}(X, Y)\)

1.3.11 Law of total expectation

The law of total expectation is relatively simple, but super handy. \[ \mathbb{E}\{\mathbb{E}(X \mid Y)\} = \mathbb{E}X \] If you are trying to take a mean of a complicated random variable, then you can condition on part of it first and then take the expectation of the part you conditioned on.

  • Example: continue from last table

  • Example: Use diagnosis and hippocampal volume above

  • Example from Rice:

1.3.12 Law of total variance

The law of total variance states that \[ \text{Var}(Y) = \mathbb{E}\text{Var}(Y \mid X) + \text{Var}\{ \mathbb{E}(Y \mid X)\}. \]

  • Example: continue from last table
  • Example: Use diagnosis and hippocampal volume above

1.4 Parameters, estimates, and estimators

1.4.1 Objectives

  • Talk about parameters, estimates, and estimators
  • Talk about how to assess estimators by comparing their variance and bias

1.4.2 Example: healthy sleep

Startsleeping.org
Startsleeping.org
  • For this example let’s look at the sleep questionnaire from the HCP
  • What proportion of people get healthy sleep (assuming 7-9 hours).
Code
hcp = read.csv('../datasets/hcp/hcp.csv')
hist(hcp$PSQI_AmtSleep, main='Nightly sleep', xlab='Hours')

Code
hcp$healthySleep = ifelse(hcp$PSQI_AmtSleep<=9 & hcp$PSQI_AmtSleep>=7, 1, 0)

1.4.3 Statistics is a way to learn about the world from a data set

All three statistical philosophies we will discuss view research questions as trying to learn about an unknown parameter.

For now, let’s work with the question, “What is the proportion of people in the united states who get a healthy amount of sleep?”

  1. What is my parameter of interest (target parameter)?
    • Descriptively – proportion of people who get a healthy amount of sleep
    • Mathematically – call this value \(p\)
  2. What is my population of interest?
    • Descriptively – The United States population
    • Quantitatively two possible things
      • Distribution defined by the model below in point 4. \(\text{Be}(p)\)
      • Unknown statistical distribution that gives rise to whether someone reports between 7-9 hours of sleep.
  3. What is a data point?
    • An answer from an individual in the population to the survey question. Coded as 1/0 if they do/don’t get 7-9 hours of sleep.
  4. What model can I use to describe how I get a data point?
    • Most obvious one is \(X_i \sim \text{Be}(p)\), where \(p\) is the parameter in point 1.

1.4.4 A random sample

A random sample is a collection of independent random variables that represent potential data points.

Let \(X_i \sim \text{Be}(p)\) for \(i=1,\ldots, n\).

Our assumption about the population \(X_i\) is drawn from connects the data to the parameter of interest.

Given our data, we have a parameter and estimate of the parameter. What are they?

1.4.5 Estimates

Note, an estimate is a function of the observed sample and it is nonrandom (Why is it non random?). We often use lowercase letters to make that clear \[ \bar x = n^{-1} \sum_{i=1}^n x_i. \]

1.4.6 Estimators

An estimator is a function of a random sample. The goal (usually) is to estimate a parameter of interest (here, \(p\)). Let’s consider the estimator \[ \hat p = \bar X = n^{-1} \sum_{i=1}^n X_i, \] which we know is pretty reasonable since \(\mathbb{E}\bar X = p\).

  • We can study the properties of estimators to learn about how they behave.
  • If we like an estimator we can compute an estimate using our sample.
  • We can use features of our estimator to make probabilistic statements about what might happen if we repeat our study. Dataset-to-dataset variability
  • This perspective is considered to be the Frequentist philosophy of Statistics.

1.4.6.1 Properties of estimators

Some common metrics are

  • Bias: let \(p\) denote our target parameter, bias is defined as \[ \mathbb{E}(\hat p - p). \]

  • Variance: this is the same as for other random variables \[ \mathbb{E}(\hat p - \mathbb{E}\hat p)^2 \]

  • Mean Squared Error: this combines variance and bias: \[ \mathbb{E}(\hat p - p)^2 \]

  • Consistency: this describes what happens to the estimator as \(n\to\infty\) \[ \hat p \to_P p \]

  • The arrow is convergence in probability. More on this later.

  • Asymptotically unbiased: is a less strong statement of what happens to the estimator as \(n\to\infty\) \[ \mathbb{E}\hat p \to p \]

1.4.6.2 Bias, variance, and MSE of the proportion estimator

  • The bias of the proportion estimator (used to estimate the proportion of people who get enough sleep) is: \[ \mathbb{E}( \hat p - p) = 0 \]
  • The variance of the estimator is \[ \begin{align*} \text{Var}(\hat p) & = n^{-2} \sum_{i=1}^n\text{Var}(X_i) \\ & = n^{-1} p(1-p) \end{align*} \]
  • The MSE is the same as the variance since \(\hat p\) is unbiased.

We will come back to this example to construct confidence intervals later.

1.4.7 HCP: example of bias, variance, and MSE

  • In HCP data
Code
# Show what Bias and Variance mean in NSDUH data using simulations
nstudies = 100000

p = 0.6

n = 100
phats = rep(NA, nstudies)
for(study in 1:nstudies){
  # random sample
  x = rbinom(n = n, prob = p, size=1)
  # estimate phat
  phats[study] = mean(x)
}

# mean of phat across the simulations
mean(phats)
## [1] 0.5998757
Code
hist(phats)

Code
# variance of phat across the simulations
var(phats)
## [1] 0.002404844
Code
# what the variance should be equal to
p * (1-p)/n
## [1] 0.0024

1.4.8 Parameters, Estimators, Estimates

To reiterate:

  • Parameter – Target unknown feature of a population (nonrandom)
  • Estimate – Value computed from observed data to approximate the parameter (nonrandom)
  • Estimator – A function of a random sample to approximate the parameter

In statistics, probability is used to define and describe the behavior of estimators.

  • Often, the distribution of an estimator is makes it too hard to find the bias, variance, and MSE of the estimator.
  • In this case, we use simulations to estimate the bias, variance, and MSE of an estimator.
  • Important concepts for this session:
    • Simulations can be used to estimate the bias and variance when it is too hard to find mathematically.
    • Sometimes, the parameter of interest is not distributional parameter.

1.4.9 Another example: multiplexed immunofluorescence (mIF) microscopy

  • mIF imaging uses antibody markers to identify protein in tissue samples.
  • The protein Beta-catenin is known to be higher in tumor cells.
beta catenin image
beta catenin image

1.4.10 mIF imaging in a statistical framework

Suppose we want to ask the question, “What is the mean concentration of Beta-catenin in the cells from a given tissue sample?”

  1. What is my parameter of interest (target parameter)?
    • Mean cellular concentration of Beta-catenin \(\mathbb{E}X_i\)
  2. What is my population of interest?
    • All possible cells from the given tumor (potentially)?
    • All possible cells from similar tumors?
  3. What is a data point?
    • A single cell
  4. What model can I use to describe how I get a data point?
    • That’s a good question. Here is the histogram of the data
Code
bc = readRDS('../datasets/betaCatenin.rds')
hist(bc, main='Beta Catenin Histogram', xlab='Marker count')

+ The Beta-catenin concentration across cells could be modeled with a Gamma distribution.

1.4.11 A Gamma model for Beta-catenin concentration

  • Previously, we defined the parameter of interest \(p\) based on what we were interested in the sleep problem and it also happened to be the parameter of the Bernoulli distribution.

  • In this example that is not the case.

  • Let \(X_i\) denote a randomly drawn cell from the tissue image.

  • Assume \(X_i \sim \text{Gamma}(\alpha, \beta)\).

  • Let’s consider three parameters

    1. \(\mathbb{E}X_i = \alpha/\beta\)
    2. \(\alpha\) (shape parameter)
    3. \(\beta\) (rate parameter)

Let’s assess estimators for each of these parameters.

1.4.12 MxIF example: Estimators

I used method of moments (which you’ll learn in another class) to obtain estimators for these parameters. The estimators are \[ \begin{align*} \hat \mu & = \bar X = n^{-1} \sum_{i=1}^n X_i \\ \tilde \alpha & = \frac{\bar X^2}{\left(\overline{X^2} - \bar X^2\right)} \\ \tilde \beta & = \frac{\bar X}{\left(\overline{X^2} - \bar X^2\right)} \end{align*} \]

  • \(\tilde \alpha\) and \(\tilde \beta\) are complicated functions of random variables involving ratios.
  • It might be hard to find their bias. It will definitely be hard to find their variance.
  • Instead, let’s use simulations to assess the bias, variance, and MSE of these estimators.

1.4.13 Concept of simulation study

  • The bias, variance, and MSE are defined with respect to the distribution of the statistic across repeated samples of the data.
  • If I can repeat the experiment in R multiple times, then I can see multiple random versions of the estimator.
  • A simulation is an experiment to study what happens across experiments.

1.4.14 Simulation study

Code
set.seed(100)
# number simulation
nsim = 10000
# sample sizes
ns = c(10, 25, 50, 100, 200, 500)
# values of alpha to consider
alphas = c(0.5, 5)
# values of beta to consider
betas = c(0.25, 3)

# presults = data.frame(p=ps, biasmuHat = rep(NA, np), biassigmaSqHat=rep(NA, np), biassigmaSqHat2=rep(NA, np),
#                       varmuHat=rep(NA, np), varsigmaSqHat=rep(NA, np), varsigmaSqHat2=rep(NA, np))

presults = expand.grid(alpha = alphas, beta=betas, n=ns)
colNames = paste(rep(c('muHat', 'alphaTilde', 'betaTilde'), each=3),
                rep(c('Bias', 'Variance', 'MSE'), 3))
presults[, colNames ] = NA

alphaind = 1; betaind = 2; nind = 3
# loops through the parameter values
for(alphaind in 1:length(alphas) ){
  for(betaind in 1:length(betas)){
    
      # data generating distribution parameters (for the gamma distribution)
      alpha = alphas[alphaind]
      beta = betas[betaind]
      
    for(nind in 1:length(ns)){
      # get n for this simulation setting
      n = ns[nind]
      cat(which(presults$n==n & presults$alpha==alpha & presults$beta == beta), '\t')
      
      # loops through the simulations
      # each simulation is one realization of the world
      results = data.frame(muHat = rep(NA, nsim), alphaTilde=rep(NA, nsim), betaTilde=rep(NA, nsim) )
      for(sim in 1:nsim){
        
        # sample a data set of size n from a random variable whose distribution is determined by alpha and beta
        x = rgamma(n, shape=alpha, rate=beta)
        
        # compute estimators
        muHat = mean(x)
        alphaTilde = muHat^2 / var(x)
        betaTilde = muHat / var(x)
        #sigmaSqHat = sum((x-mean(x))^2)/(length(x)-1)
        
        results[sim, c('muHat', 'alphaTilde', 'betaTilde') ] = c(muHat, alphaTilde, betaTilde)
      } # end of the nsim loop
      
      # saving the results for each value of p
      #presults[pind, c('biasmuHat', 'biassigmaSqHat', 'biasSigmaSqHat2')] = colMeans(results) - c(p, p*(1-p), p*(1-p))
      #presults[pind, c('varmuHat', 'varsigmaSqHat', 'varSigmaSqHat2')] = diag(var(results))
      
      # Bias variance MSE for muHat
      presults[ which(presults$n==n & presults$alpha==alpha & presults$beta == beta), c('muHat Bias', 'muHat Variance', 'muHat MSE')] = c(mean(results$muHat) - alpha/beta,
                                                                var(results$muHat),
                                                                (mean(results$muHat) - alpha/beta)^2 + var(results$muHat) )
      presults[ which(presults$n==n & presults$alpha==alpha & presults$beta == beta), c('alphaTilde Bias', 'alphaTilde Variance', 'alphaTilde MSE')] = c(mean(results$alphaTilde) - alpha,
                                                                var(results$alphaTilde),
                                                                (mean(results$alphaTilde) - alpha)^2 + var(results$alphaTilde) )
      presults[ which(presults$n==n & presults$alpha==alpha & presults$beta == beta), c('betaTilde Bias', 'betaTilde Variance', 'betaTilde MSE')] = c(mean(results$betaTilde) - beta,
                                                                var(results$betaTilde),
                                                                (mean(results$betaTilde) - beta)^2 + var(results$betaTilde) )
    } # loop through ns
  } # loop through the betas
} # loop through the alphas
## 1    5   9   13  17  21  3   7   11  15  19  23  2   6   10  14  18  22  4   8   12  16  20  24  
Code
subres = presults[ presults$alpha==0.5 & presults$beta ==0.25,]
layout(matrix(1:6, nrow=2, byrow=TRUE))

# Bias plots
plot(subres$n, subres[, 'muHat Bias'], xlab='Sample size', ylab='Bias', main='Bias of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Bias'], xlab='Sample size', ylab='Bias', main='Bias of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Bias'], xlab='Sample size', ylab='Bias', main='Bias of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'muHat Variance'], xlab='Sample size', ylab='Variance', main='Variance of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Variance'], xlab='Sample size', ylab='Variance', main='Variance of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Variance'], xlab='Sample size', ylab='Variance', main='Variance of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)

Code
subres = presults[ presults$alpha==5 & presults$beta ==3,]
layout(matrix(1:6, nrow=2, byrow=TRUE))

# Bias plots
plot(subres$n, subres[, 'muHat Bias'], xlab='Sample size', ylab='Bias', main='Bias of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Bias'], xlab='Sample size', ylab='Bias', main='Bias of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Bias'], xlab='Sample size', ylab='Bias', main='Bias of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'muHat Variance'], xlab='Sample size', ylab='Variance', main='Variance of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Variance'], xlab='Sample size', ylab='Variance', main='Variance of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Variance'], xlab='Sample size', ylab='Variance', main='Variance of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)

  • Percent bias instead of raw bias
Code
subres = presults[ presults$alpha==0.5 & presults$beta ==0.25,]
layout(matrix(1:6, nrow=2, byrow=TRUE))

# Bias plots
plot(subres$n, subres[, 'muHat Bias']/(subres[,'alpha']/subres[,'beta']), xlab='Sample size', ylab='Bias', main='Percent bias of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Bias']/subres[,'alpha']*100, xlab='Sample size', ylab='Bias', main='Percent bias of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Bias']/subres[,'beta']*100, xlab='Sample size', ylab='Bias', main='Percent bias of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, sqrt(subres[, 'muHat Variance'])/(subres[,'alpha']/subres[,'beta'])*100, xlab='Sample size', ylab='Percentage', main='Percent SE of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, sqrt(subres[, 'alphaTilde Variance'])/subres[,'alpha']*100, xlab='Sample size', ylab='Percentage', main='percent SE of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, sqrt(subres[, 'betaTilde Variance'])/subres[,'beta']*100, xlab='Sample size', ylab='Percentage', main='percent SE of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)

1.4.15 Assessing estimators in the homework

  • I left more interesting problems for you to do in the homework.
  • There, you will compare two estimators for the same parameter, and pick which one you think is better.

1.4.16 MxIF example: Estimate

The estimates in the real data are \(\hat\mu=\)3985.4, \(\tilde\alpha=\)8.7, and \(\tilde\beta=\)0.002.

Code
bc = readRDS('../datasets/betaCatenin.rds')
alphaTilde = mean(bc)^2/var(bc)
betaTilde = mean(bc)/var(bc)


x = seq(from=min(bc), to=max(bc), length.out=1000)
y = dgamma(x, shape=alphaTilde, rate=betaTilde)
hist(bc, main='Beta Catenin Histogram', freq=FALSE, xlab='Marker count', ylim=range(y))
points(x, y, type='l')