1 Probability
1.1 Probability review
1.1.1 Objectives
In this meeting we will:
- Introductions
- Review some probability notation (e.g. PMF, Expectations)
- Use Bernoulli random variables to study proportions
- Define Estimators and learn to compute values assess estimators
1.1.2 Administrative stuff
- Introductions
- How did you get interested in statistics?
- What do you do for fun?
- Syllabus
1.1.3 Types of data
- Continuous
- Categorical
- Ordinal
Examples:
- Brain volume
- Diagnosis
- Symptom rating scale (1-7)
- Coin flip
- Proportion of heads in \(N\) flips
- Proportion of time spent sleeping each week
1.1.4 Random variables
Random variables are the key player in statistics and are often used to describe the process by which data arise.
1.1.4.1 Probability notation
- Blackboard P is probability, \(\mathbb{P}(X=x)\), means the probability the random variable \(X\) is equal to the nonrandom value \(x\).
- \(\sim\) “is distributed as”
- Probability mass function (PMF; loose definition is it assigns probabilities to values for a given random variable)
- Continuous/Discrete random variables
- A random variable does not have a value in itself… We don’t usually talk about \(X=0.5\), but \(\mathbb{P}(X=0.5)\).
1.1.4.1.1 Bernoulli example
- Something that takes only two values can be described with a Bernoulli random variable.
- Coin flip
- Diagnosis
- For \(X\sim \text{Be}(\pi)\): X is distributed as Bernoulli, with parameter \(\pi \in (0,1)\), e.g. \(\pi=0.25\).
- \(\mathbb{P}(X = x) = \pi^x(1-\pi)^{(1-x)}\). That is the probability mass function (PMF) of a Bernoulli random variable
- Common notation is \(f_X(x)\) is the PMF of the random variable \(X\). In this case \(f_X(x):=\pi^x(1-\pi)^{(1-x)}\).
- \(:=\) notation means “is defined by.”
1.1.4.2 Probability axioms
We have an intuitive understanding of the basic axioms of probability.
- An event \(E\) is something that can occur, e.g. head or tails \(\{0,1\}\).
- Let \(\Omega = {0,1}\) be union of all possible events
The (Kolmogorov) axioms are
- \(\mathbb{P}(X\in E) \ge 0\) – probability is positive.
- \(\mathbb{P}(X\in \Omega) = 1\) – probabilities sum to 1.
- For disjoint sets \(E_1 \cap E_2 = \varnothing\), \(\mathbb{P}(E_1 \cup E_2) = \mathbb{P}(E_1) + \mathbb{P}(E_2)\).
1.1.4.3 Why use probabilities for data?
STOPPED HERE
- Intuitive and communicable explanation?
- Table below from Chess in the air
1.1.4.4 Multinomial example
- A die is an example of a multinomial distribution
Code
X | P(X) |
---|---|
1 | \(\pi_1\) |
2 | \(\pi_2\) |
3 | \(\pi_3\) |
4 | \(\pi_4\) |
5 | \(\pi_5\) |
6 | \(\pi_6\) |
- Can also think about statements like \(\mathbb{P}(X\le x)\)
- This is a sum over probabilities corresponding to possible values of \(X\).
- Missing probabilities are called parameters
e.g. x = 2 \[ \sum_{i=1}^2 \mathbb{P}(X=i) = 1/3 \]
1.1.4.5 Wordle Multinomial example
- Multinomial can be used for distributions of other things
- E.g. Wordle scores. The probability of getting the Wordle in a certain number of guesses.
- Also 6 (or one more) possible values.
- Same family of distribution as the die, except with different parameters
\(Y=\)Number of tries | \(P(Y)\) |
---|---|
1 | 0 |
2 | 0.03 |
3 | 0.28 |
4 | 0.40 |
5 | 0.18 |
6 | 0.11 |
7 | ?? |
1.1.4.6 Normal example
The normal density has two parameters \(\mu, \sigma\) is \[ \mathbb{P}(X\le x) = \int_{-\infty}^x (2\pi\sigma^2)^{-1/2} \exp\left\{-\frac{1}{2\sigma^2}(z-\mu)^2 \right\}dz \]
Probabilities of sets \(\mathbb{P}(X \in E)\), where \(E = (-\infty, -1.96) \cup (1.96, \infty)\).
Probability \(X=x\)?
Examples:
- Multinomial – after a rolling a die, it has to land on one of them. Normal, \(E=(-\infty,\infty)\).
- If not disjoint, then adding some probabilities twice. Multinomial \(\{1,2,3 \}, \{4,3\}\)
1.2 CDFs, PDFs, quantile functions
1.2.2 PDFs and CDFs (and quantile functions)
1.2.2.1 PDFs
The PDF (probability density function) is the derivative of the CDF and often denoted with a lower case letter \(f(x)\). For discrete random variables the PDF is call the PMF (probability mass function).
1.2.2.2 PDFs continued
The PDF can conceptually be thought of as \(\mathbb{P}(X=x)\), for PMFs, that’s exactly what it is. But, for continuous random variables the probability they take any value is equal to zero. These PDFs are functions with parameters: \[ \begin{align*} \text{Normal: }& f(x; \mu, \sigma^2) = (2\pi\sigma^2)^{-1/2} \exp\left\{-\frac{1}{2\sigma^2}(x-\mu)^2 \right\} \\ \text{Gamma: }& f(x; \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} \\ \text{Poisson: }& f(x; \lambda) = \frac{\lambda^k\ e^{-\lambda}}{k!} \\ \end{align*} \]
The CDF (below) and PDF are useful because we never know what values the random variable will take, so we can integrate the PDF or use the CDF to figure out where they are most likely to fall.
1.2.2.3 CDFs
The cumulative distribution function (CDF), \(F(x)\), for a random variable \(X\) is a function that satisfies \[ F(x) = \mathbb{P}(X \le x). \] They are usually functions of parameters. Here are some examples.
1.2.2.3.1 Human Connectome Project Brain volume example
- Human Connectome Project data
- Study designed to understand how regions of the brain are interconnected
Code
## [1] 510019.7
## [1] 53940.92
Code
For discrete random variables the derivative of the CDF does not exist because it is a step function, but the probability mass function is the amount the CDF jumps up at that location, heuristically we can define it as \[ f(x) = F(x+\Delta x) - F(x), \] for an infinitesimal value \(\Delta x\).
1.2.2.3.2 Wordle CDF example
\(Y=\)Number of tries | \(P(Y|\text{Simon})\) | \(P(Y|\text{Lillie})\) |
---|---|---|
1 | 0 | 0 |
2 | 0.03 | 0.05 |
3 | 0.28 | 0.24 |
4 | 0.40 | 0.44 |
5 | 0.18 | 0.19 |
6 | 0.11 | 0.08 |
Code
1.2.2.3.3 Human Connectome Project CDF example (continued)
- We can compute probabilities that people lie within a given region using the CDF.
Code
- Probability someone is between \(450,000\mathrm{mm}^3\) and \(450,000\mathrm{mm}^3\)
- Probability someone is less than \(400,000\mathrm{mm}^3\).
- probability someone is greater than\(500,000\mathrm{mm}^3\)
## [1] 0.6377898
## [1] 0.0206934
## [1] 0.0476453
1.2.2.4 Quantile functions
The quantile function is the inverse of the CDF.
- For a given probability \(p\), it spits out a value \(Q(p)\) such that that \(F(Q(p)) = p\).
- The interpretation is that for a given probability \(p\), There’s a \(p\) percent probability that the random variable will be below \(Q(p)\).
- Sometimes, it’s possible to find the quantile function explicitly, but many times it isn’t.
- Also can call these “percentiles” if you multiply them by 100.
1.2.3 Expectations
1.2.3.1 Overview
Many things we care about about a random variable are expectations. The expectation is an “operator” on a random variable, meaning it takes function of a random variable \(g(X)\) for a (somewhat) arbitrary function \(g\) and is defined by \[ \mathbb{E} g(X) = \int_{\mathcal{X}} g(x) p(x) dx, \] where \(p(x)\) is the density function of \(X\) and the \(\mathcal{X}\) subscript is to denote that we are integrating over the domain of values \(X\) can take.
1.2.3.2 Great expectations
The mean is the most common expectation \[ \mathbb{E} X = \int_{\mathcal{X}} x p(x) dx. \]
- The integral notation is in the sense of “real analysis” type integrals that can refer to sums or integrals. This is to emphasize that the definition is the same with continuous or discrete random variables.
Another great expectation is the variance \[ \text{Var}(X) = \mathbb{E} (X - \mathbb{E}X)^2 = \mathbb{E}X^2 - (\mathbb{E}X)^2 \] Properties of the variance are homework questions.
In general, \[ \mathbb{E}X^k \] is call the \(k\)th moment of \(X\).
1.2.3.3 Properties of expectations
- \(\mathbb{E}a X + b = a \mathbb{E}X + b\)
- \(\mathbb{E}\{ \mathbb{E}X\} = \mathbb{E}X\) (\(\mathbb{E}X\) is a constant)
1.2.3.4 Expected value for Bernoulli
- For the Bernoulli \(X\sim \text{Be}(p)\) the expectation is \[ \mathbb{E}X = \sum_{i=0}^1 x_i f_X(x_i) = 1 \times p + 0 \times (1-p) = p \]
1.2.3.5 Expected value for Multinomial
\(Y\) | \(P(Y)\) | \(P(Y)\) | \(P(Y)\) |
---|---|---|---|
1 | 1/4 | .1 | .7 |
2 | 1/4 | .2 | .1 |
3 | 1/4 | .3 | .1 |
4 | 1/4 | .4 | .1 |
1.2.3.6 Expected value for Wordle score
\(Y=\)Number of tries | \(P(Y)\) |
---|---|
1 | 0 |
2 | 0.03 |
3 | 0.28 |
4 | 0.40 |
5 | 0.18 |
6 | 0.11 |
1.2.3.7 Expected value for Poisson random variable
- Poisson random variables are useful for modeling skewed counts.
- E.g. number of drinks of alcohol in a week.
- E.g. score on a depression inventory or other psychology/scale.
\[ f(x; \lambda) = \frac{\lambda^x\ e^{-\lambda}}{x!} \]
- Expectation and Variance of Poisson distribution
1.2.3.8 Expected value for standard Normal distribution
Start with the standardized normal distribution \[\begin{align*} \mathbb{E}X & = \int_{-\infty}^\infty x \left(2\pi \right)^{-1/2} \exp\left\{-\frac{1}{2} x^2 \right\}\\ & = \left(2\pi \right)^{-1/2} \int_{0}^\infty \exp\{-u\} du - \left(2\pi \right)^{-1/2} \int_{0}^\infty \exp\{-u\} du \end{align*}\] where \(u = x^2/2\).
- This equals zero.
1.2.3.9 Expected value for Normal distribution
Start with the standardized normal distribution and use change of variables.
- For \(X\sim N(0,1)\), let \(Y = X\sigma + \mu\).
- What is \(\mathbb{E}Y\)?
- Then \[\begin{align*} \mathbb{P}(Y<z) = \mathbb{P}((X\sigma+\mu)<z) = \mathbb{P}(X< (z - \mu)/\sigma) \\ = \int_{-\infty}^{(z-\mu)/\sigma} \left(2\pi \right)^{-1/2} \exp\left\{-\frac{1}{2} x^2 \right\} dx\\ = \int_{-\infty}^{y} \left(2\pi \sigma^2\right)^{-1/2} \exp\left\{-\frac{1}{2\sigma^2} (w-\mu)^2 \right\} dw \end{align*}\]
- This is the CDF a \(N(\mu, \sigma^2)\).
1.3 Multivariate random variables
1.3.0.1 Overview
Let \(W = (X, Y) \in \mathcal{B}\) be a multivariate random variable. \(X\) and \(Y\) can be continuous or discrete. For example \(X,Y\) can be multivariate normal.
1.3.0.2 Example
For \(X\in\{1,2\}\) and \(Y \in \{1,2,3\}\)
- Joint probability table
- Marginal probabilities
- Conditional probabilities (intuitively)
Draw the 3 different tables.
Code
1 | 2 | |
---|---|---|
1 | \(p_{11}\) | \(p_{12}\) |
2 | \(p_{21}\) | \(p_{22}\) |
3 | \(p_{31}\) | \(p_{32}\) |
1.3.0.3 Example: Umbrella and raining
- A common example is whether someone brings an umbrella with them
It’s not raining | It’s raining | |
---|---|---|
I don’t bring umbrella | 0.05 | 0.17 |
I bring umbrella | 0.35 | 0.43 |
Another example
It’s not precipitating | It’s raining | It’s snowing | |
---|---|---|---|
I don’t bring umbrella | 0.03 | 0.17 | 0.02 |
I bring umbrella | 0.35 | 0.40 | 0.03 |
1.3.1 Distribution functions
The distribution function for multivariate random variables is often written in terms of subsets \(B \subset \mathcal{B}\) \[ F_W(B) = \mathbb{P}(W \in B). \]
If \(W\) has a density then we can also write for values \(x\) and \(y\) (in the appropriate domain) \[ \mathbb{P}(W \in B) = \int_{B} f_W(x, y)dx dy, \] where \(f_W\) is the PDF or PMF of \(W = (X,Y)\).
1.3.2 Marginal probability
- The marginal probability, \(\mathbb{P}(Y=y)\) is the probability for \(Y\) ignoring whatever is going on with \(X\).
- This can be found by integrating out the \(X\) variable. \[ f_Y(y) = \int_\mathcal{X} f_{X,Y}(x, y)dx \]
- Joint multinomial above to demonstrate this.
1.3.3 Conditional distribution
Conditional distributions can be written in terms of the underlying probability space. For sets \(A, B \subset \mathcal{B}\) \[ \mathbb{P}(A \mid B ) = \frac{\mathbb{P}( A \cap B)}{\mathbb{P}(B)} \]Conceptually, the denominator adjusts the numerator for the fact that we are only considering events that include B occurring. This makes the conditional probabilities sum to 1.
For multivariate random variables the definition is similar \[ \mathbb{P}(Y=y \mid X=x) = \frac{\mathbb{P}(X=x, Y=y)}{\mathbb{P}(X=x)} \]
- Can think of the Venn diagram in terms of the joint random variable falling into the set.
- Probability I bring and umbrella given that it’s raining
- Probability it’s raining given it’s not precipitating (Conditioning on itself)
1.3.4 Examples
1.3.4.1 Insurance dataset children and region
- The insurance dataset is in the
RESI
R package, which you can install with the following code
- There are some cool multivariate distributions in this data set.
- Consider the probability of having a certain number of children living in a certain area.
Code
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
northeast | 0.11 | 0.06 | 0.04 | 0.03 | 0.01 | 0.00 |
northwest | 0.10 | 0.06 | 0.05 | 0.03 | 0.00 | 0.00 |
southeast | 0.12 | 0.07 | 0.05 | 0.03 | 0.00 | 0.00 |
southwest | 0.10 | 0.06 | 0.04 | 0.03 | 0.01 | 0.01 |
Let \(X\) be a random variable denoting number of children
- Is it categorical/continuous/ordinal?
Let \(Y\) be a random variable denoting region
- Is it categorical/continuous/ordinal?
joint distribution is a function of both variables \(\mathbb{P}(X=x, Y=y) = f(x,y)\).
Questions about regional differences and number of children can be determined from the table
What is the probability of having zero kids and living in the east?
What is the probability of having zero kids?
1.3.4.2 Insurance data multivariate distribution of age and charges
Code
- Relationship between age and charges in the insurance data
Code
- The scatter plot can be thought of as a dataset version of a joint density \(f_{X,Y}(x,y)\)
- We can get the conditional histograms for a few different age groups
Code
par(mar=c(8,2.8,1.8,.2), mgp=c(6,.7,0))
nbin = 15
insCat = with(insurance, data.frame(age=cut(age, breaks = seq(min(age), max(age), length.out=7), include.lowest=T), charges=cut(charges/1000, breaks = seq(min(charges/1000), max(charges/1000), length.out=nbin), include.lowest=T) ) )
condTab = do.call(cbind, by(insCat$charges, insCat$age, function(x){res=table(x); res/sum(res)}))
barplot(condTab, beside=TRUE, col=rep(cols[1:ncol(condTab)], each=nbin-1),
names.arg = rep(rownames(condTab), ncol(condTab) ), las=2, xlab='Charges (in thousands)', main='Conditional frequencies')
legend('topright', fill=cols[1:ncol(condTab)], legend=colnames(condTab), bty='n')
Code
par(mar=c(8,2.8,1.8,.2), mgp=c(6,.7,0))
nbin = 15
jointTab = do.call(cbind, by(insCat$charges, insCat$age, function(x){res=table(x); res}))
jointTab = jointTab/sum(jointTab)
barplot(jointTab, beside=TRUE, col=rep(cols[1:ncol(jointTab)], each=nbin-1),
names.arg = rep(rownames(jointTab), ncol(jointTab) ), las=2, xlab='Charges (in thousands)', main='Joint frequencies')
legend('topright', fill=cols[1:ncol(jointTab)], legend=colnames(jointTab), bty='n')
1.3.5 Conditional means and variances
- Once you have conditional probabilities, you can do all the same stuff with it that you would do with a regular random variable
- For example, \(\mathbb{E}(Y \mid X) = \sum_{y} y f_{y\mid x}(y \mid x)\)
Conditional expectation is just an expectation with respect to a conditional distribution. \[ \begin{align*} \mathbb{E}(Y \mid X=x) & = \int_{\mathcal{Y}} yf_{Y\mid X=x}(y\mid x)dy \\ & = \int_{\mathcal{Y}} yf_{Y,X}(y, x)/f(x)dy \\ & = \int_{\mathcal{Y}} yf_{Y,X}(y, X)/f(X)dy \end{align*} \]
It can still be thought of as random with respect to the conditioning variable.
\(P(X=1)=.1\) | \(P(X=2) = .7\) | \(P(X=3)=.2\) | |
---|---|---|---|
\(Y\) | \(P(Y|X=1)\) | \(P(Y| X=2)\) | \(P(Y| X=3)\) |
1 | 1/4 | .1 | .7 |
2 | 1/4 | .2 | .1 |
3 | 1/4 | .3 | .1 |
4 | 1/4 | .4 | .1 |
1.3.6 Properties of expectation for two random variables
Expectation is linear because it is an integral. For example, for constants \(a,b\) and random variables \(X,Y\) \[ \mathbb{E}(aX + bY) = a \mathbb{E} X + b\mathbb{E}Y \]
1.3.6.1 Example: discrete and continuous joint distribution
- Diagnosis and hippocampus volume in Alzheimer’s disease.
- \(P(DX)\) is multinomial, \(P(Hipp\mid DX)\) assumed to be normal.
Code
adni = readRDS('../datasets/adni/hippocampus.rds')
tab = round(do.call(rbind, by(adni$LEFTHIPPO, adni$DX, function(x) c(mean(x), var(x)) )), 0)
pX = table(adni$DX); pX = as.data.frame(round(pX/sum(pX),2)); names(pX) = c('X', 'P(X)')
colnames(tab) = c('Mean', 'Variance')
knitr::kable(pX, row.names = FALSE, escape = FALSE )
X | P(X) |
---|---|
AD | 0.26 |
HC | 0.29 |
MCI | 0.45 |
Mean | Variance | |
---|---|---|
AD | 1549 | 109794 |
HC | 2122 | 89241 |
MCI | 1803 | 131030 |
1.3.7 Independence
- Two random variables are independent if they factor \(\mathbb{P}(X=x, Y=y) = \mathbb{P}(X=x)\mathbb{P}(Y=y)\) for all possible values of \(x\) and \(y\).
- Intuitively, this means that fixing one variable doesn’t affect the distribution of the other.
- Another way to express it is that \(\mathbb{P}(X=x\mid Y=y) = \mathbb{P}(X=x)\).
- The table is pretty close to independent.
- Conditional probabilities are \(\mathbb{P}(\text{N kids} | \text{Region})\).
- This means that they are independent if their CDFs or PDFs factor!
If the table above were perfectly independent, it might look like this:
Code
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
northeast | 0.104 | 0.059 | 0.043 | 0.028 | 0.005 | 0.003 |
northwest | 0.104 | 0.059 | 0.044 | 0.029 | 0.005 | 0.003 |
southeast | 0.117 | 0.066 | 0.049 | 0.032 | 0.005 | 0.004 |
southwest | 0.104 | 0.059 | 0.044 | 0.029 | 0.005 | 0.003 |
Code
Code
- Are charges and smoking independent?
1.3.8 Covariance and dependence
Expand this and derive the other formula \[ \text{Cov}(X, Y) = \mathbb{E}(X-\mathbb{E}X)(Y-\mathbb{E}Y) \]
Correlation is \[ \begin{align*} \text{Cor}(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \in [-1,1] \end{align*} \]
- Covariance implies dependence, but dependence does not imply covariance (homework question).
- Independence implies zero covariance, but dependence does not imply nonzero covariance.
- The covariance can be zero between two random variables, but they can still be dependent.
- Covariance summarizes the linear dependence between two random variables.
1.3.9 Pearson’s correlation
- Sampling two measurements on each subject \(X_i\) \(Y_i\).
Correlation is a normalized variance \[ \begin{align*} \rho = \text{Cor}(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}} \end{align*} \]
- It takes values between -1 and 1.
- negative values means they two variables have a negative (approximately) linear relationship
- positive means they have a positive (approximately) linear relationship
- Correlation is not slope.
- A slope of zero implies no correlation
Can estimate correlation with
\[ \begin{align*} \hat\rho = \frac{\widehat{\text{Cov}}(X,Y)}{\sqrt{\widehat{\text{Var}}(X)\widehat{\text{Var}}(Y)}} = \frac{\sum_{i=1}^n(X_i - \bar X)(Y_i - \bar Y)}{\sqrt{\sum_{i=1}^n(X_i - \bar X)^2\sum_{i=1}^n(Y_i - \bar Y)^2}} \end{align*} \]
1.3.10 Properties of variances of sums
Derive the variance of this
- \(\text{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y) + 2\text{Cov}(X, Y)\)
1.3.11 Law of total expectation
The law of total expectation is relatively simple, but super handy. \[ \mathbb{E}\{\mathbb{E}(X \mid Y)\} = \mathbb{E}X \] If you are trying to take a mean of a complicated random variable, then you can condition on part of it first and then take the expectation of the part you conditioned on.
Example: continue from last table
Example: Use diagnosis and hippocampal volume above
Example from Rice:
1.4 Parameters, estimates, and estimators
1.4.1 Objectives
- Talk about parameters, estimates, and estimators
- Talk about how to assess estimators by comparing their variance and bias
1.4.2 Example: healthy sleep
- For this example let’s look at the sleep questionnaire from the HCP
- What proportion of people get healthy sleep (assuming 7-9 hours).
Code
1.4.3 Statistics is a way to learn about the world from a data set
All three statistical philosophies we will discuss view research questions as trying to learn about an unknown parameter.
For now, let’s work with the question, “What is the proportion of people in the united states who get a healthy amount of sleep?”
- What is my parameter of interest (target parameter)?
- Descriptively – proportion of people who get a healthy amount of sleep
- Mathematically – call this value \(p\)
- What is my population of interest?
- Descriptively – The United States population
- Quantitatively two possible things
- Distribution defined by the model below in point 4. \(\text{Be}(p)\)
- Unknown statistical distribution that gives rise to whether someone reports between 7-9 hours of sleep.
- What is a data point?
- An answer from an individual in the population to the survey question. Coded as 1/0 if they do/don’t get 7-9 hours of sleep.
- What model can I use to describe how I get a data point?
- Most obvious one is \(X_i \sim \text{Be}(p)\), where \(p\) is the parameter in point 1.
1.4.4 A random sample
A random sample is a collection of independent random variables that represent potential data points.
Let \(X_i \sim \text{Be}(p)\) for \(i=1,\ldots, n\).
Our assumption about the population \(X_i\) is drawn from connects the data to the parameter of interest.
Given our data, we have a parameter and estimate of the parameter. What are they?
1.4.5 Estimates
Note, an estimate is a function of the observed sample and it is nonrandom (Why is it non random?). We often use lowercase letters to make that clear \[ \bar x = n^{-1} \sum_{i=1}^n x_i. \]
1.4.6 Estimators
An estimator is a function of a random sample. The goal (usually) is to estimate a parameter of interest (here, \(p\)). Let’s consider the estimator \[ \hat p = \bar X = n^{-1} \sum_{i=1}^n X_i, \] which we know is pretty reasonable since \(\mathbb{E}\bar X = p\).
- We can study the properties of estimators to learn about how they behave.
- If we like an estimator we can compute an estimate using our sample.
- We can use features of our estimator to make probabilistic statements about what might happen if we repeat our study. Dataset-to-dataset variability
- This perspective is considered to be the Frequentist philosophy of Statistics.
1.4.6.1 Properties of estimators
Some common metrics are
Bias: let \(p\) denote our target parameter, bias is defined as \[ \mathbb{E}(\hat p - p). \]
Variance: this is the same as for other random variables \[ \mathbb{E}(\hat p - \mathbb{E}\hat p)^2 \]
Mean Squared Error: this combines variance and bias: \[ \mathbb{E}(\hat p - p)^2 \]
Consistency: this describes what happens to the estimator as \(n\to\infty\) \[ \hat p \to_P p \]
The arrow is convergence in probability. More on this later.
Asymptotically unbiased: is a less strong statement of what happens to the estimator as \(n\to\infty\) \[ \mathbb{E}\hat p \to p \]
1.4.6.2 Bias, variance, and MSE of the proportion estimator
- The bias of the proportion estimator (used to estimate the proportion of people who get enough sleep) is: \[ \mathbb{E}( \hat p - p) = 0 \]
- The variance of the estimator is \[ \begin{align*} \text{Var}(\hat p) & = n^{-2} \sum_{i=1}^n\text{Var}(X_i) \\ & = n^{-1} p(1-p) \end{align*} \]
- The MSE is the same as the variance since \(\hat p\) is unbiased.
We will come back to this example to construct confidence intervals later.
1.4.7 HCP: example of bias, variance, and MSE
- In HCP data
Code
## [1] 0.5998757
## [1] 0.002404844
## [1] 0.0024
1.4.8 Parameters, Estimators, Estimates
To reiterate:
- Parameter – Target unknown feature of a population (nonrandom)
- Estimate – Value computed from observed data to approximate the parameter (nonrandom)
- Estimator – A function of a random sample to approximate the parameter
In statistics, probability is used to define and describe the behavior of estimators.
- Often, the distribution of an estimator is makes it too hard to find the bias, variance, and MSE of the estimator.
- In this case, we use simulations to estimate the bias, variance, and MSE of an estimator.
- Important concepts for this session:
- Simulations can be used to estimate the bias and variance when it is too hard to find mathematically.
- Sometimes, the parameter of interest is not distributional parameter.
1.4.9 Another example: multiplexed immunofluorescence (mIF) microscopy
- mIF imaging uses antibody markers to identify protein in tissue samples.
- The protein Beta-catenin is known to be higher in tumor cells.
1.4.10 mIF imaging in a statistical framework
Suppose we want to ask the question, “What is the mean concentration of Beta-catenin in the cells from a given tissue sample?”
- What is my parameter of interest (target parameter)?
- Mean cellular concentration of Beta-catenin \(\mathbb{E}X_i\)
- What is my population of interest?
- All possible cells from the given tumor (potentially)?
- All possible cells from similar tumors?
- What is a data point?
- A single cell
- What model can I use to describe how I get a data point?
- That’s a good question. Here is the histogram of the data
Code
+ The Beta-catenin concentration across cells could be modeled with a Gamma distribution.
1.4.11 A Gamma model for Beta-catenin concentration
Previously, we defined the parameter of interest \(p\) based on what we were interested in the sleep problem and it also happened to be the parameter of the Bernoulli distribution.
In this example that is not the case.
Let \(X_i\) denote a randomly drawn cell from the tissue image.
Assume \(X_i \sim \text{Gamma}(\alpha, \beta)\).
Let’s consider three parameters
- \(\mathbb{E}X_i = \alpha/\beta\)
- \(\alpha\) (shape parameter)
- \(\beta\) (rate parameter)
Let’s assess estimators for each of these parameters.
1.4.12 MxIF example: Estimators
I used method of moments (which you’ll learn in another class) to obtain estimators for these parameters. The estimators are \[ \begin{align*} \hat \mu & = \bar X = n^{-1} \sum_{i=1}^n X_i \\ \tilde \alpha & = \frac{\bar X^2}{\left(\overline{X^2} - \bar X^2\right)} \\ \tilde \beta & = \frac{\bar X}{\left(\overline{X^2} - \bar X^2\right)} \end{align*} \]
- \(\tilde \alpha\) and \(\tilde \beta\) are complicated functions of random variables involving ratios.
- It might be hard to find their bias. It will definitely be hard to find their variance.
- Instead, let’s use simulations to assess the bias, variance, and MSE of these estimators.
1.4.13 Concept of simulation study
- The bias, variance, and MSE are defined with respect to the distribution of the statistic across repeated samples of the data.
- If I can repeat the experiment in
R
multiple times, then I can see multiple random versions of the estimator. - A simulation is an experiment to study what happens across experiments.
1.4.14 Simulation study
Code
set.seed(100)
# number simulation
nsim = 10000
# sample sizes
ns = c(10, 25, 50, 100, 200, 500)
# values of alpha to consider
alphas = c(0.5, 5)
# values of beta to consider
betas = c(0.25, 3)
# presults = data.frame(p=ps, biasmuHat = rep(NA, np), biassigmaSqHat=rep(NA, np), biassigmaSqHat2=rep(NA, np),
# varmuHat=rep(NA, np), varsigmaSqHat=rep(NA, np), varsigmaSqHat2=rep(NA, np))
presults = expand.grid(alpha = alphas, beta=betas, n=ns)
colNames = paste(rep(c('muHat', 'alphaTilde', 'betaTilde'), each=3),
rep(c('Bias', 'Variance', 'MSE'), 3))
presults[, colNames ] = NA
alphaind = 1; betaind = 2; nind = 3
# loops through the parameter values
for(alphaind in 1:length(alphas) ){
for(betaind in 1:length(betas)){
# data generating distribution parameters (for the gamma distribution)
alpha = alphas[alphaind]
beta = betas[betaind]
for(nind in 1:length(ns)){
# get n for this simulation setting
n = ns[nind]
cat(which(presults$n==n & presults$alpha==alpha & presults$beta == beta), '\t')
# loops through the simulations
# each simulation is one realization of the world
results = data.frame(muHat = rep(NA, nsim), alphaTilde=rep(NA, nsim), betaTilde=rep(NA, nsim) )
for(sim in 1:nsim){
# sample a data set of size n from a random variable whose distribution is determined by alpha and beta
x = rgamma(n, shape=alpha, rate=beta)
# compute estimators
muHat = mean(x)
alphaTilde = muHat^2 / var(x)
betaTilde = muHat / var(x)
#sigmaSqHat = sum((x-mean(x))^2)/(length(x)-1)
results[sim, c('muHat', 'alphaTilde', 'betaTilde') ] = c(muHat, alphaTilde, betaTilde)
} # end of the nsim loop
# saving the results for each value of p
#presults[pind, c('biasmuHat', 'biassigmaSqHat', 'biasSigmaSqHat2')] = colMeans(results) - c(p, p*(1-p), p*(1-p))
#presults[pind, c('varmuHat', 'varsigmaSqHat', 'varSigmaSqHat2')] = diag(var(results))
# Bias variance MSE for muHat
presults[ which(presults$n==n & presults$alpha==alpha & presults$beta == beta), c('muHat Bias', 'muHat Variance', 'muHat MSE')] = c(mean(results$muHat) - alpha/beta,
var(results$muHat),
(mean(results$muHat) - alpha/beta)^2 + var(results$muHat) )
presults[ which(presults$n==n & presults$alpha==alpha & presults$beta == beta), c('alphaTilde Bias', 'alphaTilde Variance', 'alphaTilde MSE')] = c(mean(results$alphaTilde) - alpha,
var(results$alphaTilde),
(mean(results$alphaTilde) - alpha)^2 + var(results$alphaTilde) )
presults[ which(presults$n==n & presults$alpha==alpha & presults$beta == beta), c('betaTilde Bias', 'betaTilde Variance', 'betaTilde MSE')] = c(mean(results$betaTilde) - beta,
var(results$betaTilde),
(mean(results$betaTilde) - beta)^2 + var(results$betaTilde) )
} # loop through ns
} # loop through the betas
} # loop through the alphas
## 1 5 9 13 17 21 3 7 11 15 19 23 2 6 10 14 18 22 4 8 12 16 20 24
Code
subres = presults[ presults$alpha==0.5 & presults$beta ==0.25,]
layout(matrix(1:6, nrow=2, byrow=TRUE))
# Bias plots
plot(subres$n, subres[, 'muHat Bias'], xlab='Sample size', ylab='Bias', main='Bias of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Bias'], xlab='Sample size', ylab='Bias', main='Bias of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Bias'], xlab='Sample size', ylab='Bias', main='Bias of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'muHat Variance'], xlab='Sample size', ylab='Variance', main='Variance of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Variance'], xlab='Sample size', ylab='Variance', main='Variance of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Variance'], xlab='Sample size', ylab='Variance', main='Variance of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)
Code
subres = presults[ presults$alpha==5 & presults$beta ==3,]
layout(matrix(1:6, nrow=2, byrow=TRUE))
# Bias plots
plot(subres$n, subres[, 'muHat Bias'], xlab='Sample size', ylab='Bias', main='Bias of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Bias'], xlab='Sample size', ylab='Bias', main='Bias of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Bias'], xlab='Sample size', ylab='Bias', main='Bias of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'muHat Variance'], xlab='Sample size', ylab='Variance', main='Variance of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Variance'], xlab='Sample size', ylab='Variance', main='Variance of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Variance'], xlab='Sample size', ylab='Variance', main='Variance of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)
- Percent bias instead of raw bias
Code
subres = presults[ presults$alpha==0.5 & presults$beta ==0.25,]
layout(matrix(1:6, nrow=2, byrow=TRUE))
# Bias plots
plot(subres$n, subres[, 'muHat Bias']/(subres[,'alpha']/subres[,'beta']), xlab='Sample size', ylab='Bias', main='Percent bias of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'alphaTilde Bias']/subres[,'alpha']*100, xlab='Sample size', ylab='Bias', main='Percent bias of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, subres[, 'betaTilde Bias']/subres[,'beta']*100, xlab='Sample size', ylab='Bias', main='Percent bias of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, sqrt(subres[, 'muHat Variance'])/(subres[,'alpha']/subres[,'beta'])*100, xlab='Sample size', ylab='Percentage', main='Percent SE of muHat', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, sqrt(subres[, 'alphaTilde Variance'])/subres[,'alpha']*100, xlab='Sample size', ylab='Percentage', main='percent SE of alphaTilde', type='b')
abline(h=0, lty=2, lwd=2)
plot(subres$n, sqrt(subres[, 'betaTilde Variance'])/subres[,'beta']*100, xlab='Sample size', ylab='Percentage', main='percent SE of betaTilde', type='b')
abline(h=0, lty=2, lwd=2)
1.4.15 Assessing estimators in the homework
- I left more interesting problems for you to do in the homework.
- There, you will compare two estimators for the same parameter, and pick which one you think is better.
1.4.16 MxIF example: Estimate
The estimates in the real data are \(\hat\mu=\)3985.4
, \(\tilde\alpha=\)8.7
, and \(\tilde\beta=\)0.002
.
Code
bc = readRDS('../datasets/betaCatenin.rds')
alphaTilde = mean(bc)^2/var(bc)
betaTilde = mean(bc)/var(bc)
x = seq(from=min(bc), to=max(bc), length.out=1000)
y = dgamma(x, shape=alphaTilde, rate=betaTilde)
hist(bc, main='Beta Catenin Histogram', freq=FALSE, xlab='Marker count', ylim=range(y))
points(x, y, type='l')