Class 2

Author

Simon Vandekar

Objectives:

  1. Learn some basic concepts about probabilities
  2. Learn to use probability to think about data
  3. But first, we will recap some statistica jargon and practice using it to describe the data.

Some common statistical jargon so far

Types of variables

  • Categorical variable - a variable that takes discrete values, such as diagnosis, sex,
  • Continuous variable - a variable that can take any values within a range, such as age, blood pressure, brain volume.
  • Ordinal variable - a categorical variable that can be ordered, such as age category, Likert scales, Number of days, Number of occurrences.

Distribution/Density features

  • Mode/ Non-modal/Unimodal/ Bimodal - a peak in a density.
    • Non-modal - no peak
    • Unimodal - one peak
    • Bimodal - two peaks
  • Quantile - the pth quantile, q_p divides the distribution such that p% of the data are below q_p.
    • Median - 50th quantile of a distribution
  • Skew/Skewness - the amount of asymmetry of a distribution
    • Symmetric distribution - a distribution with no skew
  • Kurtosis - the heavy tailed-ness of a distribution.

Bivariate distribution features

  • Correlation/correlated - two variables are linearly dependent
  • Dependent - two or more variables are related to each other in some way
  • Independent - two variables that are not related to each other

Practice (15 min)

Below are some questions related to some of the material we’ve covered so far. We’ll look at an insurance dataset with the amount of charges as a function of health and demographic features.

Insurance companies are often interested in predicting charges so that they can know who high-risk individuals are based on easily measurable demographic information.

  • For each of the variables in the table below describe which are continuous, ordinal, or categorical. They might be one or more of these things.
    • Age -
    • Sex -
    • BMI (body mass index) -
    • Children -
    • Smoker -
    • Region -
    • Charges -
Code
library(RESI)
Registered S3 method overwritten by 'clubSandwich':
  method    from    
  bread.mlm sandwich
Code
library(reactable)
reactable(insurance)

The table below summarizes the variables in the dataset for nonsmokers smokers. For each of the variables in the table below, comment on the skewness and whether the variables look similar or different between smokers and nonsmokers.

Code
insurance$smoker = factor(insurance$smoker, labels = c('Nonsmoker', 'Smoker'))
label(insurance$smoker) = 'Smoking Status'
table1(~ age + bmi + children + sex + region + charges | smoker, data=insurance)
Nonsmoker
(N=1064)
Smoker
(N=274)
Overall
(N=1338)
age
Mean (SD) 39.4 (14.1) 38.5 (13.9) 39.2 (14.0)
Median [Min, Max] 40.0 [18.0, 64.0] 38.0 [18.0, 64.0] 39.0 [18.0, 64.0]
bmi
Mean (SD) 30.7 (6.04) 30.7 (6.32) 30.7 (6.10)
Median [Min, Max] 30.4 [16.0, 53.1] 30.4 [17.2, 52.6] 30.4 [16.0, 53.1]
children
Mean (SD) 1.09 (1.22) 1.11 (1.16) 1.09 (1.21)
Median [Min, Max] 1.00 [0, 5.00] 1.00 [0, 5.00] 1.00 [0, 5.00]
sex
female 547 (51.4%) 115 (42.0%) 662 (49.5%)
male 517 (48.6%) 159 (58.0%) 676 (50.5%)
region
northeast 257 (24.2%) 67 (24.5%) 324 (24.2%)
northwest 267 (25.1%) 58 (21.2%) 325 (24.3%)
southeast 273 (25.7%) 91 (33.2%) 364 (27.2%)
southwest 267 (25.1%) 58 (21.2%) 325 (24.3%)
charges
Mean (SD) 8430 (5990) 32100 (11500) 13300 (12100)
Median [Min, Max] 7350 [1120, 36900] 34500 [12800, 63800] 9380 [1120, 63800]

The plots below summarize each of the variables separately using histograms with densities.

  • For each of the histograms below comment on:
    • Symmetry
    • Skewness
    • Mode(s)
Code
par(mfrow=c(2,2))
invisible(sapply(names(insurance)[sapply(insurance, is.numeric)], function(name){ hist(insurance[,name], xlab=name, main=name, freq=FALSE); lines(density(insurance[,name], adjust=1.5), lwd = 2) } ))

For each of the plots comment on the correlation and dependence of the variables with the number of charges.

Code
par(mfrow=c(2,3))
xvars = names(insurance)[1:6]
insurance$sex = factor(insurance$sex)
insurance$region = factor(insurance$region)
invisible(sapply(xvars, function(xvar){
  if(!is.factor(xvar)){
  plot(insurance[,xvar], insurance$charges, ylab='Charges', xlab=xvar, main=paste(xvar, 'and', 'Charges') )
  # remove missing values for lowess fit
  subset = na.omit(insurance[, c(xvar, 'charges')])
  lines(lowess( subset[,xvar], subset$charges), col='blue', lwd=2)
  } else {
   barplot(as.formula(paste('charges ~', xvar)), data=insurance)
  }
}))

Practice with plotting

  • It looks like for some of the plots above, there are multiple modes or clusters. Which ones? What could be driving these different clusters?
  • Create a scatter plot coloring the points for age and charges and BMI and charges by smoking status, to see if smokers fall in a particular cluster.

Why probability

  • Probability is the language of statistics.
  • It is how we can understand data and evaluate the quantitative methods we use.

Important tools in probability

Dependence/Independence Discrete random variables Continuous random variables Expected values/Averages CDF, PDF

Coin flips and diagnoses

  • Something that takes only two values can be described with a Bernoulli random variable.
    • yes/no categories, e.g. smoking
    • Diagnosis: schizophrenia or not, COVID+ or not
Bernoulli values x probabilities
0 (nonsmoker) 1-p
1 (smoker) p
  • What is a sensible value for p for the different Bernoulli variables above?

Random variables and probability notation

  • For X\sim \text{Be}(p): X is distributed as (\sim means “is distributed as”) Bernoulli, with parameter p \in (0,1), e.g. p=0.25.
  • Blackboard or capital P is probability, \mathbb{P}(X=x), means the probability the random variable X is equal to the nonrandom value x.
  • \mathbb{P}(X = x) = p^x(1-p)^{(1-x)}. That is the probability mass function (PMF) of a Bernoulli random variable
  • Common notation is f_X(x) is the PMF of the random variable X. In this case f_X(x):=p^x(1-p)^{(1-x)}.
  • := notation means “is defined by.”
  • A random variable does not have a value in itself… We don’t usually talk about X=0.5, but \mathbb{P}(X=0.5).

Some questions:

  • For X\sim \text{Be}(p), what is the probability that X is 100?
  • What is the probability X is 0.5?
  • What is the probability X is 0?

Random sample to estimate smoking proportion

  • Let’s say the smoking status of each person in the insurance dataset is a random sample from X_i \sim \text{Be}(p), for i=1, \ldots, 1338.
  • What is a reasonable value for p?
  • How can we estimate what p is?

Multiple groups, Multinomial distribution

  • The multinomial can be used for distributions with multiple categories.
    • Die rolls, Wordle number of guesses
    • Likert scales (e.g. PANSS in schizophrenia), Tanner staging (puberty; 1-5), Cancer staging (0-4)
  • Number of children.
  • Region.
  • Let’s fill in this table together
y=Number of children \mathbb{P}(Y=y)
0 ___
1 ___
2 ___
3 ___
4 ___
5 ___
6+ ___??
  • What is the probability someone in the US has more than 5 children?
  • What is the estimated probability someone in the US has more than 5 children?
  • Write out the table for region and estimate the probabilities from the dataset.

Probability density function (PDF)

  • The probability density function f_X(x) for a random variable X can be thought of as \mathbb{P}(X=x).
  • It is called the probability mass function (PMF) for discrete random variables.

Density functions.

Number of children probability density function (PDF)

  • Let’s draw the estimated PDF for smoking status.
  • Let’s draw the estimated PDF for the number of children.
  • Let’s draw the estimated PDF for region.

Body mass index

From the insurance dataset

  • BMI is a continuous random variable, \mathrm{BMI} = \mathrm{weight}/\mathrm{height}^2

  • Here’s a plot from the insurance dataset.

  • It looks approximately normally distributed (not quite).

  • I’ve drawn the normal density on top

Code
library(RESI)
lims = hist(insurance$bmi, freq = FALSE, ylim=c(0, .08), main='BMI Histogram', xlab='BMI')
x = seq(min(lims$mids), max(lims$mids), length.out=100)
lines(x, dnorm(x, mean = mean(insurance$bmi),sd=sd(insurance$bmi)))