6 Categorical data

This will be to catch some of the things I think you will need that we haven’t covered yet.

  • Contingency tables
  • Nonparametric tests

6.1 Contingency tables

Today we will:

  • Introduce Pearson’s chi-squared test for two way tables
    • Extend to larger tables
  • Case control sampling
  • Cohort study sampling
  • Introduce contingency table concepts
    • Risk difference
    • Relative Risk
    • Odds ratio
  • A contingency table is an array of counts associated with one or more categorical or ordinal variables.
  • The number of dimensions of the table depends on the number of variables and the size of each dimension depends on the number of categories.
Code
set.seed(1333)
puf = readRDS('../datasets/nsduh/puf.rds')
knitr::kable(matrix(c('a', 'b', 'c', 'd'), nrow=2, dimnames=list(c('X=0', 'X=1'), c('Y=0', 'Y=1') ) ) ) 
Y=0 Y=1
X=0 a c
X=1 b d

6.1.1 Example: Marijuana and LSD use

Code
mjlsdtab = with(puf, table(mrjflag, lsdflag))
pander::pander(descr::CrossTable(mjlsdtab, prop.r = FALSE, prop.c = FALSE, prop.chisq = FALSE))
 
mrjflag
lsdflag
no
 
yes
 
Total
no
N
Total(%)
 
33033
58.0575%
 
75
0.1318%
 
33108
yes
N
Total(%)
 
19281
33.8876%
 
4508
7.9231%
 
23789
Total 52314 4583 56897

6.2 Cohort sampling

Code
# Cohort sampleing
n = 200 # number in each group
dat = puf[sample(nrow(puf), size = n), c('mrjflag', 'lsdflag')]
Cohortmjlsdtab = with(dat, table(mrjflag, lsdflag))
pander::pander(descr::CrossTable(Cohortmjlsdtab, prop.r = TRUE, prop.t=FALSE, prop.c = FALSE, prop.chisq = FALSE))
 
mrjflag
lsdflag
no
 
yes
 
Total
no
N
Row(%)
 
117
100.0000%
 
0
0.0000%
 
117
58.5000%
yes
N
Row(%)
 
63
75.9036%
 
20
24.0964%
 
83
41.5000%
Total 180 20 200
  • The way proportions are computed matters.
    • If we are sampling from a cohort study, then no margins are fixed. The cells can be represented by joint or conditional probabilities.
    • If we are sampling using a case-control study, then a single margin is fixed. The cells should be represented with conditional probabilities.

6.3 Case-control sampling

  • In case-control sampling, the data are sampled conditional on the outcome.
  • For example, if I sample patients and controls in equal proportions.
  • Case-control studies sample conditional on one variable (e.g. diagnosis, or MJ use).
Code
# case control sampling
n = 200 # number in each group
mrjflag = factor(c(sample(puf$mrjflag[ puf$lsdflag=='no'], n), # samples mrj use for 200 people who have never used LSD
                                  sample(puf$mrjflag[ puf$lsdflag=='yes'], n) ), # samples mrj use for 200 people who have used LSD
                                levels = c('no', 'yes'), labels=c('no', 'yes') )
dat = data.frame(lsdflag=rep(c('no', 'yes'), each=n),
                 mrjflag=mrjflag )
CCmjlsdtab = with(dat, table(mrjflag, lsdflag))
pander::pander(descr::CrossTable(CCmjlsdtab, prop.r = FALSE, prop.t=FALSE, prop.c = TRUE, prop.chisq = FALSE))
 
mrjflag
lsdflag
no
 
yes
 
Total
no
N
Column(%)
 
126
63%
 
2
1%
 
128
yes
N
Column(%)
 
74
37%
 
198
99%
 
272
Total
200
50%
200
50%
400

6.4 Design considerations for case-control versus cohort samples

  • If the disease outcome is rare, it’s probably better to use a case-control design to ensure that there are enough cases.
  • Case-control designs don’t allow you to estimate the risk difference or ratio of the disease, because you’ve fixed the number of patient and control subjects.

6.5 Probability models for the two designs

6.5.1 Binomial conditional on \(Y\).

  • Let \(Y_i\) denote the marijuana use yes/no for subject \(i\) and \(X_i\) denote the LSD use for \(i=1,\ldots, n\).
  • Conditional on \(Y_i\), we can assume that \(Y_i=0\) for all \(i=1, \ldots, n_0\), and \(Y_i=1\) for all \(i=n_0 + 1, \ldots, n\).
  • This allows us to write this like we did for the two means problem \[ \begin{align*} X_i & \sim \mathrm{Bern}(\pi_0) \text{ for $i=1,\ldots,n_0$}\\ X_i & \sim \mathrm{Bern}(\pi_1)\text{ for $i=n_0+1,\ldots,n_0+n_1$} \end{align*} \]

6.5.2 Multinomial where probabilities depend on shape of array.

  • The probability someone falls into a given cell can be modeled with a multinomial distribution.
  • Let \(x_i = [x_{i1}, x_{i2}, x_{i3}, x_{i4}]\), where \(x_{ij}=0,1\) and only one of the elements \(x_{ij}=1\). \[ \mathbb{P}(X_i=[x_{i1}, x_{i2}, x_{i3}, x_{i4}] ) = \pi_1^{x_1}\pi_2^{x_2}\pi_3^{x_3} \pi_4^{x_4} \] and \(\sum \pi_j = 1\)
  • The multinomial probability describes how likely someone falls into a cell with a given combination yes/no answers.
  • Have to decide some order about how the \(\pi_j\) correspond to elements of the table.

6.5.3 Independence hypothesis

  • Conditional probabilities are equal \[ \pi_0 = \pi_1 \]
  • Product of marginals \[ P(X_i=j, Y_i = k) = P(X_i =j)P(Y_i=k) \]

6.6 Pearson’s chi-squared test

  • Pearson’s chi-squared test is a way to test this hypothesis
  • The degrees of freedom depends on the model \[ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} = N \sum_{i=1}^n \frac{\left(O_i/N - p_i\right)^2}{p_i} \]
  • Derivation of the test statistic distribution uses linear algebra

6.6.1 Using the chi-squared test

  • Specify the null hypothesis of independence for the case-control and cohort designs
  • \(H_0 : \mathbb{P}(X_{ij}=x_{ij}) = \text{product of marginal probabilities}\)
  • Implies independence of the two variables.
Code
# how should I compute the null probabilities for this?
cs = colSums(CCmjlsdtab)
#chisq.test(CCmjlsdtab, p = nullprobs)
CC = chisq.test(CCmjlsdtab)

#pander::pander(descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE))
descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## |           N / Row Total | 
## |-------------------------|
## 
## ================================
##            lsdflag
## mrjflag       no     yes   Total
## --------------------------------
## no           126       2     128
##               64      64        
##            0.984   0.016   0.320
## --------------------------------
## yes           74     198     272
##              136     136        
##            0.272   0.728   0.680
## --------------------------------
## Total        200     200     400
## ================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 176.6544      d.f. = 1      p <2e-16 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 = 173.8166      d.f. = 1      p <2e-16
Code
# how should I compute the null probabilities for this?
cs = colSums(Cohortmjlsdtab)
rs = rowSums(Cohortmjlsdtab)
cohort = chisq.test(Cohortmjlsdtab)

# renders it in a nice output, but removes test
#pander::pander(descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE))
descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## |           N / Row Total | 
## |-------------------------|
## 
## ================================
##            lsdflag
## mrjflag       no     yes   Total
## --------------------------------
## no           126       2     128
##               64      64        
##            0.984   0.016   0.320
## --------------------------------
## yes           74     198     272
##              136     136        
##            0.272   0.728   0.680
## --------------------------------
## Total        200     200     400
## ================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 176.6544      d.f. = 1      p <2e-16 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 = 173.8166      d.f. = 1      p <2e-16

6.6.2 Age by LSD use

  • Test is appropriate for arbitrary number of categories.
  • Uses normal (chi-squared) approximation so requires sufficient numbers in each cell.
    • Might not be very accurate with extremely rare events.
Code
puf$age = as.character(puf$age)
# looking at subset of ages 18 and older
agelsdtab = with(puf[puf$age!='12-17',], table(age, lsdflag))
descr::CrossTable(agelsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## |           N / Row Total | 
## |-------------------------|
## 
## =======================================
##                lsdflag
## age                 no      yes   Total
## ---------------------------------------
## 18-25            12543     1117   13660
##                12254.1   1405.9        
##                  0.918    0.082   0.320
## ---------------------------------------
## 26-34             7907      844    8751
##                 7850.3    900.7        
##                  0.904    0.096   0.205
## ---------------------------------------
## 35-49             9830     1531   11361
##                10191.7   1169.3        
##                  0.865    0.135   0.267
## ---------------------------------------
## 50 or older       7958      895    8853
##                 7941.8    911.2        
##                  0.899    0.101   0.208
## ---------------------------------------
## Total            38238     4387   42625
## =======================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 195.204      d.f. = 3      p <2e-16

6.7 Categorical versus ordinal variables

  • Test above is not necessarily the best option. Why?
  • Categorical vs. ordinal variables
  • Cochran-Armitage trend test would be a more efficient way to test this because age is ordinal
  • Anything more complicated than that will require logistic regression (a linear model)
    • E.g. testing whether that an age trend is sufficient to completely model the differences in age

6.8 Contingency table concepts

6.8.1 Association metrics for two binomial proportions

  • There are a few different ways of summarizing categorical differences. Usually described in reference to a treatment/exposure and a disease/outcome. In the example above marijuana use is the “exposure” and LSD use is the “disease”.

    • Risk difference
    • Relative Risk
    • Odds ratio
  • All are parameters are based on the conditional probability of the disease/outcome given the treatment/exposure. For \(\mathbb{P}(Y_i = 1 \mid X_i=0) = \pi_0\) and \(\mathbb{P}(Y_i = 1 \mid X_i=1) = \pi_1\). \[ \begin{align*} RD & = \pi_1 - \pi_0\\ RR & = \frac{\pi_1}{\pi_0}\\ OR & = \frac{\pi_1/(1-\pi_1)}{\pi_0/(1-\pi_0)} \end{align*} \]

  • What is \(H_0\) expressed as each of these parameters?

6.8.2 Confidence intervals for these parameters

  • They are all based on normal approximations.
  • You will derive an estimator and test statistic for the first one.
  • What are intuitive estimators for the other two using the notation below?
LSD No LSD
MJ \(n_{11}\) \(n_{12}\)
No MJ \(n_{21}\) \(n_{22}\)

\[ \begin{align*} \widehat{RD} & = \hat\pi_1 - \hat\pi_0 = \\ \widehat{RR} & = \frac{\pi_1}{\pi_0} = \frac{n_{11}(n_{21}+n_{22})}{(n_{11}+n_{12})n_{21}}\\ \widehat{OR} & = \frac{n_{11}n_{22}}{n_{12}n_{21}} \end{align*} \]

Confidence intervals using these estimators \[ \begin{align*} \widehat{RD} \pm z_{1-\alpha/2} \times SE(\widehat{RD}) \\ \exp\left\{ \log\widehat{RR}\pm z_{1-\alpha/2} \times SE(\log\widehat{RD}) \right \} \\ \exp\left\{\log \widehat{OR} \pm z_{1-\alpha/2} \times SE(\log\widehat{OR}) \right\} \end{align*} \]

Where \[ \begin{align*} SE(\widehat{RD}) & = \sqrt{\dfrac{n_{11}n_{12}}{(n_{11} + n_{12})^3}+\dfrac{n_{21}n_{22}}{(n_{21}+n_{22})^3}} \\ SE(\log\widehat{RD}) &= \sqrt{\frac{1}{n_{11}} - \frac{1}{n_{11}+n_{12}} + \frac{1}{n_{21}}- \frac{1}{n_{21}+n_{22}}} \\ SE(\log\widehat{OR}) & = \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}}+\frac{1}{n_{22}} } \end{align*} \]

  • Where \(SE(\log\widehat{RD}) = \sqrt{\text{Var}(\log\widehat{RD})}\)
  • The formula for the variance of \(\log\widehat{RD}\) and \(\log\widehat{OR}\) is obtained using the Delta method.

6.9 Things not talked about here

  • McNemar’s test – testing correspondence across multiple two-way tables
  • Simpson’s paradox – stratified tables contradict the marginal table Wikipedia.
  • Logistic regression (Can answer all these questions and more in a regression framework)

6.9.1 Other resources

  • Agresti is often used to teach categorical data analysis.