6 Categorical data

This will be to catch some of the things I think you will need that we haven’t covered yet.

Contingency tables
Nonparametric tests

6.1 Contingency tables

Today we will:

Introduce Pearson’s chi-squared test for two way tables
- Extend to larger tables
Case control sampling
Cohort study sampling
Introduce contingency table concepts
- Risk difference
- Relative Risk
- Odds ratio
A contingency table is an array of counts associated with one or more categorical or ordinal variables.
The number of dimensions of the table depends on the number of variables and the size of each dimension depends on the number of categories.

Code

set.seed(1333)
puf = readRDS('../datasets/nsduh/puf.rds')
knitr::kable(matrix(c('a', 'b', 'c', 'd'), nrow=2, dimnames=list(c('X=0', 'X=1'), c('Y=0', 'Y=1') ) ) )

	Y=0	Y=1
X=0	a	c
X=1	b	d

6.1.1 Example: Marijuana and LSD use

Code

mjlsdtab = with(puf, table(mrjflag, lsdflag))
pander::pander(descr::CrossTable(mjlsdtab, prop.r = FALSE, prop.c = FALSE, prop.chisq = FALSE))

mrjflag	lsdflag no	yes	Total
no N Total(%)	33033 58.0575%	75 0.1318%	33108
yes N Total(%)	19281 33.8876%	4508 7.9231%	23789
Total	52314	4583	56897

6.2 Cohort sampling

Code

# Cohort sampleing
n = 200 # number in each group
dat = puf[sample(nrow(puf), size = n), c('mrjflag', 'lsdflag')]
Cohortmjlsdtab = with(dat, table(mrjflag, lsdflag))
pander::pander(descr::CrossTable(Cohortmjlsdtab, prop.r = TRUE, prop.t=FALSE, prop.c = FALSE, prop.chisq = FALSE))

mrjflag	lsdflag no	yes	Total
no N Row(%)	117 100.0000%	0 0.0000%	117 58.5000%
yes N Row(%)	63 75.9036%	20 24.0964%	83 41.5000%
Total	180	20	200

The way proportions are computed matters.
- If we are sampling from a cohort study, then no margins are fixed. The cells can be represented by joint or conditional probabilities.
- If we are sampling using a case-control study, then a single margin is fixed. The cells should be represented with conditional probabilities.

6.3 Case-control sampling

In case-control sampling, the data are sampled conditional on the outcome.
For example, if I sample patients and controls in equal proportions.
Case-control studies sample conditional on one variable (e.g. diagnosis, or MJ use).

Code

# case control sampling
n = 200 # number in each group
mrjflag = factor(c(sample(puf$mrjflag[ puf$lsdflag=='no'], n), # samples mrj use for 200 people who have never used LSD
                                  sample(puf$mrjflag[ puf$lsdflag=='yes'], n) ), # samples mrj use for 200 people who have used LSD
                                levels = c('no', 'yes'), labels=c('no', 'yes') )
dat = data.frame(lsdflag=rep(c('no', 'yes'), each=n),
                 mrjflag=mrjflag )
CCmjlsdtab = with(dat, table(mrjflag, lsdflag))
pander::pander(descr::CrossTable(CCmjlsdtab, prop.r = FALSE, prop.t=FALSE, prop.c = TRUE, prop.chisq = FALSE))

mrjflag	lsdflag no	yes	Total
no N Column(%)	126 63%	2 1%	128
yes N Column(%)	74 37%	198 99%	272
Total	200 50%	200 50%	400

6.4 Design considerations for case-control versus cohort samples

If the disease outcome is rare, it’s probably better to use a case-control design to ensure that there are enough cases.
Case-control designs don’t allow you to estimate the risk difference or ratio of the disease, because you’ve fixed the number of patient and control subjects.

6.5 Probability models for the two designs

6.5.1 Binomial conditional on $Y$.

Let $Y_i$ denote the marijuana use yes/no for subject $i$ and $X_i$ denote the LSD use for $i=1,\ldots, n$.
Conditional on $Y_i$, we can assume that $Y_i=0$ for all $i=1, \ldots, n_0$, and $Y_i=1$ for all $i=n_0 + 1, \ldots, n$.
This allows us to write this like we did for the two means problem \[ \begin{align*} X_i & \sim \mathrm{Bern}(\pi_0) \text{ for $i=1,\ldots,n_0$}\\ X_i & \sim \mathrm{Bern}(\pi_1)\text{ for $i=n_0+1,\ldots,n_0+n_1$} \end{align*} \]

6.5.2 Multinomial where probabilities depend on shape of array.

The probability someone falls into a given cell can be modeled with a multinomial distribution.
Let $x_i = [x_{i1}, x_{i2}, x_{i3}, x_{i4}]$, where $x_{ij}=0,1$ and only one of the elements $x_{ij}=1$. \[ \mathbb{P}(X_i=[x_{i1}, x_{i2}, x_{i3}, x_{i4}] ) = \pi_1^{x_1}\pi_2^{x_2}\pi_3^{x_3} \pi_4^{x_4} \] and $\sum \pi_j = 1$
The multinomial probability describes how likely someone falls into a cell with a given combination yes/no answers.
Have to decide some order about how the $\pi_j$ correspond to elements of the table.

6.5.3 Independence hypothesis

Conditional probabilities are equal \[ \pi_0 = \pi_1 \]
Product of marginals \[ P(X_i=j, Y_i = k) = P(X_i =j)P(Y_i=k) \]

6.6 Pearson’s chi-squared test

Pearson’s chi-squared test is a way to test this hypothesis
The degrees of freedom depends on the model \[ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} = N \sum_{i=1}^n \frac{\left(O_i/N - p_i\right)^2}{p_i} \]
Derivation of the test statistic distribution uses linear algebra

6.6.1 Using the chi-squared test

Specify the null hypothesis of independence for the case-control and cohort designs
$H_0 : \mathbb{P}(X_{ij}=x_{ij}) = \text{product of marginal probabilities}$
Implies independence of the two variables.

Code

# how should I compute the null probabilities for this?
cs = colSums(CCmjlsdtab)
#chisq.test(CCmjlsdtab, p = nullprobs)
CC = chisq.test(CCmjlsdtab)

#pander::pander(descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE))
descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## |           N / Row Total | 
## |-------------------------|
## 
## ================================
##            lsdflag
## mrjflag       no     yes   Total
## --------------------------------
## no           126       2     128
##               64      64        
##            0.984   0.016   0.320
## --------------------------------
## yes           74     198     272
##              136     136        
##            0.272   0.728   0.680
## --------------------------------
## Total        200     200     400
## ================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 176.6544      d.f. = 1      p <2e-16 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 = 173.8166      d.f. = 1      p <2e-16

Code

# how should I compute the null probabilities for this?
cs = colSums(Cohortmjlsdtab)
rs = rowSums(Cohortmjlsdtab)
cohort = chisq.test(Cohortmjlsdtab)

# renders it in a nice output, but removes test
#pander::pander(descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE))
descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## |           N / Row Total | 
## |-------------------------|
## 
## ================================
##            lsdflag
## mrjflag       no     yes   Total
## --------------------------------
## no           126       2     128
##               64      64        
##            0.984   0.016   0.320
## --------------------------------
## yes           74     198     272
##              136     136        
##            0.272   0.728   0.680
## --------------------------------
## Total        200     200     400
## ================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 176.6544      d.f. = 1      p <2e-16 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 = 173.8166      d.f. = 1      p <2e-16

6.6.2 Age by LSD use

Test is appropriate for arbitrary number of categories.
Uses normal (chi-squared) approximation so requires sufficient numbers in each cell.
- Might not be very accurate with extremely rare events.

Code

puf$age = as.character(puf$age)
# looking at subset of ages 18 and older
agelsdtab = with(puf[puf$age!='12-17',], table(age, lsdflag))
descr::CrossTable(agelsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |              Expected N | 
## |           N / Row Total | 
## |-------------------------|
## 
## =======================================
##                lsdflag
## age                 no      yes   Total
## ---------------------------------------
## 18-25            12543     1117   13660
##                12254.1   1405.9        
##                  0.918    0.082   0.320
## ---------------------------------------
## 26-34             7907      844    8751
##                 7850.3    900.7        
##                  0.904    0.096   0.205
## ---------------------------------------
## 35-49             9830     1531   11361
##                10191.7   1169.3        
##                  0.865    0.135   0.267
## ---------------------------------------
## 50 or older       7958      895    8853
##                 7941.8    911.2        
##                  0.899    0.101   0.208
## ---------------------------------------
## Total            38238     4387   42625
## =======================================
## 
## Statistics for All Table Factors
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 = 195.204      d.f. = 3      p <2e-16

6.7 Categorical versus ordinal variables

Test above is not necessarily the best option. Why?
Categorical vs. ordinal variables
Cochran-Armitage trend test would be a more efficient way to test this because age is ordinal
Anything more complicated than that will require logistic regression (a linear model)
- E.g. testing whether that an age trend is sufficient to completely model the differences in age

6.8 Contingency table concepts

6.8.1 Association metrics for two binomial proportions

There are a few different ways of summarizing categorical differences. Usually described in reference to a treatment/exposure and a disease/outcome. In the example above marijuana use is the “exposure” and LSD use is the “disease”.
- Risk difference
- Relative Risk
- Odds ratio
All are parameters are based on the conditional probability of the disease/outcome given the treatment/exposure. For $\mathbb{P}(Y_i = 1 \mid X_i=0) = \pi_0$ and $\mathbb{P}(Y_i = 1 \mid X_i=1) = \pi_1$. \[ \begin{align*} RD & = \pi_1 - \pi_0\\ RR & = \frac{\pi_1}{\pi_0}\\ OR & = \frac{\pi_1/(1-\pi_1)}{\pi_0/(1-\pi_0)} \end{align*} \]
What is $H_0$ expressed as each of these parameters?

6.8.2 Confidence intervals for these parameters

They are all based on normal approximations.
You will derive an estimator and test statistic for the first one.
What are intuitive estimators for the other two using the notation below?

	LSD	No LSD
MJ	$n_{11}$	$n_{12}$
No MJ	$n_{21}$	$n_{22}$

\[ \begin{align*} \widehat{RD} & = \hat\pi_1 - \hat\pi_0 = \\ \widehat{RR} & = \frac{\pi_1}{\pi_0} = \frac{n_{11}(n_{21}+n_{22})}{(n_{11}+n_{12})n_{21}}\\ \widehat{OR} & = \frac{n_{11}n_{22}}{n_{12}n_{21}} \end{align*} \]

Confidence intervals using these estimators \[ \begin{align*} \widehat{RD} \pm z_{1-\alpha/2} \times SE(\widehat{RD}) \\ \exp\left\{ \log\widehat{RR}\pm z_{1-\alpha/2} \times SE(\log\widehat{RD}) \right \} \\ \exp\left\{\log \widehat{OR} \pm z_{1-\alpha/2} \times SE(\log\widehat{OR}) \right\} \end{align*} \]

Where \[ \begin{align*} SE(\widehat{RD}) & = \sqrt{\dfrac{n_{11}n_{12}}{(n_{11} + n_{12})^3}+\dfrac{n_{21}n_{22}}{(n_{21}+n_{22})^3}} \\ SE(\log\widehat{RD}) &= \sqrt{\frac{1}{n_{11}} - \frac{1}{n_{11}+n_{12}} + \frac{1}{n_{21}}- \frac{1}{n_{21}+n_{22}}} \\ SE(\log\widehat{OR}) & = \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}}+\frac{1}{n_{22}} } \end{align*} \]

Where $SE(\log\widehat{RD}) = \sqrt{\text{Var}(\log\widehat{RD})}$
The formula for the variance of $\log\widehat{RD}$ and $\log\widehat{OR}$ is obtained using the Delta method.

6.9 Things not talked about here

McNemar’s test – testing correspondence across multiple two-way tables
Simpson’s paradox – stratified tables contradict the marginal table Wikipedia.
Logistic regression (Can answer all these questions and more in a regression framework)

6.9.1 Other resources

Agresti is often used to teach categorical data analysis.

BIOS6311: Principles of Modern Biostatistics

6 Categorical data

6.1 Contingency tables

6.1.1 Example: Marijuana and LSD use

6.2 Cohort sampling

6.3 Case-control sampling

6.4 Design considerations for case-control versus cohort samples

6.5 Probability models for the two designs

6.5.1 Binomial conditional on \(Y\).

6.5.2 Multinomial where probabilities depend on shape of array.

6.5.3 Independence hypothesis

6.6 Pearson’s chi-squared test

6.6.1 Using the chi-squared test

6.6.2 Age by LSD use

6.7 Categorical versus ordinal variables

6.8 Contingency table concepts

6.8.1 Association metrics for two binomial proportions

6.8.2 Confidence intervals for these parameters

6.9 Things not talked about here

6.9.1 Other resources

	LSD	No LSD
MJ	\(n_{11}\)	\(n_{12}\)
No MJ	\(n_{21}\)	\(n_{22}\)