6 Categorical data
This will be to catch some of the things I think you will need that we haven’t covered yet.
- Contingency tables
- Nonparametric tests
6.1 Contingency tables
Today we will:
- Introduce Pearson’s chi-squared test for two way tables
- Extend to larger tables
- Case control sampling
- Cohort study sampling
- Introduce contingency table concepts
- Risk difference
- Relative Risk
- Odds ratio
- A contingency table is an array of counts associated with one or more categorical or ordinal variables.
- The number of dimensions of the table depends on the number of variables and the size of each dimension depends on the number of categories.
Code
Y=0 | Y=1 | |
---|---|---|
X=0 | a | c |
X=1 | b | d |
6.2 Cohort sampling
Code
mrjflag |
lsdflag no |
yes |
Total |
---|---|---|---|
no N Row(%) |
117 100.0000% |
0 0.0000% |
117 58.5000% |
yes N Row(%) |
63 75.9036% |
20 24.0964% |
83 41.5000% |
Total | 180 | 20 | 200 |
- The way proportions are computed matters.
- If we are sampling from a cohort study, then no margins are fixed. The cells can be represented by joint or conditional probabilities.
- If we are sampling using a case-control study, then a single margin is fixed. The cells should be represented with conditional probabilities.
6.3 Case-control sampling
- In case-control sampling, the data are sampled conditional on the outcome.
- For example, if I sample patients and controls in equal proportions.
- Case-control studies sample conditional on one variable (e.g. diagnosis, or MJ use).
Code
# case control sampling
n = 200 # number in each group
mrjflag = factor(c(sample(puf$mrjflag[ puf$lsdflag=='no'], n), # samples mrj use for 200 people who have never used LSD
sample(puf$mrjflag[ puf$lsdflag=='yes'], n) ), # samples mrj use for 200 people who have used LSD
levels = c('no', 'yes'), labels=c('no', 'yes') )
dat = data.frame(lsdflag=rep(c('no', 'yes'), each=n),
mrjflag=mrjflag )
CCmjlsdtab = with(dat, table(mrjflag, lsdflag))
pander::pander(descr::CrossTable(CCmjlsdtab, prop.r = FALSE, prop.t=FALSE, prop.c = TRUE, prop.chisq = FALSE))
mrjflag |
lsdflag no |
yes |
Total |
---|---|---|---|
no N Column(%) |
126 63% |
2 1% |
128 |
yes N Column(%) |
74 37% |
198 99% |
272 |
Total |
200 50% |
200 50% |
400 |
6.4 Design considerations for case-control versus cohort samples
- If the disease outcome is rare, it’s probably better to use a case-control design to ensure that there are enough cases.
- Case-control designs don’t allow you to estimate the risk difference or ratio of the disease, because you’ve fixed the number of patient and control subjects.
6.5 Probability models for the two designs
6.5.1 Binomial conditional on \(Y\).
- Let \(Y_i\) denote the marijuana use
yes/no
for subject \(i\) and \(X_i\) denote the LSD use for \(i=1,\ldots, n\). - Conditional on \(Y_i\), we can assume that \(Y_i=0\) for all \(i=1, \ldots, n_0\), and \(Y_i=1\) for all \(i=n_0 + 1, \ldots, n\).
- This allows us to write this like we did for the two means problem \[ \begin{align*} X_i & \sim \mathrm{Bern}(\pi_0) \text{ for $i=1,\ldots,n_0$}\\ X_i & \sim \mathrm{Bern}(\pi_1)\text{ for $i=n_0+1,\ldots,n_0+n_1$} \end{align*} \]
6.5.2 Multinomial where probabilities depend on shape of array.
- The probability someone falls into a given cell can be modeled with a multinomial distribution.
- Let \(x_i = [x_{i1}, x_{i2}, x_{i3}, x_{i4}]\), where \(x_{ij}=0,1\) and only one of the elements \(x_{ij}=1\). \[ \mathbb{P}(X_i=[x_{i1}, x_{i2}, x_{i3}, x_{i4}] ) = \pi_1^{x_1}\pi_2^{x_2}\pi_3^{x_3} \pi_4^{x_4} \] and \(\sum \pi_j = 1\)
- The multinomial probability describes how likely someone falls into a cell with a given combination yes/no answers.
- Have to decide some order about how the \(\pi_j\) correspond to elements of the table.
6.6 Pearson’s chi-squared test
- Pearson’s chi-squared test is a way to test this hypothesis
- The degrees of freedom depends on the model \[ \chi^2 = \sum_{i=1}^{n} \frac{(O_i - E_i)^2}{E_i} = N \sum_{i=1}^n \frac{\left(O_i/N - p_i\right)^2}{p_i} \]
- Derivation of the test statistic distribution uses linear algebra
6.6.1 Using the chi-squared test
- Specify the null hypothesis of independence for the case-control and cohort designs
- \(H_0 : \mathbb{P}(X_{ij}=x_{ij}) = \text{product of marginal probabilities}\)
- Implies independence of the two variables.
Code
# how should I compute the null probabilities for this?
cs = colSums(CCmjlsdtab)
#chisq.test(CCmjlsdtab, p = nullprobs)
CC = chisq.test(CCmjlsdtab)
#pander::pander(descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE))
descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE)
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | N / Row Total |
## |-------------------------|
##
## ================================
## lsdflag
## mrjflag no yes Total
## --------------------------------
## no 126 2 128
## 64 64
## 0.984 0.016 0.320
## --------------------------------
## yes 74 198 272
## 136 136
## 0.272 0.728 0.680
## --------------------------------
## Total 200 200 400
## ================================
##
## Statistics for All Table Factors
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 176.6544 d.f. = 1 p <2e-16
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 173.8166 d.f. = 1 p <2e-16
Code
# how should I compute the null probabilities for this?
cs = colSums(Cohortmjlsdtab)
rs = rowSums(Cohortmjlsdtab)
cohort = chisq.test(Cohortmjlsdtab)
# renders it in a nice output, but removes test
#pander::pander(descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE))
descr::CrossTable(CCmjlsdtab, chisq=TRUE, expected=TRUE, prop.t=FALSE, prop.c = FALSE,prop.chisq=FALSE)
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | N / Row Total |
## |-------------------------|
##
## ================================
## lsdflag
## mrjflag no yes Total
## --------------------------------
## no 126 2 128
## 64 64
## 0.984 0.016 0.320
## --------------------------------
## yes 74 198 272
## 136 136
## 0.272 0.728 0.680
## --------------------------------
## Total 200 200 400
## ================================
##
## Statistics for All Table Factors
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 176.6544 d.f. = 1 p <2e-16
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 173.8166 d.f. = 1 p <2e-16
6.6.2 Age by LSD use
- Test is appropriate for arbitrary number of categories.
- Uses normal (chi-squared) approximation so requires sufficient numbers in each cell.
- Might not be very accurate with extremely rare events.
Code
## Cell Contents
## |-------------------------|
## | N |
## | Expected N |
## | N / Row Total |
## |-------------------------|
##
## =======================================
## lsdflag
## age no yes Total
## ---------------------------------------
## 18-25 12543 1117 13660
## 12254.1 1405.9
## 0.918 0.082 0.320
## ---------------------------------------
## 26-34 7907 844 8751
## 7850.3 900.7
## 0.904 0.096 0.205
## ---------------------------------------
## 35-49 9830 1531 11361
## 10191.7 1169.3
## 0.865 0.135 0.267
## ---------------------------------------
## 50 or older 7958 895 8853
## 7941.8 911.2
## 0.899 0.101 0.208
## ---------------------------------------
## Total 38238 4387 42625
## =======================================
##
## Statistics for All Table Factors
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 195.204 d.f. = 3 p <2e-16
6.7 Categorical versus ordinal variables
- Test above is not necessarily the best option. Why?
- Categorical vs. ordinal variables
- Cochran-Armitage trend test would be a more efficient way to test this because age is ordinal
- Anything more complicated than that will require logistic regression (a linear model)
- E.g. testing whether that an age trend is sufficient to completely model the differences in age
6.8 Contingency table concepts
6.8.1 Association metrics for two binomial proportions
There are a few different ways of summarizing categorical differences. Usually described in reference to a treatment/exposure and a disease/outcome. In the example above marijuana use is the “exposure” and LSD use is the “disease”.
- Risk difference
- Relative Risk
- Odds ratio
All are parameters are based on the conditional probability of the disease/outcome given the treatment/exposure. For \(\mathbb{P}(Y_i = 1 \mid X_i=0) = \pi_0\) and \(\mathbb{P}(Y_i = 1 \mid X_i=1) = \pi_1\). \[ \begin{align*} RD & = \pi_1 - \pi_0\\ RR & = \frac{\pi_1}{\pi_0}\\ OR & = \frac{\pi_1/(1-\pi_1)}{\pi_0/(1-\pi_0)} \end{align*} \]
What is \(H_0\) expressed as each of these parameters?
6.8.2 Confidence intervals for these parameters
- They are all based on normal approximations.
- You will derive an estimator and test statistic for the first one.
- What are intuitive estimators for the other two using the notation below?
LSD | No LSD | |
---|---|---|
MJ | \(n_{11}\) | \(n_{12}\) |
No MJ | \(n_{21}\) | \(n_{22}\) |
\[ \begin{align*} \widehat{RD} & = \hat\pi_1 - \hat\pi_0 = \\ \widehat{RR} & = \frac{\pi_1}{\pi_0} = \frac{n_{11}(n_{21}+n_{22})}{(n_{11}+n_{12})n_{21}}\\ \widehat{OR} & = \frac{n_{11}n_{22}}{n_{12}n_{21}} \end{align*} \]
Confidence intervals using these estimators \[ \begin{align*} \widehat{RD} \pm z_{1-\alpha/2} \times SE(\widehat{RD}) \\ \exp\left\{ \log\widehat{RR}\pm z_{1-\alpha/2} \times SE(\log\widehat{RD}) \right \} \\ \exp\left\{\log \widehat{OR} \pm z_{1-\alpha/2} \times SE(\log\widehat{OR}) \right\} \end{align*} \]
Where \[ \begin{align*} SE(\widehat{RD}) & = \sqrt{\dfrac{n_{11}n_{12}}{(n_{11} + n_{12})^3}+\dfrac{n_{21}n_{22}}{(n_{21}+n_{22})^3}} \\ SE(\log\widehat{RD}) &= \sqrt{\frac{1}{n_{11}} - \frac{1}{n_{11}+n_{12}} + \frac{1}{n_{21}}- \frac{1}{n_{21}+n_{22}}} \\ SE(\log\widehat{OR}) & = \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}}+\frac{1}{n_{22}} } \end{align*} \]
- Where \(SE(\log\widehat{RD}) = \sqrt{\text{Var}(\log\widehat{RD})}\)
- The formula for the variance of \(\log\widehat{RD}\) and \(\log\widehat{OR}\) is obtained using the Delta method.
6.9 Things not talked about here
- McNemar’s test – testing correspondence across multiple two-way tables
- Simpson’s paradox – stratified tables contradict the marginal table Wikipedia.
- Logistic regression (Can answer all these questions and more in a regression framework)
6.9.1 Other resources
- Agresti is often used to teach categorical data analysis.