--- title: "Stat 5421 Lecture Notes: To Accompany Agresti Ch 9" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: bookdown::html_document2: number_sections: false md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML bookdown::pdf_document2: number_sections: false md_extensions: -tex_math_single_backslash linkcolor: blue urlcolor: blue --- # Section 9.1 Two-Way Tables ## Section 9.1.1 Only Main Effects The *independence* or *homogeneity of proportions* model has mean-value parameters $$ \mu_{i j} = \alpha_i \beta_j $$ or canonical parameters \begin{equation} \theta_{i j} = \alpha_i + \beta_j (\#eq:two-way-independence) \end{equation} where the alphas and betas are different in the two equations (this causes no confusion, since if we look at parameters at all, these are usually canonical parameters). ## Section 9.1.3 Interactions The *saturated* model has completely arbitrary parameters. The mean value parameters $\mu_{i j}$ have no specified structure. The canonical parameters $\theta_{i j}$ have no specified structure. So this is the largest model. It has the most (arbitrarily variable) parameters. ## Section 9.1.5 Hierarchical Models The hierarchical model principle says that, if you have an interaction term of a certain order, then you have all lower-order interactions of the same variables. For the saturated model for a two-way table this says \begin{equation} \theta_{i j} = \alpha_i + \beta_j + \gamma_{i j} (\#eq:two-way-saturated) \end{equation} (but this just says $\theta_{i j}$ can be anything. This is not an identifiable parameterization. ## Identifiability A statistical model (in general, not just in categorical data analysis) is *identifiable* if there is only one (vector) parameter value corresponding to a probability distribution. In exponential family models ([Notes on Exponential Families, Section 6.2](http://www.stat.umn.edu/geyer/5421/notes/expfam.pdf#page=28)) every different mean vector corresponds to a different distribution, but different canonical parameter vectors can correspond to the same distribution. If we are assuming *Poisson sampling* ([Notes on Exponential Families, Part I, section 7](http://www.stat.umn.edu/geyer/5421/notes/expfam.pdf#page=46)) then the canonical parameters are identifiable if and only if the mean-value parameters are (because the link function (componentwise log) is invertible). Clearly the specification for the two-way independence model \@ref(eq:two-way-independence) is not identifiable because one can add a constant to all of the alphas and subtract the same constant from all the betas and get the same result $$ (\alpha_i + c) + (\beta_j - c) = \alpha_i + \beta_j $$ Hence \@ref(eq:two-way-saturated) is also not identifiable. But we don't need to worry about identifiability when fitting models because the computer automatically takes care of it. For example, consider the data in Table 3.8 in Agresti ```{r data.fetal.alcohol.syndrome} counts <- c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1) drinks <- rep(c("0", "< 1", "1-2", "3-5", ">= 6"), times = 2) malformation <- rep(c("Absent", "Present"), each = 5) data.frame(drinks, malformation, counts) ``` We can fit this using R function `chisq.test` ```{r foo} foo <- xtabs(counts ~ malformation + drinks) foo chisq.test(foo) ``` But it says "approximation may be incorrect" so try ```{r foo.bar} chisq.test(foo, simulate.p.value = TRUE) ``` But we can also use R function `glm` and its helper functions to do the job. ```{r glm} gout.indep <- glm(counts ~ malformation + drinks, family = poisson) summary(gout.indep) gout.sat <- glm(counts ~ malformation * drinks, family = poisson) summary(gout.sat) anova(gout.indep, gout.sat, test = "LRT") anova(gout.indep, gout.sat, test = "Rao") ``` ```{r glm.hide} p.value.lrt <- anova(gout.indep, gout.sat, test = "LRT")[2, "Pr(>Chi)"] p.value.rao <- anova(gout.indep, gout.sat, test = "Rao")[2, "Pr(>Chi)"] p.value.chisq <- suppressWarnings(chisq.test(foo)$p.value) p.value.lrt p.value.rao p.value.chisq ``` We notice a number of things. * The $P$-value for the Rao test (`r round(p.value.rao, 5)`) is exactly equal to that output for the Chi-Square test because the Pearson chi-square test is a special case of the Rao test. * The likelihood ratio test disagrees quite strongly with the Rao test. Of course, these test are asymptotically equivalent, but here the sample size is not "large". The total number of subjects ($`r sum(foo)`$) is large, but the expected number in some cells of the contingency table are quite small, a lot less than what the rule of thumb says is necessary for valid asymptotics (at least five in each cell), so maybe that accounts for the difference. ```{r foo.expected} suppressWarnings(chisq.test(foo)$expected) ``` * The computer (R function `glm`) knows how to deal with identifiability. It adds an intercept, so it is really using the formula $$ \theta_{i j} = \alpha + \beta_i + \gamma_j $$ for the independence model and $$ \theta_{i j} = \alpha + \beta_i + \gamma_j + \delta_{i j} $$ for the saturated model. Then it "drops" (sets equal to zero) one of the betas, one of the gammas, and all interaction terms corresponding to the dropped beta and the dropped gamma. ```{r gout.indep.coef} names(coef(gout.indep)) ``` It keeps the "intercept" ($\alpha$ in our notation just above). It drops one of the coefficients for the factor `malformation` (and keeps the other one). It drops one of the coefficients for the factor `drinks` (and keeps the other four). And this gives it an identifiable model. ```{r gout.sat.coef} names(coef(gout.sat)) ``` It has all the parameters of the independence model plus a lot of "interaction" parameters (the ones whose names contain colons). But it drops all the "interaction" parameters that involve parameters that were "dropped" from the independence model (it contains no "interaction" terms containing `malformationAbsent` or `drinks< 1`) And this gives it an identifiable model. # Section 9.2 Three-Way Tables Three way tables are just like two-way tables except the data have three subscripts for the three dimensions of the table. Our analogs of ## Section 9.2.1 Only Main Effects The *independence* or *homogeneity of proportions* model has mean-value parameters $$ \mu_{i j k} = \alpha_i \beta_j \gamma_k $$ or canonical parameters \begin{equation} \theta_{i j k} = \alpha_i + \beta_j + \gamma_k (\#eq:three-way-independence) \end{equation} where the alphas, betas, and gammas are different in the two equations (this causes no confusion, since if we look at parameters at all, these are usually canonical parameters). ## The Saturated Model As before, the *saturated* model has completely arbitrary parameters. The mean value parameters $\mu_{i j k}$ have no specified structure. The canonical parameters $\theta_{i j k}$ have no specified structure. So this is the largest model. It has the most (arbitrarily variable) parameters. ## Example: High School Student Survey This is Table 9.3 in Agresti and can be found in R package `CatDataAnalysis`. ```{r high.school} # clean up R global environment rm(list = ls()) library(CatDataAnalysis) data(table_9.3) names(table_9.3) foo <- xtabs(count ~ a + c + m, data = table_9.3) foo ``` Since `a`, `c`, and `m` are categorical, we should perhaps make them factors. But since they each have only two values, it does not matter whether we make them factors or not. If they had more than two values, then it would be essential to make them factors. We can make these data look nicer by copying the names out of the book ```{r high.school.dimnames} dimnames(foo) <- list(alcohol = c("yes", "no"), cigarettes = c("yes", "no"), marijuana = c("yes", "no")) aperm(foo, c(2, 3, 1)) ``` We needed R function `aperm`, which permutes the dimensions of an array, to present the data in the same order as in the book. Of course, the order doesn't matter and neither do the `dimnames`. That is just to match the book. Now R function `chisq.test` is useless, but R function `glm` has no problems. Our test of independence or homogeneity of proportions is the same, just with three "predictors" rather than two (the "predictors" are in scare quotes because they are just the `dimnames`) ```{r high.school.test} gout.indep <- glm(count ~ a + c + m, data = table_9.3, family = poisson) gout.sat <- glm(count ~ a * c * m, data = table_9.3, family = poisson) anova(gout.indep, gout.sat, test = "LRT") anova(gout.indep, gout.sat, test = "Rao") ``` Clearly, the independence model does not fit the data. So statistics says these variables, alcohol use, cigarette use, and marijuana use have some association. No real surprise, but the data do confirm what just about everybody assumes. ## Interactions Our saturated model has the three-way interaction `a * c * m` but there can also be two-way interactions. The model with all two-way interactions would have canonical parameter $$ \theta_{i j k} = \alpha_{i j} + \beta_{i k} + \gamma_{j k} $$ (as before the terms with alphas, betas, and gammas need not match up with previous uses of these greek letters). And people who insist on the hierarchical principle would write $$ \theta_{i j k} = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k} $$ but both formulas specify the same model and neither is identifiable. So let's look at that model. ```{r high.school.all.two.way} gout.all.two.way <- glm(count ~ (a + c + m)^2, data = table_9.3, family = poisson) anova(gout.indep, gout.all.two.way, gout.sat, test = "LRT") ``` ```{r high.school.p.values} foompter <- anova(gout.indep, gout.all.two.way, gout.sat, test = "LRT") foompter.p.values <- foompter[-1, "Pr(>Chi)"] foompter.p.values ``` This says the two-way interactions model (model 2) fits as well as the three-way interactions model (model 3, the saturated model) because $P = `r round(foompter.p.values[2], 4)`$ is not statistically significant (not even close). And it says the two-way interactions model (model 2) fits much better than the main effects only model (model 1, the independence model) because $P < 2 \times 10^{-16}$ is highly statistically significant (by anyone's standards). We should not be overly impressed by very very small $P$-values because asymptotic approximation gives only small absolute errors rather than small relative errors. All this very very small $P$-value says is that an exact calculation (if we could do it, which we cannot) would be something small, say $P < 0.001$. But there is no reason to believe the $P < 2 \times 10^{-16}$ that R prints is correct to within even several orders of magnitude. ## Many Models There are many hierarchical models here, because we need not include all of the main effects or all two-way interactions. The hierarchical principle says that if we have a two-way interaction, then we must have the corresponding main effects, but that leaves \begin{align*} \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k} + \theta_{i j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \zeta_{i k} + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \zeta_{i k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \varepsilon_{i j} \\ \theta_{i j k} & = \alpha + \beta_i + \delta_k + \zeta_{i k} \\ \theta_{i j k} & = \alpha + \gamma_j + \delta_k + \eta_{j k} \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j + \delta_k \\ \theta_{i j k} & = \alpha + \beta_i + \gamma_j \\ \theta_{i j k} & = \alpha + \beta_i + \delta_k \\ \theta_{i j k} & = \alpha + \gamma_j + \delta_k \\ \theta_{i j k} & = \alpha + \beta_i \\ \theta_{i j k} & = \alpha + \gamma_j \\ \theta_{i j k} & = \alpha + \delta_k \\ \theta_{i j k} & = \alpha \end{align*} Of course, we already know many of these models don't fit the data. We saw that the main effects only model doesn't fit. So none of its submodels can fit either. But this is a complete listing of all hierarchical models (except for the saturated model). How can we choose among all these models? There is a methodology for that, but this is getting a bit ahead of ourselves. There will be a handout for that. But for now, we just show it without much explanation. ```{r high.school.glmbb} library(glmbb) out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson) summary(out) ``` This says that none of the models that do not have all three two-way interactions fit the data according to the criterion: none have AIC less than the AIC for the best model ("best" according to AIC) plus 10. We don't know what AIC is yet, but we will learn in a later handout. For now, just consider it a criterion of goodness of fit. If we want to see how bad some of those non-fitting models were, we can increase the cutoff. ```{r high.school.glmbb.too} library(glmbb) out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson, cutoff = 90) summary(out) ``` That is a huge jump up to the next best fitting model, which does not (take my word for it for now) does not fit the data well at all. # Section 9.4 Four-Way and Beyond And so on and so forth. Higher order tables are just more complicated, but present no new issues. ## Example: Seat-Belt Use These data are given in Table 9.8 in Agresti. The are apparently not in R package `CatDataAnalysis`. ```{r seat.belt,error=TRUE} # clean up R global environment rm(list = ls()) count <- c(7287, 11587, 3246, 6134, 10381, 10969, 6123, 6693, 996, 759, 973, 757, 812, 380, 1084, 513) injury <- gl(2, 8, 16, labels = c("No", "Yes")) gender <- gl(2, 4, 16, labels = c("Female", "Male")) location <- gl(2, 2, 16, labels = c("Urban", "Rural")) seat.belt <- gl(2, 1, 16, labels = c("No", "Yes")) data.frame(gender, location, seat.belt, injury, count) xtabs(count ~ seat.belt + injury + location + gender) ``` Looks OK. ```{r seat.belt.glmbb} out <- glmbb(count ~ seat.belt * injury * location * gender, family = "poisson") summary(out) ``` Now there are several models that fit these data fairly well. We will leave it at that for now. # Parametric Bootstrap ## Assuming Poisson Sampling The method of R function `anova` for objects of class `"glm"` produced by R function `glm` has no optional argument `simulate.p.value`. So here is how we could do that for our first example. ```{r parametric.bootstrap} # clean up R global environment rm(list = ls()) # re-establish stuff done in first section counts <- c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1) drinks <- rep(c("0", "< 1", "1-2", "3-5", ">= 6"), times = 2) malformation <- rep(c("Absent", "Present"), each = 5) gout.indep <- glm(counts ~ malformation + drinks, family = poisson) gout.sat <- glm(counts ~ malformation * drinks, family = poisson) aout <- anova(gout.indep, gout.sat, test = "LRT") ``` This gets us back to where we were before we removed everything produced in our original analysis of these data. Now we need the estimated mean values (expected cell counts) and the $P$-value for the test ```{r parametric.bootstrap.estimates} p.value.hat <- aout[2, "Pr(>Chi)"] p.value.hat mu.hat <- predict(gout.indep, type = "response") mu.hat ``` Now we can do the simulation ```{r parametric.bootstrap.bootstrap.estimates,cache=TRUE} # set random number generator seed for reproducibility set.seed(42) nboot <- 999 p.value.star <- double(nboot) for (iboot in 1:nboot) { # simulate new data from fitted model counts.star <- rpois(length(mu.hat), lambda = mu.hat) gout.indep.star <- glm(counts.star ~ malformation + drinks, family = poisson) gout.sat.star <- glm(counts.star ~ malformation * drinks, family = poisson) aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT") p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"] } all.p.values <- c(p.value.star, p.value.hat) mean(all.p.values <= p.value.hat) sd(all.p.values <= p.value.hat) / sqrt(nboot) ``` The code above is a bit tricky, so here is the explanation. * Each time through the loop we simulate new data based on our best estimate of the model under the null hypothesis. We are assuming Poisson sampling, so we use R function `rpois` to simulate the data. This function vectorizes, so it works with `mu.hat` being a vector. * The next three statements do exactly the same as the original analysis except that we use the simulated data rather than the real data. * Then we extract the $P$-value from the result of R function `anova` and store it for future use. * After the loop terminates, we have `nboot` simulations from the distribution of the $P$-value under the null hypothesis. We also have one more (not simulated) $P$-value that has the same distribution under the null hypothesis as the simulated $P$-values. This is `p.value.hat`, the value for the real data. * The estimated $P$-value is then the fraction of the time the bootstrap simulated $P$-values `p.value.star` are at least as extreme as the $P$-value for the actual data `p.value.hat`. * The last statement calculates the Monte Carlo standard error, how far off our simulation is from what we would get if we did an infinite number of simulations. The idea is that, when the asymptotics cannot be trusted, all we can say about a $P$-value is that is can be used as a test statistic. We take lower values as stronger evidence against the null hypothesis, but we don't know that it is correctly calibrated: we don't think that we would have $P \le \alpha$ with probability $\alpha$ for all $\alpha$ under the null hypothesis. So we need to simulate the distribution of this test statistic to compute a valid $P$-value. And that is what this does. Call the observed value of this test statistic $\widehat{P}$ and the simulated value of this test statistic $P^*$, then $\Pr(P^* \le \widehat{P})$ is the valid $P$-value. The trick of including the observed $P$-value among the simulated $P$-values guarantees that we cannot get a spurious bootstrap $P$-value of zero by making the Monte Carlo sample size too small. The smallest bootstrap $P$-value we can get by this procedure is `1 / (nboot + 1)`. This trick also comes with an argument that it gives an exact randomized test for any significance level that is a multiple of `1 / (nboot + 1)`. Under the null hypothesis, it has the discrete uniform distribution on the numbers `seq(1, nboot + 1) / (nboot + 1)`. ## Assuming Multinomial Sampling When we simulate, we are not using asymptotics. And we are using exact sampling distributions. Thus Poisson sampling and multinomial sampling are not equivalent (their equivalence is an asymptotic result). So we can do the same simulation under multinomial sampling. ```{r parametric.bootstrap.bootstrap.estimates.too,cache=TRUE} n <- sum(counts) p.hat <- mu.hat / n p.value.star <- double(nboot) for (iboot in 1:nboot) { # simulate new data from fitted model counts.star <- rmultinom(1, n, prob = p.hat) gout.indep.star <- glm(counts.star ~ malformation + drinks, family = poisson) gout.sat.star <- glm(counts.star ~ malformation * drinks, family = poisson) aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT") p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"] } all.p.values <- c(p.value.star, p.value.hat) mean(all.p.values <= p.value.hat) sd(all.p.values <= p.value.hat) / sqrt(nboot) ``` The only difference between this simulation and the one before is the line ``` counts.star <- rmultinom(1, n, prob = p.hat) ``` and the lines defining `n` to be the multinomial sample size and `p.hat` to be the estimated multinomial parameter vector so we can use it in the statement just above.