---
title: "Stat 5421 Lecture Notes: To Accompany Agresti Ch 9"
author: "Charles J. Geyer"
date: "`r format(Sys.time(), '%B %d, %Y')`"
output:
  bookdown::html_document2:
    number_sections: false
    md_extensions: -tex_math_single_backslash
    mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML
  bookdown::pdf_document2:
    number_sections: false
    md_extensions: -tex_math_single_backslash
linkcolor: blue
urlcolor: blue
---

# Section 9.1 Two-Way Tables

## Section 9.1.1 Only Main Effects

The *independence* or *homogeneity of proportions* model has mean-value
parameters
$$
   \mu_{i j} = \alpha_i \beta_j
$$
or canonical parameters
\begin{equation}
   \theta_{i j} = \alpha_i + \beta_j
   (\#eq:two-way-independence)
\end{equation}
where the alphas and betas are different in the two equations (this causes
no confusion, since if we look at parameters at all, these are usually
canonical parameters).

## Section 9.1.3 Interactions

The *saturated* model has completely arbitrary parameters.
The mean value parameters $\mu_{i j}$ have no specified structure.
The canonical parameters $\theta_{i j}$ have no specified structure.
So this is the largest model.  It has the most (arbitrarily variable)
parameters.

## Section 9.1.5 Hierarchical Models

The hierarchical model principle says that, if you have an interaction term
of a certain order, then you have all lower-order interactions of the same
variables.  For the saturated model for a two-way table this says
\begin{equation}
   \theta_{i j} = \alpha_i + \beta_j + \gamma_{i j}
   (\#eq:two-way-saturated)
\end{equation}
(but this just says $\theta_{i j}$ can be anything.  This is not an
identifiable parameterization.

## Identifiability

A statistical model (in general, not just in categorical data analysis)
is *identifiable* if there is only one (vector) parameter value corresponding
to a probability distribution.  In exponential family models
([Notes on Exponential Families,
Section 6.2](http://www.stat.umn.edu/geyer/5421/notes/expfam.pdf#page=28))
every different mean vector corresponds to
a different distribution, but different canonical parameter vectors can
correspond to the same distribution.

If we are assuming *Poisson sampling* ([Notes on Exponential Families, Part I,
section 7](http://www.stat.umn.edu/geyer/5421/notes/expfam.pdf#page=46)) then the
canonical parameters are identifiable if and only if the mean-value parameters
are (because the link function (componentwise log) is invertible).

Clearly the specification for the two-way independence model
\@ref(eq:two-way-independence) is not identifiable because one
can add a constant to all of the alphas and subtract the same constant
from all the betas and get the same result
$$
   (\alpha_i + c) + (\beta_j - c)
   =
   \alpha_i + \beta_j
$$
Hence \@ref(eq:two-way-saturated) is also not identifiable.

But we don't need to worry about identifiability when fitting models
because the computer automatically takes care of it.

For example, consider the data in Table 3.8 in Agresti
```{r data.fetal.alcohol.syndrome}
counts <- c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1)
drinks <- rep(c("0", "< 1", "1-2", "3-5", ">= 6"), times = 2)
malformation <- rep(c("Absent", "Present"), each = 5)
data.frame(drinks, malformation, counts)
```

We can fit this using R function `chisq.test`
```{r foo}
foo <- xtabs(counts ~ malformation + drinks)
foo
chisq.test(foo)
```

But it says "approximation may be incorrect" so try
```{r foo.bar}
chisq.test(foo, simulate.p.value = TRUE)
```

But we can also use R function `glm` and its helper functions to do
the job.
```{r glm}
gout.indep <- glm(counts ~ malformation + drinks, family = poisson)
summary(gout.indep)
gout.sat <- glm(counts ~ malformation * drinks, family = poisson)
summary(gout.sat)
anova(gout.indep, gout.sat, test = "LRT")
anova(gout.indep, gout.sat, test = "Rao")
```
```{r glm.hide}
p.value.lrt <- anova(gout.indep, gout.sat, test = "LRT")[2, "Pr(>Chi)"]
p.value.rao <- anova(gout.indep, gout.sat, test = "Rao")[2, "Pr(>Chi)"]
p.value.chisq <- suppressWarnings(chisq.test(foo)$p.value)
p.value.lrt
p.value.rao
p.value.chisq
```

We notice a number of things.

 * The $P$-value for the Rao test (`r round(p.value.rao, 5)`) is exactly equal
   to that output for the Chi-Square test because the Pearson chi-square
   test is a special case of the Rao test.

 * The likelihood ratio test disagrees quite strongly with the Rao test.
   Of course, these test are asymptotically equivalent, but here the sample
   size is not "large".  The total number of subjects ($`r sum(foo)`$) is
   large, but the expected number in some cells of the contingency table
   are quite small, a lot less than what the rule of thumb says is necessary
   for valid asymptotics (at least five in each cell),
   so maybe that accounts for the difference.
```{r foo.expected}
suppressWarnings(chisq.test(foo)$expected)
```
 
 * The computer (R function `glm`) knows how to deal with identifiability.
   It adds an intercept, so it is really using the formula
   $$
      \theta_{i j} = \alpha + \beta_i + \gamma_j
   $$
   for the independence model and
   $$
      \theta_{i j} = \alpha + \beta_i + \gamma_j + \delta_{i j}
   $$
   for the saturated model.  Then it "drops" (sets equal to zero) one
   of the betas, one of the gammas, and all interaction terms corresponding
   to the dropped beta and the dropped gamma.

```{r gout.indep.coef}
names(coef(gout.indep))
```
It keeps the "intercept" ($\alpha$ in our notation just above).
It drops one of the coefficients for the factor `malformation`
(and keeps the other one).
It drops one of the coefficients for the factor `drinks`
(and keeps the other four).
And this gives it an identifiable model.

```{r gout.sat.coef}
names(coef(gout.sat))
```
It has all the parameters of the independence model plus a lot of "interaction"
parameters (the ones whose names contain colons).  But it drops all the
"interaction" parameters that involve parameters that were "dropped" from
the independence model (it contains no "interaction" terms containing
`malformationAbsent` or `drinks< 1`)
And this gives it an identifiable model.

# Section 9.2 Three-Way Tables

Three way tables are just like two-way tables except the data have three
subscripts for the three dimensions of the table.  Our analogs of

## Section 9.2.1 Only Main Effects

The *independence* or *homogeneity of proportions* model has mean-value
parameters
$$
   \mu_{i j k} = \alpha_i \beta_j \gamma_k
$$
or canonical parameters
\begin{equation}
   \theta_{i j k} = \alpha_i + \beta_j + \gamma_k
   (\#eq:three-way-independence)
\end{equation}
where the alphas, betas, and gammas
are different in the two equations (this causes
no confusion, since if we look at parameters at all, these are usually
canonical parameters).

## The Saturated Model

As before, the *saturated* model has completely arbitrary parameters.
The mean value parameters $\mu_{i j k}$ have no specified structure.
The canonical parameters $\theta_{i j k}$ have no specified structure.
So this is the largest model.  It has the most (arbitrarily variable)
parameters.

## Example: High School Student Survey

This is Table 9.3 in Agresti and can be found in R package `CatDataAnalysis`.
```{r high.school}
# clean up R global environment
rm(list = ls())

library(CatDataAnalysis)
data(table_9.3)
names(table_9.3)
foo <- xtabs(count ~ a + c + m, data = table_9.3)
foo
```

Since `a`, `c`, and `m` are categorical, we should perhaps make them
factors.  But since they each have only two values, it does not matter
whether we make them factors or not.  If they had more than two values,
then it would be essential to make them factors.


We can make these data look nicer by copying the names out of the book
```{r high.school.dimnames}
dimnames(foo) <- list(alcohol = c("yes", "no"), cigarettes = c("yes", "no"), marijuana = c("yes", "no"))
aperm(foo, c(2, 3, 1))
```
We needed R function `aperm`, which permutes the dimensions of an array,
to present the data in the same order as in the book.

Of course, the order doesn't matter and neither do the `dimnames`.
That is just to match the book.

Now R function `chisq.test` is useless, but R function `glm` has no problems.
Our test of independence or homogeneity of proportions is the same, just
with three "predictors" rather than two (the "predictors" are in scare
quotes because they are just the `dimnames`)
```{r high.school.test}
gout.indep <- glm(count ~ a + c + m, data = table_9.3, family = poisson)
gout.sat <- glm(count ~ a * c * m, data = table_9.3, family = poisson)
anova(gout.indep, gout.sat, test = "LRT")
anova(gout.indep, gout.sat, test = "Rao")
```

Clearly, the independence model does not fit the data.
So statistics says these variables, alcohol use, cigarette use, and
marijuana use have some association.  No real surprise, but the data
do confirm what just about everybody assumes.

## Interactions

Our saturated model has the three-way interaction `a * c * m` but there
can also be two-way interactions.
The model with all two-way interactions would have canonical parameter
$$
   \theta_{i j k} = \alpha_{i j} + \beta_{i k} + \gamma_{j k}
$$
(as before the terms with alphas, betas, and gammas need not match up
with previous uses of these greek letters).
And people who insist on the hierarchical principle would write
$$
   \theta_{i j k} = \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k}
$$
but both formulas specify the same model and neither is identifiable.

So let's look at that model.
```{r high.school.all.two.way}
gout.all.two.way <- glm(count ~ (a + c + m)^2, data = table_9.3, family = poisson)
anova(gout.indep, gout.all.two.way, gout.sat, test = "LRT")
```
```{r high.school.p.values}
foompter <- anova(gout.indep, gout.all.two.way, gout.sat, test = "LRT")
foompter.p.values <- foompter[-1, "Pr(>Chi)"]
foompter.p.values
```

This says the two-way interactions model (model 2) fits as well as
the three-way interactions model (model 3, the saturated model)
because $P = `r round(foompter.p.values[2], 4)`$ is not
statistically significant (not even close).

And it says the two-way interactions model (model 2) fits much better than
the main effects only model (model 1, the independence model)
because $P < 2 \times 10^{-16}$ is
highly statistically significant (by anyone's standards).

We should not be overly impressed by very very small $P$-values because
asymptotic approximation gives only small absolute errors rather than
small relative errors.  All this very very small $P$-value says is that
an exact calculation (if we could do it, which we cannot) would be something
small, say $P < 0.001$.  But there is no reason to believe
the $P < 2 \times 10^{-16}$ that R prints is correct to within even several
orders of magnitude.

## Many Models

There are many hierarchical models here, because we need not include
all of the main effects or all two-way interactions.  The hierarchical
principle says that if we have a two-way interaction, then we must have
the corresponding main effects, but that leaves
\begin{align*}
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k} + \theta_{i j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k} + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \zeta_{i k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j} + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \zeta_{i k} + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \varepsilon_{i j}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \zeta_{i k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \varepsilon_{i j}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \delta_k + \zeta_{i k}
   \\
   \theta_{i j k}
   & =
   \alpha + \gamma_j + \delta_k + \eta_{j k}
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j + \delta_k
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \gamma_j
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i + \delta_k
   \\
   \theta_{i j k}
   & =
   \alpha + \gamma_j + \delta_k
   \\
   \theta_{i j k}
   & =
   \alpha + \beta_i
   \\
   \theta_{i j k}
   & =
   \alpha + \gamma_j
   \\
   \theta_{i j k}
   & =
   \alpha + \delta_k
   \\
   \theta_{i j k}
   & =
   \alpha
\end{align*}
Of course, we already know many of these models don't fit the data.
We saw that the main effects only model doesn't fit.
So none of its submodels can fit either.
But this is a complete listing of all hierarchical models (except for
the saturated model).

How can we choose among all these models?  There is a methodology for
that, but this is getting a bit ahead of ourselves.  There will be
a handout for that.  But for now, we just show it without much explanation.
```{r high.school.glmbb}
library(glmbb)
out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson)
summary(out)
```

This says that none of the models that do not have all three two-way
interactions fit the data according to the criterion: none have AIC
less than the AIC for the best model ("best" according to AIC) plus 10.
We don't know what AIC is yet, but we will learn in a later handout.
For now, just consider it a criterion of goodness of fit.

If we want to see how bad some of those non-fitting models were, we can
increase the cutoff.
```{r high.school.glmbb.too}
library(glmbb)
out <- glmbb(count ~ a * c * m, data = table_9.3, family = poisson, cutoff = 90)
summary(out)
```
That is a huge jump up to the next best fitting model, which does not
(take my word for it for now) does not fit the data well at all.

# Section 9.4 Four-Way and Beyond

And so on and so forth.  Higher order tables are just more complicated,
but present no new issues.

## Example: Seat-Belt Use

These data are given in Table 9.8 in Agresti.  The are apparently not
in R package `CatDataAnalysis`.
```{r seat.belt,error=TRUE}
# clean up R global environment
rm(list = ls())

count <- c(7287, 11587, 3246, 6134, 10381, 10969, 6123, 6693,
            996,   759,  973,  757,   812,   380, 1084,  513)
injury <- gl(2, 8, 16, labels = c("No", "Yes"))
gender <- gl(2, 4, 16, labels = c("Female", "Male"))
location <- gl(2, 2, 16, labels = c("Urban", "Rural"))
seat.belt <- gl(2, 1, 16, labels = c("No", "Yes"))
data.frame(gender, location, seat.belt, injury, count)
xtabs(count ~ seat.belt + injury + location + gender)
```

Looks OK.
```{r seat.belt.glmbb}
out <- glmbb(count ~ seat.belt * injury * location * gender,
    family = "poisson")
summary(out)
```
Now there are several models that fit these data fairly well.
We will leave it at that for now.

# Parametric Bootstrap

## Assuming Poisson Sampling

The method of R function `anova` for objects of class `"glm"` produced
by R function `glm` has no optional argument `simulate.p.value`.

So here is how we could do that for our first example.
```{r parametric.bootstrap}
# clean up R global environment
rm(list = ls())
# re-establish stuff done in first section
counts <- c(17066, 14464, 788, 126, 37, 48, 38, 5, 1, 1)
drinks <- rep(c("0", "< 1", "1-2", "3-5", ">= 6"), times = 2)
malformation <- rep(c("Absent", "Present"), each = 5)
gout.indep <- glm(counts ~ malformation + drinks, family = poisson)
gout.sat <- glm(counts ~ malformation * drinks, family = poisson)
aout <- anova(gout.indep, gout.sat, test = "LRT")
```

This gets us back to where we were before we removed everything
produced in our original analysis of these data.  Now we need
the estimated mean values (expected cell counts) and the $P$-value for
the test
```{r parametric.bootstrap.estimates}
p.value.hat <- aout[2, "Pr(>Chi)"]
p.value.hat
mu.hat <- predict(gout.indep, type = "response")
mu.hat
```

Now we can do the simulation
```{r parametric.bootstrap.bootstrap.estimates,cache=TRUE}
# set random number generator seed for reproducibility
set.seed(42)

nboot <- 999
p.value.star <- double(nboot)
for (iboot in 1:nboot) {
    # simulate new data from fitted model
    counts.star <- rpois(length(mu.hat), lambda = mu.hat)
    gout.indep.star <- glm(counts.star ~ malformation + drinks,
        family = poisson)
    gout.sat.star <- glm(counts.star ~ malformation * drinks,
        family = poisson)
    aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT")
    p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"]
}
all.p.values <- c(p.value.star, p.value.hat)
mean(all.p.values <= p.value.hat)
sd(all.p.values <= p.value.hat) / sqrt(nboot)
```

The code above is a bit tricky, so here is the explanation.

 * Each time through the loop we simulate new data based on our best
   estimate of the model under the null hypothesis.  We are assuming
   Poisson sampling, so we use R function `rpois` to simulate the data.
   This function vectorizes, so it works with `mu.hat` being a vector.

 * The next three statements do exactly the same as the original analysis
   except that we use the simulated data rather than the real data.

 * Then we extract the $P$-value from the result of R function `anova`
   and store it for future use.

 * After the loop terminates, we have `nboot` simulations from the
   distribution of the $P$-value under the null hypothesis.  We also
   have one more (not simulated)  $P$-value that has the same distribution
   under the null hypothesis as the simulated $P$-values.  This is
   `p.value.hat`, the value for the real data.

 * The estimated $P$-value is then the fraction of the time the bootstrap
   simulated $P$-values `p.value.star` are at least as extreme as
   the $P$-value for the actual data `p.value.hat`.

 * The last statement calculates the Monte Carlo standard error, how
   far off our simulation is from what we would get if we did an
   infinite number of simulations.

The idea is that, when the asymptotics cannot be trusted, all we can say
about a $P$-value is that is can be used as a test statistic.  We take
lower values as stronger evidence against the null hypothesis, but we
don't know that it is correctly calibrated: we don't think that we would
have $P \le \alpha$ with probability $\alpha$ for all $\alpha$ under the
null hypothesis.  So we need to simulate the distribution of this test
statistic to compute a valid $P$-value.  And that is what this does.
Call the observed value of this test statistic $\widehat{P}$ and 
the simulated value of this test statistic $P^*$,
then $\Pr(P^* \le \widehat{P})$ is the valid $P$-value.

The trick of including the observed $P$-value among the simulated $P$-values
guarantees that we cannot get a spurious bootstrap $P$-value of zero by making
the Monte Carlo sample size too small.  The smallest bootstrap $P$-value
we can get by this procedure is `1 / (nboot + 1)`.

This trick also comes with an argument that it gives an exact randomized
test for any significance level that is a multiple of `1 / (nboot + 1)`.
Under the null hypothesis, it has the discrete uniform distribution on
the numbers `seq(1, nboot + 1) / (nboot + 1)`.

## Assuming Multinomial Sampling

When we simulate, we are not using asymptotics.  And we are using exact
sampling distributions.  Thus Poisson sampling and multinomial sampling
are not equivalent (their equivalence is an asymptotic result).

So we can do the same simulation under multinomial sampling.
```{r parametric.bootstrap.bootstrap.estimates.too,cache=TRUE}
n <- sum(counts)
p.hat <- mu.hat / n
p.value.star <- double(nboot)
for (iboot in 1:nboot) {
    # simulate new data from fitted model
    counts.star <- rmultinom(1, n, prob = p.hat)
    gout.indep.star <- glm(counts.star ~ malformation + drinks,
        family = poisson)
    gout.sat.star <- glm(counts.star ~ malformation * drinks,
        family = poisson)
    aout.star <- anova(gout.indep.star, gout.sat.star, test = "LRT")
    p.value.star[iboot] <- aout.star[2, "Pr(>Chi)"]
}
all.p.values <- c(p.value.star, p.value.hat)
mean(all.p.values <= p.value.hat)
sd(all.p.values <= p.value.hat) / sqrt(nboot)
```

The only difference between this simulation and the one before is the line
```
    counts.star <- rmultinom(1, n, prob = p.hat)
```
and the lines defining `n` to be the multinomial sample size and `p.hat`
to be the estimated multinomial parameter vector so we can use
it in the statement just above.