--- title: "Stat 5421 Lecture Notes: To Accompany Agresti Ch 4, Addendum" author: "Charles J. Geyer" date: "`r format(Sys.time(), '%B %d, %Y')`" output: html_document: md_extensions: -tex_math_single_backslash mathjax: https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML pdf_document: default --- # Intercept Makes No Difference When Any Predictor Is Categorical We use for an example data from the examples for R function `glm` ```{r glm.example} ## Dobson (1990) Page 93: Randomized Controlled Trial : counts <- c(18,17,15,20,10,20,25,13,12) outcome <- gl(3,1,9) treatment <- gl(3,3) data.frame(treatment, outcome, counts) # showing data glm.D93 <- glm(counts ~ outcome + treatment, family = poisson()) summary(glm.D93) ``` Note that `outcome` and `treatment` are factors (the terminology R uses for categorical variables) ```{r glm.example.types} class(outcome) class(treatment) ``` and they each have three levels ```{r glm.example.nlevels} nlevels(outcome) nlevels(treatment) ``` So how does a categorical variable correspond to a regression model in which it is used. For each categorical variable, the regression model needs *dummy variables*, one for each category (R calls them *levels* of the *factor* variable). We can see these dummy variables as follows ```{r glm.example.dummies} model.matrix(~ 0 + outcome) model.matrix(~ 0 + treatment) ``` Each dummy variable indicates the cases (components of the response vector, rows of the model matrix) which are in that category (level). Note that the dummy variables corresponding to a categorical variable add up to the vector all of whose components are equal to one, which is what R calls the *intercept* dummy variable. ```{r glm.example.dummy.sums} rowSums(model.matrix(~ 0 + outcome)) rowSums(model.matrix(~ 0 + treatment)) model.matrix(counts ~ 1) ``` Thus we would not have an identifiable model if we kept all of the dummy variables corresponding to a categorical variable and kept an "intercept" dummy variable. The default behavior in R is to * keep an "intercept" dummy variable and * drop one of the dummy variables for each categorical variable (factor). This gives an identifiable model. ```{r glm.example.default.model.matrix} model.matrix(counts ~ outcome + treatment) ``` The dummy variables `outcome1` and `treatment1` are omitted. But we can tell R to omit the intercept. ```{r glm.example.no.intercept.model.matrix} model.matrix(counts ~ 0 + outcome + treatment) ``` Now the behavior is * omit an "intercept" dummy variable and * drop one of the dummy variables for each categorical variable (factor) **except for one** categorical variable (for which we keep all of its dummy variables) Note — this is very important — that these two recipes give the *same* model. * Leaving out the intercept produces a *different* model *when all of the predictor variables are quantitative*. * Leaving out the intercept produces the *same* model *when at least one of the predictor variables is categorical (qualitative, factor)*. ```{r glm.example.no.intercept} glm.D93.no.intercept <- glm(counts ~ 0 + outcome + treatment, family = poisson()) summary(glm.D93.no.intercept) ``` Note that the regression coefficients (estimates) are different from the ones with intercept shown above. But the estimated cell means are the same, hence the *both specify the same statistical model*. These are different parameterizations of the same model. ```{r glm.example.no.expected.values} mu <- predict(glm.D93, type = "response") mu.no.intercept <- predict(glm.D93.no.intercept, type = "response") all.equal(mu, mu.no.intercept) ``` # R Function `drop1` R function `drop1` is useful for finding out the effect of dropping terms from a model, especially when there are categorical variables, in which case one wants to drop all of the dummy variables for a categorical variable or drop none of them. ```{r glm.example.drop1} drop1(glm.D93, test = "LRT") ``` # R Function `add1` There is also an R function `add1` ```{r glm.example.add1} glm.D93.intercept.only <- glm(counts ~ 1, family = poisson()) add1(glm.D93.intercept.only, scope = ~ outcome + treatment, test = "LRT") ``` # R Function `anova` But one can also do these tests using R function `anova` ```{r glm.example.anova} glm.D93.outcome.only <- glm(counts ~ outcome, family = poisson()) glm.D93.treatment.only <- glm(counts ~ treatment, family = poisson()) # compare with results of drop1 anova(glm.D93.outcome.only, glm.D93, test = "LRT") anova(glm.D93.treatment.only, glm.D93, test = "LRT") # compare with results of add1 anova(glm.D93.intercept.only, glm.D93.outcome.only, test = "LRT") anova(glm.D93.intercept.only, glm.D93.treatment.only, test = "LRT") ```