Copyright © January 17, 2021 by Nathaniel E. Helwig

Sections 1-3 summarize parts of
Wood, S. N. (2017). Generalized linear models (chapter 3). Generalized additive models: An introduction with R, Second Edition. CRC Press.

1 The theory of GLMs


Generalized linear models (GLMs) are flexible extensions of linear models that can be used to fit regression models to non-Gaussian data. The basic form of a GLM is \[ g(\mu_i) = \mathbf{X}_i^\top \boldsymbol\beta = \beta_0 + \sum_{j = 1}^p x_{ij} \beta_j \] where \(\mu_i = \mathbb{E}(Y_i)\) is the expected value of the response \(Y_i\) given the predictors, \(g(\cdot)\) is a smooth and monotonic link function that connects \(\mu_i\) to the predictors, \(\mathbf{X}_i^\top = (x_{i0}, x_{i1}, \ldots, x_{ip})\) is the \(i\)-th observation’s known predictor vector with \(x_{i0} = 1\), and \(\boldsymbol\beta = (\beta_0, \beta_1, \ldots, \beta_p)^\top\) is the unknown vector of regression coefficients. The right hand portion of the above equation is often denoted by \[ \eta_i = \beta_0 + \sum_{j = 1}^p x_{ij} \beta_j \] which is referred to as the linear predictor for the \(i\)-th observation. Since the link function is assumed to be invertible, we can write the expected value as \(\mu_i = g^{-1}(\eta_i)\), where \(g^{-1}(\cdot)\) is the inverse of the link function.

The exponential family of distributions

In GLMs, the response variable \(Y_i\) is assumed to follow an exponential family distribution, which has the form \[ f_\theta(y) = \exp\left[ \{y \theta - b(\theta)\}/a(\phi) + c(y, \phi) \right] \] where \(a(\cdot)\), \(b(\cdot)\), and \(c(\cdot)\) are known functions defining the particular distribution, \(\phi\) is nuisance parameter called the scale or dispersion parameter, and \(\theta\) is the canonical parameter of the distribution.

Some properties of exponential family distributed variables:

  1. log-likelihood: \(l(\theta) = \{y \theta - b(\theta) \} / a(\phi) + c(y, \phi)\)

  2. expected value: \(\mathbb{E}(Y) = \mu = b'(\theta)\) where \(b'(\cdot)\) is the first derivative of \(b(\cdot)\)

  3. variance: \(\mathrm{var}(Y) = b''(\theta) a(\phi)\) where \(b''(\cdot)\) is the second derivative of \(b(\cdot)\)

A helpful table summarizing several common exponential family distributions:

Reproduction of Table 3.1 from Wood (2017).

If the link function in the GLM is the canonical link function (see table), then the canonical parameter is equal to the linear predictor, i.e., \(\theta_i = \eta_i\). However, the canonical link function is not always appropriate for a given sample of data. In more general cases, the canonical parameter can be written as \(\theta_i = g_c(\mu_i)\) where \(g_c(\cdot)\) is the canonical link function and \(\mu_i = g^{-1}(\eta_i)\) with \(g^{-1}(\cdot)\) denoting the inverse of the chosen link function.

Fitting generalized linear models

Given \(n\) independent observations, the likelihood function has the form \[ L(\boldsymbol\beta) = \prod_{i=1}^n f_{\theta_i}(y_i) \] Note that the canonical parameter is subscripted with \(i\) because each observation can have a unique canonical parameter, which depends on the common \(\boldsymbol\beta\) vector.

The log-likelihood function has the form \[ l(\boldsymbol\beta) = \log(L(\boldsymbol\beta)) = \sum_{i=1}^n \{y_i \theta_i - b_i(\theta_i) \} / a_i(\phi) + c_i(y_i, \phi) \] where the \(a(\cdot)\), \(b(\cdot)\), and \(c(\cdot)\) functions are now subscripted with \(i\) because they may differ across observations (e.g., by a known weight). In many cases, we can write \(a_i(\phi) = \phi / \omega_i\) where \(\omega_i\) is a known weight (typically \(\omega_i = 1\) for all \(i\)), so the log-likelihood can be simplified to \[ l(\boldsymbol\beta) = \sum_{i=1}^n \omega_i \{y_i \theta_i - b_i(\theta_i) \} / \phi + c_i(y_i, \phi) \] where \(c_i(y_i, \phi)\) does not depend on \(\boldsymbol\beta\) (so it can be dropped from the equation).

The maximum likelihood estimate of \(\boldsymbol\beta\) is defined as \[ \hat{\boldsymbol\beta} = \underset{\boldsymbol\beta}{\arg\!\max} \ l(\boldsymbol\beta) \] which the value of the coefficient vector that maximizes the likelihood function. For Gaussian response variables, there exists a closed form solution for the coefficient vector. However, for other exponential family response variables, an iterative approach is used to numerically obtain the MLE \(\hat{\boldsymbol\beta}\). Obtaining the MLEs boils down to an iteratively reweighted least squares (IRLS) problem, which typically converges rather quickly for well specified models.

Large sample distribution of \(\hat{\boldsymbol\beta}\)

From the asymptotic normality of MLEs, we know that as \(n \rightarrow \infty\) the coefficient estimates are multivariate normal \[ \hat{\boldsymbol\beta} \sim N(\boldsymbol\beta, \mathcal{I}^{-1}) \] where \(\mathcal{I} = \phi^{-1} \mathbf{X}^\top \mathbf{W} \mathbf{X}\) is the Fisher information matrix. Note that \(\mathbf{X}\) is the design matrix, and \(\mathbf{W}\) are the weights at the convergence of the IRLS problem.

Likelihood ratio tests

Suppose that we have two nested models, \(M_0\) and \(M_1\), and consider testing \[ \begin{split} H_0: g(\mu_i) = \mathbf{X}_{i0}^\top \boldsymbol\beta_0 \\ H_1: g(\mu_i) = \mathbf{X}_{i1}^\top \boldsymbol\beta_1 \\ \end{split} \] where \(\mathbf{X}_{i0} \subset \mathbf{X}_{i1}\) (i.e., the predictors in model \(M_0\) are a subset of those in model \(M_1\)). We can test this hypothesis using a likelihood ratio test (LRT), which has the form \[ 2\{l(\hat{\boldsymbol\beta}_1) - l(\hat{\boldsymbol\beta}_0) \} \sim \chi_{p_1 - p_0}^2 \] where \(p_0\) and \(p_1\) are the number of parameters in each model (with \(p_0 < p_1\)). Remember, this test is a large sample result.


The concept of deviance in a GLM is used to quantify lack of fit, similar to how the residual sum of squares is used for Gaussian linear models. The deviance of a fit model is defined as \[ D = 2\{l(\hat{\boldsymbol\beta}_\max) - l(\hat{\boldsymbol\beta}) \} \phi \] where \(l(\hat{\boldsymbol\beta}_\max)\) is the maximized log-likelihood of the saturated model (i.e., the model with \(n\) parameters), and \(l(\hat{\boldsymbol\beta})\) is the maximized log-likelihood for the model of interest. Note that the saturated model defines \(\hat{\mu_i} = y_i\).

The scaled deviance divides the deviance by the scale (or dispersion) parameter \(\phi\), i.e., \[ D^* = D / \phi \] which produces a scale-free estimate of the deviance. The previously discussed LRT can be formulated as \[ D_0^* - D_1^* \sim \chi_{p_1 - p_0}^2 \] where \(D_0^*\) and \(D_1^*\) are the scaled deviances for models \(M_0\) and \(M_1\), respectively.

Model comparison with unknown \(\phi\)

Using the scaled deviance (or LRT) requires the dispersion parameter \(\phi\), which is unknown in some cases. Under the null hypothesis that model \(M_0\) holds, we have that \[ D_0^* - D_1^* \sim \chi_{p_1 - p_0}^2 \qquad \mbox{and} \qquad D_1^* \sim \chi_{n - p_1}^2 \] which results in the large sample \(F\) approximation \[ F = \frac{(D_0 - D_1) / (p_1 - p_0)}{D_1 / (n - p_1)} \sim F_{p_1 - p_0, n - p_1} \] Note that the above \(F\) statistic can be calculated using the unscaled deviances, given that the \(\phi\) parameter cancels out from the numerator and denominator of the \(F\) statistic.


Akaike’s Information Criterion (AIC) can be used to compare non-nested models. The AIC is defined as \[ \mathrm{AIC} = -2 l(\hat{\boldsymbol\beta}) + 2p \] where \(l(\hat{\boldsymbol\beta})\) is the maximized log-likelihood and \(p\) is the number of parameters. When comparing multiple models, the model with the smallest AIC is preferred.

Estimating \(\phi\)

Various different approaches can be used to estimate the dispersion parameter:

  1. Deviance: \(\hat{\phi}_D = \hat{D} / (n - p)\)

  2. Pearson: \(\hat{\phi}_P = \hat{X}^2 / (n - p)\) where \(\hat{X}^2 = \sum_{i = 1}^n (y_i - \hat{\mu}_i)^2 / V(\hat{\mu}_i)\)

  3. Fletcher: \(\hat{\phi}_F = \hat{\phi}_P / (1 + \bar{s})\) where \(\bar{s} = (1/n) \sum_{i = 1}^n V'(\hat{\mu}_i) (y_i - \hat{\mu}_i) / V(\hat{\mu}_i)\)

The deviance approach can under-estimate \(\phi\) for count data with small mean values. The Pearson estimate is less biased, but can be unstable. The Fletcher approach overcomes theses issues.


Similarly, there are various different ways to define residuals in GLMs:

  1. Deviance: \(\hat{\epsilon}_i^d = \mathrm{sign}(y_i - \hat{\mu}_i) \sqrt{d_i}\) where \(\hat{D} = \sum_{i = 1}^n d_i\)

  2. Pearson: \(\hat{\epsilon}_i^p = (y_i - \hat{\mu}_i) / \sqrt{V(\hat{\mu}_i)}\) where \(\hat{X}^2 = \sum_{i = 1}^n (\hat{\epsilon}_i^p)^2\)

Note that these two residuals relate to the two (corresponding) approaches used to estimate the dispersion parameter.


For some data, an exponential family distribution will not be appropriate. For example, suppose we have count data (like for a Poisson response), but the variance of the data is not equal to the mean (which the Poisson assumes is the case). To fit GLMs to such data, some more flexible approach is needed. The idea of quasi-likelihood estimation is to use a different function (than the likelihood function) to obtain the parameter estimates. Unlike likelihood estimation (which requires specifying the distribution of the response), quasi-likelihood estimation only requires specifying the mean and variance functions.

Tweedie and negative binomial distributions

The Tweedie and negative binomial distributions are not exponential family distributions, but can be treated as exponential family distributions if an additional parameter is known. The Tweedie distribution is for non-negative real numbers (like Gamma). The negative binomial distribution is for count data (like Poisson).

The variance functions are

  1. Tweedie: \(V(\mu) = \mu^p\) where \(1 < p < 2\) is the extra parameter

  2. NegBin: \(V(\mu) = \mu + \mu^2 / \theta\) where \(\theta > 0\) is the extra parameter

The Tweedie distribution can be useful for data with point masses at zero.

The negative binomial approaches the Poisson distribution as \(\theta \rightarrow \infty\).

2 Example 1: Logistic Regression

Data overview

The level of the blood enzyme creatinine kinase (CK) is thought to be relevant for early diagnosis of heart attacks. The dataset below gives the CK levels and heart attack outcomes (i.e., counts) for \(n = 360\) patients from a study by Smith (1967). Note that ck is the CK level, ha is the number of patients that had a heart attack, and ok is the number of patients who did not have a heart attack.

ck <- c(20, 60, 100, 140, 180, 220, 260, 300, 340, 380, 420, 460)
ha <- c(2, 13, 30, 30, 21, 19, 18, 13, 19, 15, 7, 8)
ok <- c(88, 26, 8, 5, 0, 1, 1, 1, 1, 0, 0, 0)
heart <- data.frame(ck = ck, ha = ha, ok = ok)
##     ck ha ok
## 1   20  2 88
## 2   60 13 26
## 3  100 30  8
## 4  140 30  5
## 5  180 21  0
## 6  220 19  1
## 7  260 18  1
## 8  300 13  1
## 9  340 19  1
## 10 380 15  0
## 11 420  7  0
## 12 460  8  0

We might begin with some simple visualizations of the data trends:

plot(ck, ha / (ha + ok), xlab = "CK level", ylab = "Probability of Heart Attack")

Note that the probability of having a heart attack seems to increase (in a non-linear fashion) as a function of the CK level.

Simple logistic regression

Let \(p_i\) denote the probability of having a heart attack, and let \(\mu_i = p_i N_i\) denote the expected number of heart attacks for the \(i\)-th CK level (where \(N_i\) is the number of patients at the \(i\)-th CK level). We will begin by considering a simple logistic regression model, which assumes that \[ g(\mu_i) = \log\left(\frac{\mu_i}{N_i - \mu_i}\right) = \log\left(\frac{p_i}{1 - p_i}\right) = \beta_0 + \beta_1 x_i \] where \(x_i\) is the \(i\)-th CK level. This is the “logit” link function, which takes the (natural) logarithm of the odds of success.

Side note: the odds of success is defined as \(p / (1 - p)\), where \(p\) is the probability of the event occurring. So, if someone says “the odds are 4:1”, this means that there is an 80% chance of the event occurring (\(p = 4/5 \rightarrow p/(1-p) = 4\)).

To fit the model, we will use the glm function, which works similar to the lm function, except that you have to specify the form of the exponential family. In this case, we will use the binomial family.

mod.0 <- glm(cbind(ha, ok) ~ ck, family = binomial, data = heart)
## Call:  glm(formula = cbind(ha, ok) ~ ck, family = binomial, data = heart)
## Coefficients:
## (Intercept)           ck  
##    -2.75836      0.03124  
## Degrees of Freedom: 11 Total (i.e. Null);  10 Residual
## Null Deviance:       271.7 
## Residual Deviance: 36.93     AIC: 62.33

The glm object can be summarized just like an lm object:

## Call:
## glm(formula = cbind(ha, ok) ~ ck, family = binomial, data = heart)
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.08184  -1.93008   0.01652   0.41772   2.60362  
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.758358   0.336696  -8.192 2.56e-16 ***
## ck           0.031244   0.003619   8.633  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 271.712  on 11  degrees of freedom
## Residual deviance:  36.929  on 10  degrees of freedom
## AIC: 62.334
## Number of Fisher Scoring iterations: 6

We can also obtain basic diagnostic plots, which suggest some issues:

par(mfrow = c(2,2))