Chapter 5 Likelihood and Predictive Procedures

The likelihood for a set of data, a model, and a set of parameters for the model is simply the probability (or probability density) for those data in that model with those parameters. For a variety of reasons, we generally work with the (natural) log of the likelihood rather than the likelihood itself. We can use log likelihood to compare models and to create confidence regions for parameters.

The R function logLik() (don’t blame me, I didn’t name it) is the basic tool for computing the log likelihood of a model fitted to data. We will use it explicitly and implicitly when we do likelihood procedures on data.

5.1 Models

We have seen “formulas” of the form response ~ predictor in a couple of use cases already. The most common models are linear models, where the responses are assumed to be independent and follow a normal (Gaussian) distribution with constant variance and means determined by the predictors in the formula. The most common predictors could be grouping variables (factors), in which case the means are means in different groups, or the predictors could be continuous numeric variables, in which case the mean is determined as coefficients times the numeric value of the predictors (aka linear regression).

By default, formulas will include an intercept, or constant value. This is indicated in the formula by a 1 (a coefficient times the constant value 1 just yields a constant coefficient contribution to the mean). A model of the form response ~ 1 just fits the single mean. If predictor is a single numeric variable, then a formula of the form response ~ predictor or response ~ 1 + predictor produces the same simple linear regression, because the 1+ part is added automatically. On the other hand, a formula of the form response ~ predictor - 1 or response ~ 0 + predictor leaves the intercept out of the model.

The formula response ~ 0 just fits the mean as zero, period. If you want to fit a model with a different mean, say 5, you could subtract 5 from the data and fit a mean of 0, as in response - 5 ~ 0.

Will will demonstrate using the data from the RunStitch example in cfcdae, which we have used before. Each worker run stitches collars using two different setups: the conventional setup and an ergonomic setup. The two runs are made in random order for each worker, and the interest is in any difference in average speed between the two setups.

Load the runstitch data from the package and create differences between Standard and Ergonomic.

data(RunStitch)
differences <- RunStitch[,"Standard"]-RunStitch[,"Ergonomic"]
head(differences)
[1]  1.03 -0.04  0.26  0.30 -0.97  0.04

5.2 Likelihood and Likelihood Ratio Test

The likelihood ratio test (LRT) compares a statistic to a chisquare distribution. The statistic is twice the difference in log likelihoods between the model at the null value of the parameters and the model with the parameters set to the values that maximize the likelihood. The chisquare distribution has degrees of freedom equal to the difference in number of parameters between the two models (that is, how many parameters does the null hypothesis restrict).

The p-values from this test are approximate, with the approximation getting better as the sample size increases. In modest sized samples and simple models such as this, you would typcially use a t-test.

Here are the two models:

null0model <- lm(differences~0)
nullp5model <- lm(differences-.5~0)
fullmodel <- lm(differences~1)

The likelihoods are and twice their difference are:

null0Like <- logLik(null0model)
nullp5Like <- logLik(nullp5model)
fullLike <- logLik(fullmodel)
null0Like;nullp5Like;fullLike
'log Lik.' -29.98793 (df=1)
'log Lik.' -32.37791 (df=1)
'log Lik.' -28.88137 (df=2)

and twice their difference are:

LRT0 <- 2*(fullLike-null0Like)
LRT0
'log Lik.' 2.213124 (df=2)
LRTp5 <- 2*(fullLike-nullp5Like)
LRTp5
'log Lik.' 6.993092 (df=2)

The models differ by one parameter (set the mean to zero or let it be fit), so we should compare this to a chisquare with one degree of freedom, using the upper tail area to get the p-value.

pchisq(LRT0,1,lower.tail=FALSE)
'log Lik.' 0.1368413 (df=2)
pchisq(LRTp5,1,lower.tail=FALSE)
'log Lik.' 0.008182487 (df=2)

The p-value is about .14 for the null of 0, so there is no evidence against the null that the mean difference is 0. However, the p-value for the null of .5 is less than .01, so there is some evidence against that null.

5.3 Likelihood Confidence Interval

The values of a parameter in a likelihood confidence interval with coverage 1-\(\alpha\) are those null values that would not be rejected by the LRT at level \(\alpha\). This process is not automated in R for this simple model, so we need to do a little work.

Create a function that computes the log likelihood for a given null value:

myLogLike <- function(null) {
  logLik(lm(differences-null~0))
}

Next we create a dense grid of potential null mean values and evaluate the log likelihood at each potential null. The grid used here is probably more dense than necessary, and the denser the grid, the longer it takes to do the work.

x <- seq(-.2,.6,.001) # steps of .001
loglike <- x
for(i in 1:length(x)) {
  loglike[i] <- myLogLike(x[i])
}

Those null values that do not get rejected by the LRT are in the interval. A null value does not get rejected if its log likelihood is bigger than a cutoff. The cutoff is the maximized log likelihood minus one half of the 99th percentile (for 99% coverage). Note that the LRT, which is compared to chisquare, uses twice the log likelihood, so we use .5 times the chisquare when using the likelihood. This gives us the interval, which is -.139 to .490.

cutoff <- logLik(fullmodel) - qchisq(.99,1)/2
range(x[loglike>cutoff])
[1] -0.139  0.490

It’s also nice to see a plot of the log likelihood with a horizontal line at the cutoff.

plot(x,loglike,xlab="Mean",ylab="log likelihood",type='l')
abline(h=cutoff)

5.4 AIC, AICc, and BIC

Information Criteria such as AIC and BIC attempt to select a model that will do a good job of predicting future data. The idea is that you compute AIC or BIC for a selection of potential models for you data and choose the one that has the lowest IC.

Both criteria take the form of minus 2 times the log likelihood plus a multiple of the number of parameters in the model. AIC uses 2 times the number of parameters, so it does not penalize additional parameters very much. BIC uses log of the sample size times the number of parameters, so it can penalize additional parameters quite a bit. AICc is a small sample correction for AIC.

If the true model is among the models you are comparing, BIC will eventually find it in large sample sizes. However, BIC may not work well when all the models are approximate. Conversely, AIC can pick out a reasonable model even when all of the models are approximate, but it will not home in on the true model if the true model is among the selection group, even in large sample sizes.

Let’s try these on the differences data. BIC prefers the null model.

BIC(null0model)
[1] 63.37706
BIC(nullp5model)
[1] 68.15703
BIC(fullmodel)
[1] 64.56513

AICc prefers the null model (barely).

AICc(null0model)
[1] 62.11872
AICc(nullp5model)
[1] 66.89868
AICc(fullmodel)
[1] 62.20718

And AIC slightly prefers the full model.

AIC(null0model)
[1] 61.97586
AIC(fullmodel)
[1] 61.76274
AIC(nullp5model)
[1] 66.75583

The model assuming a mean of .5 is never really in the running.