Chapter 5 Likelihood and Predictive Procedures
The likelihood for a set of data, a model, and a set of parameters for the model is simply the probability (or probability density) for those data in that model with those parameters. For a variety of reasons, we generally work with the (natural) log of the likelihood rather than the likelihood itself. We can use log likelihood to compare models and to create confidence regions for parameters.
The R function logLik()
(don’t blame me, I didn’t name it)
is the basic tool for computing the log likelihood of a model
fitted to data. We will use it explicitly and
implicitly when we do likelihood procedures on data.
5.1 Models
We have seen “formulas” of the form response ~ predictor
in
a couple of use cases already. The most common models are
linear models, where the responses are assumed to be
independent and follow a
normal (Gaussian) distribution with constant variance and
means determined by the predictors in the formula. The
most common predictors could be grouping variables
(factors), in which case the means are means in different
groups, or the predictors could be continuous numeric variables, in
which case the mean is determined as coefficients times
the numeric value of the predictors (aka linear regression).
By default, formulas will include an intercept, or constant value.
This is indicated in the formula by a 1
(a coefficient times
the constant value 1 just yields a constant coefficient contribution
to the mean). A model of the form response ~ 1
just fits the
single mean. If predictor
is a single numeric variable, then
a formula of the form response ~ predictor
or response ~ 1 + predictor
produces the same simple linear regression, because the
1+
part is added automatically. On the other hand, a formula
of the form response ~ predictor - 1
or response ~ 0 + predictor
leaves the intercept out of the model.
The formula
response ~ 0
just fits the mean as zero, period. If you want
to fit a model with a different mean, say 5, you could subtract
5 from the data and fit a mean of 0, as in response - 5 ~ 0
.
Will will demonstrate using the data from the RunStitch
example
in cfcdae
, which we have used before. Each worker run stitches collars using two different
setups: the conventional setup and an ergonomic setup. The two
runs are made in random order for each worker, and the interest
is in any difference in average speed between the two setups.
Load the runstitch data from the package and create differences between Standard and Ergonomic.
data(RunStitch)
<- RunStitch[,"Standard"]-RunStitch[,"Ergonomic"]
differences head(differences)
[1] 1.03 -0.04 0.26 0.30 -0.97 0.04
5.2 Likelihood and Likelihood Ratio Test
The likelihood ratio test (LRT) compares a statistic to a chisquare distribution. The statistic is twice the difference in log likelihoods between the model at the null value of the parameters and the model with the parameters set to the values that maximize the likelihood. The chisquare distribution has degrees of freedom equal to the difference in number of parameters between the two models (that is, how many parameters does the null hypothesis restrict).
The p-values from this test are approximate, with the approximation getting better as the sample size increases. In modest sized samples and simple models such as this, you would typcially use a t-test.
Here are the two models:
<- lm(differences~0)
null0model <- lm(differences-.5~0)
nullp5model <- lm(differences~1) fullmodel
The likelihoods are and twice their difference are:
<- logLik(null0model)
null0Like <- logLik(nullp5model)
nullp5Like <- logLik(fullmodel)
fullLike null0Like;nullp5Like;fullLike
'log Lik.' -29.98793 (df=1)
'log Lik.' -32.37791 (df=1)
'log Lik.' -28.88137 (df=2)
and twice their difference are:
<- 2*(fullLike-null0Like)
LRT0 LRT0
'log Lik.' 2.213124 (df=2)
<- 2*(fullLike-nullp5Like)
LRTp5 LRTp5
'log Lik.' 6.993092 (df=2)
The models differ by one parameter (set the mean to zero or let it be fit), so we should compare this to a chisquare with one degree of freedom, using the upper tail area to get the p-value.
pchisq(LRT0,1,lower.tail=FALSE)
'log Lik.' 0.1368413 (df=2)
pchisq(LRTp5,1,lower.tail=FALSE)
'log Lik.' 0.008182487 (df=2)
The p-value is about .14 for the null of 0, so there is no evidence against the null that the mean difference is 0. However, the p-value for the null of .5 is less than .01, so there is some evidence against that null.
5.3 Likelihood Confidence Interval
The values of a parameter in a likelihood confidence interval with coverage 1-\(\alpha\) are those null values that would not be rejected by the LRT at level \(\alpha\). This process is not automated in R for this simple model, so we need to do a little work.
Create a function that computes the log likelihood for a given null value:
<- function(null) {
myLogLike logLik(lm(differences-null~0))
}
Next we create a dense grid of potential null mean values and evaluate the log likelihood at each potential null. The grid used here is probably more dense than necessary, and the denser the grid, the longer it takes to do the work.
<- seq(-.2,.6,.001) # steps of .001
x <- x
loglike for(i in 1:length(x)) {
<- myLogLike(x[i])
loglike[i] }
Those null values that do not get rejected by the LRT are in the interval. A null value does not get rejected if its log likelihood is bigger than a cutoff. The cutoff is the maximized log likelihood minus one half of the 99th percentile (for 99% coverage). Note that the LRT, which is compared to chisquare, uses twice the log likelihood, so we use .5 times the chisquare when using the likelihood. This gives us the interval, which is -.139 to .490.
<- logLik(fullmodel) - qchisq(.99,1)/2
cutoff range(x[loglike>cutoff])
[1] -0.139 0.490
It’s also nice to see a plot of the log likelihood with a horizontal line at the cutoff.
plot(x,loglike,xlab="Mean",ylab="log likelihood",type='l')
abline(h=cutoff)
5.4 AIC, AICc, and BIC
Information Criteria such as AIC and BIC attempt to select a model that will do a good job of predicting future data. The idea is that you compute AIC or BIC for a selection of potential models for you data and choose the one that has the lowest IC.
Both criteria take the form of minus 2 times the log likelihood plus a multiple of the number of parameters in the model. AIC uses 2 times the number of parameters, so it does not penalize additional parameters very much. BIC uses log of the sample size times the number of parameters, so it can penalize additional parameters quite a bit. AICc is a small sample correction for AIC.
If the true model is among the models you are comparing, BIC will eventually find it in large sample sizes. However, BIC may not work well when all the models are approximate. Conversely, AIC can pick out a reasonable model even when all of the models are approximate, but it will not home in on the true model if the true model is among the selection group, even in large sample sizes.
Let’s try these on the differences data. BIC prefers the null model.
BIC(null0model)
[1] 63.37706
BIC(nullp5model)
[1] 68.15703
BIC(fullmodel)
[1] 64.56513
AICc prefers the null model (barely).
AICc(null0model)
[1] 62.11872
AICc(nullp5model)
[1] 66.89868
AICc(fullmodel)
[1] 62.20718
And AIC slightly prefers the full model.
AIC(null0model)
[1] 61.97586
AIC(fullmodel)
[1] 61.76274
AIC(nullp5model)
[1] 66.75583
The model assuming a mean of .5 is never really in the running.