Chapter 15 More about Models in R
If we’re going to fit models in R, we need to be able to specify
them. In R terminology, a model is specified (in part) by a “formula”.
A formula takes the form response ~ predictors
, where the ~
can be
thought of as “is modeled by,” and predictors
is one or more “terms” comprising
variable names in various combinations.
In basic usage, the formula is the model for the mean structure;
in more advanced usage, aspects of variation can also be modeled via the formula.
Multiple predictors can be used in a formula by “adding” them, as in
response ~ pred1 + pred2 + pred3
. Redundant predictors are ignored, so
response ~ pred1 + pred2 + pred1
is the same as response ~ pred1 + pred2
.
You can also remove a previous term by “subtracting” it. Thus
response ~ pred1 + pred2 - pred1
is the same as response ~ pred2
.
15.1 Variables in a formula
Variables that go into terms in formulas include the following types:
1
This represents an overall constant to be fitted. It might be useful to think of1
as a predictor that always takes the value 1; we estimate a multiplier (coefficient) for this predictor and the product gets added to all of the estimated means (fitted values).
1
often has an interpretation as an intercept or overall mean. If you have other terms in your formula, an implicit1
term is automatically added as the first term, but you can prevent this by including- 1
in your formula.0
A0
in a formula tells R not to add the automatic1
term. Putting0
in a model does the same thing as- 1
.x
(a quantitative vector) When you have a term that is a quantitative vector, R estimates a coefficient for that vector and adds the product to the fitted values.Xmat
(a quantitative matrix) When you have a term that is a quantitative matrix, R estimates a separate coefficient for each column of the matrix. The sum of each column multiplied by its coefficient is added to the fitted values.grp
(a factor variable) A factor variable indicates groups rather than quantities. Including a factor variable in a formula allows R to add a different coefficient for each of the groups indicated by the factor. In the simplest usage, this tells R to fit a different mean value for each group.
Internally, R converts the factor variable to a matrix predictor, with the multiple columns allowing the estimation of different means in different groups. One complication of factor variables is that there are many ways to represent a factor as a matrix predictor. You need to know which representation you are using in order to understand the coefficients. A second complication is that the representation can also depend on what terms precede a given factor predictor in the model. Thus0+grp
and1+grp
lead to different representations ofgrp
.
For factors, there is arbitrariness in the way the factor is represented, but the ultimate fitted/predicted values are the same regardless of which representation you use.
You can use most functions within formulas. For example, you could predict
the log of y by a multiple of the sine of x by using the formula
log(y) ~ sin(x)
. One important exception is that you cannot use exponentiation via ^
on
the predictor side of the formula. For example, y ~ x^2
will not yield \(x^2\) as a predictor.
To use these more fragile manipulations, you must protect your term by wrapping it in I()
.
Thus, y ~ I(x^2)
will create \(x^2\) and use it as the predictor, and you could
create a polynomial regression of order 3 via y ~ x + I(x^2) + I(x^3)
.
There is a function poly(x,ord)
that can be used in a model to create a matrix
predictor whose columns are orthogonal polynomials of x
up to order ord
. An orthogonal polynomial
of order 3 will make the same predictions as an ordinary polynomial of order 3, but
the orthogonal polynomials are more stable in computations and provide
uncorrelated coefficient estimates (unlike ordinary polynomials, which can have highly correlated
coefficient estimates). On the down side, they are not as easy to understand.
15.2 Interactions of predictors
Predictors can be combined by forming their interaction. This is represented using a colon,
so the interaction of pred1
and pred2
is pred1:pred2
. The interaction of two
quantitative predictors is another quantitative predictor that is just the element-wise product
of the two predictors. The interaction of a matrix predictor and a quantitative predictor
is a new matrix predictor, with each column being the element-wise product of one column of the
matrix predictor and the quantitative predictors. The interaction of two matrix
predictors is another matrix predictor with one column as the element-wise product
for each combination of a column
from the first predictor and a column from the second predictor.
The interaction of two or more grouping factors will allow the model to fit a separate
mean for each combination of those factors. The interaction of a grouping factor and
a quantitative predictor z
will allow the model to fit a different coefficient (slope) for z
for every level of the grouping factor.
As with a single factor, there are many possibilities for how the factors, their
interactions, and the slopes are represented, but the different representations do not
change what the ultimate fitted values will be.
Be aware: the interaction of any predictor with itself is just itself. Thus, a:a
is just a
.
15.3 Shortcuts
R has some shortcuts that make building large, complex models easier to type.
- Star notation When you combine predictors with a
*
, you get predictors and all of their interactions. Thusa*b*c
is equivalent toa + b + c + a:b + a:c + b:c + a:b:c
. - Parentheses Interaction with a group of terms in parentheses distributes the interaction across all of the terms in the parentheses. Thus
(a+b+c):d
is the same asa:d + b:d + c:d
. - Exponential notation When a group of terms is “raised to a power”, the model includes the terms and all of their interactions up to the power. Thus
(a+b+c+d)^2
is equivalent toa + b + c + d + a:b + a:c + a:d + b:c + b:d + c:d
. This kind of notation explains why you cannot directly raise quantitative predictors to a power within the formula;a^4
becomesa:a:a:a
which just reduces toa
. - Slash (nesting) notation The form
a/b
expands toa + a:b
. Similarly,a/b/c
expands toa + a:b + a:b:c
.