Chapter 15 More about Models in R

If we’re going to fit models in R, we need to be able to specify them. In R terminology, a model is specified (in part) by a “formula”. A formula takes the form response ~ predictors, where the ~ can be thought of as “is modeled by,” and predictors is one or more “terms” comprising variable names in various combinations. In basic usage, the formula is the model for the mean structure; in more advanced usage, aspects of variation can also be modeled via the formula.

Multiple predictors can be used in a formula by “adding” them, as in response ~ pred1 + pred2 + pred3. Redundant predictors are ignored, so response ~ pred1 + pred2 + pred1 is the same as response ~ pred1 + pred2. You can also remove a previous term by “subtracting” it. Thus response ~ pred1 + pred2 - pred1 is the same as response ~ pred2.

15.1 Variables in a formula

Variables that go into terms in formulas include the following types:

  • 1 This represents an overall constant to be fitted. It might be useful to think of 1 as a predictor that always takes the value 1; we estimate a multiplier (coefficient) for this predictor and the product gets added to all of the estimated means (fitted values).
    1 often has an interpretation as an intercept or overall mean. If you have other terms in your formula, an implicit 1 term is automatically added as the first term, but you can prevent this by including - 1 in your formula.
  • 0 A 0 in a formula tells R not to add the automatic 1 term. Putting 0 in a model does the same thing as - 1.
  • x (a quantitative vector) When you have a term that is a quantitative vector, R estimates a coefficient for that vector and adds the product to the fitted values.
  • Xmat (a quantitative matrix) When you have a term that is a quantitative matrix, R estimates a separate coefficient for each column of the matrix. The sum of each column multiplied by its coefficient is added to the fitted values.
  • grp (a factor variable) A factor variable indicates groups rather than quantities. Including a factor variable in a formula allows R to add a different coefficient for each of the groups indicated by the factor. In the simplest usage, this tells R to fit a different mean value for each group.
    Internally, R converts the factor variable to a matrix predictor, with the multiple columns allowing the estimation of different means in different groups. One complication of factor variables is that there are many ways to represent a factor as a matrix predictor. You need to know which representation you are using in order to understand the coefficients. A second complication is that the representation can also depend on what terms precede a given factor predictor in the model. Thus 0+grp and 1+grp lead to different representations of grp.
    For factors, there is arbitrariness in the way the factor is represented, but the ultimate fitted/predicted values are the same regardless of which representation you use.

You can use most functions within formulas. For example, you could predict the log of y by a multiple of the sine of x by using the formula log(y) ~ sin(x). One important exception is that you cannot use exponentiation via ^ on the predictor side of the formula. For example, y ~ x^2 will not yield \(x^2\) as a predictor. To use these more fragile manipulations, you must protect your term by wrapping it in I(). Thus, y ~ I(x^2) will create \(x^2\) and use it as the predictor, and you could create a polynomial regression of order 3 via y ~ x + I(x^2) + I(x^3).

There is a function poly(x,ord) that can be used in a model to create a matrix predictor whose columns are orthogonal polynomials of x up to order ord. An orthogonal polynomial of order 3 will make the same predictions as an ordinary polynomial of order 3, but the orthogonal polynomials are more stable in computations and provide uncorrelated coefficient estimates (unlike ordinary polynomials, which can have highly correlated coefficient estimates). On the down side, they are not as easy to understand.

15.2 Interactions of predictors

Predictors can be combined by forming their interaction. This is represented using a colon, so the interaction of pred1 and pred2 is pred1:pred2. The interaction of two quantitative predictors is another quantitative predictor that is just the element-wise product of the two predictors. The interaction of a matrix predictor and a quantitative predictor is a new matrix predictor, with each column being the element-wise product of one column of the matrix predictor and the quantitative predictors. The interaction of two matrix predictors is another matrix predictor with one column as the element-wise product for each combination of a column from the first predictor and a column from the second predictor.

The interaction of two or more grouping factors will allow the model to fit a separate mean for each combination of those factors. The interaction of a grouping factor and a quantitative predictor z will allow the model to fit a different coefficient (slope) for z for every level of the grouping factor. As with a single factor, there are many possibilities for how the factors, their interactions, and the slopes are represented, but the different representations do not change what the ultimate fitted values will be.

Be aware: the interaction of any predictor with itself is just itself. Thus, a:a is just a.

15.3 Shortcuts

R has some shortcuts that make building large, complex models easier to type.

  • Star notation When you combine predictors with a *, you get predictors and all of their interactions. Thus a*b*c is equivalent to a + b + c + a:b + a:c + b:c + a:b:c.
  • Parentheses Interaction with a group of terms in parentheses distributes the interaction across all of the terms in the parentheses. Thus (a+b+c):d is the same as a:d + b:d + c:d.
  • Exponential notation When a group of terms is “raised to a power”, the model includes the terms and all of their interactions up to the power. Thus (a+b+c+d)^2 is equivalent to a + b + c + d + a:b + a:c + a:d + b:c + b:d + c:d. This kind of notation explains why you cannot directly raise quantitative predictors to a power within the formula; a^4 becomes a:a:a:a which just reduces to a.
  • Slash (nesting) notation The form a/b expands to a + a:b. Similarly, a/b/c expands to a + a:b + a:b:c.