Chapter 19 More about Factors
You can turn a vector into a factor by either
factor(vector)
or as.factor(vector)
. For most
usages, it doesn’t matter which you use. as.factor()
can be slightly faster (but not enough to make a difference in most usage).
factor()
has optional
arguments that let you reorder the levels, change factor labels, exclude some levels,
and so on. If myfactor
is already a factor, as.factor(myfactor)
leaves it
absolutely alone, but factor(myfactor)
could make some subtle changes having to
do with empty levels. But again, in most situations it doesn’t
matter.
If you make something into a factor, R puts the levels in ascending order (numerical, alphabetical, etc).
You can override that default order by using the levels
option in factor()
. Note below that
the two factors print the same, but the levels
attributes are reordered.
> junk <- as.factor(c(22,22,33,33,15,26));junk
[1] 22 22 33 33 15 26
Levels: 15 22 26 33
> junk2 <- factor(c(22,22,33,33,15,26),levels=c(26,33,22,15));junk2
[1] 22 22 33 33 15 26
Levels: 26 33 22 15
You can also turn character strings into a factor, and R will sort the levels alphabetically (unless you specify some other order as we just showed). Note that levels are case sensitive.
> junk3<-factor(c("red","blue","yellow","Red","white","blue"));junk3
[1] red blue yellow Red white blue
Levels: blue red Red white yellow
If you try to make a factor numeric,
what it gives you are the category numbers, not
the original numeric values (assuming
they were numeric). In junk
22 was category 2,
but 22 is category 3 in
junk2
.
> as.numeric(junk)
[1] 2 2 4 4 1 3
> as.numeric(junk2)
[1] 3 3 2 2 4 1
You can’t do arithmetic with factors.
> junk+10
Warning in Ops.factor(junk, 10): '+' not meaningful for factors
[1] NA NA NA NA NA NA
You can convert a factor to numeric and do arithmetic, but you are working with the group indicators, not the original numbers that created the groups (15, 22, etc.).
> as.numeric(junk)+10
[1] 12 12 14 14 11 13
You can retrieve the levels of a factor, but they show up as character data (they’re labels).
> levels(junk)
[1] "15" "22" "26" "33"
Because the levels here are character versions of numbers, we can coerce them to numbers and then do arithmetic with them if we wish.
> as.numeric(levels(junk))
[1] 15 22 26 33
> as.numeric(levels(junk))+10
[1] 25 32 36 43
Of course, that doesn’t work if the levels are not versions of numbers.
> levels(junk3)
[1] "blue" "red" "Red" "white" "yellow"
> as.numeric(levels(junk3))
Warning: NAs introduced by coercion
[1] NA NA NA NA NA
We can retrieve the levels of the factor (which show up as character), and then convert them to numeric. If we then subscript with the numeric version of the factor we can recover the original numeric data. So you can get the original back, but it’s often easier to just have two variables, one a factor and one not a factor.
as.numeric(levels(junk))[as.numeric(junk)]
19.1 Patterned data
If you need to type
in factor levels, you can save a lot of time by
learning to use the rep()
function (repeat function). Here 1 gets repeated 8 times, 2 gets repeated 8 times,
and finally 5 gets repeated six times.
> tb <- factor(rep(1:5,c(8,8,8,7,6)))
> tb
[1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5
Levels: 1 2 3 4 5
For more regular patterns, you can
use rep()
with the each
and length
parameters.
> rep(1:4,each=3,length=24)
[1] 1 1 1 2 2 2 3 3 3 4 4 4 1 1 1 2 2 2 3 3 3 4 4 4
Here is a way to get counts (replications) in different treatment groups.
> library(cfcdae)
> data(ResinLifetimes)
> replications(~temp,data=ResinLifetimes)
$temp
temp
175 194 213 231 250
8 8 8 7 6
19.2 Splitting and applying
You can split a data vector by a factor, and the result will be a list with named components corresponding to the different levels of the factor. Each element of the list will be a vector of data with the corresponding level of the factor.
> attach(ResinLifetimes)
The following object is masked _by_ .GlobalEnv:
logTime
The following objects are masked from ResinLifetimes (pos = 4):
logTime, temp, temp.z, Time
> split(logTime,temp)
Warning in split.default(logTime, temp): data length is not a multiple of split
variable
$`175`
[1] 3.14159
$`194`
numeric(0)
$`213`
numeric(0)
$`231`
numeric(0)
$`250`
numeric(0)
Splitting
is useful in several ways. First, some functions operate directly
on lists. For example, boxplot()
can take a list argument and
make separate boxplots for each component of the list.
Second, the functions lapply()
and sapply()
allow you to
execute a function for each of the components of a list.
lapply(mylist,function.name)
applies the function function.name()
separately to the data in each component of mylist
, and then
collects the output in a new list.
In this example, we get summary
statistics
for the data in each treatment group.
> lapply(split(logTime,temp), summary)
Warning in split.default(logTime, temp): data length is not a multiple of split
variable
$`175`
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.142 3.142 3.142 3.142 3.142 3.142
$`194`
Min. 1st Qu. Median Mean 3rd Qu. Max.
$`213`
Min. 1st Qu. Median Mean 3rd Qu. Max.
$`231`
Min. 1st Qu. Median Mean 3rd Qu. Max.
$`250`
Min. 1st Qu. Median Mean 3rd Qu. Max.
When the
output of each function application is the same shape, sapply()
will simplify the output into a compact matrix form.
This would not work
if we were applying the sort()
function instead of
summary()
, because the
different groups had different lengths, and so the
sorted groups would have different lengths after sorting and would
not fit into a matrix.
> sapply(split(logTime,temp), summary)
Warning in split.default(logTime, temp): data length is not a multiple of split
variable
175 194 213 231 250
Min. 3.14159 NA NA NA NA
1st Qu. 3.14159 NA NA NA NA
Median 3.14159 NA NA NA NA
Mean 3.14159 NaN NaN NaN NaN
3rd Qu. 3.14159 NA NA NA NA
Max. 3.14159 NA NA NA NA
Finally, if the output of your function is just a scalar value,
tapply(y,a,function.name)
is a shortcut for sapply(split(y,a),function.name)
.
19.3 More on factor representation
Fair warning: this is a more technical/advanced topic, and it can get to be a bit of tough sledding. You might not need all the details.
Factor representation is basically the outcome of how a factor is parameterized. For \(g\) groups, there are \(g\) group means \(\mu_i\). It is also possible to write the means as \(\mu_i = \mu + \alpha_i\). The parameter \(\mu\) is some kind of reference value, and the treatment effects \(\alpha_i\) describe how group \(i\) differs from the reference value. Each different way of defining the reference value yields a different set of treatment effects. Similarly, each different way of defining the reference value leads to a different way of representing the factor as a matrix predictor. Because we are using \(g+1\) parameters, and there are only \(g\) groups to fit, there is redundancy. Once we have determined what we mean by the reference value \(\mu\), only \(g-1\) of the treatment effects can vary freely; the last one must be constrained to make the means line up. This means that if we include the overall mean \(\mu\) in the formula, we only need \(g-1\) columns in the matrix predictor representation of the fact. Mathematically, all these different representations (parameterizations or ways of defining the reference value) are interchangeable. They all lead to the same fitted values.
We will explain the basic issues using a data set of size 6. FactorA
has three levels that appear as (1,1,2,2,3,3) in the data. Factor B
has two levels that appear as (1,2,1,2,1,2) in the data. Covariate
z
is numeric with values (1,2,3,4,5,6).
> A <- factor(c(1,1,2,2,3,3))
> B <- factor(c(1,2,1,2,1,2))
> z <- 1:6
~ 0 + A
. In this case, there is no overall constant in the model (you
are not fitting a reference value, called (Intercept)
in R), so
R assumes that you want to use treatment means \(\mu_i\).
In effect, the reference value is \(\mu = 0\) (because the intercept is not in the model)
and \(\mu_i = \alpha_i\) (the treatment effects are the treatment means).
When there is no overall constant
preceding A
in the formula, factor A
will be represented by three columns of dummy (0 or 1)
variables, each indicating one of the groups (you can ignore the attributes [attr
stuff] in the output).
> model.matrix(~ 0 + A)
A1 A2 A3
1 1 0 0
2 1 0 0
3 0 1 0
4 0 1 0
5 0 0 1
6 0 0 1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$A
[1] "contr.sum"
D.A
for dummy A. There is an analogous matrix for B
, which we will call D.B
.
> model.matrix(~ 0 + B)
B1 B2
1 1 0
2 0 1
3 1 0
4 0 1
5 1 0
6 0 1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$B
[1] "contr.sum"
19.3.1 Contrasts
For the formula ~ 1 + A
(or just ~ A
), the factor variable follows the intercept in the formula.
In this case, the intercept is represented by a column of all ones, and A
is represented by two
columns. (In general, for a factor with \(g\) groups the representation of the factor will have \(g-1\) columns.) There
are infinitely many choices for those two columns for A
, but a handful of options are common.
D.A
and uses the remaining \(g-1\) columns.
> model.matrix(~A,contrasts=list(A=contr.treatment))
(Intercept) A2 A3
1 1 0 0
2 1 0 0
3 1 1 0
4 1 1 0
5 1 0 1
6 1 0 1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$A
2 3
1 0 0
2 1 0
3 0 1
In this setting, the coefficient of the column of all ones will be the mean of group
1 of A (that is, \(\mu = \mu_1\) or \(\alpha_1=0\)), the coefficient of column two (A2
) will be the difference group 2 mean minus group 1 mean (\(\alpha_2 = \mu_2 - \mu_1\)),
and the coefficient of column three (A3
) will be the difference group 3 mean minus group 1 mean
(\(\alpha_3 = \mu_3 - \mu_1\)).
~ B
has two columns: the column of all ones and the second column
of D.B
.
> model.matrix(~B,contrasts=list(B=contr.treatment))
(Intercept) B2
1 1 0
2 1 1
3 1 0
4 1 1
5 1 0
6 1 1
attr(,"assign")
[1] 0 1
attr(,"contrasts")
attr(,"contrasts")$B
2
1 0
2 1
In this model, the intercept represents the mean of group 1 of B, and the coefficient of the second column represents the difference between the mean of B group 2 and B group 1.
Note that the meaning of the intercept changes between these two models; in the first example it was the mean in group 1 of factor A, and in the second example it was the mean of group 1 of factor B.
Another common set of contrasts is the “sum” contrasts. When you loadcfcdae
, it changes the default contrasts to be “sum” contrasts.
The sum contrasts force the treatment
effects to add to zero. Equivalently, the sum contrast forces \(\mu\) to be the
average of the \(\mu_i\) values. Because treatment effects add to zero with
these contrasts, the last treatment effect (here \(\alpha_3\))
is minus the sum of the others (here \(\alpha_3 = -\alpha_1 -\alpha_2)\).
The representing matrix looks like:
> model.matrix(~A,contrasts=list(A=contr.sum))
(Intercept) A1 A2
1 1 1 0
2 1 1 0
3 1 0 1
4 1 0 1
5 1 -1 -1
6 1 -1 -1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$A
[,1] [,2]
1 1 0
2 0 1
3 -1 -1
When you are in group 1, you get \(\mu + \alpha_1\), and you get \(\mu + \alpha_2\) when you are in group 2. In group 3, you get \(\mu - \alpha_1 - \alpha_2\), which is the same as \(\mu + \alpha_3\) when the effects add to zero. Of course, you get an analogous matrix for factor B, making use of the fact that \(\beta_2 = -\beta_1\).
Notice that the meaning of \(\mu\) is the same in both ~A
and ~B
when you
use the sum to zero contrasts. (In unbalanced data, the meaning of \(\mu\)
would change slightly between the two models.)
A
in dummy form and B
in contrast form:
> model.matrix(~ 0 + A + B,contrasts=c(A=contr.sum,B=contr.sum))
A1 A2 A3 B1
1 1 0 0 1
2 1 0 0 -1
3 0 1 0 1
4 0 1 0 -1
5 0 0 1 1
6 0 0 1 -1
attr(,"assign")
[1] 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$A
[,1] [,2]
1 1 0
2 0 1
3 -1 -1
attr(,"contrasts")$B
[,1]
1 1
2 -1
z
.
> model.matrix(~z)
(Intercept) z
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 1 6
attr(,"assign")
[1] 0 1
19.3.2 Once more, with interaction
Consider first the simple hierarchical situation. Here, any interaction in a formula is preceded by any terms it includes. Thus,a:b
must follow both a
and b
in the formula, and a:b:c
must follow a
, b
, c
, a:b
, a:c
, and b:c
in the formula.
In this case, the matrix that represents an interaction is the matrix that has all the
products of one column from each of the matrices representing the interaction’s constituent factors.
> model.matrix(~ A + B + A:B, contrasts=c(A=contr.sum,B=contr.sum))
(Intercept) A1 A2 B1 A1:B1 A2:B1
1 1 1 0 1 1 0
2 1 1 0 -1 -1 0
3 1 0 1 1 0 1
4 1 0 1 -1 0 -1
5 1 -1 -1 1 -1 -1
6 1 -1 -1 -1 1 1
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$A
[,1] [,2]
1 1 0
2 0 1
3 -1 -1
attr(,"contrasts")$B
[,1]
1 1
2 -1
> model.matrix(~ A + z + A:z,contrasts=c(A=contr.sum))
(Intercept) A1 A2 z A1:z A2:z
1 1 1 0 1 1 0
2 1 1 0 2 2 0
3 1 0 1 3 0 3
4 1 0 1 4 0 4
5 1 -1 -1 5 -5 -5
6 1 -1 -1 6 -6 -6
attr(,"assign")
[1] 0 1 1 2 3 3
attr(,"contrasts")
attr(,"contrasts")$A
[,1] [,2]
1 1 0
2 0 1
3 -1 -1
In this example, the coefficient of z
is an average slope,
and the coefficients of A:z
adjust the slope up or down for the first
two groups of A
. The coefficient for the last group of A
is set so
that the three sum to zero.
The process is rather less obvious for formulas that do not follow hierarchy, and it’s probably best to leave that for another time.