Chapter 2 Getting your data
2.1 Variables in R
R supports several variable types including numeric (e.g., 1, -7, 3.14159), logical (e.g., TRUE or FALSE), and character (e.g., “Moe”, “Larry”, “Curly”), and we will see and use multiple types of variables. Most obviously, the responses we measure in experiments are generally numeric variables. (There could be contexts where the response is expressed in some other fashion, for example, “Success” or “Failure”, but these are often turned into numeric via 0 or 1.)
Perhaps the most common use for logical variables is subsetting other
variables. For example, suppose we only want to use the data that were
collected on a Tuesday. Then we might have an argument in a command
like this: subset = (DayOfWeek == "Tuesday")
. Or we might want to
eliminate missing values via: newy <- y[!is.na(y)]
. We will
see this off and on in examples.
Character data most often arise as labels for kinds of treatments
(called levels of treatments). Thus we might have results for different
kinds of acids, and the variable AcidType
records the type of
acid used in a particular unit (“sulfuric”, “hydrochloric”, “hydrofluoric”).
Many kinds of data can be turned into factors. Factors indicate grouping.
If you make a factor out of numeric data, for example factor(c(1,4,3,2,5,4))
,
the result is not numeric. That is, 1
in the factor we just made no longer represents
a number. Instead, it represents a group that has the label 1
. If you
have data from multiple groups, say 3 groups, you can represent that as
three different numeric variables, or you can represent that as one
numeric variable holding all of the data and a factor with three levels
indicating which group each value belongs to.
If you have multiple variables that all have the same length and the same type, you can collect them into a matrix format. We often think of that as cases (rows) by variables (columns). However, you cannot mix data types in a matrix or combine numeric and factor variables.
If you have multiple variables that all have the same length, you can generally collect them into data frame, which is sort of a pseudo-matrix. It still looks like cases (rows) by variables (columns), but the variables are allowed to have different types. For example, you could combine a numeric response variable and a factor variable indicating treatment type into a data frame.
2.2 Data Sets in cfcdae
All of the data from FCDAE are available in cfcdae
as data frames.
For example, if you want the runstitching data, you can give
the commands:> library(cfcdae)
> data(RunStitch)
You only need the library()
command once each session, and we will
not show library(cfcdae)
in the future.
RunStitch
itself is a data frame with two columns named
Standard and Ergonomic; here are the first few values:
> head(RunStitch)
Standard Ergonomic
1 4.90 3.87
2 4.50 4.54
3 4.86 4.60
4 5.57 5.27
5 4.62 5.59
6 4.65 4.61
> RunStitch$Standard # take the component named Standard
[1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06
> RunStitch[,"Standard"] # take the column named Standard
[1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06
> RunStitch[,1] # take the first column
[1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06
> with(RunStitch, Standard) # look for Standard in RunStitch before looking elsewhere
[1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06
> attach(RunStitch)
> Standard
[1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06
However, use “attach()” with care, as you can confuse yourself mightily. You could also have a variable called “Standard” that was outside of the data frame, and you have to keep track of which one you are using at any time.
The most common way to use data in a data frame is that many R functions
have an optional “data=dataframename” argument. If you use that data=dataframe
argument, then the variables inside the data frame are generally available
without needing to reference the data frame.
2.3 Typing in data
You can type data into R if you need to, but try to avoid it if you can. Typing is an extremely easy way to introduce bad data.
R stores data as
scalars (a single number), vectors (a list of numbers), matrices
(a table of numbers), and other ways. The function
c()
takes its arguments and puts them together into
a vector; I think of it as a shortcut for “combine” or
“concatenate” or something like that.
The form <-
is assignment; it means take
whatever is on the right and assign it to the variable whose
name is given on the left. We want to input phosphorus values
for 15 day old plants.
This command combines 4 numbers into a vector
and then assigns that to a variable named day15
.
> day15 <- c(4.3,4.6,4.8,5.4)
You can also use an equals sign instead of the
assignment arrow (this is standard in many programming languages, but it does lead to
semantically correct statements like y=y+1
). You can also extend a command over more
than one line (but some GUI front ends can make it a bit challenging to do).
You can make your lines as
long as you like.
I recommend using the <-
form of assignment, as the equals
sign version is also used for setting function parameters.
These are the phosphorus values for the 28 day plants.
> day28 <- c(5.3,5.7,6.0,
+ 6.3)
If we wanted to make a matrix containing both the 15 and 28 day data with days as columns, we could first put the data in one long vector and then turn that vector into a matrix.
> alldata <- c(4.3, 4.6, 4.8, 5.4, 5.3, 5.7, 6.0, 6.3)
> alldata <- c(day15, day28) # gives the same thing
> matrixdata <- matrix(alldata, nrow=4)
> matrixdata
[,1] [,2]
[1,] 4.3 5.3
[2,] 4.6 5.7
[3,] 4.8 6.0
[4,] 5.4 6.3
> cbind(day15, day28) # bind columns if you already had columns
day15 day28
[1,] 4.3 5.3
[2,] 4.6 5.7
[3,] 4.8 6.0
[4,] 5.4 6.3
By default, matrix()
puts the data into the matrix down columns,
but you can ask it to put data in by rows (note: you need to put the
data in the correct order).
> alldata2 <- c(4.3, 5.3, 4.6, 5.7, 4.8, 6.0, 5.4, 6.3)
> rowmatrix <- matrix(alldata2, nrow=4, byrow=TRUE)
> rowmatrix
[,1] [,2]
[1,] 4.3 5.3
[2,] 4.6 5.7
[3,] 4.8 6.0
[4,] 5.4 6.3
2.4 Generating Data
Sometimes we can generate numbers programmatically. For example, if we want 100 random draws from a normal distribution with mean 10 and standard deviation 3, we can do> set.seed(12345)
> y <- rnorm(100, mean=10, sd=3)
The set.seed()
call sets the starting point (seed) of the random number generator. If you set the seed so,
you get the same normals each time you call it. If the seed is something else, you will get different
normals. There are corresponding functions for many other distributions (e.g., rbinom, rgamma, rt, etc.)
seq()
function.
> seq(1, 6, by=1/3) # 1 to 6 in steps of 1/3
[1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 3.333333
[9] 3.666667 4.000000 4.333333 4.666667 5.000000 5.333333 5.666667 6.000000
> seq(1, 6) # steps of 1 is the default
[1] 1 2 3 4 5 6
> 1:6 # short cut
[1] 1 2 3 4 5 6
rep()
function.
> rep(1:3, 2) # repeat the complete sequence twice
[1] 1 2 3 1 2 3
> rep(1:3, c(2, 2, 4)) # repeat first two twice, last one four times
[1] 1 1 2 2 3 3 3 3
> rep(1:3, each=3, length=18) # repeat each 3 times, then repeat all of that to length 18
[1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
This will be very useful when setting up variables to indicate grouping.
2.5 Reading data from files
We sometimes have data on external files that we would like to read in. Common formats include plain text, .csv, or .Rda. The RStudio file menu has an item to import from additional formats, including SAS, SPSS, Stata, and Excel formats.
- If you just have a bunch of numbers in a plain text file, you can use
scan("filename")
to read the data in. You will need to assign the output to a vector. - If you have a plain text file that is formatted as a matrix, you can read the data in using
read.table("filename")
. The return value is a data frame. If the first line contains column names, you should useread.table("filename",header=TRUE)
. - For a .csv file, the function
read.csv("filename")
does what you want. Again, this returns a data frame, and you should useheader=TRUE
if the first row contains column names. - For a .Rda file (one or more R variables saved in the R format), you can use
load("filename")
. This directly creates the saved variables.