Chapter 2 Getting your data

2.1 Variables in R

R supports several variable types including numeric (e.g., 1, -7, 3.14159), logical (e.g., TRUE or FALSE), and character (e.g., “Moe,” “Larry,” “Curly”), and we will see and use multiple types of variables. Most obviously, the responses we measure in experiments are generally numeric variables. (There could be contexts where the response is expressed in some other fashion, for example, “Success” or “Failure,” but these are often turned into numeric via 0 or 1.)

Perhaps the most common use for logical variables is subsetting other variables. For example, suppose we only want to use the data that were collected on a Tuesday. Then we might have an argument in a command like this: subset = DayOfWeek == "Tuesday". Or we might want to eliminate missing values via: newy <- y[!is.na(y)]. We will see this off and on in examples.

Character data most often arise as labels for kinds of treatments (called levels of treatments). Thus we might have results for different kinds of acids, and the variable AcidType records the type of acid used in a particular unit (“sulfuric,” “hydrochloric,” “hydrofluoric”).

Many kinds of data can be turned into factors. Factors indicate grouping. If you make a factor out of numeric data, for example factor(c(1,4,3,2,5,4)), the result is not numeric. That is, 1 in the factor we just made no longer represents a number. Instead, it represents a group that has the label 1. If you have data from multiple groups, say 3 groups, you can represent that as three different numeric variables, or you can represent that as one numeric variable holding all of the data and a factor with three levels indicating which group each value belongs to.

If you have multiple variables that all have the same length and the same type, you can collect them into a matrix format. We often think of that as cases (rows) by variables (columns). However, you cannot mix data types in a matrix or combine numeric and factor variables.

If you have multiple variables that all have the same length, you can generally collect them into data frame, which is sort of a pseudo-matrix. It still looks like cases (rows) by variables (columns), but the variables are allowed to have different types. For example, you could combine a numeric response variable and a factor variable indicating treatment type into a data frame.

2.2 Data Sets in cfcdae

All of the data from FCDAE are available in cfcdae as data frames. For example, if you want the runstitching data, you can give the commands:

library(cfcdae)
data(RunStitch)

You only need the library() command once each session, and we will not show library(cfcdae) in the future.

RunStitch itself is a data frame with two columns named Standard and Ergonomic; here are the first few values:

head(RunStitch)
  Standard Ergonomic
1     4.90      3.87
2     4.50      4.54
3     4.86      4.60
4     5.57      5.27
5     4.62      5.59
6     4.65      4.61

You can access the data in Standard or Ergonomic in a number of ways including:

RunStitch$Standard # take the component named Standard
 [1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06
RunStitch[,"Standard"] # take the column named Standard
 [1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06
RunStitch[,1] # take the first column
 [1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06
with(RunStitch,Standard) # look for Standard in RunStitch before looking elsewhere
 [1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06

You can “attach” a data frame, and it will be automatically searched for variables.

attach(RunStitch)
Standard
 [1] 4.90 4.50 4.86 5.57 4.62 4.65 4.62 6.39 4.36 4.91 4.70 4.77 4.75 4.60 5.06
[16] 5.51 4.66 4.95 4.75 4.67 5.06 4.44 4.46 5.43 4.83 5.05 5.78 5.10 4.68 6.06

However, use “attach()” with care, as you can confuse yourself mightily. You could also have a variable called “Standard” that was outside of the data frame, and you have to keep track of which one you are using at any time.

The most common way to use data in a data frame is that many R functions have an optional “data=dataframename” argument. If you use that data=dataframe argument, then the variables inside the data frame are generally available without needing to reference the data frame.

2.3 Typing in data

You can type data into R if you need to, but try to avoid it if you can. Typing is an extremely easy way to introduce bad data.

R stores data as scalars (a single number), vectors (a list of numbers), matrices (a table of numbers), and other ways. The function c() takes its arguments and puts them together into a vector; I think of it as a shortcut for combine" orconcatenate" or something like that.
The form <- is assignment; it means take whatever is on the right and assigns it to the variable whose name is given on the left. We want to input phosphorus values for 15 day old plants. This command combines 4 numbers into a vector and then assigns that to a variable named day15.

day15 <- c(4.3,4.6,4.8,5.4)

You can also use an equals sign instead of the assignment arrow (this is standard in many programming languages, but it does lead to semantically correct statements like y=y+1). You can also extend a command over more than one line (but some GUI front ends can make it a bit challenging to do). You can make your lines as long as you like.

I recommend using the <- form of assignment, as the equals sign version is also used for setting function parameters.

These are the phosphorus values for the 28 day plants.

day28 <- c(5.3,5.7,6.0,
6.3)

If we wanted to make a matrix containing both the 15 and 28 day data with days as columns, we could first put the data in one long vector and then turn that vector into a matrix.

alldata <- c(4.3,4.6,4.8,5.4,5.3,5.7,6.0,6.3)
alldata <- c(day15,day28) # gives the same thing
matrixdata <- matrix(alldata,nrow=4)
matrixdata
     [,1] [,2]
[1,]  4.3  5.3
[2,]  4.6  5.7
[3,]  4.8  6.0
[4,]  5.4  6.3
cbind(day15,day28) # bind columns if you already had columns
     day15 day28
[1,]   4.3   5.3
[2,]   4.6   5.7
[3,]   4.8   6.0
[4,]   5.4   6.3

By default, matrix() puts the data into the matrix down columns, but you can ask it to put data in by rows (note: you need to put the data in the correct order).

alldata2 <- c(4.3,5.3,4.6,5.7,4.8,6.0,5.4,6.3)
rowmatrix <- matrix(alldata2,nrow=4,byrow=TRUE)
rowmatrix
     [,1] [,2]
[1,]  4.3  5.3
[2,]  4.6  5.7
[3,]  4.8  6.0
[4,]  5.4  6.3

2.4 Generating Data

Sometimes we can generate numbers programmatically. For example, if we want 100 random draws from a normal distribution with mean 10 and standard deviation 3, we can do

set.seed(12345)
y <- rnorm(100,mean=10,sd=3)

The set.seed() call sets the starting point (seed) of the random number generator. If you set the seed so, you get the same normals each time you call it. If the seed is something else, you will get different normals. There are corresponding functions for many other distributions (e.g., rbinom, rgamma, rt, etc.)

You can generate sequences using the seq() function.

seq(1,6,by=1/3) # 1 to 6 in steps of 1/3
 [1] 1.000000 1.333333 1.666667 2.000000 2.333333 2.666667 3.000000 3.333333
 [9] 3.666667 4.000000 4.333333 4.666667 5.000000 5.333333 5.666667 6.000000
seq(1,6) # steps of 1 is the default
[1] 1 2 3 4 5 6
1:6 # short cut
[1] 1 2 3 4 5 6

You can get repeats by using the rep() function.

rep(1:3,2) # repeat the complete sequence twice
[1] 1 2 3 1 2 3
rep(1:3,c(2,2,4)) # repeat first two twice, last one four times
[1] 1 1 2 2 3 3 3 3
rep(1:3,each=3,length=18) # repeat each 3 times, then repeat all of that to length 18
 [1] 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3

This will be very useful when setting up variables to indicate grouping.

2.5 Reading data from files

We sometimes have data on external files that we would like to read in. Common formats include plain text, .csv, or .Rda. The RStudio file menu has an item to import from additional formats, including SAS, SPSS, Stata, and Excel formats.

  • If you just have a bunch of numbers in a plain text file, you can use scan("filename") to read the data in. You will need to assign the output to a vector.
  • If you have a plain text file that is formatted as a matrix, you can read the data in using read.table("filename"). The return value is a data frame. If the first line contains column names, you should use read.table("filename",header=TRUE).
  • For a .csv file, the function read.csv("filename") does what you want. Again, this returns a data frame.
  • For a .Rda file (one or more R variables saved in the R format), you can use load("filename"). This directly creates the saved variables.