Citation:

Helwig, N. E. (2019, Mar 8). Reproducible data wrangling and modeling with R. Department of Psychology Open and Reproducible Science Brown Bag Series. University of Minnesota. Obtained from http://users.stat.umn.edu/~helwig/talks/ReproducibleCode.html

and

Helwig, N. E., & Ruprecht, M. R. (2017). Age, gender, and self-esteem: a sociocultural look through a nonparametric lens. Archives of Scientific Psychology, 5(1), 19-31. doi: https://doi.org/10.1037/arc0000032

Source code: http://users.stat.umn.edu/~helwig/talks/ReproducibleCode.Rmd

1 Introduction

1.1 Motivation

Going from collected data to published paper often involves several steps that apply different “operations” to the data. Keeping a detailed record of each operation can be a daunting task—but is imperative for reproducible research. In this document, I demonstrate how the entire “data wrangling and modeling” process can be automated using reproducible R code. The ideas (and some R code) used in this demonstration have been adapted from Helwig and Ruprecht (2017).

1.2 Why R?

R is a free and open-source software environment for statistics, and is the lingua franca for statisticians and data scientists.

  • Free = installing and using R costs you nothing (i.e., no initial cost or yearly fee)

  • Open-source = the R source code is available to you (i.e., not a black box)

R seamlessly documents each operation applied to a dataset, which makes reproducible research easy and enjoyable.

  • Customizable functions and scripts streamline process

  • Data processing, analysis, and reporting (all-in-one)

Available from The R Project for Statistical Computing: http://www.r-project.org/

1.3 Overview

To demonstrate the power of R for reproducible research, this document provides a fully worked example covering all aspects of the “wrangling and modeling” process. The demonstration will cover data processing and cleaning, data visualization, data tabulation (i.e., making tables), and data analysis. For each topic, I will provide reproducible code to conduct the “operation” and save (or export) the result. Aside from the Generalized Additive Models (GAMs) fit in Section 5.5, all of the “operations” conducted in this document are done using pure R, i.e., without any add-on packages.

2 Data Loading and Preprocessing

2.1 Data Overview

For the example, we will use the “RSE” (Rosenberg Self-Esteem) Data from http://openpsychometrics.org/_rawdata/RSE.zip

Updated: 2/15/2014

Number of Subjects: 47,974

Number of Variables: 14

The first 10 variables are the items of the RSE:

  • Q1: I feel that I am a person of worth, at least on an equal plane with others.
  • Q2: I feel that I have a number of good qualities.
  • Q3: All in all, I am inclined to feel that I am a failure.
  • Q4: I am able to do things as well as most other people.
  • Q5: I feel I do not have much to be proud of.
  • Q6: I take a positive attitude toward myself.
  • Q7: On the whole, I am satisfied with myself.
  • Q8: I wish I could have more respect for myself.
  • Q9: I certainly feel useless at times.
  • Q10: At times I think I am no good at all.

From the helpfile, the dataset also contains…

  • gender: (1 = male, 2 = female, 3 = other; 0 = none chosen)
  • age: free response (0 = response that could not be converted to integer)
  • source: How the user came to the page (HTTP referer). 1 = front page of personality test website, 2 = Google search, 3 = other.
  • country: Inferred from technical information using MaxMind GeoLite.

Remark 1: In the dataset helpfile, the levels of “gender” are incorrectly labeled (1 = male, 3 = female, 3 = other; 0 = none chosen).

2.2 Download Data

The data are stored as a “data.csv” file, which is within the “RSE.zip” file linked above.

Remark 2: Although the file has a “.csv” extension, it is not a CSV (comma-separated values) file. In reality, it is a TSV (tab-separated values) file in disguise.

First, we need to download and unzip the file. This can be done manually (boo!) or via reproducible R code (woo-hoo!).

# Step 1: define destination file (i.e., where to save the data)
destfile <- paste0(getwd(), "/RSE.zip")

# Step 2: download the data file (to the destination file)
download.file(url = "http://openpsychometrics.org/_rawdata/RSE.zip",
              destfile = destfile)

# Step 3: unzip the "RSE.zip" file
unzip(destfile)

Warning: the unzip function may not work on Windows machines.

Next, we will load the data into R using the read.table function

datapath <- paste0(getwd(), "/RSE/data.csv")
rse <- read.table(file = datapath, header = TRUE)

Note that the header = TRUE option is included because the “data.csv” file contains a header, i.e., variable names as the first row.

2.3 Look at the Data

Let’s take a look at the first 6 (by default) rows of data:

head(rse)
##   Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 gender age source country
## 1  3  3  1  4  3  4  3  2  3   3      1  40      1      US
## 2  4  4  1  3  1  3  3  2  3   2      1  36      1      US
## 3  2  3  2  3  3  3  2  3  3   3      2  22      1      US
## 4  4  3  2  3  2  3  2  3  3   3      1  31      1      US
## 5  4  4  1  4  1  4  4  1  1   1      1  30      1      EU
## 6  4  4  1  3  1  3  4  2  2   1      2  25      1      CA

Remark 3: The numbers (on the left) are the rownames, which are not a part of the data.

To confirm this, let’s check the data dimensions:

dim(rse)
## [1] 47974    14

We have 47,974 rows (subjects) and 14 columns (variables).

2.4 Recode the Missing Gender and Age

From the data helpfile, we know missing values for gender and age are coded as 0.

First, let’s determine how many missing values we have for each variable:

table(rse$gender)    # table all values
## 
##     0     1     2     3 
##   462 17801 29182   529
sum(rse$age == 0)    # count number of zeros
## [1] 568

We will recode the missing values as NA (Not Available), which is the “Missing Value” indicator in R.

# recode gender missing values
indx <- which(rse$gender == 0)
rse$gender[indx] <- NA

# recode age missing values
indx <- which(rse$age == 0)
rse$age[indx] <- NA

2.5 Recode the Missing RSE Items

From the data helpfile, we know missing values for Q1Q10 are coded as 0.

To recode the missing values as NA, we can use a “for loop” to streamline the process

for(j in 1:10){
  indx <- which(rse[,j] == 0)
  rse[indx,j] <- NA
}

Note that the for loop redefines the zeros as NA’s for columns \(j = 1, \ldots, 10\), which correspond to the RSE items Q1, …, Q10.

2.6 Recode Levels of Gender

From the helpfile, the gender variable is coded using integers (1 = male, 2 = female, 3 = other). Since R can handle factor variables (see ?factor), we can recode the levels to be more intuitive.

To recode the levels, we just need to redefine the factor:

rse$gender <- factor(rse$gender, levels = c(1, 2, 3),
                     labels = c("male", "female", "other"))
table(rse$gender)
## 
##   male female  other 
##  17801  29182    529

Note that the 1’s are now “male”, the 2’s are now “female”, and the 3’s are now “other”. Also note that the missing data (i.e., NA values) are excluded from the table now.

2.7 Recode Negatively Worded RSE Items

Note that five of the items of the RSE are worded in the negative direction, such that endorsement of the item indicates lower levels of self-esteem (i.e., items 3, 5, 8, 9, 10).

The (non-missing) RSE items are coded with the integers 1 (= strongly disagree), 2 (= disagree), 3 (= agree), and 4 (= strongly agree).

To recode the negatively worded items, we need to convert…

  • 1 (= strongly disagree) to 4 (= strongly agree)
  • 2 (= disagree) to 3 (= agree)
  • 3 (= agree) to 2 (= disagree)
  • 4 (= strongly agree) to 1 (= strongly disagree)

Algebraically, this conversion can be represented as \[y = 5 - x\] where \(x\) is the negatively worded item and \(y\) is the converted score.

To implement this conversion in R, we just write out the algebraic equation:

indx <- c(3, 5, 8, 9, 10)
rse[,indx] <- 5 - rse[,indx]

2.8 Define RSE Total Score

After the conversion, all of the items are worded in the positive direction. This makes it possible to define the RSE Total score, which is the summation of the 10 item scores.

We can add the total score to our dataframe by applying the rowSums function to the first 10 columns of our dataframe:

rse$total <- rowSums(rse[,1:10])
head(rse)
##   Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 gender age source country total
## 1  3  3  4  4  2  4  3  3  2   2   male  40      1      US    30
## 2  4  4  4  3  4  3  3  3  2   3   male  36      1      US    33
## 3  2  3  3  3  2  3  2  2  2   2 female  22      1      US    24
## 4  4  3  3  3  3  3  2  2  2   2   male  31      1      US    27
## 5  4  4  4  4  4  4  4  4  4   4   male  30      1      EU    40
## 6  4  4  4  3  4  3  4  3  3   4 female  25      1      CA    36

Note that our dataframe now has 15 columns, with the total score appended at the end.

2.9 Check Age Range

Before any analysis work, let’s check the range of the ages

range(rse$age, na.rm = TRUE)
## [1]          1 2147483647

The minimum reported subject age is 1 (year old) and the maximum reported age is 2,147,483,647 (years old). It seems that we have some miscoded and/or corrupted data…

To check the extent of the problem, let’s look at the proportion of subjects with ages outside of the range 10 to 80 years old:

mean(rse$age < 10, na.rm = TRUE)
## [1] 0.001307851
mean(rse$age > 80, na.rm = TRUE)
## [1] 0.001265663

Note that about 0.1% of the subjects have reported ages below 10 years old. Similarly, about 0.1% of the subjects have reported ages greater than 80 years old.

2.10 Subset Data: 10-80 year olds

Let’s subset the dataset to only include the 10-80 year old subjects:

rse1080 <- subset(rse, age >= 10 & age <= 80)

Note that we have created a new dataset that contains only the subjects with ages in the range 10 to 80 years old:

dim(rse1080)
## [1] 47284    15
head(rse1080)
##   Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 gender age source country total
## 1  3  3  4  4  2  4  3  3  2   2   male  40      1      US    30
## 2  4  4  4  3  4  3  3  3  2   3   male  36      1      US    33
## 3  2  3  3  3  2  3  2  2  2   2 female  22      1      US    24
## 4  4  3  3  3  3  3  2  2  2   2   male  31      1      US    27
## 5  4  4  4  4  4  4  4  4  4   4   male  30      1      EU    40
## 6  4  4  4  3  4  3  4  3  3   4 female  25      1      CA    36
range(rse1080$age, na.rm = TRUE)
## [1] 10 80

2.11 Remove Missing Data

As a final preprocessing step, we will remove subjects who are missing data for the RSE total score, the gender, or the age.

navars <- c("gender", "age", "total")
nacheck <- function(x) any(is.na(x))
indx <- apply(rse1080[, navars], 1, nacheck)
rse1080 <- rse1080[!indx,]
nrow(rse1080)
## [1] 45728

3 Data Visualization

3.1 Age by Gender

Let’s look at histograms of the ages:

hist(rse1080$age, xlab = "Age",
     main = "Histogram of Age")

Suppose that we want to look at separate histograms for each gender. To do this, we will first create an indexing vector for each gender:

indxM <- which(rse1080$gender == "male")
indxF <- which(rse1080$gender == "female")
indxO <- which(rse1080$gender == "other")
length(indxM) + length(indxF) + length(indxO)
## [1] 45728

Note that the vector indxM contains the row numbers corresponding to the male subjects, indxF contains the row numbers corresponding to the female subjects, and indxO contains the row numbers corresponding to the other subjects.

Now we can plot separate histograms of age for each gender by indexing the appropriate elements of the age variable:

par(mfrow = c(1,3))
hist(rse1080$age[indxM], xlab = "Age",
     main = "Histogram of Age (males)", col = "blue")
hist(rse1080$age[indxF], xlab = "Age",
     main = "Histogram of Age  (females)", col = "red")
hist(rse1080$age[indxO], xlab = "Age",
     main = "Histogram of Age (other)", col = "purple")

To visualize all of the distributions together, we could use a kernel density estimate (KDE):

# KDE for males
kde <- density(rse1080$age[indxM], na.rm = TRUE)
plot(kde, xlab = "Age", ylab = "Density", 
     main = "Kernel Density Estimates of Age by Gender",
     col = "blue", lwd = 2, xlim = c(10, 80), ylim = c(0, 0.12))

# KDE for females
kde <- density(rse1080$age[indxF], na.rm = TRUE)
lines(kde, col = "red", lty = 2, lwd = 2)

# KDE for others
kde <- density(rse1080$age[indxO], na.rm = TRUE)
lines(kde, col = "purple", lty = 3, lwd = 2)

# add legend
legend("topright", legend = c("male", "female", "other"),
       lty = 1:3, lwd = 2, col = c("blue", "red", "purple"), bty = "n")