Citation:
Helwig, N. E. (2019, Mar 8). Reproducible data wrangling and modeling with R. Department of Psychology Open and Reproducible Science Brown Bag Series. University of Minnesota. Obtained from http://users.stat.umn.edu/~helwig/talks/ReproducibleCode.html
and
Helwig, N. E., & Ruprecht, M. R. (2017). Age, gender, and self-esteem: a sociocultural look through a nonparametric lens. Archives of Scientific Psychology, 5(1), 19-31. doi: https://doi.org/10.1037/arc0000032
Source code: http://users.stat.umn.edu/~helwig/talks/ReproducibleCode.Rmd
Going from collected data to published paper often involves several steps that apply different “operations” to the data. Keeping a detailed record of each operation can be a daunting task—but is imperative for reproducible research. In this document, I demonstrate how the entire “data wrangling and modeling” process can be automated using reproducible R code. The ideas (and some R code) used in this demonstration have been adapted from Helwig and Ruprecht (2017).
R is a free and open-source software environment for statistics, and is the lingua franca for statisticians and data scientists.
Free = installing and using R costs you nothing (i.e., no initial cost or yearly fee)
Open-source = the R source code is available to you (i.e., not a black box)
R seamlessly documents each operation applied to a dataset, which makes reproducible research easy and enjoyable.
Customizable functions and scripts streamline process
Data processing, analysis, and reporting (all-in-one)
Available from The R Project for Statistical Computing: http://www.r-project.org/
To demonstrate the power of R for reproducible research, this document provides a fully worked example covering all aspects of the “wrangling and modeling” process. The demonstration will cover data processing and cleaning, data visualization, data tabulation (i.e., making tables), and data analysis. For each topic, I will provide reproducible code to conduct the “operation” and save (or export) the result. Aside from the Generalized Additive Models (GAMs) fit in Section 5.5, all of the “operations” conducted in this document are done using pure R, i.e., without any add-on packages.
For the example, we will use the “RSE” (Rosenberg Self-Esteem) Data from http://openpsychometrics.org/_rawdata/RSE.zip
Updated: 2/15/2014
Number of Subjects: 47,974
Number of Variables: 14
The first 10 variables are the items of the RSE:
From the helpfile, the dataset also contains…
Remark 1: In the dataset helpfile, the levels of “gender” are incorrectly labeled (1 = male, 3 = female, 3 = other; 0 = none chosen).
The data are stored as a “data.csv” file, which is within the “RSE.zip” file linked above.
Remark 2: Although the file has a “.csv” extension, it is not a CSV (comma-separated values) file. In reality, it is a TSV (tab-separated values) file in disguise.
First, we need to download and unzip the file. This can be done manually (boo!) or via reproducible R code (woo-hoo!).
# Step 1: define destination file (i.e., where to save the data)
destfile <- paste0(getwd(), "/RSE.zip")
# Step 2: download the data file (to the destination file)
download.file(url = "http://openpsychometrics.org/_rawdata/RSE.zip",
destfile = destfile)
# Step 3: unzip the "RSE.zip" file
unzip(destfile)
Warning: the unzip
function may not work on Windows machines.
Next, we will load the data into R using the read.table
function
datapath <- paste0(getwd(), "/RSE/data.csv")
rse <- read.table(file = datapath, header = TRUE)
Note that the header = TRUE
option is included because the “data.csv” file contains a header, i.e., variable names as the first row.
Let’s take a look at the first 6 (by default) rows of data:
head(rse)
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 gender age source country
## 1 3 3 1 4 3 4 3 2 3 3 1 40 1 US
## 2 4 4 1 3 1 3 3 2 3 2 1 36 1 US
## 3 2 3 2 3 3 3 2 3 3 3 2 22 1 US
## 4 4 3 2 3 2 3 2 3 3 3 1 31 1 US
## 5 4 4 1 4 1 4 4 1 1 1 1 30 1 EU
## 6 4 4 1 3 1 3 4 2 2 1 2 25 1 CA
Remark 3: The numbers (on the left) are the rownames, which are not a part of the data.
To confirm this, let’s check the data dimensions:
dim(rse)
## [1] 47974 14
We have 47,974 rows (subjects) and 14 columns (variables).
From the data helpfile, we know missing values for gender and age are coded as 0.
First, let’s determine how many missing values we have for each variable:
table(rse$gender) # table all values
##
## 0 1 2 3
## 462 17801 29182 529
sum(rse$age == 0) # count number of zeros
## [1] 568
We will recode the missing values as NA
(Not Available), which is the “Missing Value” indicator in R.
# recode gender missing values
indx <- which(rse$gender == 0)
rse$gender[indx] <- NA
# recode age missing values
indx <- which(rse$age == 0)
rse$age[indx] <- NA
From the data helpfile, we know missing values for Q1–Q10 are coded as 0.
To recode the missing values as NA
, we can use a “for loop” to streamline the process
for(j in 1:10){
indx <- which(rse[,j] == 0)
rse[indx,j] <- NA
}
Note that the for loop redefines the zeros as NA’s for columns \(j = 1, \ldots, 10\), which correspond to the RSE items Q1, …, Q10.
From the helpfile, the gender variable is coded using integers (1 = male, 2 = female, 3 = other). Since R can handle factor variables (see ?factor
), we can recode the levels to be more intuitive.
To recode the levels, we just need to redefine the factor:
rse$gender <- factor(rse$gender, levels = c(1, 2, 3),
labels = c("male", "female", "other"))
table(rse$gender)
##
## male female other
## 17801 29182 529
Note that the 1’s are now “male”, the 2’s are now “female”, and the 3’s are now “other”. Also note that the missing data (i.e., NA
values) are excluded from the table now.
Note that five of the items of the RSE are worded in the negative direction, such that endorsement of the item indicates lower levels of self-esteem (i.e., items 3, 5, 8, 9, 10).
The (non-missing) RSE items are coded with the integers 1 (= strongly disagree), 2 (= disagree), 3 (= agree), and 4 (= strongly agree).
To recode the negatively worded items, we need to convert…
Algebraically, this conversion can be represented as \[y = 5 - x\] where \(x\) is the negatively worded item and \(y\) is the converted score.
To implement this conversion in R, we just write out the algebraic equation:
indx <- c(3, 5, 8, 9, 10)
rse[,indx] <- 5 - rse[,indx]
After the conversion, all of the items are worded in the positive direction. This makes it possible to define the RSE Total score, which is the summation of the 10 item scores.
We can add the total score to our dataframe by applying the rowSums
function to the first 10 columns of our dataframe:
rse$total <- rowSums(rse[,1:10])
head(rse)
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 gender age source country total
## 1 3 3 4 4 2 4 3 3 2 2 male 40 1 US 30
## 2 4 4 4 3 4 3 3 3 2 3 male 36 1 US 33
## 3 2 3 3 3 2 3 2 2 2 2 female 22 1 US 24
## 4 4 3 3 3 3 3 2 2 2 2 male 31 1 US 27
## 5 4 4 4 4 4 4 4 4 4 4 male 30 1 EU 40
## 6 4 4 4 3 4 3 4 3 3 4 female 25 1 CA 36
Note that our dataframe now has 15 columns, with the total score appended at the end.
Before any analysis work, let’s check the range of the ages
range(rse$age, na.rm = TRUE)
## [1] 1 2147483647
The minimum reported subject age is 1 (year old) and the maximum reported age is 2,147,483,647 (years old). It seems that we have some miscoded and/or corrupted data…
To check the extent of the problem, let’s look at the proportion of subjects with ages outside of the range 10 to 80 years old:
mean(rse$age < 10, na.rm = TRUE)
## [1] 0.001307851
mean(rse$age > 80, na.rm = TRUE)
## [1] 0.001265663
Note that about 0.1% of the subjects have reported ages below 10 years old. Similarly, about 0.1% of the subjects have reported ages greater than 80 years old.
Let’s subset the dataset to only include the 10-80 year old subjects:
rse1080 <- subset(rse, age >= 10 & age <= 80)
Note that we have created a new dataset that contains only the subjects with ages in the range 10 to 80 years old:
dim(rse1080)
## [1] 47284 15
head(rse1080)
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 gender age source country total
## 1 3 3 4 4 2 4 3 3 2 2 male 40 1 US 30
## 2 4 4 4 3 4 3 3 3 2 3 male 36 1 US 33
## 3 2 3 3 3 2 3 2 2 2 2 female 22 1 US 24
## 4 4 3 3 3 3 3 2 2 2 2 male 31 1 US 27
## 5 4 4 4 4 4 4 4 4 4 4 male 30 1 EU 40
## 6 4 4 4 3 4 3 4 3 3 4 female 25 1 CA 36
range(rse1080$age, na.rm = TRUE)
## [1] 10 80
As a final preprocessing step, we will remove subjects who are missing data for the RSE total score, the gender, or the age.
navars <- c("gender", "age", "total")
nacheck <- function(x) any(is.na(x))
indx <- apply(rse1080[, navars], 1, nacheck)
rse1080 <- rse1080[!indx,]
nrow(rse1080)
## [1] 45728
Let’s look at histograms of the ages:
hist(rse1080$age, xlab = "Age",
main = "Histogram of Age")
Suppose that we want to look at separate histograms for each gender. To do this, we will first create an indexing vector for each gender:
indxM <- which(rse1080$gender == "male")
indxF <- which(rse1080$gender == "female")
indxO <- which(rse1080$gender == "other")
length(indxM) + length(indxF) + length(indxO)
## [1] 45728
Note that the vector indxM
contains the row numbers corresponding to the male subjects, indxF
contains the row numbers corresponding to the female subjects, and indxO
contains the row numbers corresponding to the other subjects.
Now we can plot separate histograms of age for each gender by indexing the appropriate elements of the age
variable:
par(mfrow = c(1,3))
hist(rse1080$age[indxM], xlab = "Age",
main = "Histogram of Age (males)", col = "blue")
hist(rse1080$age[indxF], xlab = "Age",
main = "Histogram of Age (females)", col = "red")
hist(rse1080$age[indxO], xlab = "Age",
main = "Histogram of Age (other)", col = "purple")
To visualize all of the distributions together, we could use a kernel density estimate (KDE):
# KDE for males
kde <- density(rse1080$age[indxM], na.rm = TRUE)
plot(kde, xlab = "Age", ylab = "Density",
main = "Kernel Density Estimates of Age by Gender",
col = "blue", lwd = 2, xlim = c(10, 80), ylim = c(0, 0.12))
# KDE for females
kde <- density(rse1080$age[indxF], na.rm = TRUE)
lines(kde, col = "red", lty = 2, lwd = 2)
# KDE for others
kde <- density(rse1080$age[indxO], na.rm = TRUE)
lines(kde, col = "purple", lty = 3, lwd = 2)
# add legend
legend("topright", legend = c("male", "female", "other"),
lty = 1:3, lwd = 2, col = c("blue", "red", "purple"), bty = "n")
Or maybe you would prefer a boxplot:
boxplot(age ~ gender, data = rse1080,
xlab = "Gender", ylab = "Age",
col = c("blue", "red", "purple"),
main = "Boxplots of Age by Gender")
Let’s look at histograms of the self-esteem total score:
hist(rse1080$total, xlab = "Self-Esteem",
main = "Histogram of Self-Esteem")
Or we could look at separate histograms of the total score for each gender:
par(mfrow = c(1,3))
hist(rse1080$total[indxM],
xlab = "Self-Esteem", col = "blue",
main = "Histogram of Self-Esteem (males)")
hist(rse1080$total[indxF],
xlab = "Self-Esteem", col = "red",
main = "Histogram of Self-Esteem (females)")
hist(rse1080$total[indxO],
xlab = "Self-Esteem", col = "purple",
main = "Histogram of Self-Esteem (other)")
As before, we could use a KDE to visualize all of the distributions together:
# KDE for males
kde <- density(rse1080$total[indxM], na.rm = TRUE)
plot(kde, xlab = "Self-Esteem", ylab = "Density",
main = "Kernel Density Estimates of Self-Esteem by Gender",
col = "blue", lwd = 2, xlim = c(10, 40), ylim = c(0, 0.06))
# KDE for females
kde <- density(rse1080$total[indxF], na.rm = TRUE)
lines(kde, col = "red", lty = 2, lwd = 2)
# KDE for others
kde <- density(rse1080$total[indxO], na.rm = TRUE)
lines(kde, col = "purple", lty = 3, lwd = 2)
# add legend
legend("topright", legend = c("male", "female", "other"),
lty = 1:3, lwd = 2, col = c("blue", "red", "purple"), bty = "n")
Or a boxplot would work too:
boxplot(total ~ gender, data = rse1080,
xlab = "Gender", ylab = "Self-Esteem",
col = c("blue", "red", "purple"),
main = "Boxplots of Self-Esteem by Gender")
To visualize the relationship between self-esteem and age, we will use a smoothing spline, which is a type of nonparametric regression.
First, we will fit a smoothing spline predicting self-esteem from age ignoring gender:
ss <- with(rse1080, smooth.spline(age, total))
plot(ss, t = "l", lwd = 2, ylim = c(20, 35),
xlab = "Age", ylab = "Self-Esteem",
main = "Self-Esteem by Age")
Next, we will fit a smoothing spline predicting self-esteem from age separately for each gender:
# smoothing spline for males
ssM <- with(rse1080[indxM,], smooth.spline(age, total))
plot(ssM, t = "l", ylim = c(15, 35), lwd = 2, col = "blue",
xlab = "Age", ylab = "Self-Esteem",
main = "Self-Esteem by Age and Gender")
# smoothing spline for females
ssF <- with(rse1080[indxF,], smooth.spline(age, total))
lines(ssF, lty = 2, lwd = 2, col = "red")
# smoothing spline for others
ssO <- with(rse1080[indxO,], smooth.spline(age, total))
lines(ssO, lty = 3, lwd = 2, col = "purple")
# add legend
legend("bottomright", legend = c("male", "female", "other"),
lty = 1:3, lwd = 2, col = c("blue", "red", "purple"), bty = "n")
For a publication, it may be necessary to summarize information in tables. As an example suppose that we want to create a table summarizing the mean and standard deviation of the ages for each gender.
# number of subjects for each gender
nsubj <- with(rse1080, tapply(age, gender, length))
# mean of age for each gender
ageMN <- with(rse1080, tapply(age, gender, mean))
# standard deviation of age for each gender
ageSD <- with(rse1080, tapply(age, gender, sd))
# combine (columnwise) into table
tab.demo <- cbind(n = nsubj, Mean = ageMN, SD = ageSD)
tab.demo
## n Mean SD
## male 17139 27.75238 12.667289
## female 28097 25.98765 12.056678
## other 492 20.85366 8.592549
Now we have a table that we could easily output of R (e.g., via write.csv
) and insert into some publication. For example…
write.csv(round(tab.demo, 2), file = paste0(getwd(), "/tab_demo.csv"))
Note that I used the round
function to round the results to 2 decimal places, as is typically required for publications.
Now suppose that we want to create a table summarizing gender differences in self-esteem. This can be done using (almost) the exact same code as before.
# mean of total for each gender
selfMN <- with(rse1080, tapply(total, gender, mean))
# standard deviation of total for each gender
selfSD <- with(rse1080, tapply(total, gender, sd))
# combine (columnwise) into table
tab.self <- cbind(n = nsubj, Mean = selfMN, SD = selfSD)
# write and print
write.csv(round(tab.self, 2), file = paste0(getwd(), "/tab_self.csv"))
round(tab.self, 2)
## n Mean SD
## male 17139 27.32 6.95
## female 28097 25.73 6.92
## other 492 22.49 7.20
To save some space, we can combine the previous results into a single table
tab.comb <- cbind(tab.demo, tab.self[,-1])
colnames(tab.comb) <- c("n", "Age.MN", "Age.SD",
"SelfEsteem.MN", "SelfEsteem.SD")
write.csv(round(tab.comb, 2), file = paste0(getwd(), "/tab_comb.csv"))
round(tab.comb, 2)
## n Age.MN Age.SD SelfEsteem.MN SelfEsteem.SD
## male 17139 27.75 12.67 27.32 6.95
## female 28097 25.99 12.06 25.73 6.92
## other 492 20.85 8.59 22.49 7.20
Suppose that we want to calculate the summary statistics separately for different age groups.
Consider binning the ages into 9 different groups: 10-12, 13-17, 18-22, 23-29, 30-39, 40-49, 50-59, 60-69, 70-80.
As a first step, we need to define the “age bin” variable. This involves defining the “breaks” that specify the bin limits, and looping through each bin’s break values to find the ages within the bin.
binlab <- c("10-12", "13-17", "18-22", "23-29",
"30-39", "40-49", "50-59", "60-69", "70-80")
breaks <- c(0, 12, 17, 22, 29, 39, 49, 59, 69, 80)
agebin <- rep(0, nrow(rse1080))
for(j in 2:length(breaks)){
indx <- with(rse1080, which(age <= breaks[j] & age > breaks[j-1]))
agebin[indx] <- j - 1
}
table(agebin)
## agebin
## 1 2 3 4 5 6 7 8 9
## 334 10476 13226 8240 5939 4038 2545 808 122
Let’s check to make sure that the binning worked:
nrow(rse1080)
## [1] 45728
sum(table(agebin))
## [1] 45728
Ok, good: each subject is only in 1 bin, which is what we want.
Now we can table the results by (binned) age via an extension of our previous code
# number of subjects for each gender*age
nsubj <- with(rse1080, tapply(total, agebin, length))
# mean of total for each gender*age
selfbinMN <- with(rse1080, tapply(total, agebin, mean))
# standard deviation of total for each gender*age
selfbinSD <- with(rse1080, tapply(total, agebin, sd))
# combine (rowwise)
tab.bin <- rbind(Mean = selfbinMN, SD = selfbinSD, N = nsubj)
colnames(tab.bin) <- binlab
# write and print
write.csv(round(tab.bin, 2), file = paste0(getwd(), "/tab_abin.csv"))
round(tab.bin, 1)
## 10-12 13-17 18-22 23-29 30-39 40-49 50-59 60-69 70-80
## Mean 26.9 24.0 25.7 26.9 27.5 28.2 29.3 30.9 32.0
## SD 7.5 7.1 6.8 6.6 6.6 6.6 6.5 6.0 5.3
## N 334.0 10476.0 13226.0 8240.0 5939.0 4038.0 2545.0 808.0 122.0
Note that we can (and should!) plot the results in the table to confirm things look reasonable
agebrk <- c(10, 12, 17, 22, 29, 39, 49, 59, 69, 80)
agemid <- agebrk[1:9] + (agebrk[2:10] - agebrk[1:9])/2
plot(agemid, tab.bin[1,], t = "b", lwd = 2,
pch = 0, col = "black",
xlim = c(10, 80), ylim = c(20, 35),
xlab = "Age", ylab = "Self-Esteem",
main = "Average Self-Esteem by (Binned) Age")
If we want to get fancy, we could add 95% confidence intervals at each point:
plot(agemid, tab.bin[1,], t = "b", lwd = 2,
pch = 0, col = "black",
xlim = c(10, 80), ylim = c(20, 35),
xlab = "Age", ylab = "Self-Esteem",
main = "Average Self-Esteem by (Binned) Age with 95% CI")
stderr <- tab.bin[2,] / sqrt(tab.bin[3,])
for(j in 1:9){
ci.lo <- tab.bin[1,j] - 2 * stderr[j]
ci.up <- tab.bin[1,j] + 2 * stderr[j]
lines(x = rep(agemid[j], 2), y = c(ci.lo, ci.up), lwd = 2)
}
Note that the SE’s are very small (because \(n\) is rather large), so the 95% CI at each point is narrow.
Here is how to do the same thing with polygons instead of lines
plot(agemid, tab.bin[1,], t = "n", lwd = 2,
pch = 0, col = "black",
xlim = c(10, 80), ylim = c(20, 35),
xlab = "Age", ylab = "Self-Esteem",
main = "Average Self-Esteem by (Binned) Age with 95% CI")
ci.lo <- tab.bin[1,] - 2 * stderr
ci.up <- tab.bin[1,] + 2 * stderr
xpoly <- c(agemid, rev(agemid))
ypoly <- c(ci.lo, rev(ci.up))
trangray <- rgb(t(col2rgb("gray") / 255), alpha = 0.5)
polygon(xpoly, ypoly, col = trangray, border = NA)
lines(agemid, tab.bin[1,], pch = 0, col = "black", t = "b", lwd = 2)
We can also table the results by gender and age via an extension of our previous code
# number of subjects for each gender*age
nsubj <- with(rse1080, tapply(total, list(gender, agebin), length))
colnames(nsubj) <- binlab
# mean of total for each gender*age
selfageMN <- with(rse1080, tapply(total, list(gender, agebin), mean))
colnames(selfageMN) <- binlab
# standard deviation of total for each gender*age
selfageSD <- with(rse1080, tapply(total, list(gender, agebin), sd))
colnames(selfageSD) <- binlab
# combine (rowwise) into table for males
tab.selfageM <- rbind(Mean = selfageMN[1,], SD = selfageSD[1,], N = nsubj[1,])
rownames(tab.selfageM) <- paste0("male.", rownames(tab.selfageM))
# combine (rowwise) into table for females
tab.selfageF <- rbind(Mean = selfageMN[2,], SD = selfageSD[2,], N = nsubj[2,])
rownames(tab.selfageF) <- paste0("female.", rownames(tab.selfageF))
# combine (rowwise) into table for other
tab.selfageO <- rbind(Mean = selfageMN[3,], SD = selfageSD[3,], N = nsubj[3,])
rownames(tab.selfageO) <- paste0("other.", rownames(tab.selfageO))
# combine (rowwise) into single table
tab.selfage <- rbind(tab.selfageM, tab.selfageF, tab.selfageO)
write.csv(round(tab.selfage, 2), file = paste0(getwd(), "/tab_selfage.csv"))
round(tab.selfage, 1)
## 10-12 13-17 18-22 23-29 30-39 40-49 50-59 60-69 70-80
## male.Mean 28.6 26.5 26.5 27.3 27.8 28.2 29.3 30.7 32.2
## male.SD 6.3 7.3 6.9 6.8 6.7 6.7 6.5 6.0 5.3
## male.N 84.0 3126.0 4998.0 3356.0 2421.0 1638.0 1062.0 387.0 67.0
## female.Mean 26.5 23.0 25.2 26.6 27.3 28.1 29.3 31.1 31.7
## female.SD 7.7 6.8 6.6 6.5 6.5 6.5 6.5 5.9 5.2
## female.N 248.0 7149.0 8058.0 4815.0 3489.0 2387.0 1480.0 418.0 53.0
## other.Mean 13.5 21.0 22.0 25.3 26.5 26.0 29.7 25.7 32.5
## other.SD 4.9 6.6 6.7 6.9 8.6 7.6 11.9 14.0 10.6
## other.N 2.0 201.0 170.0 69.0 29.0 13.0 3.0 3.0 2.0
As before, we can (and should!) plot the results in the table to confirm things look reasonable
plot(agemid, tab.selfageM[1,], t = "b", lwd = 2, col = "blue",
xlim = c(10, 80), ylim = c(10, 35),
xlab = "Age", ylab = "Self-Esteem",
main = "Average Self-Esteem by (Binned) Age and Gender")
lines(agemid, tab.selfageF[1,], t = "b",
pch = 2, lty = 2, lwd = 2, col = "red")
lines(agemid, tab.selfageO[1,], t = "b",
pch = 3, lty = 3, lwd = 2, col = "purple")
legend("bottomright", legend = c("male", "female", "other"),
pch = 1:3, lty = 1:3, lwd = 2,
col = c("blue", "red", "purple"), bty = "n")
Suppose that we want to test the null hypothesis that males and females have the same mean level of self-esteem, versus the alternative hypothesis that males have higher mean levels of self-esteem compared to females.
To test the null hypothesis \(H_0: \mu_M = \mu_F\) versus the alternative hypothesis \(H_1: \mu_M > \mu_F\), we can use the t.test
function in R:
tt.self <- t.test(x = rse1080$total[indxM],
y = rse1080$total[indxF],
alternative = "greater")
tt.self
##
## Welch Two Sample t-test
##
## data: rse1080$total[indxM] and rse1080$total[indxF]
## t = 23.592, df = 36095, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## 1.475395 Inf
## sample estimates:
## mean of x mean of y
## 27.31758 25.73161
Note that the output object tt.self
contains all of the information needed for inference.
names(tt.self)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
Each of the (named) elements of tt.self
contains a piece of information that can be extracted via the $
symbol. For example
tt.self$statistic
## t
## 23.59227
Instead of copy-pasting the information from the R printout, you should combine the needed information into a table, and export the table (similar to what we did before). For example:
tt.tab <- with(tt.self, data.frame(H1 = alternative,
t = round(statistic, 2),
df = round(parameter, 2),
p = round(p.value, 4),
row.names = 1))
write.csv(tt.tab, file = paste0(getwd(), "/tab_ttest.csv"))
tt.tab
## H1 t df p
## 1 greater 23.59 36094.86 0
Suppose that we want to fit a simple linear regression model predicting self-esteem from age (ignoring gender).
Linear models are fit via the lm
function in R, which uses forumlaic syntax to express the model form:
age.mod <- lm(total ~ age, data = rse1080)
age.mod
##
## Call:
## lm(formula = total ~ age, data = rse1080)
##
## Coefficients:
## (Intercept) age
## 22.8093 0.1309
Note that summarizing the fit model (via the summary
function) provides information about the fit model.
age.mod.sum <- summary(age.mod)
age.mod.sum
##
## Call:
## lm(formula = total ~ age, data = rse1080)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.843 -5.035 -0.166 4.965 15.751
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.809298 0.075714 301.25 <2e-16 ***
## age 0.130928 0.002584 50.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.797 on 45726 degrees of freedom
## Multiple R-squared: 0.05316, Adjusted R-squared: 0.05314
## F-statistic: 2567 on 1 and 45726 DF, p-value: < 2.2e-16
In particular, you may be interested in exporting the coefficients table:
write.csv(round(age.mod.sum$coef, 4),
file = paste0(getwd(), "/tab_coef.csv"))
To plot the estimated regression line, we can use the abline
function
plot(agemid, tab.bin[1,], lwd = 2, col = "black",
xlim = c(10, 80), ylim = c(20, 35),
xlab = "Age", ylab = "Self-Esteem",
main = "Self-Esteem by Age")
abline(age.mod$coef, lwd = 2)
lines(ss, lty = 2, lwd = 2)
legend("bottomright", legend = c("Avg", "Linear", "Smooth"),
pch = c(1, NA, NA), lty = c(NA, 1, 2), lwd = 2, bty = "n")
Note the average data and the smoothing spline solution are added for reference. Also, note that the simple linear regression line does not seem to be doing a good job of capturing the non-linear trends, particularly at the younger ages.
Suppose that we want to fit a multiple linear regression model predicting self-esteem from age and gender.
Again, we will use the lm
function to fit the model (because this is still a linear model).
agegen.mod <- lm(total ~ age * gender, data = rse1080)
agegen.mod.sum <- summary(agegen.mod)
agegen.mod.sum
##
## Call:
## lm(formula = total ~ age * gender, data = rse1080)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.3299 -4.9463 -0.1338 4.8974 19.0509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.149459 0.124012 202.799 < 2e-16 ***
## age 0.078124 0.004065 19.218 < 2e-16 ***
## genderfemale -3.505293 0.156560 -22.390 < 2e-16 ***
## genderother -7.351869 0.808013 -9.099 < 2e-16 ***
## age:genderfemale 0.079160 0.005259 15.054 < 2e-16 ***
## age:genderother 0.146982 0.035638 4.124 3.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.741 on 45722 degrees of freedom
## Multiple R-squared: 0.06884, Adjusted R-squared: 0.06874
## F-statistic: 676.1 on 5 and 45722 DF, p-value: < 2.2e-16
To plot the estimated regression lines, we can still use the abline
function. But this time we have to make sure that we input the appropriate coefficients for each line…
# plot males
plot(agemid, tab.selfageM[1,], lwd = 2, col = "blue",
xlim = c(10, 80), ylim = c(10, 35),
xlab = "Age", ylab = "Self-Esteem",
main = "Average Self-Esteem by Age and Gender")
abline(a = agegen.mod$coef[1],
b = agegen.mod$coef[2],
lwd = 2, col = "blue")
# add females
points(agemid, tab.selfageF[1,],
pch = 2, lwd = 2, col = "red")
abline(a = agegen.mod$coef[1] + agegen.mod$coef[3],
b = agegen.mod$coef[2] + agegen.mod$coef[5],
lty = 2, lwd = 2, col = "red")
# add others
points(agemid, tab.selfageO[1,],
pch = 3, lwd = 2, col = "purple")
abline(a = agegen.mod$coef[1] + agegen.mod$coef[4],
b = agegen.mod$coef[2] + agegen.mod$coef[6],
lty = 3, lwd = 2, col = "purple")
# add legend
legend("bottomright", legend = c("male", "female", "other"),
pch = 1:3, lty = 1:3, lwd = 2,
col = c("blue", "red", "purple"), bty = "n")
Note that the linear regression lines do not seem to be doing a good job of capturing the non-linear trends for each gender.
Suppose that we want to fit a similar model, but allowing for polynomial effects of age. Again, we will use the lm
function to fit the model (because this is still a linear model), but this time we will use the poly
function to create polynomials of degree 3, i.e., we are including linear, quadratic, and cubic effects of age.
poly.mod <- lm(total ~ poly(age, degree = 3) * gender, data = rse1080)
poly.mod.sum <- summary(poly.mod)
Let’s plot the results:
# plot males
plot(agemid, tab.selfageM[1,], lwd = 2, col = "blue",
xlim = c(10, 80), ylim = c(10, 35),
xlab = "Age", ylab = "Self-Esteem",
main = "Average Self-Esteem by Age and Gender")
newdata <- data.frame(age = 10:80, gender = "male")
fit <- predict(poly.mod, newdata = newdata)
lines(newdata$age, fit, lwd = 2, col = "blue")
# add females
points(agemid, tab.selfageF[1,],
pch = 2, lwd = 2, col = "red")
newdata <- data.frame(age = 10:80, gender = "female")
fit <- predict(poly.mod, newdata = newdata)
lines(newdata$age, fit, lty = 2, lwd = 2, col = "red")
# add others
points(agemid, tab.selfageO[1,],
pch = 3, lwd = 2, col = "purple")
newdata <- data.frame(age = 10:80, gender = "other")
fit <- predict(poly.mod, newdata = newdata)
lines(newdata$age, fit, lty = 3, lwd = 2, col = "purple")
# add legend
legend("bottomright", legend = c("male", "female", "other"),
pch = 1:3, lty = 1:3, lwd = 2,
col = c("blue", "red", "purple"), bty = "n")
Note that even after including the quadratic and cubic effects of age, we are still not doing a good job of capturing the nonlinear trends in the data…
To capture the nonlinear relationships in the data, we will use the mgcv
package to fit a Generalized Additive Model (GAM). Note that a GAM is similar to a Smoothing Spline ANOVA (SSANOVA) model, which was used by Helwig and Ruprecht (2017) to analyze these data.
To fit the GAM, we will use the gam
function, which works somewhat like the lm
function.
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-27. For overview type 'help("mgcv-package")'.
gmod <- gam(total ~ gender + s(age, k = 20, by = gender), data = rse1080)
Note that the s()
function indicates that we want a smooth effect (i.e., nonparametric basis) for the age
effect. The k
option tells the function to use 20 “knot” values, and the by
option tells the function that the smooth effect of age should be allowed to differ for each gender
.
We could summarize the lm
object similar to how we summarized the gam
object:
summary(gmod)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## total ~ gender + s(age, k = 20, by = gender)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.23985 0.05176 526.293 <2e-16 ***
## genderfemale -1.37750 0.06547 -21.039 <2e-16 ***
## genderother -3.35136 0.37569 -8.921 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age):gendermale 10.189 12.34 32.72 < 2e-16 ***
## s(age):genderfemale 18.308 18.81 146.86 < 2e-16 ***
## s(age):genderother 2.211 2.76 17.46 5.83e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0795 Deviance explained = 8.01%
## GCV = 44.952 Scale est. = 44.919 n = 45728
Note that the summary provides some (pseudo) significance testing information, which is not too relevant for our current purpose.
What we really want is a visualization of the function estimates (along with the uncertainty of the estimates), which we can obtain by plotting the results:
# plot males
newdata <- data.frame(age = 10:80, gender = "male")
fit <- predict(gmod, newdata = newdata, se.fit = TRUE)
plot(newdata$age, fit$fit, t = "n", lwd = 2, col = "blue",
xlim = c(10, 80), ylim = c(15, 35),
xlab = "Age", ylab = "Self-Esteem",
main = "Average Self-Esteem by Age and Gender")
ci.lo <- fit$fit - 2 * fit$se.fit
ci.up <- fit$fit + 2 * fit$se.fit
xpoly <- c(newdata$age, rev(newdata$age))
ypoly <- c(ci.lo, rev(ci.up))
polygon(xpoly, ypoly,
col = rgb(t(col2rgb("cyan") / 255), alpha = 0.5),
border = NA)
lines(newdata$age, fit$fit, lwd = 2, col = "blue")
# add females
newdata <- data.frame(age = 10:80, gender = "female")
fit <- predict(gmod, newdata = newdata, se.fit = TRUE)
ci.lo <- fit$fit - 2 * fit$se.fit
ci.up <- fit$fit + 2 * fit$se.fit
xpoly <- c(newdata$age, rev(newdata$age))
ypoly <- c(ci.lo, rev(ci.up))
polygon(xpoly, ypoly,
col = rgb(t(col2rgb("pink") / 255), alpha = 0.5),
border = NA)
lines(newdata$age, fit$fit, lty = 2, lwd = 2, col = "red")
# add others
newdata <- data.frame(age = 10:80, gender = "other")
fit <- predict(gmod, newdata = newdata, se.fit = TRUE)
ci.lo <- fit$fit - 2 * fit$se.fit
ci.up <- fit$fit + 2 * fit$se.fit
xpoly <- c(newdata$age, rev(newdata$age))
ypoly <- c(ci.lo, rev(ci.up))
polygon(xpoly, ypoly,
col = rgb(t(col2rgb("lavender") / 255), alpha = 0.5),
border = NA)
lines(newdata$age, fit$fit, lty = 3, lwd = 2, col = "purple")
# add legend
legend("bottomright", legend = c("male", "female", "other"),
lty = 1:3, lwd = 2, col = c("blue", "red", "purple"), bty = "n")
# export figure (to pdf file)
dev.copy2pdf(file = paste0(getwd(), "/fig_gam.pdf"))
## quartz_off_screen
## 2
As a final step, we may want to create an R script file (i.e., .R extension) to replicate our analyses. To convert this (.RMD) file into an R script file, we can use the following code
knitr::purl(“ReproducibleCode.Rmd”, documentation = 0)
which calls the purl
function from the knitr package. This will produce an R script file containing all of the executable R code (and comments) in this document.