Chapter 3 Getting your data

3.1 Two-Sample Procedures

3.2 Standard t-test

The two-sample t-test is the typical method used to do tests regarding the means of two groups. In R, this is the t.test(x,y) function. This command does a two-sample t-test between the sets of data in x and y.
The confidence interval it generates is for the mean of x minus the mean of y.

Note that by default R uses an unpooled estimate of variance (the Welch version with fractional degrees of freedom) and a two-sided alternative. You can also get a pooled estimate and upper or lower alternatives (i.e., x has greater mean or lesser mean) by using the appropriate optional arguments.

The unpooled (Welch) version is generally the better option, but ANOVA is a generalization of the unpooled version.

Consider the data on breaking strength for notched and unnotched boards data set NotchedBoards. We would like to investigate the null hypothesis that unnotched boards of thickness .625 inch have the same strength as notched boards of thickness .75 inch with a 1 inch wide notch cut in the center to thickness .625 inch.

First read in the data and create two vectors for the two different groups.

data(NotchedBoards)
unnotched <- NotchedBoards$strength[NotchedBoards$shape=="uniform"]
notched <- NotchedBoards$strength[NotchedBoards$shape=="notched"]
unnotched
 [1] 243 229 305 395 210 311 289 269 282 399 222 331 369
notched
 [1] 215 202 273 292 253 247 350 246 352 398 267 331 342

Now do the two-sided test (and confidence interval). Then use the option that forces the group variances to be equal. Also, jump ahead to an Analysis of Variance approach; for two groups its p-value agrees with the equal variances t-test. The p-values in both cases are large providing no evidence against the null of equal means.

t.test(unnotched,notched)

    Welch Two Sample t-test

data:  unnotched and notched
t = 0.27353, df = 23.911, p-value = 0.7868
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -43.31049  56.54126
sample estimates:
mean of x mean of y 
 296.4615  289.8462 
t.test(unnotched,notched,var.equal=TRUE)

    Two Sample t-test

data:  unnotched and notched
t = 0.27353, df = 24, p-value = 0.7868
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -43.30063  56.53140
sample estimates:
mean of x mean of y 
 296.4615  289.8462 
anova(lm(strength~shape,data=NotchedBoards)) # preview
Analysis of Variance Table

Response: strength
          Df Sum Sq Mean Sq F value Pr(>F)
shape      1    284   284.5  0.0748 0.7868
Residuals 24  91249  3802.0               

One reasonable belief might be that the notched boards would be stronger that the unnotched boards, because while they have the same minimum thickness as the unnotched boards, their average thickness is greater. We can examine this using a one-sided test with the alternative that the unnotched mean is greater than the notched mean. The p-value is smaller than for the two-sided test, but it is still quite large.

t.test(unnotched,notched,alternative="greater")

    Welch Two Sample t-test

data:  unnotched and notched
t = 0.27353, df = 23.911, p-value = 0.3934
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -34.76902       Inf
sample estimates:
mean of x mean of y 
 296.4615  289.8462 

There is also a “formula” version of t.test(). The formula takes the form of response ~ predictor, where in our case the predictor is a grouping variable with two levels. You get the same results with a little less fuss.

t.test(strength ~ shape,data=NotchedBoards)

    Welch Two Sample t-test

data:  strength by shape
t = -0.27353, df = 23.911, p-value = 0.7868
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -56.54126  43.31049
sample estimates:
mean in group notched mean in group uniform 
             289.8462              296.4615 

3.3 Digresson on computing percent points and quantiles

pt() gives you the cumulative probability (area to the left) for Student’s t distribution; the lower.tail=FALSE option gives you the upper tail probability. The first argument is the t value, the second is the degrees of freedom. The lower tail and upper tail values are below, and of course they add to 1. The second line with twice the smaller tail gives the two-sided p-value.

pt(-.27353,23.911);pt(-.27353,23.911,lower.tail=FALSE)
[1] 0.3933973
[1] 0.6066027
2*pt(-abs(.27353),23.911)
[1] 0.7867947

In general in R, pFOO(q,params) gives you the cumulative probability up to q for distribution FOO, qFOO(p,params) gives you the quantile that gives you cumulative probability p, and rFOO(n,parms) gives you a random sample of size n from distribution FOO. Thus, we have pt, pnorm, pf, pchisq, pbinom, and many others as well as their q and r forms.

3.4 Randomization (permutation) two-sample test

The randomization version of the two-sample t-test can be done with the function permTS(); this function comes from the perm package. To use it, you need to install the perm package onto your computer once (although you may need to redo this every time you update R), and then load it into every R session you want to use it in. You can use the functions as shown here (the first command does the install, duh, and the second one is the one you need to do every time you want to use it), but it is usually easier to use the package menu commands in RStudio to do the install.

install.packages("perm",repos="https://cloud.r-project.org");library(perm)
Installing package into '/Users/gary/Library/R/4.0/library'
(as 'lib' is unspecified)

The downloaded binary packages are in
    /var/folders/_6/3018nw2s6x1_vm4fszmrz7t80000gp/T//RtmpurDCu2/downloaded_packages

We are going to do randomization tests, which rely on randomization. The “random” numbers in R are produced by an algorithm that starts with a “seed” value. If you want to be able to reproduce exact values, you need to seed (start) the random number generator in the same place. I do that here so that you can reproduce the results I get in the demo. In general, R will seed its own random numbers so that they’re different every time.

set.seed(654321)

The permTS() function does the two-sample randomization (permutation) t-test. By default it does a two-sided alternative.

We see that the x mean is less than the y mean, and that the probability that a randomization leads to a difference of means as large or larger than 1.05 in absolute value is 5.7%. Note that this is not very close to the t-test p-value for this small data set.

With this small data set, the p-value is computed exactly. For larger data sets, it’s too much work to compute the exact p-value, so the function uses an approximation.

permTS(unnotched,notched)

    Permutation Test using Asymptotic Approximation

data:  unnotched and notched
Z = 0.27874, p-value = 0.7804
alternative hypothesis: true mean unnotched - mean notched is not equal to 0
sample estimates:
mean unnotched - mean notched 
                     6.615385 
set.seed(654321) # try again
permTS(strength~shape,data=NotchedBoards) # same results

    Permutation Test using Asymptotic Approximation

data:  strength by shape
Z = -0.27874, p-value = 0.7804
alternative hypothesis: true mean shape=notched - mean shape=uniform is not equal to 0
sample estimates:
mean shape=notched - mean shape=uniform 
                              -6.615385 

We may also specify different alternatives, for example,

permTS(unnotched,notched,alternative="greater")

    Permutation Test using Asymptotic Approximation

data:  unnotched and notched
Z = 0.27874, p-value = 0.3902
alternative hypothesis: true mean unnotched - mean notched is greater than 0
sample estimates:
mean unnotched - mean notched 
                     6.615385