Statistics 5601 (Geyer, Spring 2006) Examples: One-Way Layout

General Instructions
Kruskal-Wallis
Jonckheere-Terpstra
Isotonic Regression

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Kruskal-Wallis

Example 6.1 in Hollander and Wolfe.

Comments

The second analysis done by the aov function is the usual parametric procedure: one-way ANOVA. It produces P = 0.5866 for comparison with the Kruskal-Wallis P-value.

R just knows that the predictor variable status is categorical because it is not numeric. If the predictor variable is numeric, then it has no way to know. The kruskal.test function still assumes categorical. The aov function assumes numeric.

Suppose, for example, we were doing the example used for the two procedures for ordered alternatives below, which loads data from the URL

http://www.stat.umn.edu/geyer/s06/5601/hwdata/t6-6.txt

which has predictor variable information and response variable number. Then

kruskal.test(number ~ information)

will do the right thing (a Kruskal-Wallis test in which the three values of information are treated as denoting treatments. But we need

information <- factor(information) 
out <- aov(number ~ information)
summary(out)

to have aov do the right thing. The factor function (on-line help) tells R that the variable is to be treated as categorical (R calls a categorical variable a factor).

Jonckheere-Terpstra

Unfortunately, R doesn't have this procedure. So we'll have to do it by hand (in R).

Example 6.2 in Hollander and Wolfe.

Summary

Upper-tailed Jonckheere-Terpstra test
Test statistic: J = 79
Sample sizes: n₁ = 6, n₂ = 6, and n₃ = 6
Monte Carlo approximation to P-value: P = 0.0231
Monte Carlo standard error of P-value: 0.0015

Comment

Rather than use large sample approximation on what are really small sample sizes, we do a Monte Carlo calculation of the P-value (that is, we compute by simulation of null distribution of the test statistic).

The Monte Carlo calculation is the loop

for (i in 1:nsim) {
    datsim <- sample(dat, length(dat))
    jsim[i] <- jkstat(datsim, grp)
}

This does nsim simulations of the null distribution of the test statistic. The first line of the body of the loop generates a new simulated data set datsim which is a permutation of the actual data (same numbers, just assigned to different groups). The second line of the body of the loop calculates the value of the test statistic for the simulated data and stores it for future use.

After the loop has finished jsim is a vector of length nsim that consists of independent, identically distributed random variables having the distribution of the test statistic J under the null hypothesis. And

phat <- mean(jsim >= jstat)

approximates the P-value, which is Pr(J ≥ j).

The slightly more tricky code

(nsim * phat + 1) / (nsim + 1)

includes the observed value in the numerator and denominator. As explained in class, this assures that if α is a multiple of 1 / nsim, then Pr(P ≤ α) is indeed α, despite the Monte Carlo.

Despite having an exact Monte Carlo test (exact meaning level α really means level α), there is some interest in the randomness in the reported P-value. Hence the next to last line calculates its standard error.

The last line reports the time the calculation takes: the first number is the elapsed time in seconds.

Isotonic Regression

This is the normal-theory competitor to Jonckheere-Terpstra. Unfortunately, R doesn't have this procedure. So we'll have to do it by hand (in R).

Example 6.2 in Hollander and Wolfe.

dat <- number grp <- as.ordered(information) mu0 <- mean(dat) sig0 <- sd(dat) n <- length(dat) ss.null <- sum((dat - mu0)^2) xbar <- sapply(split(dat, grp), mean) nbar <- sapply(split(dat, grp), length) k <- length(xbar) pava <- function(x, w) { if (any(w <= 0)) stop("weights must be positive") if (length(x) != length(w)) stop("arguments not same length") n <- length(x) design <- diag(n) repeat { out <- lm(x ~ design + 0, weights = w) mu <- coefficients(out) dmu <- diff(mu) if (all(dmu >= 0)) break j <- min(seq(along = dmu)[dmu < 0]) design[ , j] <- design[ , j] + design[ , j + 1] design <- design[ , - (j + 1), drop = FALSE] } return(as.numeric(design %*% mu)) } test.stat <- function(x, w) { mu <- pava(x, w) mu0 <- sum(w * x) / sum(w) ss.alt <- sum(w * (x - mu)^2) ss.null <- sum(w * (x - mu0)^2) return(ss.null - ss.alt) } print(tstat <- test.stat(xbar, nbar)) nsim <- 1e3 - 1 tsim <- double(nsim) for (i in 1:nsim) { xbarsim <- rnorm(k, mu0, sig0 / sqrt(nbar)) tsim[i] <- test.stat(xbarsim, nbar) } phat <- mean(tsim >= tstat) (nsim * phat + 1) / (nsim + 1) nsim / (nsim + 1) * sqrt(phat * (1 - phat) / nsim) cat("Calculation took", proc.time()[1], "seconds\n")

External Data Entry

Enter a dataset URL :

Summary

Upper-tailed isotonic regression test
Test statistic: T = 52.78
Sample sizes: n₁ = 6, n₂ = 6, and n₃ = 6
Monte Carlo approximation to P-value: P = 0.036
Monte Carlo standard error of P-value: 0.0019

Comment

The result quoted in the summary uses 10 times the sample size entered in the example form above. It was run off-line (R rather than Rweb) with the following output.

The R function pava implements the pool adjacent violators algorithm which does isotonic regression. In the famous words of a UNIX source code comment you are not expected to understand this.

The main lesson here is that the normal theory test (isotonic regression) does more or less the same as the nonparametric Jonckheere-Terpstra test. This is no surprise since the data are fairly normal looking.

Statistics 5601 (Geyer, Spring 2006) Examples: One-Way Layout

Contents

General Instructions

Kruskal-Wallis

Example 6.1 in Hollander and Wolfe.

Comments

Jonckheere-Terpstra

Example 6.2 in Hollander and Wolfe.

Summary

Comment

Isotonic Regression

Example 6.2 in Hollander and Wolfe.

Summary

Comment