University of Minnesota, Twin Cities School of Statistics Stat 5601 Rweb Computing Examples

Stat 5601 (Geyer) Examples (Kolmogorov-Smirnov Tests)

General Instructions
Theory
One-Sample Tests
The Corresponding Confidence Interval
The Corresponding Point Estimate
Two-Sample Tests

General Instructions

To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.

Theory

(Cumulative) Distribution Functions

The (cumulative) distribution function of a (real-valued) random variable X is the function F defined by

F(x) = pr(X ≤ x), - ∞ < x < ∞

Because F(x) is a probability, it is necessarily between zero and one.
Because the event X ≤ x increases as x increases, F is a nondecreasing function.
Because the event X ≤ x decreases to the empty set as x goes to minus infinity,
lim_{x → - ∞} F(x) = 0.
Because the event X ≤ x increases to the whole real line as x goes to plus infinity,
lim_{x → + ∞} F(x) = 1.
If the support of X is not the whole real line, then all of the increase of F takes place on the support, that is, if a ≤ X ≤ b with probability one, then F(a) = 0 and F(b) = 1.
Other properties of the distribution function depend on whether X is discrete or continuous.
If X is a continuous random variable, then
- F a continuous function and is strictly increasing on the support of X.
If X is a discrete random variable, then
- F is a discontinuous function.
- The discontinuities (jumps) of F occur at the atoms of X (the points having nonzero probability).
- The height of the jump gives the probability of the atom, that is,
  pr(X = x) = F(x) - F(x - ε)
  whenever ε is small enough so that there are no other jumps between x - ε and x.
- F is constant (its graph is horizontal) between jumps.

Empirical (Cumulative) Distribution Functions

The empirical distribution function F_n is just the distribution function of the empirical distribution, which puts probability 1 / n at each data point of a sample of size n.

If x_(i) are the order statistics, then the empirical distribution function jumps from (i - 1) / n to i / n at the point x_(i) and is constant except for the jumps at the order statistics.

Distribution Function Examples

The R function ecdf in the stepfun library produces empirical (cumulative) distribution functions. The R functions of the form p followed by a distribution name (pnorm, pbinom, etc.) produce theoretical distribution functions.

Comments

If you increase the sample size n the empirical distribution function will get closer to the theoretical distribution function.

If you change the theoretical distribution function from standard normal to something else, the empirical and theoretical distribution functions will still be close to each other, just different. For example, try standard exponential (rexp replaces rnorm and pexp replaces pnorm.

The Asymptotic Distribution (Brownian Bridge)

As everywhere else in statistics, there is asymptotic normality. Here it is a bit trickier than usual because the objects of interest are not scalar-valued random variables, nor even vector-valued random variables, but function-valued random variables F_n. This can be thought of as an infinite-dimensional random vector because it has an infinite number of coordinates F_n(x) for each of the (infinitely many) values of x.

But we won't bother with those technicalities. Suffice it to say that

√n [ F_n(x) − F(x) ]

converges to a Gaussian stochastic process called the Brownian bridge in the special case that the true population distribution is Uniform(0, 1). The Gaussian here refers to the normal distribution, more about this in class.

This result would have no place in a course on nonparametrics if it were peculiar to the uniform distribution. But the non-uniform case is not much different. It will be described presently.

We can see what the Brownian bridge looks like by just taking a very large sample size (large enough for the asymptotics to work).

If you repeat the plot over and over, you will see many different realizations of this random function.

The non-uniform case has to do with a curious fact from theoretical statistics. If X is any continuous random variable and F is its distribution function, then F(X) is a Uniform(0, 1) random variable. This means

Any continuous random variable X can be mapped to a Uniform(0, 1) random variable U (by the transformation F) and vice versa (by the transformation F^-1).
More importantly for the subject of Kolmogorov-Smirnov tests, this means that the distribution of √n ( F_n(x) − F(x) ) is the same for all continuous population distributions except for a transformation of the x-axis. If we base our procedures only on the vertical distance between F_n(x) and F(x) and ignore horizontal distances (which are transformed, then our procedure will be truly nonparametric.

Suprema over the Brownian Bridge

The distributions of both one-sided and two-sided suprema over the Brownian bridge are known. Define

D⁺ = sup_{0 < t < 1} B(t)

where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D⁺ is a random variable. The distribution of this random variable is known. It has distribution function

F_D⁺(x) = 1 - exp(− 2 x²)

The Brownian bridge is symmetric with respect to being turned upside down (in distribution). Thus the statistic D⁻ defined by replacing sup with inf in the definition of D⁺ has the same distribution as D⁺.

Similarly, if we define the two-sided supremum

D = sup_{0 < t < 1} | B(t) |

where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D is a random variable. The distribution of this random variable is also known. It has distribution function

F_D(x) = 1 + 2 ∑_{k = 1}^∞ (− 1)^k exp(− 2 k² x²)

although this involves an infinite series, the series is extremely rapidly converging. Usually a few terms suffice for very high accuracy.

One-Sample Tests

The one-sample Kolmogorov-Smirnov test is based on the test statistic

D⁺_n = sup_{-∞ < x < +∞} √n [ F_n(x) − F(x) ]

for an upper-tailed test. Or on the test statistic D⁻_n defined by replacing sup with inf in the formula above for a lower-tailed test. Or on the test statistic

D_n = sup_{-∞ < x < +∞} √n | F_n(x) − F(x) |

for a two-tailed test. Usually, we want a two-tailed test.

Because the distribution F hypothesized under the null hypothesis must be completely specified (no free parameters whatsoever). This test is fairly useless, and Hollander and Wolfe do not cover it.

As we shall see when we get to the bootstrap, the test can be used with free parameters to be estimated in the null distribution, but that takes us out of Hollander and Wolfe and into Efron and Tibshirani. So we will put that off.

For now we just do a toy example using the R function ks.test ( on-line help).

As can be seen by trying out the example, the test is not very powerful even for large sample sizes if the distributions are not too different. Try different sample sizes and degrees of freedom for the t.

The Corresponding Confidence Interval

As we said, one-sample Kolmogorov-Smirnov tests are fairly useless from an applied point of view (however theoretically important). But the dual confidence interval is of use. It gives a confidence band for the whole distribution function (Section 11.5 in Hollander and Wolfe).

The programmer who wrote the ks.test function for R didn't bother with the confidence interval. So we are on our own again. We (like Hollander and Wolfe) will only do the two-sided interval. The one-sided is similar. Just use the distribution of D⁺ instead of the distribution of D.

Example 11.6 in Hollander and Wolfe.

Comments.

The step function is the empirical distribution function.
The dashed lines on either side mark a 95% confidence band for the true population distribution function. (The probability that the true population distribution function lies entirely within the confidence band gets closer and closer to 0.95 as the sample size goes to infinity.)
The first half of the code (above the blank line) could be replaced by
```
crit.val <- 1.358099
```
if there was no interest in confidence levels other than 95%.

The Corresponding Point Estimate

Procedures always come in threes: a hypothesis test, the dual confidence interval, and the corresponding point estimate. What is the point estimator here?

Two-Sample Tests

The difference of two independent Brownian bridges is a rescaled Brownian bridge (vertical axis expanded by √2). The obvious statistic for comparing two empirical distribution functions F_m and G_n which is

sup_{-∞ < x < +∞} | F_m(x) − G_n(x) |

has an asymptotic distribution that is a Brownian bridge with the vertical axis expanded by (1 / m + 1 / n)^{1 / 2} because F_m has variance proportional to 1 / m and G_n has variance proportional to 1 / n.

Thus

(1 / m + 1 / n)^{− 1 / 2} sup_{-∞ < x < +∞} | F_m(x) − G_n(x) |

has the standard Brownian bridge for its asymptotic distribution.

But we don't actually need to know this ourselves. It is buried in the code for ks.test.