To do each example, just click the "Submit" button. You do not have to type in any R instructions or specify a dataset. That's already done for you.
The (cumulative) distribution function of a (real-valued) random variable X is the function F defined by
The empirical distribution function Fn is just the distribution function of the empirical distribution, which puts probability 1 / n at each data point of a sample of size n.
If x(i) are the order statistics and all of the order statistics are distinct, then the empirical distribution function jumps from (i − 1) / n to i / n at the point x(i) and is constant except for the jumps at the order statistics.
If exactly k order statistics x(i), …, x(i + k &minus 1), are tied at some value, then then the empirical distribution function jumps from (i − 1) / n to (i + k) / n at that point.
The R function ecdf
(on-line help)
produces empirical (cumulative) distribution functions. The R functions
of the form p
followed by a distribution name (pnorm
,
pbinom
, etc.) produce theoretical distribution functions.
If you increase the sample size n
the empirical distribution
function will get closer to the theoretical distribution function.
If you change the theoretical distribution function from standard normal
to something else, the empirical and theoretical distribution functions
will still be close to each other, just different. For example, try
standard exponential (rexp
replaces rnorm
and
pexp
replaces pnorm
.
As everywhere else in statistics, the law of large numbers holds. In fact, for fixed x this is just the usual law of large numbers because the empirical distribution function Fn(x) is a sample proportion (the proportion of Xi that are less than or equal to x) that estimates the true population proportion F(x). Thus the statement that
is just the ordinary law of large numbers (the convergence here is either in probability or almost sure).
But much more is true. In fact, the convergence is actually uniform
(a fact known as the Glivenko-Cantelli theorem in advanced probability theory).
As everywhere else in statistics, there is also asymptotic normality. In fact, as noted above, for fixed x this is just the usual central limit theorem because Fn(x) is a sample proportion
where p = F(x), is just the ordinary central limit theorem (the convergence here is convergence in distribution).
But much more is true. In fact, the convergence is actually uniform in a sense that we can't even start to explain at this level, because the objects of interest are not scalar-valued random variables, nor even vector-valued random variables, but function-valued random variables Fn. This can be thought of as an infinite-dimensional random vector because it has an infinite number of coordinates Fn(x) for each of the (infinitely many) values of x.
But we won't bother with those technicalities. Suffice it to say that
converges to a Gaussian stochastic process called the Brownian bridge
in the special case that the true population distribution is Uniform(0, 1).
The Gaussian
here refers to the normal distribution, more about this
in class.
This result would have no place in a course on nonparametrics if it were peculiar to the uniform distribution. But the non-uniform case is not much different. It will be described presently.
We can see what the Brownian bridge looks like by just taking a very large sample size (large enough for the asymptotics to work).
If you repeat the plot over and over, you will see many different realizations of this random function.
The non-uniform case has to do with a curious fact from theoretical statistics. If X is any continuous random variable and F is its distribution function, then F(X) is a Uniform(0, 1) random variable. This means
The distributions of both one-sided and two-sided suprema over the Brownian bridge are known. Define
where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D+ is a random variable. The distribution of this random variable is known. It has distribution function
The Brownian bridge is symmetric with respect to being turned upside down
(in distribution). Thus the statistic D− defined
by replacing sup
with inf
in the definition of
D+ has the same distribution as
D+.
Similarly, if we define the two-sided supremum
where B(t) is the Brownian bridge. Since the Brownian bridge is a random function, D is a random variable. The distribution of this random variable is also known. It has distribution function
although this involves an infinite series, the series is extremely rapidly converging. Usually a few terms suffice for very high accuracy.
The one-sample Kolmogorov-Smirnov test is based on the test statistic
for an upper-tailed test.
Or on the test statistic
D−n
defined by replacing sup
with inf
in the formula above
for a lower-tailed test.
Or on the test statistic
for a two-tailed test. Usually, we want a two-tailed test.
Because the distribution F hypothesized under the null hypothesis must be completely specified (no free parameters whatsoever). This test is fairly useless, and Hollander and Wolfe do not cover it. However, a very closely related test, the Lilliefors test, covered below is useful.
For now we just do a toy example using the R function ks.test
(on-line help).
As can be seen by trying out the example, the test is not very powerful even for large sample sizes if the distributions are not too different. Try different sample sizes and degrees of freedom for the t.
As we said, one-sample Kolmogorov-Smirnov tests are fairly useless from
an applied point of view (however theoretically important). But the
dual confidence interval is of use. It gives a confidence band
for the whole distribution function (Section 11.5 in Hollander
and Wolfe).
The programmer who wrote the ks.test
function for R didn't
bother with the confidence interval. So we are on our own again. We
(like Hollander and Wolfe) will only do the two-sided interval. The
one-sided is similar. Just use the distribution of D+
instead of the distribution of D.
confidence bandfor the true population distribution function. (The probability that the true population distribution function lies entirely within the confidence band gets closer and closer to 0.95 as the sample size goes to infinity.)
crit.val <- 1.358099if there was no interest in confidence levels other than 95%.
ylab = expression(F[n](x))
argument
to the first plot
function
makes the y-axis label Fn(x)with
n
a subscript. Many more such effects are possible
and are described by
help(plotmath)(on-line version of this help).
Procedures always come in threes: a hypothesis test, the dual confidence interval, and the corresponding point estimate. What is the point estimator here?
The difference of two independent Brownian bridges is a rescaled Brownian bridge (vertical axis expanded by √2). The obvious statistic for comparing two empirical distribution functions Fm and Gn which is
has an asymptotic distribution that is a Brownian bridge with the vertical axis expanded by (1 / m + 1 / n)1 / 2 because Fm has variance proportional to 1 / m and Gn has variance proportional to 1 / n.
Thus
has the standard Brownian bridge for its asymptotic distribution.
But we don't actually need to know this ourselves. It is buried in
the code for ks.test
.
It won't bother those with no previous exposure to the R
ks.test
function
(on-line
help) but it came as a shock to me that the meaning
of alternative = "less"
changed since the last time I taught
the course. It now means
But if the distribution function of x is less than that of y, the median of x is greater than that of y.The possible values
"two.sided"
,"less"
and"greater"
ofalternative
specify the null hypothesis that the true distribution function ofx
is equal to, not less than or not greater than the hypothesized distribution function (one-sample case) or the distribution function ofy
(two-sample case), respectively.
So the applied
meaning of alternative
is just the opposite
of what it is for wilcox.test
. If you want
wilcox.test(x, y, alternative = "less")
its competitor is
ks.test(x, y, alternative = "greater")
No real problem as long as you are aware of this issue. (A big problem if you forget!)
The one-sample Kolmogorov-Smirnov isn't very useful in practice because it requires a simple null hypothesis, that is, the distribution must be completely specified with all parameters known.
What you want to do is test with unknown parameters. You would like the null hypothesis to be all normal distributions (and the alternative all non-normal distributions) or something like that. What you want to do is something like this, a Kolmogorov-Smirnov test with estimated parameters.
The reason for the WARNING is that estimating the parameters changes the null distribution of the test statistic. The null distribution is generally not known when parameters are estimated and is not the same as when parameters are known.
Fortunately, when we have a computer, we can approximate the null distribution of the test statistic by simulation.
There is random error in this calculation from the simulation.
However, because of the trick of adding 1 to the numerator and denominator
in calculating the P-value it can be used straight
without regard
for the randomness. Under the null hypothesis the probability
Pr(P ≤ k / nsim)
is exactly k / nsim when both the randomness
in the data and the randomness in the simulation are taken into account.
The name Lilliefors test only applies to this procedure of using the Kolmogorov-Smirnov test statistic with estimated null distribution when the null distribution is assumed to be normal. In this case, the test is exact because the test statistic and the normal family of distributions are invariant under location-scale transformations.
If the same procedure were used with another family of distributions that was not a location-scale family, then the test would not be exact. It would be a special case of the parametric bootstrap, which we will eventually cover.