Copyright January 04, 2021 by Nathaniel E. Helwig

1 Introduction

1.1 Motivation and Goals

Nonparametric bootstrap sampling offers a robust alternative to classic (parametric) methods for statistical inference. Unlike classic statistical inference methods, which depend on parametric assumptions and/or large sample approximations for valid inference, the nonparametric bootstrap uses computationally intensive methods to provide valid inferential results under a wide collection of data generating conditions.

The default install of R comes with the boot package, which is a collection of bootstrap functions that were originally designed for S (the predecessor of R). The bootstrap package, which is written by the creator of the bootstrap (Brad Efron), is another popular package for nonparametric bootstrapping. Both the boot and bootstrap packages have several good features as well as some undesirable characteristics. This document focuses on the np.boot function in the nptest R package, which combines the strengths of the bootstrap functionalities in the boot and bootstrap packages.

1.2 Sampling with Replacement in R

The nonparametric bootstrap involves randomly sampling data with replacement to form a “new” sample of data, which is referred to as a bootstrap sample.

Given a vector \(\mathbf{x} = (x_1, \ldots, x_n)^\top\) of length \(n\), sampling with replacement involves forming a new vector \[ \mathbf{x}^* = (x_1^*, \ldots, x_n^*)^\top \] where each \(x_i^*\) is independently sampled from \(\mathbf{x}\) with equal probability given to each observation, i.e., \(P(x_i^* = x_j) = 1/n\) for all \(i,j\).

In R, sampling with replacement can be conducted using the sample() function with the replace = TRUE argument.

# generate data
x <- letters[1:5]
x
## [1] "a" "b" "c" "d" "e"
# set random seed (for reproducibility)
set.seed(1)

# without replacement (default)
sample(x)
## [1] "a" "d" "c" "e" "b"
# with replacement
sample(x, replace = TRUE)
## [1] "e" "c" "b" "c" "c"

Note that sampling without replacement is equivalent to permuting the original observations, so each original observation appears once in the resampled vector. In contrast, sampling with replacement produces a new vector that could contain anywhere from 0 to \(n\) occurrences of each original observation.

1.3 Parametric Statistics Primer

Parameters and Statistics

Inferential statistical methods involve specifying some population of interest, and using a sample of data to infer things about the population. Let \(X\) denote the random variable of interest, which is assumed to have some distribution function \(F(x) = P(X < x)\). Suppose that the distribution function depends on the parameter \(\theta = t(F)\), where \(t()\) is some function of \(F\). From the Frequentist perspective, the parameter \(\theta\) is assumed to be some unknown constant that describes the probabilistic nature of the population.

Suppose we have a random sample of data \(\mathbf{x} = (x_1, \ldots, x_n)^\top\) from the population \(F\), where each \(x_i\) is an independent and identically distributed (iid) realization of the random variable \(X\). Any function of the sample of data is a statistic, which is itself a random variable. For example, suppose that we calculate some estimate of the parameter such as \(\hat{\theta} = s(\mathbf{x})\), where \(s()\) is some function of \(\mathbf{x}\). The estimate \(\hat{\theta}\) is a random variable, given that it is a function of the random sample of data \(\mathbf{x}\).

Sampling Distributions

The statistic \(\hat{\theta} = s(\mathbf{x})\) is a random variable that has some probability distribution, which will be denoted by \(G(\theta) = P(\hat{\theta} < \theta)\). The distribution \(G\) is referred to as the sampling distribution of the statistic \(\hat{\theta}\), given that it describes the probabilistic nature of the statistic \(\hat{\theta}\) given a random sample of \(n\) observations from the population distribution \(F\).

The form of the sampling distribution \(G\) will depend on three things:

  1. the original data generating distribution \(F\),
  2. the function \(s()\) used to calculate the statistic,
  3. the sample size \(n\).

For certain combinations of \(F\), \(s\), and \(n\), the sampling distribution will be known exactly. However, in many situations, the sampling distribution of \(\hat{\theta}\) is only known asymptotically, i.e., as \(n \rightarrow \infty\). And, for non-standard statistics, the sampling distribution may not even be known asymptotically.

Example 1: Mean

Suppose that the parameter is the population mean \(\theta = E(X)\), and the statistic is the sample mean \(\hat{\theta} = s(\mathbf{x}) = \frac{1}{n} \sum_{i = 1}^n x_i\).

If the data generating distribution is Gaussian, i.e., \(X \sim N(\theta, \sigma^2)\), then the sampling distribution is Gaussian, i.e., \(\hat{\theta} \sim N(\theta, \sigma^2 / n)\).

## loop through different n values
par(mfrow = c(2, 2), mar = c(4, 4, 2, 2))
for(n in c(5, 10, 25, 50)){
  
  # generate 10000 means calculated from N(0,1) data
  set.seed(1)
  x <- replicate(10000, mean(rnorm(n)))

  # plot histogram 
  hist(x, freq = FALSE, main = paste0("n = ", n), breaks = 20)

  # add sampling distribution
  xseq <- seq(-1, 1, length.out = 1000)
  lines(xseq, dnorm(xseq, sd = 1 / sqrt(n)))
  
}

If the data generating distribution is some non-Gaussian distribution, then the sampling distribution of \(\hat{\theta}\) is asymptotically Gaussian, i.e., \(G(\theta) \rightarrow N(\theta, \sigma^2 / n)\) as \(n \rightarrow \infty\).

## loop through different n values
par(mfrow = c(2, 2), mar = c(4, 4, 2, 2))
for(n in c(5, 10, 25, 50)){
  
  # generate 10000 means calculated from uniform data
  set.seed(1)
  x <- replicate(10000, mean(runif(n, min = -sqrt(3), max = sqrt(3))))

  # plot histogram 
  hist(x, freq = FALSE, main = paste0("n = ", n), breaks = 20)

  # add sampling distribution
  xseq <- seq(-1, 1, length.out = 1000)
  lines(xseq, dnorm(xseq, sd = 1 / sqrt(n)))
  
}

Example 2: Median

Suppose that the parameter is the population median \(P(X < \theta) = 1/2\), and the statistic is the sample mean \(\hat{\theta} = \mathrm{median}(\mathbf{x})\).

As \(n \rightarrow \infty\), the sampling distribution of \(\hat{\theta}\) approaches a normal distribution with mean \(\theta\) and variance \(\frac{1}{4 n f(\theta)^2}\), where \(f(x) = \frac{d}{dx} F(x)\) denotes the data-generating probability density function.

## loop through different n values
par(mfrow = c(2, 2), mar = c(4, 4, 2, 2))
for(n in c(5, 10, 25, 50)){
  
  # generate 10000 medians calculated from uniform data
  set.seed(1)
  x <- replicate(10000, median(runif(n, min = -sqrt(3), max = sqrt(3))))

  # plot histogram 
  hist(x, freq = FALSE, main = paste0("n = ", n), breaks = 20)

  # add sampling distribution
  xseq <- seq(-2, 2, length.out = 1000)
  asymp.se <- 1 / sqrt(4 * n * dunif(0, min = -sqrt(3), max = sqrt(3))^2)
  lines(xseq, dnorm(xseq, sd = asymp.se))
  
}

1.4 Need for Nonparametric Bootstrap

The previous example with the median reveals the need for the nonparametric bootstrap. Specifically, for many statistics, there is no theoretical result for the exact sampling distribution, so statistical inference may only be possible using large sample approximations. And, in many cases (such as with the sample median), even the asymptotic distribution requires knowledge about the data generating distribution that is not readily obtainable in real data situations. Note that the asymptotic variance of the median requires knowledge of \(f(\theta)\), which would rarely be known in any real data situation. Thus, for statistical inference about any generic parameter (i.e., any parameter that is not the mean), some method is needed that can produce a reasonable approximation of the sampling distribution—which is needed to conduct statistical inference.

2 Nonparametric Bootstrap Basics

2.1 Bootstrapping Procedure

Suppose that we have an observed sample of data \(\mathbf{x} = (x_1, \ldots, x_n)^\top\) with \(x_i \stackrel{\mathrm{iid}}{\sim}F\) for some distribution \(F\). Furthermore, suppose that \(\theta = t(F)\) is the parameter of interest, and \(\hat{\theta} = s(\mathbf{x})\) is the statistic used to estimate \(\theta\) from the sample of data.

The nonparametric bootstrap procedure is a rather simple idea:

  1. Independently sample \(x_i^*\) with replacement from \(\{x_1, \ldots, x_n\}\) for \(i = 1, \ldots, n\)

  2. Calculate the statistic \(\hat{\theta}{}^* = s(\mathbf{x}^*)\) where \(\mathbf{x}^* = (x_1^*, \ldots, x_n^*)^\top\) is the resampled data

  3. Repeat steps 1-2 a total of \(R\) times to form the bootstrap distribution of \(\hat{\theta}\)

The bootstrap distribution consists of \(R\) estimates of \(\theta\) plus the original estimate \(\hat{\theta}\), i.e., the bootstrap distribution is \(\{\hat{\theta}, \hat{\theta}{}_1^*, \ldots, \hat{\theta}{}_R^*\}\). This bootstrap distribution can be used as a surrogate for the sampling distribution of \(\hat{\theta}\) for the purpose of statistical inference. As will be demonstrated in the following sections, the bootstrap distribution can be used for estimating properties of \(\hat{\theta}\) (e.g., standard error and bias), as well as for forming confidence intervals for \(\theta\).

Note: the number of replications should be rather large, e.g., \(R \geq 9999\), to ensure that any calculations from the bootstrap distribution are not subject to substantial Monte Carlo error. The np.boot function (in the nptest package) uses a default of \(R = 9999\) resamples to form the bootstrap distribution, but it may be desirable to increase this value. See the mcse() function (in the nptest package) for information about how the number of resamples \(R\) relates to the Monte Carlo standard error of the result.

2.2 Empirical Distribution

Definition

Suppose that we have an observed sample of data \(\mathbf{x} = (x_1, \ldots, x_n)^\top\) with \(x_i \stackrel{\mathrm{iid}}{\sim}F\) for some distribution \(F\). The empirical cumulative distribution function (ECDF) uses the sample of data to estimate the unknown CDF \(F\). The ECDF is defined as \[ \hat{F}_n(x) = \hat{P}_n(X \leq x) = \frac{1}{n} \sum_{i = 1}^n I(x_i \leq x) \] where \(I()\) is an indicator function. Note that the ECDF assigns probability \(1/n\) to each observation \(x_i\) for \(i \in \{1,\ldots, n\}\), which implies that \[ \hat{P}_n(A) = \frac{1}{n} \sum_{i = 1}^n I(x_i \in A) \] for any set \(A\) in the sample space of \(X\).

Since the ECDF is a proportion estimate from an iid sample of observations, we have that \[ E(\hat{F}_n(x)) = F(x) \] which reveals that the ECDF is unbiased, and \[ \mathrm{Var}(\hat{F}_n(x)) = \frac{1}{n} F(x) (1 - F(x)) \] which reveals that the variance of the ECDF decreases as \(n\) increases. Furthermore, the Glivenko-Cantelli theorem reveals that as \(n \rightarrow \infty\), we have that \[ \sup_{x \in \mathbb{R}} | \hat{F}_n(x) - F(x) | \stackrel{as}{\rightarrow} 0 \] where the notation \(\stackrel{as}{\rightarrow}\) denotes almost sure convergence.

Example 1: Normal Distribution

set.seed(1)
par(mfrow = c(2,2), mar = c(4, 4, 2, 2))
n <- c(50, 100, 500, 1000)
xseq <- seq(-4, 4, length.out = 100)
for(j in 1:4){
  x <- rnorm(n[j])
  plot(ecdf(x), main = paste("n = ",n[j]))
  lines(xseq, pnorm(xseq), col = "blue")
}

Example 2: Uniform Distribution

set.seed(1)
par(mfrow = c(2,2), mar = c(4, 4, 2, 2))
n <- c(50, 100, 500, 1000)
xseq <- seq(-2, 2, length.out = 100)
for(j in 1:4){
  x <- runif(n[j], min = -sqrt(3), max = sqrt(3))
  plot(ecdf(x), main = paste("n = ",n[j]))
  lines(xseq, punif(xseq, min = -sqrt(3), max = sqrt(3)), col = "blue")
}

Example 3: Bivariate Data

Consider the following data on average LSAT scores and GPAs for \(n = 15\) law schools. Note that these data represent a random sample of \(n = 15\) law schools from a total collection of \(N = 82\) American law schools that participated in the study.

Table 3.1 An Introduction to the Bootstrap (Efron & Tibshirani, 1993).

Suppose we want to estimate the proportion of schools that “safety schools”, i.e., those schools that have an average LSAT score below 600 and an average GPA below 3. Defining \(A = \{(y,z): 0 < y < 600, \ 0 < z < 3.00\}\), we can use the ECDF idea to estimate this proportion: \[ \hat{P}_{15}(A) = \frac{1}{15} \sum_{i=1}^{15} I( (y_{i},z_{i}) \in A) = 5/15 = 1/3 \]

2.3 Plug-In Principle

Definition

Suppose that we have an observed sample of data \(\mathbf{x} = (x_1, \ldots, x_n)^\top\) with \(x_i \stackrel{\mathrm{iid}}{\sim}F\) for some distribution \(F\). Furthermore, suppose that \(\theta = t(F)\) is the parameter of interest. As a reminder, the notation \(\theta = t(F)\) denotes that the parameter is a function \(t()\) of the distribution function. The plug-in estimate of the parameter \(\theta\) is defined as \[ \hat{\theta}_n = t(\hat{F}_n) \] which applies the parameter defining function \(t()\) to the ECDF in place of the CDF. Note that the plug-in estimate \(\hat{\theta}_n\) does not necessarily have to be equivalent to the estimate of the parameter \(\hat{\theta} = s(\mathbf{x})\) that would typically be used in practice.

Example 1: Mean

As a first example, suppose that the parameter is the population mean, i.e., \(\theta = E_F(X) = \int x f(x) dx\), where \(E_F()\) denotes the expectation with respect to \(F\). The plug-in estimate of \(\theta\) has the form \[ \hat{\theta}_n = E_{\hat{F}_n}(X) = \sum_{i = 1}^n x_i \hat{f}_i = \frac{1}{n} \sum_{i = 1}^n x_i = \bar{x} \] which calculates the expectation with respect to \(\hat{F}_n\) in place of \(F\). Note that \(\hat{f}_i = 1/n\) is the probability that the ECDF assigns to the \(i\)-th observation. In this case, the plug-in estimate is the same as the estimate \(\hat{\theta} = s(\mathbf{x}) = \bar{x}\) that would typically be used in practice.

Example 2: Variance

As a second example, suppose that the parameter is the population variance, i.e., \[ \theta = \mathrm{Var}(X) = E_F\left( (X - \mu)^2 \right) = E_F\left( X^2 \right) - \left( E_F(X) \right)^2 \] where \(\mu = E_F(X)\) is the expected value of \(X\). The plug-in estimate of the variance has the form \[ \hat{\theta}_n = E_{\hat{F}_n} \left(X^2 \right) - \left( E_{\hat{F}_n}(X) \right)^2 = \frac{1}{n} \sum_{i = 1}^n x_i^2 - \left( \frac{1}{n} \sum_{i = 1}^n x_i \right)^2 = \frac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})^2 \] which replaces the expectation operator \(E_F()\) with the empirical expectation operator \(E_{\hat{F}_n}()\). In this case, the plug-in estimate of the variance is equivalent to the maximum likelihood estimator of the variance (with a divisor of \(n\)). Note that this differs from the unbiased estimator \[ \hat{\theta} = \frac{1}{n-1} \sum_{i = 1}^n (x_i - \bar{x})^2 \] which is typically preferred over the maximum likelihood estimator \(\hat{\theta}_n\).

2.4 Logic of the Bootstrap

For conducting statistical inference with unknown probability distributions, the nonparametric bootstrap treats the ECDF \(\hat{F}_n\) as if it were the true CDF \(F\). In other words, the nonparametric bootstrap treats the observed sample \(\mathbf{x} = (x_1, \ldots, x_n)^\top\) as if it were the true population. As a reminder, the sampling distribution of \(\hat{\theta} = s(\mathbf{x})\) is the distribution of the \(\hat{\theta}\) values that would be observed if the statistic function \(s()\) was applied to a large number of independent samples of data from the population. The nonparametric bootstrap approximates the sampling distribution by applying the statistic function \(s()\) to a large number of independent samples of data from the observed sample. As a result, the nonparametric bootstrap can ONLY well approximate the sampling distribution of a statistic if the ECDF is a good estimate of the unknown data generating CDF. This will be the case if the sample size \(n\) is large, but is not guaranteed (or even likely) for small samples of data.

Figure 8.1 An Introduction to the Bootstrap (Efron & Tibshirani, 1993).

3 Nonparametric Bootstrap Applications

3.1 Assessing the Quality of Estimators

Variance and Standard Error

The variance of an estimator \(\hat{\theta} = s(\mathbf{x})\) for some parameter \(\theta = t(F)\) is defined as \[ \mathrm{Var}(\hat{\theta}) = E_F\left( (\hat{\theta} - E_F(\hat{\theta}))^2 \right) = E_F\left( \hat{\theta}{}^2 \right) - \left( E_F(\hat{\theta}) \right)^2 \] where \(E_F()\) denotes that the expectation is calculated with respect to the data generating distribution \(F\). The nonparametric bootstrap can be used to estimate the variance (or standard error) of an estimator \(\hat{\theta} = s(\mathbf{x})\). Given the bootstrap distribution \(\{\hat{\theta}, \hat{\theta}{}_1^*, \ldots, \hat{\theta}{}_R^* \}\), the bootstrap estimate of the variance has the form \[ \widehat{\mathrm{Var}}(\hat{\theta}) = \frac{1}{R} \sum_{r = 0}^R \left( \hat{\theta}{}_r^* - \bar{\theta}{}^* \right)^2 \] where \(\hat{\theta}{}_0^* = \hat{\theta}\) is the estimate of \(\theta\) from the observed sample and \(\bar{\theta}{}^* = \frac{1}{R+1} \sum_{r = 0}^R \hat{\theta}{}_r^*\) is the sample mean of the bootstrap distribution. Note that the estimated variance \(\widehat{\mathrm{Var}}(\hat{\theta})\) is the sample variance of the bootstrap distribution. The corresponding estimate of the standard error has the form \[ \widehat{\mathrm{SE}}(\hat{\theta}) = \sqrt{ \widehat{\mathrm{Var}}(\hat{\theta}) } = \sqrt{ \frac{1}{R} \sum_{r = 0}^R \left( \hat{\theta}{}_r^* - \bar{\theta}{}^* \right)^2 } \] which is the sample standard deviation of the bootstrap distribution. Note that as the number of bootstrap samples gets infinitely large, i.e., as \(R \rightarrow \infty\), the bootstrap estimate of the standard error converges to the plug-in estimate of the standard error of \(\hat{\theta}\).

Figure 6.1 An Introduction to the Bootstrap (Efron & Tibshirani, 1993).

Bias

The bias of an estimator \(\hat{\theta} = s(\mathbf{x})\) for some parameter \(\theta = t(F)\) is defined as \[ \mathrm{Bias}(\hat{\theta}) = E_F(\hat{\theta}) - \theta \] where \(E_F()\) denotes that the expectation is calculated with respect to the data generating distribution \(F\). The plug-in estimate of the bias uses the ECDF \(\hat{F}_n\) instead of the unknown true CDF \(F\). Specifically, the plug-in estimate of the bias has the form \[ \widehat{\mathrm{Bias}}_n(\hat{\theta}) = E_{\hat{F}_n}(s(\mathbf{x})) - \hat{\theta}_n \] where \(E_{\hat{F}_n}()\) is the expectation calculated with respect to \(\hat{F}_n\) and \(\hat{\theta}_n = t(\hat{F}_n)\) is the plug-in estimate of \(\theta\). This implies that the bootstrap estimate of the bias can be calculated as \[ \widehat{\mathrm{Bias}}(\hat{\theta}) = \bar{\theta}{}^* - \hat{\theta} \] where \(\bar{\theta}{}^* = \frac{1}{R+1} \sum_{r = 0}^R \hat{\theta}{}_r^*\) is the sample mean of the bootstrap distribution. Note that this assumes that \(\hat{\theta} = t(\hat{F}_n)\), i.e., that the statistic is the plug-in estimate of \(\theta\). If \(\hat{\theta}\) is not the plug-in estimate, then \(\hat{\theta}\) should be replaced by \(t(\hat{F}_n)\) in the definition of \(\widehat{\mathrm{Bias}}(\hat{\theta})\).

Mean Squared Error

The mean squared error (MSE) of an estimator \(\hat{\theta} = s(\mathbf{x})\) for some parameter \(\theta = t(F)\) is defined as \[ \mathrm{MSE}(\hat{\theta}) = E_F\left( (\hat{\theta} - \theta)^2 \right) = \mathrm{Var}(\hat{\theta}) + \mathrm{Bias}(\hat{\theta})^2 \] where \(E_F()\) denotes that the expectation is calculated with respect to the data generating distribution \(F\). The bootstrap estimate of the MSE involves calculating the expectation using the ECDF \(\hat{F}_n\) in place of the unknown true CDF \(F\). Note that since the MSE can be decomposed into an additive function of the variance and bias, the bootstrap estimate of the MSE simply involves adding the bootstrap estimate of the variance and (squared) bias.

3.2 Using the np.boot Function

The np.boot function (in the nptest package) can be used to implement the nonparametric bootstrap.

# load 'nptest' R package
library(nptest)
## Loading required package: parallel

Usage:

np.boot(x, statistic, ..., R = 9999, level = c(0.9, 0.95, 0.99),
                method = c("norm", "basic", "perc", "stud", "bca")[-4],
                sdfun = NULL, sdrep = 99, jackknife = NULL,
                parallel = FALSE, cl = NULL, boot.dist = TRUE)

The only two required inputs are the data x and the statistic function statistic. The input ... may also be needed if the statistic function requires additional arguments and/or if the data are multivariate. As will be demonstrated in the following subsections, the input x will differ depending on whether the data are univariate (i.e., a vector) or multivariate (i.e., a matrix or data frame).

3.3 Univariate Data Examples

Overview

Univariate samples of data have the form \(\mathbf{x} = (x_1, \ldots, x_n)^\top\) where each \(x_i \stackrel{\mathrm{iid}}{\sim}F\) for some distribution \(F\). For univariate data, using the np.boot function is rather simple, given that it just requires inputting the data vector x and the statistic function statistic (and possibly additional arguments passed through the ... argument). In the following examples, I will demonstrate how to use the np.boot function with univariate data using predefined functions in R for the statistic function. However, it should be noted that the statistic argument can be a user-defined function, which makes it possible to calculate any statistic that is of interest.

Example 1: Univariate Statistic (Median)

For this example, we will generate \(n = 100\) observations from a standard normal distribution, and use the median as the parameter/statistic of interest. Note that the true (population) median is zero. Since the median is a univariate statistic, the bootstrap distribution will be a vector of length \(R+1\) containing the bootstrap replicates of the median.

# generate 100 standard normal observations
set.seed(1)
n <- 100
x <- rnorm(n)

# nonparametric bootstrap
npbs <- np.boot(x = x, statistic = median)
npbs
## 
## Nonparametric Bootstrap of Univariate Statistic
## using R = 9999 bootstrap replicates
## 
##   t0: 0.1139
##   SE: 0.1394
## Bias: 0.0185 
## 
## BCa Confidence Intervals:
##       lower  upper
## 90% -0.0566 0.3411
## 95% -0.0811 0.3673
## 99% -0.1351 0.3940
# check t0, SE, and bias
median(x)                          # t0
## [1] 0.1139092
sd(npbs$boot.dist)                 # SE
## [1] 0.1393567
mean(npbs$boot.dist) - npbs$t0     # Bias
## [1] 0.01845341
# bootstrap distribution
hist(npbs$boot.dist, xlab = "Statistic", main = "Bootstrap Distribution")
box()
abline(v = npbs$t0, lty = 2, col = "red")
legend("topleft", "t0", lty = 2, col = "red", bty = "n")

Example 2: Multivariate Statistic (Quartiles)

For this example, we will generate \(n = 100\) observations from a standard normal distribution, and use the quartiles as the parameters/statistics of interest. Note that the true (population) quartiles are Q1 = qnorm(0.25) = -0.6744898, Q2 = qnorm(0.5) = 0, and Q3 = qnorm(0.75) = 0.6744898. Since the quartiles are a multivariate statistic, the bootstrap distribution will be a matrix of dimension \(R+1 \times 3\), where each column contains the bootstrap replicates of the corresponding quartile.

# generate 100 standard normal observations
set.seed(1)
n <- 100
x <- rnorm(n)

# nonparametric bootstrap (using ... to enter 'probs' argument)
npbs <- np.boot(x = x, statistic = quantile, 
                probs = c(0.25, 0.5, 0.75))
npbs
## 
## Nonparametric Bootstrap of Multivariate Statistic
## using R = 9999 bootstrap replicates
## 
##           25%    50%     75%
##   t0: -0.4942 0.1139  0.6915
##   SE:  0.1172 0.1394  0.0933
## Bias:  0.0058 0.0185 -0.0170
## 
## 95% BCa Confidence Intervals:
##           25%     50%    75%
## lower -0.6941 -0.0811 0.5047
## upper -0.2534  0.3673 0.8811
# check t0, SE, and bias
quantile(x, probs = c(0.25, 0.5, 0.75))     # t0
##        25%        50%        75% 
## -0.4942425  0.1139092  0.6915454
apply(npbs$boot.dist, 2, sd)                # SE
##        25%        50%        75% 
## 0.11724637 0.13935672 0.09333269
colMeans(npbs$boot.dist) - npbs$t0          # Bias
##          25%          50%          75% 
##  0.005771337  0.018453409 -0.017043706
# check cov
npbs$cov
##             25%         50%         75%
## 25% 0.013746712 0.009523836 0.003761904
## 50% 0.009523836 0.019420295 0.007248275
## 75% 0.003761904 0.007248275 0.008710991
cov(npbs$boot.dist)
##             25%         50%         75%
## 25% 0.013746712 0.009523836 0.003761904
## 50% 0.009523836 0.019420295 0.007248275
## 75% 0.003761904 0.007248275 0.008710991
# bootstrap distribution
par(mfrow = c(1,3))
for(j in 1:3){
  hist(npbs$boot.dist[,j], xlab = "Statistic", 
     main = paste0("Bootstrap Distribution", ": Q", j))
  box()
  abline(v = npbs$t0[j], lty = 2, col = "red")
  legend("topright", paste0("t0[",j,"]"), lty = 2, col = "red", bty = "n")
}

3.4 Multivariate Data Examples

Overview

Multivariate samples of data have the form \(\mathbf{X} = [\mathbf{x}_1, \ldots, \mathbf{x}_n]^\top\) where \(\mathbf{x}_i = (x_{i1}, \ldots, x_{ip})^\top\) is the \(i\)-th observation’s vector of length \(p\). This implies that \(\mathbf{X}\) is a data matrix (or data frame) of dimension \(n \times p\), where \(n\) is the number of observations and \(p\) is the number of variables. It is assumed that \(\mathbf{x}_i \stackrel{\mathrm{iid}}{\sim}F\), where \(F\) is some \(p\)-variate distribution.

For multivariate data, the np.boot function uses similar syntax as the bootstrap functions in the bootstrap package (which are designed to be more flexible than those in the boot package). In this case, the x input should be the the integers \(1,\ldots,n\) and the data matrix \(\mathbf{X}\) should be passed using the ... argument to the np.boot function. As will be demonstrated in the examples, this approach makes it possible to flexibly use the np.boot function for a variety user-defined statistic functions.

Example 1: Univariate Statistic (Correlation)

For this example, we will generate \(n = 100\) observations from a bivariate normal distribution, and use the correlation as the parameter/statistic of interest. Note that the true (population) correlation is \(\rho = 1/2\). Since the correlation is a univariate statistic, the bootstrap distribution will be a vector of length \(R+1\) containing the bootstrap replicates of the correlation coefficient.

# correlation matrix square root (with rho = 0.5)
rho <- 0.5
val <- c(sqrt(1 + rho), sqrt(1 - rho))
corsqrt <- matrix(c(val[1], -val[2], val), 2, 2) / sqrt(2)

# generate 100 bivariate observations (with rho = 0.5)
n <- 100
set.seed(1)
data <- cbind(rnorm(n), rnorm(n)) %*% corsqrt

# define statistic function
statfun <- function(x, data) cor(data[x,1], data[x,2])

# nonparametric bootstrap
npbs <- np.boot(x = 1:n, statistic = statfun, data = data)
npbs
## 
## Nonparametric Bootstrap of Univariate Statistic
## using R = 9999 bootstrap replicates
## 
##   t0: 0.4502
##   SE: 0.0851
## Bias: -0.0025 
## 
## BCa Confidence Intervals:
##      lower  upper
## 90% 0.2974 0.5780
## 95% 0.2660 0.6011
## 99% 0.2046 0.6468
# check t0, SE, and bias
statfun(1:n, data)                 # t0
## [1] 0.4502206
sd(npbs$boot.dist)                 # SE
## [1] 0.08514082
mean(npbs$boot.dist) - npbs$t0     # Bias
## [1] -0.002499469
# bootstrap distribution
hist(npbs$boot.dist, xlab = "Statistic", main = "Bootstrap Distribution")
box()
abline(v = npbs$t0, lty = 2, col = "red")
legend("topleft", "t0", lty = 2, col = "red", bty = "n")

Example 2: Multivariate Statistic (Variances and Covariance)

For this example, we will generate \(n = 100\) observations from a bivariate normal distribution, and use the variances and covariance the parameters/statistics of interest. Note that the true (population) variances are \(\sigma_1^2 = \sigma_2^2 = 1\) and the true (population) covariance is \(\sigma_{12} = 1/2\). Since the lower-triangle of the covariance matrix is a multivariate statistic, the bootstrap distribution will be a matrix of dimension \(R+1 \times 3\), where each column contains the bootstrap replicates of the corresponding variance/covariance parameter.

# correlation matrix square root (with rho = 0.5)
rho <- 0.5
val <- c(sqrt(1 + rho), sqrt(1 - rho))
corsqrt <- matrix(c(val[1], -val[2], val), 2, 2) / sqrt(2)

# generate 100 bivariate observations (with rho = 0.5)
n <- 100
set.seed(1)
data <- cbind(rnorm(n), rnorm(n)) %*% corsqrt

# define statistic function
statfun <- function(x, data) {
  cmat <- cov(data[x,])
  ltri <- lower.tri(cmat, diag = TRUE)
  cvec <- cmat[ltri]
  names(cvec) <- c("var(x1)", "cov(x1,x2)", "var(x2)")
  cvec
}

# nonparametric bootstrap
npbs <- np.boot(x = 1:n, statistic = statfun, data = data)
npbs
## 
## Nonparametric Bootstrap of Multivariate Statistic
## using R = 9999 bootstrap replicates
## 
##       var(x1) cov(x1,x2) var(x2)
##   t0:  0.8352     0.3757  0.8337
##   SE:  0.1054     0.0957  0.1076
## Bias: -0.0071    -0.0026 -0.0072
## 
## 95% BCa Confidence Intervals:
##       var(x1) cov(x1,x2) var(x2)
## lower  0.6503     0.2088  0.6563
## upper  1.0696     0.5910  1.0924
# check t0, SE, and bias
statfun(1:n, data)                          # t0
##    var(x1) cov(x1,x2)    var(x2) 
##  0.8351955  0.3756885  0.8337138
apply(npbs$boot.dist, 2, sd)                # SE
##    var(x1) cov(x1,x2)    var(x2) 
## 0.10542177 0.09574697 0.10758354
colMeans(npbs$boot.dist) - npbs$t0          # Bias
##      var(x1)   cov(x1,x2)      var(x2) 
## -0.007087959 -0.002574341 -0.007184504
# check cov
npbs$cov
##                var(x1)  cov(x1,x2)     var(x2)
## var(x1)    0.011113749 0.005895733 0.003562446
## cov(x1,x2) 0.005895733 0.009167483 0.007013532
## var(x2)    0.003562446 0.007013532 0.011574218
cov(npbs$boot.dist)
##                var(x1)  cov(x1,x2)     var(x2)
## var(x1)    0.011113749 0.005895733 0.003562446
## cov(x1,x2) 0.005895733 0.009167483 0.007013532
## var(x2)    0.003562446 0.007013532 0.011574218
# bootstrap distribution
labs <- c(expression(paste("Bootstrap Distribution: ", sigma[1]^2)),
          expression(paste("Bootstrap Distribution: ", sigma[12])),
          expression(paste("Bootstrap Distribution: ", sigma[2]^2)))
par(mfrow = c(1,3))
for(j in 1:3){
  hist(npbs$boot.dist[,j], xlab = "Statistic", main = labs[j]) 
  box()
  abline(v = npbs$t0[j], lty = 2, col = "red")
  legend("topright", paste0("t0[",j,"]"), lty = 2, col = "red", bty = "n")
}

4 Bootstrapping Regression Models

4.1 Some Regression Basics

Assumed Model (Scalar Form)

Consider the linear regression model of the form \[ y_i = \beta_0 + \sum_{j = 1}^p \beta_j x_{ij} + \epsilon_i \] for \(i = 1, \ldots, n\), where \(y_i\) is the response variable for the \(i\)-th observation, \(x_{ij}\) is the \(j\)-th predictor for the \(i\)-th observation, \(\beta_0\) is the regression intercept parameter, \(\beta_j\) is the regression slope for the \(j\)-th predictor, and \(\epsilon_i \stackrel{\mathrm{iid}}{\sim}(0, \sigma^2)\) is the error term for the \(i\)-th observation. This implies that the conditional mean of \(y_i\) given \(\mathbf{x}_i = (x_{i1}, \ldots, x_{ip})^\top\) has the form \[ E(y_i | \mathbf{x}_i) = \beta_0 + \sum_{j = 1}^p \beta_j x_{ij} \] and the conditional variance of \(y_i\) given \(\mathbf{x}_i\) is \(\mathrm{Var}(y_i | \mathbf{x}_i) = \sigma^2\).

Assumed Model (Matrix Form)

In matrix form, the linear regression model has the form \[ \mathbf{y} = \mathbf{X} \boldsymbol\beta + \boldsymbol\epsilon \] where \(\mathbf{y} = (y_1, \ldots, y_n)^\top\) is the observed response vector, \(\mathbf{X}\) is the \(n \times p+1\) design matrix with \(i\)-th row equal to \((1, x_{i1}, \ldots, x_{ip})\), \(\boldsymbol\beta = (\beta_0, \beta_1, \ldots, \beta_p)^\top\) is the regression coefficient vector, and \(\boldsymbol\epsilon = (\epsilon_1, \ldots, \epsilon_n)^\top\) is the error vector. Writing out the vectors and matrices, the regression model has the form \[ \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & \cdots & x_{1p} \\ 1 & x_{21} & \cdots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \cdots & x_{np} \\ \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix} \] The assumptions from the previous section imply that \(E(\mathbf{y} | \mathbf{X}) = \mathbf{X} \boldsymbol\beta\) and \(\mathrm{Cov}(\mathbf{y} | \mathbf{X}) = \sigma^2 \mathbf{I}_n\), where \(\mathbf{I}_n\) denotes the identity matrix of order \(n\).

Coefficient Estimates and Fitted Values

The ordinary least squares coefficient estimates are the coefficients that minimize the least squares loss function \[ \sum_{i = 1}^n \left(y_i - \beta_0 - \sum_{j = 1}^p \beta_j x_{ij} \right)^2 \] which can be written more compactly in matrix form such as \[ \| \mathbf{y} - \mathbf{X} \boldsymbol\beta \|^2 \] where \(\| \cdot \|\) denotes the Euclidean norm. It is well known that the ordinary least squares coefficient estimates have the form \[ \hat{\boldsymbol\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \] and the corresponding fitted values have the form \[ \hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol\beta} = \mathbf{H} \mathbf{y} \] where \(\mathbf{H} = \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\) is the “hat matrix” for the linear model.

Statistical Inference

If the model assumptions are correct, the expectation of the estimated coefficient vector has the form \[ \begin{split} E\left( \hat{\boldsymbol\beta} \right) &= E\left( (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \right) \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top E( \mathbf{y} ) \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{X} \boldsymbol\beta \\ &= \boldsymbol\beta \end{split} \] which reveals that the coefficient estimates are unbiased. The covariance matrix of the coefficients have the form \[ \begin{split} \mathrm{Cov}\left( \hat{\boldsymbol\beta} \right) &= \mathrm{Cov}\left( (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \right) \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathrm{Cov}( \mathbf{y} ) \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top ( \sigma^2 \mathbf{I}_n ) \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1} \end{split} \] Similar arguments can be used to show that the expected value and covariance matrix for the fitted values have the form \[ E\left(\hat{\mathbf{y}}\right) = \mathbf{X} \boldsymbol\beta \qquad \mbox{and} \qquad \mathrm{Cov}\left(\hat{\mathbf{y}}\right) = \sigma^2 \mathbf{H} \] which is due to the fact that \(\hat{\mathbf{y}} = \mathbf{X} \hat{\boldsymbol\beta}\).

Need for the Nonparametric Bootstrap

The expectations and covariance matrices derived in the previous section depend on the assumption that the model is correctly specified, i.e., that

  1. \(E(y_i | \mathbf{x}_i) = \beta_0 + \sum_{j = 1}^p \beta_j x_{ij}\)

  2. \(\mathrm{Var}(y_i | \mathbf{x}_i) = \sigma^2\)

  3. the \(\epsilon_i\) are independent of one another

Note that even these assumptions are not enough to conduct statistical inference on the coefficient estimates and/or fitted values. If we add the assumption of normality, i.e., that \(\epsilon_i \stackrel{\mathrm{iid}}{\sim}N(0, \sigma^2)\), then the sampling distribution of the coefficient estimates is known to be Gaussian, i.e., \(\hat{\boldsymbol\beta} \sim N(\boldsymbol\beta, \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1})\). This is the approach that is commonly used for inference in regression models, e.g., in the lm function in R. However, in practice, the normality assumption may not be reasonable, which implies that a more general (nonparametric) approach is needed for inference in regression.

In the following subsections, I will cover two different ways that the nonparametric bootstrap can be used for inference in regression. In both of the examples, the statistic of interest is assumed to be the regression coefficient vector \(\boldsymbol\beta\). However, it should be noted that this does not have to be the case. In some applications, it may be more useful to use some function of the coefficient vector as the statistic. Also, it is worth noting that for both of the examples, it is assumed that the linear regression model is the focus. The first method (bootstrapping residuals) is specifically designed for this linear model, but the second method (bootstrapping cases) could be applied to more general regression scenarios, e.g., generalized and/or nonparametric regression models.

4.2 Method 1: Bootstrapping Residuals

Overview

As its name implies, the bootstrapping residuals approach involves resampling (with replacement) the residuals from a fit regression model to form many bootstrap replicates of the coefficient vector. Note that the residuals are defined as \[ \hat{\epsilon}_i = y_i - \hat{y}_i \qquad \longleftrightarrow \qquad \hat{\boldsymbol\epsilon} = \mathbf{y} - \hat{\mathbf{y}} \] where \(\hat{y}_i\) is the \(i\)-th observation’s fitted value, i.e., the \(i\)-th element of \(\hat{\mathbf{y}}\). The procedure for bootstrapping the residuals is as follows:

  1. Independently sample \(\hat{\epsilon}_i^*\) with replacement from \(\{\hat{\epsilon}_1, \ldots, \hat{\epsilon}_n\}\) and define \(y_i^* = \hat{y}_i + \hat{\epsilon}_i^*\) for \(i = 1, \ldots, n\)

  2. Calculate the statistic \(\hat{\boldsymbol\beta}{}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}^*\) where \(\mathbf{y}^* = (y_1^*, \ldots, y_n^*)^\top\) is the resampled response vector.

  3. Repeat steps 1-2 a total of \(R\) times to form the bootstrap distribution of \(\hat{\boldsymbol\beta}\).

Note that this approach requires the model assumptions to be correct. More specifically, for the bootstrapping residuals method to provide reasonable estimates of standard errors and bias, it requires the three assumptions listed in the previous subsection. If the model mean structure and/or error variance structure are misspecified–which is likely the case with real data–then the bootstrapping residuals approach may produce invalid results.

As will be discussed in the next subsection, the bootstrapping cases approach requires fewer assumptions to produce valid results. So it is reasonable to ask: then why would one ever consider using the bootstrapping residuals approach? This is a reasonable question, particularly given modern computing power. Note that the benefit of the bootstrapping residuals approach is the computational efficiency of the method. Since the design matrix \(\mathbf{X}\) remains unchanged across the bootstrap replicates, it is possible to initialize the \((\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\) matrix, which is needed for calculating the bootstrap replicates of the statistic. However, unless the sample size and/or number of predictors is quite large, the bootstrapping residuals approach does not result in substantial gains in computation time.

Example 1: Correctly Specified Model

In this example, I will generated \(n = 100\) observations from a simple linear regression model where the error terms satisfy \(\epsilon_i \stackrel{\mathrm{iid}}{\sim}N(0, \sigma^2)\). Note that the true regression coefficients are \(\beta_0 = 0\) and \(\beta_1 = 1\), and the error variance is \(\sigma^2 = 1\).

# generate 100 observations
n <- 100
set.seed(1)
x <- seq(0, 1, length.out = n)
y <- x + rnorm(n)

# prepare data
xmat <- cbind(1, x)
xinv <- solve(crossprod(xmat)) %*% t(xmat)
fit <- xmat %*% xinv %*% y
data <- list(fit = fit, resid = y - fit, xinv = xinv, x = x)

# define statistic function
statfun <- function(x, data) {
  ynew <- data$fit + data$resid[x]
  coef <- as.numeric(data$xinv %*% ynew)
  names(coef) <- c("(Intercept)", "x")
  coef
}

# define jackknife function
jackfun <- function(x, data){
  ynew <- data$fit[x] + data$resid[x]
  xmat <- cbind(1, data$x[x])
  xinv <- solve(crossprod(xmat)) %*% t(xmat)
  coef <- as.numeric(xinv %*% ynew)
  names(coef) <- c("(Intercept)", "x")
  coef
}

# nonparametric bootstrap
npbs <- np.boot(x = 1:n, statistic = statfun, data = data, 
                jackknife = jackfun)
npbs
## 
## Nonparametric Bootstrap of Multivariate Statistic
## using R = 9999 bootstrap replicates
## 
##       (Intercept)       x
##   t0:      0.1312  0.9553
##   SE:      0.1779  0.3062
## Bias:      0.0039 -0.0043
## 
## 95% BCa Confidence Intervals:
##       (Intercept)      x
## lower     -0.2311 0.3622
## upper      0.4657 1.5718

Example 2: Heteroskedastic Errors

This example is the same as the previous example, except that now the error terms have different variances, i.e., \(\epsilon_i \stackrel{\mathrm{ind}}{\sim}N(0, \sigma_i^2)\) where \(\sigma_i^2 = 1 + 4 (x_i - 0.5)^2\). Note that \(\sigma_i^2 \geq 1\) for all \(i = 1,\ldots,n\), so the error variance is larger in this example.

# generate 100 observations
n <- 100
set.seed(1)
x <- seq(0, 1, length.out = n)
y <- x + rnorm(n, sd = sqrt(1 + 4 * (x - 0.5)^2))

# prepare data
xmat <- cbind(1, x)
xinv <- solve(crossprod(xmat)) %*% t(xmat)
fit <- xmat %*% xinv %*% y
data <- list(fit = fit, resid = y - fit, xinv = xinv, x = x)

# define statistic function
statfun <- function(x, data) {
  ynew <- data$fit + data$resid[x]
  coef <- as.numeric(data$xinv %*% ynew)
  names(coef) <- c("(Intercept)", "x")
  coef
}

# define jackknife function
jackfun <- function(x, data){
  ynew <- data$fit[x] + data$resid[x]
  xmat <- cbind(1, data$x[x])
  xinv <- solve(crossprod(xmat)) %*% t(xmat)
  coef <- as.numeric(xinv %*% ynew)
  names(coef) <- c("(Intercept)", "x")
  coef
}

# nonparametric bootstrap
npbs <- np.boot(x = 1:n, statistic = statfun, data = data, 
                jackknife = jackfun)
npbs
## 
## Nonparametric Bootstrap of Multivariate Statistic
## using R = 9999 bootstrap replicates
## 
##       (Intercept)       x
##   t0:      0.1585  0.9275
##   SE:      0.2061  0.3550
## Bias:      0.0047 -0.0054
## 
## 95% BCa Confidence Intervals:
##       (Intercept)      x
## lower     -0.2562 0.2373
## upper      0.5460 1.6410

4.3 Method 2: Bootstrapping Cases

Overview

The bootstrapping cases approach is similar to the “Multivariate Data Examples” that were covered in Section 3.4. In this case, the multivariate data are assumed to have the form \(\mathbf{Z} = [\mathbf{z}_1, \ldots, \mathbf{z}_n]^\top\) where \(\mathbf{z}_i = (y_i, x_{i1}, \ldots, x_{ip})^\top\) is the \(i\)-th observation’s vector of length \(p + 1\). This implies that \(\mathbf{Z}\) is a data matrix (or data frame) of dimension \(n \times p+1\), where \(n\) is the number of observations and \(p + 1\) is the total number of variables (\(p\) predictors plus 1 response). It is assumed that \(\mathbf{z}_i \stackrel{\mathrm{iid}}{\sim}F\), where \(F\) is some multivariate distribution.

For the bootstrapping cases method, the procedure is as follows:

  1. Independently sample \(\mathbf{z}_i^*\) with replacement from \(\{\mathbf{z}_1, \ldots, \mathbf{z}_n\}\) for \(i = 1, \ldots, n\).

  2. Calculate the statistic \(\hat{\boldsymbol\beta}^* = (\mathbf{X}_*^\top \mathbf{X}_*)^{-1} \mathbf{X}_*^\top \mathbf{y}^*\) where the \(i\)-th row of \(\mathbf{X}_*\) is defined as \(\mathbf{x}_i^* = (1, x_{i1}^*, \ldots, x_{ip}^*)^\top\) and \(\mathbf{y}^* = (y_1^*, \ldots, y_n^*)\).

  3. Repeat steps 1-2 a total of \(R\) times to for the bootstrap distribution of \(\hat{\boldsymbol\beta}\).

Note that this approach does not require the regression model to be correct. Instead, it just requires that the rows of the matrix \(\mathbf{Z}\) are independently sampled from some multivariate distribution. The primary downside of the bootstrapping cases approach is that it is more computationally costly than bootstrapping residuals. This is because the design matrix \(\mathbf{X}_*\) will be different for each bootstrap sample, which implies that the \((\mathbf{X}_*^\top \mathbf{X}_*)^{-1} \mathbf{X}_*^\top\) matrix will need to be recalculated for each bootstrap replicate of the statistic. Furthermore, it is important to note that it is necessary to proceed with caution when using the bootstrapping cases approach when working with categorical and/or skewed predictors. This is because some bootstrap samples may exclude levels of categorical predictors and/or extreme values of skewed predictors, which can produce misleading results.

Example 1: Correctly Specified Model

In this example, I will generated \(n = 100\) observations from a simple linear regression model where the error terms satisfy \(\epsilon_i \stackrel{\mathrm{iid}}{\sim}N(0, \sigma^2)\). Note that the true regression coefficients are \(\beta_0 = 0\) and \(\beta_1 = 1\), and the error variance is \(\sigma^2 = 1\).

# generate 100 observations
n <- 100
set.seed(1)
x <- seq(0, 1, length.out = n)
y <- x + rnorm(n)
data <- data.frame(x = x, y = y)

# define statistic function
statfun <- function(x, data) {
  xmat <- cbind(1, data$x[x])
  xinv <- solve(crossprod(xmat)) %*% t(xmat)
  coef <- as.numeric(xinv %*% data$y[x])
  names(coef) <- c("(Intercept)", "x")
  coef
}

# nonparametric bootstrap
npbs <- np.boot(x = 1:n, statistic = statfun, data = data)
npbs
## 
## Nonparametric Bootstrap of Multivariate Statistic
## using R = 9999 bootstrap replicates
## 
##       (Intercept)      x
##   t0:      0.1312 0.9553
##   SE:      0.1758 0.3094
## Bias:     -0.0007 0.0054
## 
## 95% BCa Confidence Intervals:
##       (Intercept)      x
## lower     -0.2276 0.3575
## upper      0.4650 1.5587

Example 2: Heteroskedastic Errors

This example is the same as the previous example, except that now the error terms have different variances, i.e., \(\epsilon_i \stackrel{\mathrm{ind}}{\sim}N(0, \sigma_i^2)\) where \(\sigma_i^2 = 1 + 4 (x_i - 0.5)^2\). Note that \(\sigma_i^2 \geq 1\) for all \(i = 1,\ldots,n\), so the error variance is larger in this example.

# generate 100 observations
n <- 100
set.seed(1)
x <- seq(0, 1, length.out = n)
y <- x + rnorm(n, sd = sqrt(1 + 4 * (x - 0.5)^2))
data <- data.frame(x = x, y = y)

# define statistic function
statfun <- function(x, data) {
  xmat <- cbind(1, data$x[x])
  xinv <- solve(crossprod(xmat)) %*% t(xmat)
  coef <- as.numeric(xinv %*% data$y[x])
  names(coef) <- c("(Intercept)", "x")
  coef
}

# nonparametric bootstrap
npbs <- np.boot(x = 1:n, statistic = statfun, data = data)
npbs
## 
## Nonparametric Bootstrap of Multivariate Statistic
## using R = 9999 bootstrap replicates
## 
##       (Intercept)      x
##   t0:      0.1585 0.9275
##   SE:      0.2188 0.3920
## Bias:     -0.0010 0.0067
## 
## 95% BCa Confidence Intervals:
##       (Intercept)      x
## lower     -0.2867 0.1619
## upper      0.5730 1.6831

4.4 Example: SAT and College GPA

Data and Hypotheses

We will use the SAT and College GPA example from http://onlinestatbook.com/2/case_studies/sat.html

The dataset contains information from \(n = 105\) students that graduated from a state university with a B.S. degree in computer science.

  • high_GPA = High school grade point average

  • math_SAT = Math SAT score

  • verb_SAT = Verbal SAT score

  • comp_GPA = Computer science grade point average

  • univ_GPA = Overall university grade point average

R code to read-in and look at the data

# read-in data
sat <- read.table("http://onlinestatbook.com/2/case_studies/data/sat.txt", 
                  header = TRUE)
head(sat)
##   high_GPA math_SAT verb_SAT comp_GPA univ_GPA
## 1     3.45      643      589     3.76     3.52
## 2     2.78      558      512     2.87     2.91
## 3     2.52      583      503     2.54     2.40
## 4     3.67      685      602     3.83     3.47
## 5     3.24      592      538     3.29     3.47
## 6     2.10      562      486     2.64     2.37
# plot data
plot(sat$high_GPA, sat$univ_GPA, 
     xlab = "High school GPA", ylab = "University GPA")
abline(lm(univ_GPA ~ high_GPA, data = sat))

Is University GPA Linearly Related to High School GPA?

Consider the simple linear regression model \[ Y = \beta_0 + X \beta_1 + \epsilon \] where \(Y\) is the University GPA and \(X\) is the high school GPA.

Bootstrapping Residuals:

# prepare data
xmat <- cbind(1, sat$high_GPA)
xinv <- solve(crossprod(xmat)) %*% t(xmat)
fit <- xmat %*% xinv %*% sat$univ_GPA
data <- list(fit = fit, resid = sat$univ_GPA - fit, xinv = xinv, x = sat$high_GPA)

# define statistic function
statfun <- function(x, data) {
  ynew <- data$fit + data$resid[x]
  coef <- as.numeric(data$xinv %*% ynew)
  names(coef) <- c("(Intercept)", "high_GPA")
  coef
}

# define jackknife function
jackfun <- function(x, data){
  ynew <- data$fit[x] + data$resid[x]
  xmat <- cbind(1, data$x[x])
  xinv <- solve(crossprod(xmat)) %*% t(xmat)
  coef <- as.numeric(xinv %*% ynew)
  names(coef) <- c("(Intercept)", "high_GPA")
  coef
}

# nonparametric bootstrap
npbs <- np.boot(x = 1:nrow(sat), statistic = statfun, 
                data = data, jackknife = jackfun)
npbs
## 
## Nonparametric Bootstrap of Multivariate Statistic
## using R = 9999 bootstrap replicates
## 
##       (Intercept) high_GPA
##   t0:      1.0968   0.6748
##   SE:      0.1658   0.0531
## Bias:      0.0000  -0.0001
## 
## 95% BCa Confidence Intervals:
##       (Intercept) high_GPA
## lower      0.7938   0.5643
## upper      1.4392   0.7710

Bootstrapping Cases:

# define data
data <- data.frame(x = sat$high_GPA, y = sat$univ_GPA)

# define statistic function
statfun <- function(x, data) {
  xmat <- cbind(1, data$x[x])
  xinv <- solve(crossprod(xmat)) %*% t(xmat)
  coef <- as.numeric(xinv %*% data$y[x])
  names(coef) <- c("(Intercept)", "high_GPA")
  coef
}

# nonparametric bootstrap
npbs <- np.boot(x = 1:nrow(sat), statistic = statfun, data = data)
npbs
## 
## Nonparametric Bootstrap of Multivariate Statistic
## using R = 9999 bootstrap replicates
## 
##       (Intercept) high_GPA
##   t0:      1.0968   0.6748
##   SE:      0.2080   0.0607
## Bias:     -0.0038   0.0012
## 
## 95% BCa Confidence Intervals:
##       (Intercept) high_GPA
## lower      0.7100   0.5479
## upper      1.5293   0.7879

Note that the bootstrapping cases approach produces larger estimates of the standard errors for both the intercept and slope. In this case, the bootstrapping cases approach should be preferred because the data appear to have heteroskedastic errors.

5 Bootstrap Confidence Intervals

5.1 Definition and Interpretation

Definition of a Confidence Interval

Suppose that we have an independent and identically distributed (iid) sample of data \(x_1, \ldots, x_n\) where \(x_i \stackrel{\mathrm{iid}}{\sim}F\) from some distribution \(F\). Furthermore, suppose that the distribution \(F\) depends on some parameter \(\theta = t(F)\), and we have formed an estimate of the parameter, denoted by \(\hat{\theta} = s(\mathbf{x})\), where \(\mathbf{x} = (x_1, \ldots, x_n)^\top\) is the sample of data. As a reminder, the notation \(\theta = t(F)\) denotes that the parameter is a function of the distribution, and the notation \(\hat{\theta} = s(\mathbf{x})\) denotes that the estimate is a function of the sample.

Given a confidence level \(1 - \alpha \in (0, 1)\), the probabilistic statement \[ P\left( a(\hat{\theta}) < \theta < b(\hat{\theta}) \right) = 1 - \alpha \] defines a \(100 (1 - \alpha)\)% confidence interval for the unknown parameter \(\theta\). The confidence interval endpoints \(a(\cdot)\) and \(b(\cdot)\) are functions of the estimate \(\hat{\theta}\), which is a random variable. Thus, the \(100 (1 - \alpha)\)% confidence interval provides a range of values depending on \(\hat{\theta}\) such that the probability of \(\theta\) being within the interval is \(1 - \alpha\).

Example 1: Confidence Interval for \(\mu\)

Suppose that \(x_i \stackrel{\mathrm{iid}}{\sim}N(\mu, \sigma^2)\) for \(i = 1,\ldots,n\) and we want to form a confidence interval for \(\mu\). As an estimate of \(\mu\), we will use the sample mean \(\bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i\). Assuming that \(x_i \stackrel{\mathrm{iid}}{\sim}N(\mu, \sigma^2)\), we know that \(\bar{x} \sim N(\mu, \sigma^2/n)\), which implies that \(\sqrt{n}(\bar{x} - \mu)/\sigma \sim N(0,1)\). As a result, we have that \[ P\left( z_{\alpha/2} < \frac{\bar{x} - \mu}{\sigma / \sqrt{n}} < z_{1 - \alpha/2} \right) = 1 - \alpha \] where \(z_{\alpha} = \Phi^{-1}(\alpha)\) with \(\Phi^{-1}(\cdot)\) denoting the quantile function for the standard normal distribution. Rearranging the terms inside the above probability statement gives \[ \begin{split} 1 - \alpha &= P\left( z_{\alpha/2} \sigma / \sqrt{n} < \bar{x} - \mu < z_{1 - \alpha/2} \sigma / \sqrt{n} \right) \\ &= P\left( z_{\alpha/2} \sigma / \sqrt{n} - \bar{x} < - \mu < z_{1 - \alpha/2} \sigma / \sqrt{n} - \bar{x} \right) \\ &= P\left( \bar{x} - z_{\alpha/2} \sigma / \sqrt{n} > \mu > \bar{x} - z_{1 - \alpha/2} \sigma / \sqrt{n} \right) \end{split} \] which implies that a \(100 (1 - \alpha)\)% confidence interval for \(\mu\) defines \(a(\bar{x}) = \bar{x} - z_{1 - \alpha/2} \sigma / \sqrt{n}\) and \(b(\bar{x}) = \bar{x} - z_{\alpha/2} \sigma / \sqrt{n}\). Note that since \(-z_{\alpha/2} = z_{1 - \alpha/2}\) we can write the two endpoints of the confidence interval as \[ \bar{x} \pm z_{1 - \alpha/2} \mathrm{SE}(\bar{x}) \] where \(\mathrm{SE}(\bar{x}) = \sigma / \sqrt{n}\) is the standard error of the sample mean. In practice, it is typical to form a 90% confidence interval (i.e., \(\alpha = 0.1\)), which corresponds to \(z_{0.95} \approx 1.65\), a 95% confidence interval (i.e., \(\alpha = 0.05\)), which corresponds to \(z_{0.975} \approx 1.96\), or a 99% confidence interval (i.e., \(\alpha = 0.01\)), which corresponds to \(z_{0.995} = 2.58\).

Example 2: Confidence Interval for \(\sigma^2\)

Suppose that \(x_i \stackrel{\mathrm{iid}}{\sim}N(\mu, \sigma^2)\) for \(i = 1,\ldots,n\) and we want to form a confidence interval for \(\sigma^2\). As an estimate of \(\sigma^2\), we will use the sample variance \(s^2 = \frac{1}{n-1} \sum_{i = 1}^n (x_i - \bar{x})^2\). Assuming that \(x_i \stackrel{\mathrm{iid}}{\sim}N(\mu, \sigma^2)\), we know that \((n-1) s^2 / \sigma^2 \sim \chi_{n-1}^2\). As a result, we have that \[ P\left( q_{n-1; \alpha/2} < (n-1) \frac{s^2}{\sigma^2} < q_{n-1; 1 - \alpha/2} \right) = 1 - \alpha \] where \(q_{n-1; \alpha} = Q_{n-1}(\alpha)\) with \(Q_{n-1}(\cdot)\) denoting the quantile function for the \(\chi_{n-1}^2\) distribution. Rearranging the terms inside the above probability statement gives \[ \begin{split} 1 - \alpha &= P\left( \frac{q_{n-1; \alpha/2}}{n-1} < \frac{s^2}{\sigma^2} < \frac{q_{n-1; 1 - \alpha/2}}{n-1} \right) \\ &= P\left( \frac{q_{n-1; \alpha/2}}{s^2 (n-1)} < \frac{1}{\sigma^2} < \frac{q_{n-1; 1 - \alpha/2}}{s^2 (n-1)} \right) \\ &= P\left( \frac{s^2 (n-1) }{q_{n-1; \alpha/2}} > \sigma^2 > \frac{s^2 (n-1) }{ q_{n-1; 1 - \alpha/2} } \right) \end{split} \] which implies that a \(100 (1 - \alpha)\)% confidence interval for \(\sigma^2\) defines \(a(s^2) = (n-1) s^2 / q_{n-1; 1 - \alpha/2}\) and \(b(s^2) = (n-1) s^2 / q_{n-1; \alpha/2}\). In this case, the quantile values \(q_{n-1; 1 - \alpha/2}\) and \(q_{n-1; \alpha/2}\) depend on the sample size \(n\), so there are not standard quantiles values (such as 1.65, 1.96, and 2.58). But it is still typical to use a 90%, 95%, or 99% confidence interval.

Interpretation of a Confidence Interval

Confidence intervals are often misinterpreted. The most common misinterpretation of a confidence interval is that there is a \(100 (1 - \alpha)\)% chance that \(a(\hat{\theta}) < \theta < b(\hat{\theta})\) for a given estimate \(\hat{\theta}\). Note that this interpretation is incorrect because for any given estimate \(\hat{\theta}\) and corresponding confidence interval \([a(\hat{\theta}), b(\hat{\theta})]\), the inequality statement \(a(\hat{\theta}) < \theta < b(\hat{\theta})\) is either true or false.

The correct interpretation of a confidence interval is as follows. Suppose that we repeat our experiment a large number of independent times, i.e., we collect \(R\) independent samples of data each of size \(n\). Let \(\hat{\theta}_r\) denote the estimate of \(\theta\) for the \(r\)-th sample of data, and let \([a(\hat{\theta}_r), b(\hat{\theta}_r)]\) denote confidence interval formed from \(\hat{\theta}_r\) (for \(r = 1,\ldots,R\)). As the number of replications \(R \rightarrow \infty\), we have that \[ \frac{1}{R} \sum_{r = 1}^R I\left( a(\hat{\theta}_r) < \theta < b(\hat{\theta}_r) \right) \rightarrow 1 - \alpha \] where \(I(\cdot)\) is an indicator function, i.e., \(I\left( \cdot \right) = 1\) if the inequality statement is true, and \(I\left( \cdot \right) = 0\) otherwise. Note that for the \(r\)-th replication, the true parameter \(\theta\) is either in the interval or not, i.e., \(I(\cdot)\) is either equal to 1 or 0.

For the example of forming a confidence interval for \(\mu\) with \(x_i \stackrel{\mathrm{iid}}{\sim}N(\mu, \sigma^2)\), here is a simple demonstration of forming a 95% confidence interval using \(R = 10000\) replications with \(n = 25\) observations. Note that \(\mu = 0\) and \(\sigma^2 = 1\) in the below example.

R <- 10000
n <- 25
set.seed(1)
xbar <- replicate(R, mean(rnorm(n)))
ci.lo <- xbar - qnorm(.975) / sqrt(n)     # 95% CI lower bound
ci.up <- xbar - qnorm(.025) / sqrt(n)     # 95% CI upper bound
ci.in <- (ci.lo <= 0) & (0 <= ci.up)
summary(ci.in)
##    Mode   FALSE    TRUE 
## logical     499    9501
mean(ci.in)
## [1] 0.9501

5.2 Characteristics of CIs

Width and Shape

The width of the confidence interval refers to the distance between the upper and lower bounds, i.e., \(\mbox{width} = b(\hat{\theta}) - a(\hat{\theta})\). Note that if two different confidence interval procedures have accurate coverage rates (see next subsection), then the narrower confidence interval (i.e., the one with the smaller width) should be preferred. This is because the width of the confidence interval relates to its efficiency, such that wider intervals are less efficient.

The shape of the confidence interval refers to the ratio of the distance between the upper bound and the estimate relative to the distance between the estimate and the lower bound, i.e., \(\mbox{shape} = [b(\hat{\theta}) - \hat{\theta}] / [\hat{\theta} - a(\hat{\theta})]\). If \(\mbox{shape} > 1\), this indicates that the confidence interval is wider on the right side, and if \(\mbox{shape} < 1\), this indicates that the confidence interval is wider on the left side. Note that if \(\mbox{shape} = 1\), then the confidence interval is symmetric around the estimate \(\hat{\theta}\), which is typically the case for confidence intervals of mean parameters.

IMPORTANT: The purpose of any method for forming a confidence interval is to produce endpoints \(a(\hat{\theta})\) and \(b(\hat{\theta})\) that provide accurate coverage rates. Thus, the coverage rate is the most important aspect of any procedure for forming a confidence interval. If the coverage rate is not accurate, then the width and shape do not really matter. So it is only typical to discuss the width and shape of the confidence interval after ensuring that the confidence interval procedure results in accurate coverage rates.

Coverage Rate

Given a procedure for forming confidence interval, the coverage rate refers to the proportion of times that the parameter \(\theta\) is included within the interval using some specified number of replications \(R \gg 1\). For example, the observed coverage rate in the sample mean example was \(0.9501\) using \(R = 10000\) replications. If the coverage rate is above the nominal level of \(1 - \alpha\), then the confidence interval procedure is said to be conservative. In contrast, if the coverage rate is below the nominal level, then the confidence interval procedure is said to be liberal. Finally, the coverage rate is said to be “accurate” if it is approximately equal to the nominal rate.

To understand differences in confidence interval accuracy, it is helpful to introduce the concept of “big oh” notation. Specifically, the notation \[ f(x) = O(g(x)) \] should be read as “\(f(x)\) is big oh of \(g(x)\)”. This means that as \(x \rightarrow \infty\), we have the relation \[ |f(x)| \leq h |g(x)| \] for all \(x \geq x_0\) and some \(h > 0\). In other words, for sufficiently large \(x\), the magnitude of \(f(x)\) is at most \(h\) times the magnitude of \(g(x)\). Thus, if \(f(x) = O(g(x))\), then the magnitude of the function \(g()\) bounds (from above) the magnitude of the function \(f()\).

The nonparametric bootstrap can be used to form a variety of different types of confidence intervals. These different confidence interval methods can be classified into two general types, which are distinguished by their level of accuracy. A confidence interval procedure is first-order accurate if the non-coverage probability on each side is \(O(n^{-1/2})\), i.e., \[ P(\theta < a(\hat{\theta})) = \alpha/2 + h_a / \sqrt{n} \qquad \mbox{and} \qquad P(\theta > b(\hat{\theta})) = \alpha/2 + h_b / \sqrt{n} \] where \(h_a\) and \(h_b\) are constants. In contrast, a confidence interval procedure is second-order accurate if the non-coverage probability on each side is \(O(n^{-1})\), i.e., \[ P(\theta < a(\hat{\theta})) = \alpha/2 + h_a / n \qquad \mbox{and} \qquad P(\theta > b(\hat{\theta})) = \alpha/2 + h_b / n \] where \(h_a\) and \(h_b\) are constants.

IMPORTANT: second-order accurate confidence intervals are preferred over first-order accurate intervals, given that the coverage rate converges to the nominal rate faster as \(n \rightarrow \infty\). However, for any given sample of data, there is no guarantee which confidence interval method will perform best. In the long-run, the second-order accurate intervals should produce better results, but this does not guarantee that they will produce better results for every sample of data. Thus, in practice, it may be useful to compare various different confidence interval methods to get a sense of the uncertainty in the parameter estimate.

5.3 Bootstrap CI Methods

Normal Approximation

The normal approximation confidence interval uses the classic confidence interval formula (for the mean), but replaces the standard error with the bootstrap estimate of the standard error. Specifically, the normal approximation interval has the form \[ \hat{\theta} \pm Z_{1 - \alpha/2} \widehat{\mathrm{SE}}(\hat{\theta}) \] where \(Z_{1 - \alpha/2}\) is the quantile of the standard normal distribution that cuts-off \(\alpha/2\) in the upper tail (e.g., \(Z_{1 - \alpha/2} = 1.96\) for a 95% interval), and \(\widehat{\mathrm{SE}}(\hat{\theta})\) is the bootstrap estimate of the standard error of \(\hat{\theta}\). This interval doesn’t have any real benefit over using the classic confidence interval formula, but it can be applied in cases where the standard error of \(\hat{\theta}\) is unknown or difficult to derive. Note that the np.boot function uses the bias corrected version, which subtracts the bootstrap estimate of the bias from \(\hat{\theta}\) before forming the interval.

Benefits:

  • Simple to form and easy to understand

  • Can be applied to situations where \(\mathrm{SE}(\hat{\theta})\) is unknown

Drawbacks:

  • Only first-order accurate

  • Tends to be too narrow because \(\widehat{\mathrm{SE}}(\hat{\theta})\) is too small

  • Still too narrow when \(t_{n - 1; 1 - \alpha/2}\) is used in place of \(Z_{1 - \alpha/2}\)

  • Performs poorly when the sampling distribution of \(\hat{\theta}\) is highly skewed

Basic (Reverse Percentile)

The ``basic’’ (or reverse percentile) interval is based on the distribution of \(\hat{\delta} = \hat{\theta} - \theta\). Suppose that we know the distribution of \(\hat{\delta}\), and let \(q_{\alpha/2}\) and \(q_{1 - \alpha/2}\) denote the quantiles that cut-off \(\alpha/2\) in each tail of the distribution. Then we would form a \(100 (1 - \alpha)\)% confidence interval for \(\theta\) using the logic \[ 1 - \alpha = P(q_{\alpha/2} < \hat{\delta} < q_{1 - \alpha/2}) = P(\hat{\theta} - q_{1 - \alpha/2} < \theta < \hat{\theta} - q_{\alpha/2}) \] which has lower bound \(a(\hat{\theta}) = \hat{\theta} - q_{1 - \alpha/2}\) and upper bound \(b(\hat{\theta}) = \hat{\theta} - q_{\alpha/2}\). In practice, the distribution of \(\hat{\delta}\) will be unknown, so we can use the bootstrap distribution in its place.

Specifically, consider using \(\hat{\delta}{}^* = \hat{\theta}{}^* - \hat{\theta}\), where \(\hat{\theta}{}^*\) denotes a bootstrap replicate of the statistic. To estimate the confidence interval, we can use \(q_{\alpha/2}^*\) and \(q_{1 - \alpha/2}^*\), which are the quantiles of the bootstrap distribution of \(\hat{\delta}{}^*\). This implies that we can write \[ 1 - \alpha = P(q_{\alpha/2}^* < \hat{\delta}{}^* < q_{1 - \alpha/2}^*) = P(\hat{\theta}{}^* - q_{1 - \alpha/2}^* < \hat{\theta} < \hat{\theta}{}^* - q_{\alpha/2}) \] so the \(100 (1 - \alpha)\)% confidence interval for \(\theta\) has the form \[ [2\hat{\theta} - Q_{1 - \alpha/2}^*, 2\hat{\theta} - Q_{\alpha/2}^*] \] where \(Q_{\alpha/2}^*\) and \(Q_{1 - \alpha/2}^*\) are the quantiles of the bootstrap distribution of \(\hat{\theta}\).

Benefits:

  • Simple to form and easy to understand

  • Can work well for the median

Drawbacks:

  • Only first-order accurate

  • Tends to be too narrow (for small \(n\))

  • Not range preserving or transformation invariant

  • Skewed in the wrong direction (for skewed distributions)

Percentile

The percentile method defines the \(100 (1 - \alpha)\)% confidence interval for \(\theta\) as \[ [Q_{\alpha/2}^*, Q_{1 - \alpha/2}^*] \] where \(Q_{\alpha/2}^*\) and \(Q_{1 - \alpha/2}^*\) denote the quantiles of the bootstrap distribution of \(\hat{\theta}\). Note that this approach simply uses the bootstrap distribution as if it were the sampling distribution of \(\hat{\theta}\).

Benefits:

  • Simple to form and easy to understand

  • Range preserving and transformation invariant

  • Advantages over normal and basic for skewed data

Drawbacks:

  • Only first-order accurate

  • Tends to be too narrow (for small \(n\))

  • Only does a partial skewness correction

Studentized (Bootstrap-t)

One of the better methods for calculating bootstrap confidence intervals is referred to as the "studentized’’ (or bootstrap-\(t\)) method. This approach calculates a \(t\)-like statistic for each bootstrap replicate \[ t_r^* = \frac{\hat{\theta}{}_r^* - \hat{\theta}}{\widehat{\mathrm{SE}}(\hat{\theta}{}_r^*) } \] where \(\widehat{\mathrm{SE}}(\hat{\theta}{}_r^*)\) is an estimate of the standard error of \(\hat{\theta}{}_r^*\). In most cases, the standard error of \(\hat{\theta}{}_r^*\) is unknown, so this confidence interval method typically involves using a bootstrap within the bootstrap to estimate \(\widehat{\mathrm{SE}}(\hat{\theta}{}_r^*)\). In other words, for each bootstrap sample \(\mathbf{x}_r^* = (x_{1r}^*, \ldots, x_{nr}^*)^\top\), we would typically need to use bootstrap resampling to estimate \(\widehat{\mathrm{SE}}(\hat{\theta}{}_r^*)\). In this inner bootstrap, the \(r\)-th bootstrap sample \(\mathbf{x}_r^*\) is treated as the observed data \(\mathbf{x}\), and the \(r\)-th replicate \(\hat{\theta}{}_r^*\) is treated as the observed estimate \(\hat{\theta}\). The bootstrap \(t\) statistic distribution \(\{t_r^*\}_{r = 1}^R\) is then used to form the \(100 (1 - \alpha)\)% confidence interval, such that \[ a(\hat{\theta}) = \hat{\theta} - q_t^*(1 - \alpha/2) \widehat{\mathrm{SE}}(\hat{\theta}) \] and \[ b(\hat{\theta}) = \hat{\theta} - q_t^*(\alpha/2) \widehat{\mathrm{SE}}(\hat{\theta}) \] where \(q_t^*(\cdot)\) are the sample quantiles of the bootstrap \(t\) statistic distribution \(\{t_r^*\}_{r = 1}^R\).

Benefits:

  • Second-order accurate (better than normal, basic, and percentile)

  • Simple idea with intuitive procedure

  • Works well for location parameters

Drawbacks:

  • Not range preserving or transformation invariant

  • Computationally costly when \(\widehat{\mathrm{SE}}(\hat{\theta}{}_r^*)\) requires inner bootstrap

  • Doesn’t tend to work as well for correlation/association measures

Bias Corrected and Accelerated

Another good method for forming bootstrap confidence intervals is the “bias corrected and accelerated” (BCa) method. This method is similar to (but better than) the percentile method, given that it uses a corrected version of the bootstrap quantiles \(Q(\cdot)\). More specifically, the BCa method defines the endpoints of the \(100 (1 - \alpha)\)% confidence interval as \(a(\hat{\theta}) = Q(\alpha_1)\) and \(b(\hat{\theta}) = Q(\alpha_2)\), where \(\alpha_1\) and \(\alpha_2\) are the corrected probabilities that are input into the sample quantile function (in place of \(\alpha/2\) and \(1 - \alpha/2\)). The corrected probabilities have the form \[ \begin{split} \alpha_1 &= \Phi\left( z_0 + \frac{z_0 + z_{\alpha/2}}{1 - \gamma (z_0 + z_{\alpha/2})} \right) \\ \alpha_2 &= \Phi\left( z_0 + \frac{z_0 + z_{1 - \alpha/2}}{1 - \gamma (z_0 + z_{1 - \alpha/2})} \right) \\ \end{split} \] where \(\Phi(\cdot)\) is the CDF for the standard normal distribution, \(z_{\alpha} = \Phi^{-1}(\alpha)\) is the \(\alpha\)-th quantile of the standard normal distribution, \(z_0\) is the bias correction factor, and \(\gamma\) is the acceleration parameter. If \(z_0 = \gamma = 0\), then \(\alpha_1 = \alpha/2\) and \(\alpha_2 = 1 - \alpha/2\), which reveals that the BCa method is equivalent to the percentile method when no corrections are needed.

In practice, the bias correction and acceleration factors can be estimated as \[ \hat{z}_0 = \Phi^{-1}\left( \frac{1}{R+1} \sum_{r = 0}^R I( \hat{\theta}{}_r^* < \hat{\theta}) \right) \qquad \mbox{and} \qquad \hat{\gamma} = \frac{\sum_{i = 1}^n \left( \hat{\theta}_{(\cdot)} - \hat{\theta}_{(i)} \right)^3}{6 \left[\sum_{i = 1}^n \left( \hat{\theta}_{(\cdot)} - \hat{\theta}_{(i)} \right)^2\right]^{3/2}} \] where \(\hat{\theta}_{(i)}\) is the estimate of \(\theta\) that would be obtained if \(x_i\) was excluded from the sample (which is referred to as the “jackknife estimate”) and \(\hat{\theta}_{(\cdot)} = \frac{1}{n} \sum_{i = 1}^n \hat{\theta}_{(i)}\) is the average of the jackknife estimates. The bias correction estimate \(\hat{z}_0\) quantifies the median bias of the bootstrap distribution, which relates to the difference between \(\mathrm{median}(\hat{\theta}{}_r^*)\) and \(\hat{\theta}\). The acceleration estimate \(\hat{\gamma}\) quantifies the rate of change of the standard error of \(\hat{\theta}\) with respect to the true parameter \(\theta\). Note that the bias correction estimate \(\hat{z}_0\) is simple to compute, whereas the acceleration estimate \(\hat{\gamma}\) is more computationally costly. However, assuming that \(n \ll R\), estimating \(\hat{\gamma}\) is less costly than estimating \(\widehat{\mathrm{SE}}(\hat{\theta}{}_r^*)\), which often makes the BCa method less computationally costly than the bootstrap-\(t\) method.

Benefits:

  • Second-order accurate (better than normal, basic, and percentile)

  • Range preserving and transformation invariant

  • Works well for a variety of parameters

Drawbacks:

  • Requires estimation of acceleration and bias-correction

  • Less intuitive than other methods (b/c of acceleration and bias parameters)

5.4 Simulation Study

Simulation Design

A simple simulation study was conducted to explore the quality of the different confidence interval methods. The simulation involved generating \(n \in \{10, 20, 50, 100, 200\}\) observations from a \(\chi_1^2\) distribution, and then using the nonparametric bootstrap to form a confidence interval for the mean \(\mu\) or the median \(\theta\). As a reminder, the mean of a chi square distribution is the degrees of freedom parameter \(k\), which in this case is \(\mu = 1\). The median of a chi square distribution with one degree of freedom is \(\theta = 0.4549364\).

Simulation Analyses

For each sample size, I generated 10,000 independent samples of data of size \(n\), and then I used the nonparametric bootstrap methods to form 95% confidence intervals. The nonparametric bootstrap was implemented using the np.boot function in the nptest R package. The coverage rate was defined as the proportion of the 10,000 replications where the given confidence interval method contained the true parameter (\(\mu = 1\) for the mean or \(\theta = 0.4549364\) for the median).

Simulation Results for Mean

The simulation results for the mean parameter are summarized in the below figure. The left subplot shows the coverage rate for each method across the different sample sizes. The right subplot shows the average runtime for each method across the different sample sizes.

Simulation results for the mean parameter: coverage rate (left) and runtime (right).

For the coverage rates, first note that all of the methods do better as the sample size increases, which is not surprising. As a reminder, all of these methods produce consistent results, so we should expect all of the methods for perform better as \(n\) increases. Comparing the different methods, note that…

  • The studentized (bootstrap-t) method does best, which is expected. For the mean statistic, there is a well-known formula for the standard error, i.e., \(\mathrm{SE}(\bar{x}) = \sigma / \sqrt{n}\), where \(\sigma\) is the standard deviation, so the studentized method doesn’t require the inner bootstrap resampling to estimate the standard error of the statistic for each replicate.

  • The BCa method performs second best, which is also not too surprising, given that the BCa method is second-order accurate.

  • The percentile method performs the best out of the “less accurate” (first-order accurate) methods, but it is not much better than the normal approximation. The basic method performs the worst, which is not too surprising (because of the previously mentioned skewness issue).

For the runtimes, note that the studentized method is noticeably more costly than the other methods, which all have similar runtimes. However, using the mean statistic, all of the methods are relatively efficient, so the bootstrapping (with \(R = 9999\) resamples) can be computed in less than one-half of a second regardless of the chosen confidence interval method.

Simulation Results for Median

The simulation results for the median parameter are summarized in the below figure. The left subplot shows the coverage rate for each method across the different sample sizes. The right subplot shows the average runtime for each method across the different sample sizes.

Simulation results for the median parameter: coverage rate (left) and runtime (right).

For the coverage rates, first note that all of the methods do better as the sample size increases (for \(n \geq 25\)), which is not surprising. As a reminder, all of these methods produce consistent results, so we should expect all of the methods for perform better as \(n\) increases. Comparing the different methods, note that…

  • The studentized (bootstrap-t) method seems to show overly optimistic performance for \(n = 10\), and for \(n \geq 25\) the coverage rate gradually approaches the nomial rate of \(0.95\). Note that the studentized method should display more stable (and better) results if the number of inner bootstrap replicates were increased from the default value of 99, which was used in this simulation.

  • The BCa method performs quite well and seems to converge to the nominal level for \(n \geq 100\) observations. Again, this is not too surprising, given that the BCa method is second-order accurate.

  • The percentile method performs surprisingly well and gives accurate coverage rates for \(n \geq 25\). But this won’t necessarily be the case for other statistics and/or data generating distributions.

  • The normal approximation performs noticeably worse than the BCa and percentile methods, which is not too surprising. For distributions that are highly skewed, the sample size \(n\) may need to be quite large for the normal approximation to work well.

  • The basic method performs the worst of all methods, which is expected. As a reminder, the basic method does not work well for skewed distributions.

For the runtimes, note that the studentized method is noticeably more costly than the other methods, which all have similar runtimes. For the median statistic (which requires an inner bootstrap), the studentized method takes 5-8 seconds, which is substantially more costly than it was for the mean statistic (~ 0.1-0.4 seconds). This demonstrates the benefit of the BCa method, which can typically provide accurate coverage rates at a fraction of the computational cost of the studentized (bootstrap-t) method.

Simulation R Code

WARNING: the simulation for the median parameter takes a few days to run.

#########***#########   INITIALIZATIONS   #########***#########

# load nptest package (ver 1.0-2)
library(nptest)

# define cluster for parallel computing
cl <- makeCluster(detectCores())



#########***#########   MEAN PARAMETER   #########***#########

# simulate from chisq(df = 1)      mean = 1
theta <- 1

# simulation design
methods <- c("norm", "basic", "perc", "stud", "bca")
longmeth <- c("normal", "basic", "percent", "student", "bca")
nobs <- c(10, 25, 50, 100, 200)
nrep <- 1e4

# initialize to store results for all n
coverage <- runtimes <- data.frame(normal = rep(NA, 5),
                                   basic = rep(NA, 5), 
                                   percent = rep(NA, 5),
                                   student = rep(NA, 5),
                                   bca = rep(NA, 5))

# standard error function (for studentized)
sdfun <- function(x) sd(x) / sqrt(length(x))

# loop through nobs
for(n in 1:length(nobs)){
  
  # initialize to store results for given n
  cover <- rtime <- data.frame(normal = rep(NA, nrep),
                               basic = rep(NA, nrep), 
                               percent = rep(NA, nrep),
                               student = rep(NA, nrep),
                               bca = rep(NA, nrep))
  
  # loop through replications
  for(i in 1:nrep){
    
    # print progress
    cat("mean:   n =", nobs[n], "    rep =", i, "\n")
    
    # set seed and sample data
    set.seed(i)
    x <- rchisq(nobs[n], df = 1)
    
    # bootstrap confidence intervals
    for(j in 1:length(methods)){
      set.seed(12345)
      tic <- proc.time()
      bs <- np.boot(x, mean, level = 0.95, method = methods[j], 
                    sdfun = sdfun, parallel = TRUE, cl = cl)
      toc <- proc.time() - tic
      ci <- bs[[longmeth[j]]]
      cover[i,j] <- (ci[1] < theta) & (ci[2] > theta)
      rtime[i,j] <- toc[3]
    } # end for(j in 1:length(methods))
    
  } # end for(k in 1:nrep)
  
  # record coverage rate and average runtime
  coverage[n,] <- colMeans(cover)
  runtimes[n,] <- colMeans(rtime)
  
} # end for(n in nobs)

# save results
write.csv(coverage, file = "mean-coverage.csv", row.names = FALSE)
write.csv(runtimes, file = "mean-runtimes.csv", row.names = FALSE)

# plot results
pdf(file = "mean-results.pdf", w = 8, h = 4)
par(mar = c(5.1, 4.6, 1.1, 2.1), oma = c(0, 0, 2, 0))
layout(matrix(1:3, 1, 3), widths = c(0.45, 0.45, 0.1))

# coverage rate
plot(1:length(nobs), coverage[,1], type = "l",
     axes = FALSE, ylim = c(0.78, 1),
     xlab = "Sample Size (n)", ylab = "Coverage Rate", lwd = 2,
     cex.lab = 1.5)
axis(1, at = 1:length(nobs), labels = nobs, cex.axis = 1.25)
axis(2, cex.axis = 1.25)
abline(h = 0.95, lty = 3)
box()
for(k in 2:nrow(coverage)){
  lines(1:length(nobs), coverage[,k], lty = k, col = k, lwd = 2)
}

# runtimes
plot(1:length(nobs), runtimes[,1], type = "l",
     axes = FALSE, ylim = c(0, 0.5),
     xlab = "Sample Size (n)", ylab = "Runtime (sec)", lwd = 2,
     cex.lab = 1.5)
axis(1, at = 1:length(nobs), labels = nobs, cex.axis = 1.25)
axis(2, cex.axis = 1.25)
box()
for(k in 2:nrow(runtimes)){
  lines(1:length(nobs), runtimes[,k], lty = k, col = k, lwd = 2)
}

# legend
par(mar = rep(0, 4))
plot(1, 1, type = "n", axes = FALSE)
legend("center", legend = colnames(coverage), lwd = 2,
       lty = 1:ncol(coverage), col = 1:ncol(coverage), bty = "n")

# title
mtext("Simulation Results for Mean", outer = TRUE, cex = 1.5)

dev.off()



#########***#########   MEDIAN PARAMETER   #########***#########

# simulate from chisq(df = 1)      median ~= 0.4549364
theta <- qchisq(p = 0.5, df = 1)

# simulation design
methods <- c("norm", "basic", "perc", "stud", "bca")
longmeth <- c("normal", "basic", "percent", "student", "bca")
nobs <- c(10, 25, 50, 100, 200)
nrep <- 1e4

# initialize to store results for all n
coverage <- runtimes <- data.frame(normal = rep(NA, 5),
                                   basic = rep(NA, 5), 
                                   percent = rep(NA, 5),
                                   student = rep(NA, 5),
                                   bca = rep(NA, 5))

# loop through nobs
for(n in 1:length(nobs)){
  
  # initialize to store results for given n
  cover <- rtime <- data.frame(normal = rep(NA, nrep),
                               basic = rep(NA, nrep), 
                               percent = rep(NA, nrep),
                               student = rep(NA, nrep),
                               bca = rep(NA, nrep))
  
  # loop through replications
  for(i in 1:nrep){
    
    # print progress
    cat("mean:   n =", nobs[n], "    rep =", i, "\n")
    
    # set seed and sample data
    set.seed(i)
    x <- rchisq(nobs[n], df = 1)
    
    # bootstrap confidence intervals
    for(j in 1:length(methods)){
      set.seed(12345)
      tic <- proc.time()
      bs <- np.boot(x, median, level = 0.95, method = methods[j], 
                    parallel = TRUE, cl = cl)
      toc <- proc.time() - tic
      ci <- bs[[longmeth[j]]]
      cover[i,j] <- (ci[1] < theta) & (ci[2] > theta)
      rtime[i,j] <- toc[3]
    } # end for(j in 1:length(methods))
    
  } # end for(k in 1:nrep)
  
  # record coverage rate and average runtime
  coverage[n,] <- colMeans(cover)
  runtimes[n,] <- colMeans(rtime)
  
} # end for(n in nobs)

# save results
write.csv(coverage, file = "median-coverage.csv", row.names = FALSE)
write.csv(runtimes, file = "median-runtimes.csv", row.names = FALSE)

# plot results
pdf(file = "median-results.pdf", w = 8, h = 4)
par(mar = c(5.1, 4.6, 1.1, 2.1), oma = c(0, 0, 2, 0))
layout(matrix(1:3, 1, 3), widths = c(0.45, 0.45, 0.1))

# coverage rate
plot(1:length(nobs), coverage[,1], type = "l",
     axes = FALSE, ylim = c(0.72, 1),
     xlab = "Sample Size (n)", ylab = "Coverage Rate", lwd = 2,
     cex.lab = 1.5)
axis(1, at = 1:length(nobs), labels = nobs, cex.axis = 1.25)
axis(2, cex.axis = 1.25)
abline(h = 0.95, lty = 3)
box()
for(k in 2:nrow(coverage)){
  lines(1:length(nobs), coverage[,k], lty = k, col = k, lwd = 2)
}

# runtimes
plot(1:length(nobs), runtimes[,1], type = "l",
     axes = FALSE, ylim = c(0, 8),
     xlab = "Sample Size (n)", ylab = "Runtime (sec)", lwd = 2,
     cex.lab = 1.5)
axis(1, at = 1:length(nobs), labels = nobs, cex.axis = 1.25)
axis(2, cex.axis = 1.25)
box()
for(k in 2:nrow(runtimes)){
  lines(1:length(nobs), runtimes[,k], lty = k, col = k, lwd = 2)
}

# legend
par(mar = rep(0, 4))
plot(1, 1, type = "n", axes = FALSE)
legend("center", legend = colnames(coverage), lwd = 2,
       lty = 1:ncol(coverage), col = 1:ncol(coverage), bty = "n")

# title
mtext("Simulation Results for Median", outer = TRUE, cex = 1.5)

dev.off()

# stop cluster
stopCluster(cl)