1 Introduction

1.1 Motivation and Goals

Nonparametric regression offers a flexible alternative to classic (parametric) methods for regression. Unlike classic (parametric) methods, which assume that the regression relationship has a known form that depends on a finite number of unknown parameters, nonparametric regression models attempt to learn the form of the regression relationship from a sample of data.

All nonparametric regression models involve finding some balance between fitting the observed sample of data (model fit) and “smoothing” the function estimate (model parsimony). Typically, this balance is determined using some form of cross-validation, which attempts to find a function estimate that does well for predicting new data. As a result, nonparametric regression models can be useful for discovering relationships between variables, as well as for developing generalizable prediction rules.

In these notes, you will learn how to fit and tune simple nonparametric regression models. You will also learn how to use a nonparametric regression model for visualizing relationships in data and forming predictions for new data.

1.2 Simple Smoothers in R

These notes cover three classic methods for “simple” nonparametric regression: local averaging, local regression, and kernel regression. Note that by “simple”, I mean that there is a single (continuous) predictor. Smoothing splines, as well as extensions for multiple and generalized regression, will be covered in another set of notes.

Local Averaging (Friedman’s “Super Smoother”):
supsmu(x, y, wt =, span = "cv", periodic = FALSE, bass = 0, trace = FALSE)

Local Regression (Cleveland’s LOESS):
loess(formula, data, weights, subset, na.action, model = FALSE,
            span = 0.75, enp.target, degree = 2,
            parametric = FALSE, drop.square = FALSE, normalize = TRUE,
            family = c("gaussian", "symmetric"),
            method = c("loess", "model.frame"),
            control = loess.control(...), ...)

Kernel Regression (Nadaraya and Watson’s Kernel Smoother):
ksmooth(x, y, kernel = c("box", "normal"), bandwidth = 0.5,
range.x = range(x),
n.points = max(100L, length(x)), x.points)

1.3 Need for NP Regression

Anscombe’s (1973) quartet was originally devised to emphasize the need for using graphics with regression models, but it also provides a nice motivation for nonparametric regression.

# look at the data
anscombe

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

# fit four regression models
lm1 <- lm(y1 ~ x1, data = anscombe)
lm2 <- lm(y2 ~ x2, data = anscombe)
lm3 <- lm(y3 ~ x3, data = anscombe)
lm4 <- lm(y4 ~ x4, data = anscombe)

# print coefficients
coef(lm1)

## (Intercept)          x1 
##   3.0000909   0.5000909

coef(lm2)

## (Intercept)          x2 
##    3.000909    0.500000

coef(lm3)

## (Intercept)          x3 
##   3.0024545   0.4997273

coef(lm4)

## (Intercept)          x4 
##   3.0017273   0.4999091

# check R-squared values
summary(lm1)$r.squared

## [1] 0.6665425

summary(lm2)$r.squared

## [1] 0.666242

summary(lm3)$r.squared

## [1] 0.666324

summary(lm4)$r.squared

## [1] 0.6667073

# plot results
mods <- list(lm1, lm2, lm3, lm4)
par(mfrow = c(2, 2), mar = c(5, 5, 4, 2) + 0.1)
for(j in 1:4){
  xseq <- seq(4, 15, length.out = 20) + ifelse(j == 4, 4, 0)
  coefs <- coef(mods[[j]])
  title <- substitute("Dataset " * jj * ":   " * hat(y) * " = 3 + 0.5" * italic(x),
                      list(jj = j))
  plot(xseq, coefs[1] + coefs[2] * xseq, type = "l", ylim = c(0, 15),
     xlab = expression(italic(x)), ylab = expression(hat(italic(y))),
     main = title, lwd = 2, col = "blue", cex.lab = 1.25, cex.main = 1.5)
  text(6 + ifelse(j == 4, 4, 0), 13, expression(R^2 * " = 0.67"))
}

# plot results with data
mods <- list(lm1, lm2, lm3, lm4)
par(mfrow = c(2, 2), mar = c(5, 5, 4, 2) + 0.1)
for(j in 1:4){
  xseq <- seq(4, 15, length.out = 20) + ifelse(j == 4, 4, 0)
  coefs <- coef(mods[[j]])
  title <- substitute("Dataset " * jj * ":   " * hat(y) * " = 3 + 0.5" * italic(x),
                      list(jj = j))
  plot(xseq, coefs[1] + coefs[2] * xseq, type = "l", ylim = c(0, 15),
     xlab = expression(italic(x)), ylab = expression(hat(italic(y))),
     main = title, lwd = 2, col = "blue", cex.lab = 1.25, cex.main = 1.5)
  points(anscombe[,c(j, j + 4)], pch = 19, col = "red")
}

Points of the story:

Just because a model fits “good” doesn’t mean that it is a good model.
Parametric regression models may not reveal “true” relationships in data.

1.4 Model and Assumptions

Suppose we observe $n$ independent pairs of points $\{ (x_i, y_i) \}_{i = 1}^n$ and assume without loss of generality (WLOG) that $x_1 \leq x_2 \cdots \leq x_n$ . Furthermore, suppose that $X$ and $Y$ have a functional relationship of the form $\begin{equation} y_i = f(x_i) + \epsilon_i \end{equation}$ where $f(\cdot)$ is some unknown smooth function, and $\epsilon_i$ is an iid error term with mean zero and variance $\sigma^2$ . Note that the assumption of homogeneous error variance is not actually needed for these methods; however, this assumption is often useful for tuning and inference. The assumption of mean zero error terms is necessary, given that this implies that $E(y_i | x_i) = f(x_i)$ , which implies that the unknown smooth function $f(\cdot)$ describes the condition mean of $Y$ given $X$ . Nonparametric regression estimators (also known as “smoothers”) attempt to estimate the unknown function $f(\cdot)$ from a sample of noisy data.

2 Local Averaging

2.1 Definition

To estimate the unknown function $f(\cdot)$ , Friedman (1984) proposed the simple and powerful idea of local averaging. The local averaging estimate of the function $f(\cdot)$ evaluated at the point $x$ is constructed by taking an average of the $y_i$ values corresponding to $x_i$ values that are nearby the given $x$ . More formally, the local averaging estimate can be written as $\begin{equation} \hat{f}(x) = \frac{ \sum_{i = 1}^n y_i w_i(x) }{ \sum_{i = 1}^n w_i(x) } \end{equation}$ where the weights are defined such that $w_i(x) = 1$ if the point $x_i$ is nearby the point $x$ , and $w_i(x) = 0$ otherwise. Note that the weights are a function of $x$ , i.e., there will be a different set of weights (defining the nearby points) for each $x$ .

2.2 How Close is Nearby?

To formalize the concept of nearby, Friedman proposed using the smallest symmetric window around $x_i$ that contains $sn$ observations, where $s \in (0, 1]$ . The parameter $s$ is called the span parameter, which controls the smoothness of the function estimate. As $s \rightarrow 0$ the function estimate becomes more jagged (i.e., less smooth), and as $s \rightarrow 1$ the function estimate becomes more smooth. The goal is to find the value of $s$ that provides a “reasonable” estimate of the unknown smooth function $f(\cdot)$ .

2.3 Visual Intuition

To get an understanding of the local averaging estimator, we will look at how the estimate is formed for the point $x^* = 0.3$ using different values of the span parameter.

# define function and data
set.seed(1)
n <- 101
x <- seq(0, 1, length.out = n)
fx <- sin(2 * pi * x)
y <- fx + rnorm(n, sd = 0.5)

# define x* and color for window
xstar <- 0.3
cols <- rgb(190/255, 190/255, 190/255, alpha = 0.5)

# set-up 2 x 2 subplot
par(mfrow = c(2,2))

# loop through spans (0.1, 0.2, 0.3, 0.4)
for(s in c(0.1, 0.2, 0.3, 0.4)){
  
  # plot data and true function
  plot(x, y, main = paste0("span = ", s), ylim = c(-2.5, 2.5),
       cex.lab = 1.5, cex.axis = 1.25)
  lines(x, fx, col = "blue", lwd = 2)
  
  # plot window
  window <- c(xstar - s / 2, xstar + s / 2)
  rect(window[1], -3, window[2], 3, col = cols)
  
  # define weights
  w <- rep(0, n)
  w[x >= window[1] & x <= window[2]] <- 1
  
  # plot estimate
  ystar <- sum(y * w) / sum(w)
  points(xstar, ystar, pch = 17, col = "red", cex = 1)
  
  # add legend
  legend("topright", legend = c("data", "truth"),
         pch = c(1, NA), lty = c(NA, 1), col = c("black", "blue"), bty = "n")
  legend("bottomright", legend = c("estimate", "window"),
         pch = c(17, 15), col = c("red", "gray"), bty = "n")
  
}

Local averaging estimate of $f(0.3)$ with different span values.

Note that the estimate (denoted by the red triangle) is calculated by averaging all of the $y_i$ values that are within the local window (denoted by the gray box). The above example shows the window for the point $x^* = 0.3$ , but the same idea applies to other points. To form the function estimate at a different $x$ value, we would just slide the averaging window down the x-axis to be centered at the new $x$ value. As a result, the local averaging estimator is sometimes referred to as a “moving average” estimator.

2.4 Selecting the Span

Typically the value of $s$ is determined using ordinary (leave-one-out) cross-validation. Let $\hat{y}_{(i)}(s)$ denote the local averaging estimate of $f(\cdot)$ at the point $x_i$ obtained by holding out the $i$ -th pair $(x_i, y_i)$ . Note that the notation $\hat{y}_{(i)}(s)$ denotes that the estimate is a function of the span parameter $s$ . The ordinary cross-validation method seeks to find the value of $s$ that minimizes $\frac{1}{n} \sum_{i = 1}^n \left(y_i - \hat{y}_{(i)}(s) \right)^2$ which minimizes the (squared) leave-one-out prediction errors. This is the default method used by the supsmu function for selecting the span parameter.

3 Local Regression

LOWESS: LOcally WEighted Scatterplot Smoothing
LOESS: LOcal regrESSion

3.1 Definition

To estimate the unknown function $f(\cdot)$ , Cleveland (1979) proposed the idea of local regression. The local regression estimate of the function $f(\cdot)$ evaluated at the point $x$ is constructed by fitting a weighted least squares regression model to the $y_i$ values corresponding to $x_i$ values that are nearby the given $x$ . More formally, the local regression estimate can be written as $\begin{equation} \hat{f}(x) = \hat{\beta}_0^{(x)} + \hat{\beta}_1^{(x)} x \end{equation}$ where $\hat{\beta}_0^{(x)}$ and $\hat{\beta}_1^{(x)}$ are the minimizers of the weighted least squares problem $\begin{equation} \sum_{i=1}^n w_i(x) (y_i - \beta_0 - \beta_1 x_i)^2 . \end{equation}$ Note that $\hat{\beta}_0^{(x)}$ and $\hat{\beta}_0^{(x)}$ are functions of $x$ given that the weights $\{w_i(x)\}_{i = 1}^n$ are a function of $x$ .

In this case, the weights are defined such that $w_i(x)$ is a non-increasing function of $| x - x_i |$ that takes nonzero values only when $x_i$ is nearby $x$ . A popular weight function is the tricube function $\begin{equation} w_i(x) = \left\{ \begin{array}{ll} (1 - |x - x_i|^3)^3 & \mathrm{if} \ |x - x_i| < \delta_i \\ 0 & \mathrm{otherwise} \end{array} \right. \end{equation}$ where $\delta_i > 0$ is some scalar that is used to determine which points are close enough to the given $x$ to receive a non-zero weight.

The above formulation is for local linear regression, where a simple linear regression model is fit in the neighborhood of $x$ . The idea could be easily extended to include high-order polynomial terms. For example, the local quadratic regression estimate has the form $\begin{equation} \hat{f}(x) = \hat{\beta}_0^{(x)} + \hat{\beta}_1^{(x)} x + \hat{\beta}_2^{(x)} x^2 \end{equation}$ where $\{ \hat{\beta}_0^{(x)}, \hat{\beta}_1^{(x)}, \hat{\beta}_2^{(x)c}\}$ are the minimizers of the weighted least squares problem $\begin{equation} \sum_{i=1}^n w_i(x) (y_i - \beta_0 - \beta_1 x_i - \beta_2 x_i^2)^2 \end{equation}$ and the weights $w_i(x)$ are defined as previously described.

3.2 How Close is Nearby?

Cleveland proposed defining $\delta_i$ such that the weights have non-zero values for $s n$ observations, where $s \in (0, 1]$ is the span parameter. Assuming that the $x_i$ are uniformly spread between $a = \min_i (x_i)$ and $b = \max(x_i)$ , we have that $\delta_i = s / 2$ for points $x_i$ that are sufficiently far from the boundary points $a$ and $b$ .

# define tricube weight function
tricube <- function(x, delta = 0.1) {
  ifelse(abs(x) < delta, (1 - abs(x)^3)^3, 0)
}

# plot tricube weight function
xstar <- 0.3
plot(x, tricube(x - xstar, 0.1 / 2), t = "l", ylab = "W(x)",
     cex.lab = 1.5, cex.axis = 1.25)
deltas <- c(0.2, 0.3, 0.4)
for(k in 1:3){
  lines(x, tricube(x - xstar, deltas[k] / 2), lty = k + 1)
}
legend("topright", legend = c(expression(delta * " = 0.05"),
                              expression(delta * " = 0.10"),
                              expression(delta * " = 0.15"),
                              expression(delta * " = 0.20")),
       lty = 1:4, bty = "n")

Tricube Function. Weights for local regression problem at $x = 0.3$ with different delta values.

3.3 Visual Intuition

To get an understanding of the local regression estimator, we will look at how the estimate is formed for the point $x^* = 0.3$ using different values of the span parameter. We will start by looking at the LOESS estimator using local linear regression, and then we will look at the LOESS estimator using local quadratic regression.

Local Linear Regression

# define function and data
set.seed(1)
n <- 101
x <- seq(0, 1, length.out = n)
fx <- sin(2 * pi * x)
y <- fx + rnorm(n, sd = 0.5)

# define x* and color for window
xstar <- 0.3
cols <- rgb(190/255,190/255,190/255,alpha=0.5)

# set-up 2 x 2 subplot
par(mfrow = c(2,2))

# loop through spans (0.1, 0.2, 0.3, 0.4)
for(s in c(0.1, 0.2, 0.3, 0.4)){
  
  # plot data and true function
  plot(x, y, main = paste0("span = ", s), ylim = c(-2.5, 2.5),
       cex.lab = 1.5, cex.axis = 1.25)
  lines(x, fx, col = "blue", lwd = 2)
  
  # plot window
  window <- c(xstar - s / 2, xstar + s / 2)
  rect(window[1], -3, window[2], 3, col = cols)
  
  # define weights
  w <- tricube(x - xstar, delta = s / 2)
  
  # plot estimate
  X.w <- sqrt(w) * cbind(1, x)
  y.w <- sqrt(w) * y
  beta <- solve(crossprod(X.w)) %*% crossprod(X.w, y.w)
  ystar <- as.numeric(cbind(1, xstar) %*% beta)
  points(xstar, ystar, pch = 17, col = "red", cex = 1)
  
  # add regression line
  abline(beta, lty = 3)
  
  # add legend
  legend("topright", legend = c("data", "truth"),
         pch = c(1, NA), lty = c(NA, 1), col = c("black", "blue"), bty = "n")
  legend("bottomright", legend = c("estimate", "window"),
         pch = c(17, 15), col = c("red", "gray"), bty = "n")
  
}

Local linear regression estimate of $f(0.3)$ with different span values.

Note that the estimate (denoted by the red triangle) is calculated by fitting a weighted linear regression model to the $y_i$ values that are within the local window (denoted by the gray box). The above example shows the window for the point $x^* = 0.3$ , but the same idea applies to other points. To form the function estimate at a different $x$ value, we would just slide the window down the x-axis to be centered at the new $x$ value, and redefine the weights accordingly.

Local Quadratic Regression

# define function and data
set.seed(1)
n <- 101
x <- seq(0, 1, length.out = n)
fx <- sin(2 * pi * x)
y <- fx + rnorm(n, sd = 0.5)

# define x* and color for window
xstar <- 0.3
cols <- rgb(190/255,190/255,190/255,alpha=0.5)

# set-up 2 x 2 subplot
par(mfrow = c(2,2))

# loop through spans (0.1, 0.2, 0.3, 0.4)
for(s in c(0.1, 0.2, 0.3, 0.4)){
  
  # plot data and true function
  plot(x, y, main = paste0("span = ", s), ylim = c(-2.5, 2.5),
       cex.lab = 1.5, cex.axis = 1.25)
  lines(x, fx, col = "blue", lwd = 2)
  
  # plot window
  window <- c(xstar - s / 2, xstar + s / 2)
  rect(window[1], -3, window[2], 3, col = cols)
  
  # define weights
  w <- tricube(x - xstar, delta = s / 2)
  
  # plot estimate
  X <- cbind(1, x - 0.5, (x - 0.5)^2)
  X.w <- sqrt(w) * X
  y.w <- sqrt(w) * y
  beta <- solve(crossprod(X.w)) %*% crossprod(X.w, y.w)
  ystar <- as.numeric(cbind(1, xstar - 0.5, (xstar - 0.5)^2) %*% beta)
  points(xstar, ystar, pch = 17, col = "red", cex = 1)
  
  # add regression line
  lines(x, X %*% beta, lty = 3)
  
  # add legend
  legend("topright", legend = c("data", "truth"),
         pch = c(1, NA), lty = c(NA, 1), col = c("black", "blue"), bty = "n")
  legend("bottomright", legend = c("estimate", "window"),
         pch = c(17, 15), col = c("red", "gray"), bty = "n")
  
}

Local quadratic regression estimate of $f(0.3)$ with different span values.

Note that the estimate (denoted by the red triangle) is calculated by fitting a weighted quadratic regression model to the $y_i$ values that are within the local window (denoted by the gray box). The above example shows the window for the point $x^* = 0.3$ , but the same idea applies to other points. To form the function estimate at a different $x$ value, we would just slide the window down the x-axis to be centered at the new $x$ value, and redefine the weights accordingly.

3.4 Selecting the Span

As you may have noticed, the loess function does not offer any data-driven method for selecting the span parameter. As a result, many users of the loess function (unknowingly?) use the default span value of $s = 0.75$ , which has no guarantee of producing a reasonable estimate of the mean function $f(\cdot)$ .

Instead of using an arbitrary value of the span parameter, it would be more ideal to use a cross-validated estimate. Note that the ordinary cross-validation criterion (from the previous section) can be more efficiently written as $\mathrm{OCV}(s) = \frac{1}{n} \sum_{i = 1}^n \left( \frac{y_i - \hat{y}_i(s)}{1 - h_{ii}(s)} \right)^2$ where $h_{ii}(s)$ is the $i$ -th diagonal of the “hat matrix” $\mathbf{H}_s$ , which is the matrix that linearly transforms $\mathbf{y} = (y_1, \ldots, y_n)^\top$ into $\hat{\mathbf{y}}_s = (\hat{y}_1(s), \ldots, \hat{y}_n(s))^\top$ , i.e., $\hat{\mathbf{y}}_s = \mathbf{H}_s \mathbf{y}$ where $\hat{\mathbf{y}}_s$ is the vector of fitted values corresponding to the span parameter $s$ .

The OCV criterion can be interpreted as a weighted version of the (mean) squared error loss function, where the weights have the form $w_i = (1 - h_{ii}(s))^{-2}$ . Note that the leverage scores $h_{ii}(s)$ , which range between 0 and 1, can be substantially different between different observations, which implies that the OCV criterion can assign different weight to different observations. The GCV criterion improves upon the OCV criterion by replacing $h_{ii}(s)$ with its average value in the calculation, such as $\mathrm{GCV}(s) = \frac{\frac{1}{n} \sum_{i = 1}^n \left( y_i - \hat{y}_i(s) \right)^2 }{(1 - \nu_s / n)^2}$ where $\nu_s = \sum_{i = 1}^n h_{ii}(s)$ is the trace of the hat matrix, which is an estimate of the degrees of freedom of the function estimate. The below function offers a simple implementation of the loess function that uses the GCV to tune the span parameter.

loess.gcv <- function(x, y){
  nobs <- length(y)
  xs <- sort(x, index.return = TRUE)
  x <- xs$x
  y <- y[xs$ix]
  tune.loess <- function(s){
    lo <- loess(y ~ x, span = s)
    mean((lo$fitted - y)^2) / (1 - lo$trace.hat/nobs)^2
  }
  os <- optimize(tune.loess, interval = c(.01, 99))$minimum
  lo <- loess(y ~ x, span = os)
  list(x = x, y = lo$fitted, df = lo$trace.hat, span = os)
}

4 Kernel Regression

4.1 Definition

Kernel regression is an extension of the idea of kernel density estimation to regression problems. Nadaraya (1964) and Watson (1964) independently proposed estimating $f(\cdot)$ using a kernel-weighted linear combination of the response values, such as $\begin{equation} \hat{f}(x) = \sum_{i = 1}^n y_i w_i(x) \end{equation}$ where the weights are defined as $\begin{equation} w_i(x) = \frac{ K( [x - x_i] / h ) }{ \sum_{i = 1}^n K( [x - x_i] / h ) } \end{equation}$ with $K(\cdot)$ denoting some known kernel function (typically the Gaussian kernel), and $h > 0$ denoting the bandwidth parameter. Note that the bandwidth parameter controls how much information from nearby $(x_i, y_i)$ points is used to estimate the function $f(\cdot)$ at the point $x$ .

4.2 How Close is Nearby?

Remember that the bandwidth parameter relates to the standard deviation of the kernel function, which is centered at each $x_i$ value. Thus, the kernel regression estimate of the function $f(\cdot)$ evaluated at the point $x$ is constructed by taking a linear combination of the the $y_i$ values, where the weight given to each $y_i$ is relatively large if $x_i$ is nearby $x$ and relatively small otherwise. As the bandwidth parameter $h \rightarrow 0$ , the kernel regression estimate of $f(x)$ only uses information from $y_i$ values for $x_i$ values that are very close to $x$ . In contrast, as $h \rightarrow \infty$ , the kernel regression estimate of $f(x)$ uses information from $y_i$ values for $x_i$ values that are relatively far from $x$ . In both cases, the kernel regression estimate gives larger weight to the $y_i$ values whose $x_i$ values are closer to the given point $x$ .

# plot Gaussian kernel function
xstar <- 0.3
plot(x, dnorm((x - xstar) / (1 / 60)), 
     t = "l", ylab = "K([x - 0.3] / h)", cex.lab = 1.5, 
     cex.axis = 1.25, ylim = c(0, 0.4))
for(k in 2:4){
  lines(x, dnorm((x - xstar) / (k / 60)) , lty = k)
}
legend("topright", legend = c("h = 1/60", "h = 2/60", "h = 3/60", "h = 4/60"),
       lty = 1:4, bty = "n")

Bandwidths. Weights for Gaussian kernel regression problem at $x = 0.3$ with different bandwidths.

4.3 Visual Intuition

To get an understanding of the kernel regression estimator, we will look at how the estimate is formed for the point $x^* = 0.3$ using different values of the bandwidth parameter.

# define function and data
set.seed(1)
n <- 101
x <- seq(0, 1, length.out = n)
fx <- sin(2 * pi * x)
y <- fx + rnorm(n, sd = 0.5)

# define x* and color for window
xstar <- 0.3
cols <- rgb(190/255, 190/255, 190/255, alpha = 0.5)

# set-up 2 x 2 subplot
par(mfrow = c(2,2))

# loop through h = c(1, 2, 3, 4) / 60
for(h in c(1:4)/60){
  
  # plot data and true function
  plot(x, y, main = paste0("h = ", h * 60, "/60"), ylim = c(-2.5, 2.5),
       cex.lab = 1.5, cex.axis = 1.25)
  lines(x, fx, col = "blue", lwd = 2)
  
  # plot 99% window
  window <- c(xstar - 3 * h, xstar + 3 * h)
  rect(window[1], -3, window[2], 3, col = cols)
  
  # define weights
  w <- dnorm((x - xstar) / h) 
  w <- w / sum(w)
  
  # plot estimate
  ystar <- sum(y * w)
  points(xstar, ystar, pch = 17, col = "red", cex = 1)
  
  # add legend
  legend("topright", legend = c("data", "truth"),
         pch = c(1, NA), lty = c(NA, 1), col = c("black", "blue"), bty = "n")
  legend("bottomright", legend = c("estimate", "99% area"),
         pch = c(17, 15), col = c("red", "gray"), bty = "n")
  
}

Kernel regression estimate of $f(0.3)$ with different bandwidth values.

Note that the estimate (denoted by the red triangle) is calculated using a weighted average of the $y_i$ values. In this case, all of the $y_i$ values have non-zero weights, but many of the weights are practically zero. For the visualization, the gray box denotes 99% of the area under the Gaussian kernel for the corresponding bandwidth. The above example shows the window for the point $x^* = 0.3$ , but the same idea applies to other points. To form the function estimate at a different $x$ value, we would just slide the window down the x-axis to be centered at the new $x$ value, and redefine the weights accordingly.

4.4 Selecting the Bandwidth

Similar to the loess function, the ksmooth function does not offer any data-driven method for selecting the bandwidth parameter. As a result, many users of the ksmooth function (unknowingly?) use the default bandwidth value of $h = 0.5$ , which has no guarantee of producing a reasonable estimate of the mean function $f(\cdot)$ . Note that R’s ksmooth function parameterizes the problem such that “kernels are scaled so that their quartiles (viewed as probability densities) are at +/- 0.25 bandwidth”, so the bandwidth argument of the ksmooth function is not the standard deviation of the kernel function.

The below function offers a simple implementation of the ksmooth function (with Gaussian kernel) that uses the GCV to tune the bandwidth parameter. Note that this function searches for the optimal bandwidth in the range $[r / n, r]$ , where $r = \max(x_i) - \min(x_i)$ is the range of the observed sample. If the optimal $h$ falls on a boundary point, i.e., if $\hat{h} = r/n$ or $\hat{h} = r$ , then the search range should be expanded in the appropriate direction. Furthermore, it should be noted that the ksmooth.gcv function will be quite slow for large samples of data, given that it requires forming an $n \times n$ kernel matrix for each choice of bandwidth parameter.

ksmooth.gcv <- function(x, y){
  nobs <- length(y)
  xs <- sort(x, index.return = TRUE)
  x <- xs$x
  y <- y[xs$ix]
  xdif <- outer(x, x, FUN = "-")
  tune.ksmooth <- function(h){
    xden <- dnorm(xdif / h)
    xden <- xden / rowSums(xden)
    df <- sum(diag(xden))
    fit <- xden %*% y
    mean((fit - y)^2) / (1 - df/nobs)^2
  }
  xrng <- diff(range(x))
  oh <- optimize(tune.ksmooth, interval = c(xrng/nobs, xrng))$minimum
  if(any(oh == c(xrng/nobs, xrng)))
    warning("Minimum found on boundary of search range.\nYou should retune model with expanded range.")
  xden <- dnorm(xdif / oh)
  xden <- xden / rowSums(xden)
  df <- sum(diag(xden))
  fit <- xden %*% y
  list(x = x, y = fit, df = df, h = oh)
}

5 Example 1: Prestige from Income

5.1 Overview of Data

The Prestige dataset (from the car package) contains the prestige of $n = 102$ Canadian occupations from 1971, as well as the average income of the occupation. We will use the nonparametric regression methods to explore the relationship between prestige and income.

First, let’s load the data and visualize the relationship between income ( $X$ ) and prestige ( $Y$ ).

# load data
library(car)

## Loading required package: carData

data(Prestige)
head(Prestige)

##                     education income women prestige census type
## gov.administrators      13.11  12351 11.16     68.8   1113 prof
## general.managers        12.26  25879  4.02     69.1   1130 prof
## accountants             12.77   9271 15.70     63.4   1171 prof
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof
## chemists                14.62   8403 11.68     73.5   2111 prof
## physicists              15.64  11030  5.13     77.6   2113 prof

# plot data
plot(Prestige$income, Prestige$prestige, 
     xlab = "Income", ylab = "Prestige")

Note: the relationship looks non-linear. For occupations that earn less than $10K, there is a strong (positive) linear relationship between income and prestige. However, for occupations that earn between $10K and $25K, the relationship has a substantially different (attenuated) slope.

5.2 Analyses and Results

R code to fit local averaging, local regression, and kernel regression:

# local averaging (cv span selection)
locavg <- with(Prestige, supsmu(income, prestige))

# local regression (gcv span selection)
locreg <- with(Prestige, loess.gcv(income, prestige))
locreg$df

## [1] 3.000201

# kernel regression (gcv span selection)
kern <- with(Prestige, ksmooth.gcv(income, prestige))
kern$df

## [1] 8.00558

Note: the model has an effective degrees of freedom of about 3 for the local regression estimator and about 8 for the kernel regression estimator.

R code to plot the results:

# plot data
plot(Prestige$income, Prestige$prestige, xlab = "Income", ylab = "Prestige")

# add fit lines
lines(locavg, lwd = 2)
lines(locreg, lwd = 2, lty = 2, col = "red")
lines(kern, lwd = 2, lty = 3, col = "blue")

# add legend
legend("bottomright", c("supsmu", "loess", "ksmooth"),
       lty = 1:3, lwd = 2, col = c("black", "red", "blue"), bty = "n")

From the above plot, it looks like the GCV-tuned LOESS estimate is performing best (i.e., providing the best combination of fit and smoothness), the CV-tuned local average is performing second best, and the GCV-tuned kernel regression estimate is performing worst. In particular, the GCV-tuned kernel regression estimate produces a rather rough/wiggly estimate of the relationship between income and prestige.

6 Example 2: Motorcycle Accident

6.1 Overview of Data

The mcycle dataset (from the MASS package) contains $n = 133$ pairs of time points (in ms) and observed head accelerations (in g) that were recorded in a simulated motorcycle accident. We will use the nonparametric regression methods to explore the relationship between time and acceleration.

First, let’s load the data and visualize the relationship between time ( $X$ ) and acceleration ( $Y$ ).

# load data
library(MASS)
data(mcycle)
head(mcycle)

##   times accel
## 1   2.4   0.0
## 2   2.6  -1.3
## 3   3.2  -2.7
## 4   3.6   0.0
## 5   4.0  -2.7
## 6   6.2  -2.7

# plot data
plot(mcycle$times, mcycle$accel, 
     xlab = "Time (ms)", ylab = "Acceleration (g)")

Note: the relationship looks non-linear. The head acceleration is stable from 0-15 ms, drops from about 15-20 ms, rises from 20-30 ms, drops from 30-40 ms, and then begins to stabilize.

6.2 Analyses and Results

R code to fit local averaging, local regression, and kernel regression:

# local averaging (cv span selection)
locavg <- with(mcycle, supsmu(times, accel))

# local regression (gcv span selection)
locreg <- with(mcycle, loess.gcv(times, accel))
locreg$df

## [1] 10.49846

# kernel regression (gcv span selection)
kern <- with(mcycle, ksmooth.gcv(times, accel))
kern$df

## [1] 19.95457

Note: the model has an effective degrees of freedom of about 10.5 for the local regression estimator and about 20 for the kernel regression estimator.

R code to plot the results:

# plot data
plot(mcycle$times, mcycle$accel, 
     xlab = "Time (ms)", ylab = "Acceleration (g)")

# add fit lines
lines(locavg, lwd = 2)
lines(locreg, lwd = 2, lty = 2, col = "red")
lines(kern, lwd = 2, lty = 3, col = "blue")

# add legend
legend("bottomright", c("supsmu", "loess", "ksmooth"),
       lty = 1:3, lwd = 2, col = c("black", "red", "blue"), bty = "n")

From the above plot, it looks like the GCV-tuned LOESS estimate is performing best (i.e., providing the best combination of fit and smoothness). The CV-tuned local average is too smooth and misses the peak around 30 ms. The GCV-tuned kernel regression estimate fits the data better than the local average, but it produces a rather rough/wiggly estimate when the data gets noisier (i.e., for 30 ms or more after the onset of the simulated accident).

Nonparametric Regression (Smoothers) in R

Nathaniel E. Helwig
Department of Psychology & School of Statistics
University of Minnesota

January 04, 2021

1 Introduction

1.1 Motivation and Goals

1.2 Simple Smoothers in R

1.3 Need for NP Regression

1.4 Model and Assumptions

2 Local Averaging

2.1 Definition

2.2 How Close is Nearby?

2.3 Visual Intuition

2.4 Selecting the Span

3 Local Regression

3.1 Definition

3.2 How Close is Nearby?

3.3 Visual Intuition

3.4 Selecting the Span

4 Kernel Regression

4.1 Definition

4.2 How Close is Nearby?

4.3 Visual Intuition

4.4 Selecting the Bandwidth

5 Example 1: Prestige from Income

5.1 Overview of Data

5.2 Analyses and Results

6 Example 2: Motorcycle Accident

6.1 Overview of Data

6.2 Analyses and Results

Nonparametric Regression (Smoothers) in R

Nathaniel E. Helwig Department of Psychology & School of Statistics University of Minnesota

January 04, 2021

1 Introduction

1.1 Motivation and Goals

1.2 Simple Smoothers in R

1.3 Need for NP Regression

1.4 Model and Assumptions

2 Local Averaging

2.1 Definition

2.2 How Close is Nearby?

2.3 Visual Intuition

2.4 Selecting the Span

3 Local Regression

3.1 Definition

3.2 How Close is Nearby?

3.3 Visual Intuition

3.4 Selecting the Span

4 Kernel Regression

4.1 Definition

4.2 How Close is Nearby?

4.3 Visual Intuition

4.4 Selecting the Bandwidth

5 Example 1: Prestige from Income

5.1 Overview of Data

5.2 Analyses and Results

6 Example 2: Motorcycle Accident

6.1 Overview of Data

6.2 Analyses and Results

Nathaniel E. Helwig
Department of Psychology & School of Statistics
University of Minnesota