Copyright © January 04, 2021 by Nathaniel E. Helwig

1 Introduction

1.1 Motivation and Goals

Nonparametric regression offers a flexible alternative to classic (parametric) methods for regression. Unlike classic (parametric) methods, which assume that the regression relationship has a known form that depends on a finite number of unknown parameters, nonparametric regression models attempt to learn the form of the regression relationship from a sample of data.

All nonparametric regression models involve finding some balance between fitting the observed sample of data (model fit) and “smoothing” the function estimate (model parsimony). Typically, this balance is determined using some form of cross-validation, which attempts to find a function estimate that does well for predicting new data. As a result, nonparametric regression models can be useful for discovering relationships between variables, as well as for developing generalizable prediction rules.

In these notes, you will learn how to fit and tune simple nonparametric regression models. You will also learn how to use a nonparametric regression model for visualizing relationships in data and forming predictions for new data.

1.2 Simple Smoothers in R

These notes cover three classic methods for “simple” nonparametric regression: local averaging, local regression, and kernel regression. Note that by “simple”, I mean that there is a single (continuous) predictor. Smoothing splines, as well as extensions for multiple and generalized regression, will be covered in another set of notes.

Local Averaging (Friedman’s “Super Smoother”):
supsmu(x, y, wt =, span = "cv", periodic = FALSE, bass = 0, trace = FALSE)

Local Regression (Cleveland’s LOESS):
loess(formula, data, weights, subset, na.action, model = FALSE,
            span = 0.75, enp.target, degree = 2,
            parametric = FALSE, drop.square = FALSE, normalize = TRUE,
            family = c("gaussian", "symmetric"),
            method = c("loess", "model.frame"),
            control = loess.control(...), ...)

Kernel Regression (Nadaraya and Watson’s Kernel Smoother):
ksmooth(x, y, kernel = c("box", "normal"), bandwidth = 0.5,
               range.x = range(x),
               n.points = max(100L, length(x)), x.points)

1.3 Need for NP Regression

Anscombe’s (1973) quartet was originally devised to emphasize the need for using graphics with regression models, but it also provides a nice motivation for nonparametric regression.

# look at the data
anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
# fit four regression models
lm1 <- lm(y1 ~ x1, data = anscombe)
lm2 <- lm(y2 ~ x2, data = anscombe)
lm3 <- lm(y3 ~ x3, data = anscombe)
lm4 <- lm(y4 ~ x4, data = anscombe)

# print coefficients
coef(lm1)
## (Intercept)          x1 
##   3.0000909   0.5000909
coef(lm2)
## (Intercept)          x2 
##    3.000909    0.500000
coef(lm3)
## (Intercept)          x3 
##   3.0024545   0.4997273
coef(lm4)
## (Intercept)          x4 
##   3.0017273   0.4999091
# check R-squared values
summary(lm1)$r.squared
## [1] 0.6665425
summary(lm2)$r.squared
## [1] 0.666242
summary(lm3)$r.squared
## [1] 0.666324
summary(lm4)$r.squared
## [1] 0.6667073
# plot results
mods <- list(lm1, lm2, lm3, lm4)
par(mfrow = c(2, 2), mar = c(5, 5, 4, 2) + 0.1)
for(j in 1:4){
  xseq <- seq(4, 15, length.out = 20) + ifelse(j == 4, 4, 0)
  coefs <- coef(mods[[j]])
  title <- substitute("Dataset " * jj * ":   " * hat(y) * " = 3 + 0.5" * italic(x),
                      list(jj = j))
  plot(xseq, coefs[1] + coefs[2] * xseq, type = "l", ylim = c(0, 15),
     xlab = expression(italic(x)), ylab = expression(hat(italic(y))),
     main = title, lwd = 2, col = "blue", cex.lab = 1.25, cex.main = 1.5)
  text(6 + ifelse(j == 4, 4, 0), 13, expression(R^2 * " = 0.67"))
}

# plot results with data
mods <- list(lm1, lm2, lm3, lm4)
par(mfrow = c(2, 2), mar = c(5, 5, 4, 2) + 0.1)
for(j in 1:4){
  xseq <- seq(4, 15, length.out = 20) + ifelse(j == 4, 4, 0)
  coefs <- coef(mods[[j]])
  title <- substitute("Dataset " * jj * ":   " * hat(y) * " = 3 + 0.5" * italic(x),
                      list(jj = j))
  plot(xseq, coefs[1] + coefs[2] * xseq, type = "l", ylim = c(0, 15),
     xlab = expression(italic(x)), ylab = expression(hat(italic(y))),
     main = title, lwd = 2, col = "blue", cex.lab = 1.25, cex.main = 1.5)
  points(anscombe[,c(j, j + 4)], pch = 19, col = "red")
}

Points of the story:

  1. Just because a model fits “good” doesn’t mean that it is a good model.

  2. Parametric regression models may not reveal “true” relationships in data.

1.4 Model and Assumptions

Suppose we observe \(n\) independent pairs of points \(\{ (x_i, y_i) \}_{i = 1}^n\) and assume without loss of generality (WLOG) that \(x_1 \leq x_2 \cdots \leq x_n\). Furthermore, suppose that \(X\) and \(Y\) have a functional relationship of the form \[\begin{equation} y_i = f(x_i) + \epsilon_i \end{equation}\] where \(f(\cdot)\) is some unknown smooth function, and \(\epsilon_i\) is an iid error term with mean zero and variance \(\sigma^2\). Note that the assumption of homogeneous error variance is not actually needed for these methods; however, this assumption is often useful for tuning and inference. The assumption of mean zero error terms is necessary, given that this implies that \(E(y_i | x_i) = f(x_i)\), which implies that the unknown smooth function \(f(\cdot)\) describes the condition mean of \(Y\) given \(X\). Nonparametric regression estimators (also known as “smoothers”) attempt to estimate the unknown function \(f(\cdot)\) from a sample of noisy data.

2 Local Averaging

2.1 Definition

To estimate the unknown function \(f(\cdot)\), Friedman (1984) proposed the simple and powerful idea of local averaging. The local averaging estimate of the function \(f(\cdot)\) evaluated at the point \(x\) is constructed by taking an average of the \(y_i\) values corresponding to \(x_i\) values that are nearby the given \(x\). More formally, the local averaging estimate can be written as \[\begin{equation} \hat{f}(x) = \frac{ \sum_{i = 1}^n y_i w_i(x) }{ \sum_{i = 1}^n w_i(x) } \end{equation}\] where the weights are defined such that \(w_i(x) = 1\) if the point \(x_i\) is nearby the point \(x\), and \(w_i(x) = 0\) otherwise. Note that the weights are a function of \(x\), i.e., there will be a different set of weights (defining the nearby points) for each \(x\).

2.2 How Close is Nearby?

To formalize the concept of nearby, Friedman proposed using the smallest symmetric window around \(x_i\) that contains \(sn\) observations, where \(s \in (0, 1]\). The parameter \(s\) is called the span parameter, which controls the smoothness of the function estimate. As \(s \rightarrow 0\) the function estimate becomes more jagged (i.e., less smooth), and as \(s \rightarrow 1\) the function estimate becomes more smooth. The goal is to find the value of \(s\) that provides a “reasonable” estimate of the unknown smooth function \(f(\cdot)\).

2.3 Visual Intuition

To get an understanding of the local averaging estimator, we will look at how the estimate is formed for the point \(x^* = 0.3\) using different values of the span parameter.

# define function and data
set.seed(1)
n <- 101
x <- seq(0, 1, length.out = n)
fx <- sin(2 * pi * x)
y <- fx + rnorm(n, sd = 0.5)

# define x* and color for window
xstar <- 0.3
cols <- rgb(190/255, 190/255, 190/255, alpha = 0.5)

# set-up 2 x 2 subplot
par(mfrow = c(2,2))

# loop through spans (0.1, 0.2, 0.3, 0.4)
for(s in c(0.1, 0.2, 0.3, 0.4)){
  
  # plot data and true function
  plot(x, y, main = paste0("span = ", s), ylim = c(-2.5, 2.5),
       cex.lab = 1.5, cex.axis = 1.25)
  lines(x, fx, col = "blue", lwd = 2)
  
  # plot window
  window <- c(xstar - s / 2, xstar + s / 2)
  rect(window[1], -3, window[2], 3, col = cols)
  
  # define weights
  w <- rep(0, n)
  w[x >= window[1] & x <= window[2]] <- 1
  
  # plot estimate
  ystar <- sum(y * w) / sum(w)
  points(xstar, ystar, pch = 17, col = "red", cex = 1)
  
  # add legend
  legend("topright", legend = c("data", "truth"),
         pch = c(1, NA), lty = c(NA, 1), col = c("black", "blue"), bty = "n")
  legend("bottomright", legend = c("estimate", "window"),
         pch = c(17, 15), col = c("red", "gray"), bty = "n")
  
}