Chapter 28 Assessing Assumptions
28.1 Residuals
Our statistical assumptions about the data are about the \(\epsilon_{ij}\)s: independence, constant variance, and normality. But we don’t get to observe the \(\epsilon_{ij}\)s. The best that we can do is try to approximate them by the residuals in our data. We will see that there are several kinds of residuals, based on how much post-processing we do. For most purposes, any of the types of residuals will suffice, but there are situations where one or another is advantageous.
When assessing assumptions, be aware that a problem of one sort (say non-constant variance) can make a problem of another type (say non-normality) appear to be present. Similarly, a non-normal distribution for the errors could lead one to think that there are outliers, when what is really present is general non-normality. Thus it is important to assess the model assumptions in multiple ways before making a too-early decision about what is going on in the data.
Because we want to explore assessing assumptions in data, we will look at data sets that violate the assumptions. In earlier examples we looked at the logarithm of lifetime as a response in the resin lifetime data. Now we will examine these data on the original scale (before taking logs).> library(cfcdae)
> data(ResinLifetimes)
> lifetimes.separate.means <- lm(Time~temp,data=ResinLifetimes)
> anova(lifetimes.separate.means)
Analysis of Variance Table
Response: Time
Df Sum Sq Mean Sq F value Pr(>F)
temp 4 28027.9 7007.0 101.81 < 2.2e-16 ***
Residuals 32 2202.4 68.8
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are several kinds of residuals and corresponding
functions in R to extract them: residuals()
, rstandard()
,
and rstudent()
.
- Raw residuals are just the difference between the observed and predicted values.
- Standardized residuals divide each residual by its estimated standard deviation. These are also called “internally studentized.”
- Studentized residuals are what you get when you compute each particular residual assuming that the predicted values are based on the data except for the corresponding data point, and then standardizing. These are also called “externally studentized”.
> round(residuals(lifetimes.separate.means),2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
23.22 -5.14 13.58 -3.25 -15.63 4.78 -10.57 -6.99 2.15 7.73 -17.26 13.98 2.15 -2.82
15 16 17 18 19 20 21 22 23 24 25 26 27 28
-8.08 2.15 9.36 10.15 -0.53 -4.10 -2.13 -5.90 -6.32 -0.53 -1.59 0.88 -0.93 -1.26
29 30 31 32 33 34 35 36 37
0.50 3.34 -0.93 6.32 -5.11 0.15 -1.40 0.43 -0.39
> round(rstandard(lifetimes.separate.means),2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2.99 -0.66 1.75 -0.42 -2.01 0.62 -1.36 -0.90 0.28 1.00 -2.22 1.80 0.28 -0.36 -1.04 0.28
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
1.21 1.31 -0.07 -0.53 -0.27 -0.76 -0.81 -0.07 -0.21 0.11 -0.12 -0.16 0.07 0.43 -0.12 0.84
33 34 35 36 37
-0.67 0.02 -0.19 0.06 -0.05
> round(rstudent(lifetimes.separate.means),2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
3.47 -0.66 1.81 -0.41 -2.12 0.61 -1.38 -0.90 0.27 1.00 -2.38 1.87 0.27 -0.36 -1.04 0.27
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
1.22 1.32 -0.07 -0.52 -0.27 -0.76 -0.81 -0.07 -0.20 0.11 -0.12 -0.16 0.06 0.43 -0.12 0.83
33 34 35 36 37
-0.67 0.02 -0.18 0.06 -0.05
> data(CloudSeeding)
> cloud.fit <- lm(rainfall~seeding,data=CloudSeeding)
> anova(cloud.fit)
Analysis of Variance Table
Response: rainfall
Df Sum Sq Mean Sq F value Pr(>F)
seeding 1 1000332 1000332 3.993 0.05114 .
Residuals 50 12526130 250523
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> round(rstudent(cloud.fit),2)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2.19 1.37 0.42 0.37 0.32 0.16 0.00 -0.03 -0.14 -0.16 -0.17 -0.19 -0.24 -0.25 -0.26 -0.27
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
-0.27 -0.28 -0.28 -0.28 -0.29 -0.30 -0.31 -0.32 -0.32 -0.33 6.21 2.72 2.61 1.09 0.53 0.10
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
-0.02 -0.22 -0.28 -0.34 -0.34 -0.38 -0.40 -0.49 -0.49 -0.63 -0.65 -0.66 -0.66 -0.71 -0.82 -0.83
49 50 51 52
-0.83 -0.86 -0.88 -0.89
> data(Thermocouples)
> Oven.fit <- lm(temp.diff~1,data=Thermocouples)
> summary(Oven.fit)
Call:
lm(formula = temp.diff ~ 1, data = Thermocouples)
Residuals:
Min 1Q Median 3Q Max
-0.09125 -0.01125 0.00375 0.00875 0.04875
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.141250 0.003138 1001 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0251 on 63 degrees of freedom
28.2 Nonconstant variance
Our principal tool for assessing assumptions is to plot residuals. The old standby plot, frequently called the residual plot, is to plot one point for each data point. The horizontal location is the predicted value and the vertical location is the residual (possibly standardized or Studentized). A more recent variation is the scale-location plot. Again, there is one point on the plot for each data point. The horizontal location for a point is again the predicted value, and the vertical location is the square root of the absolute standardized residual.
For the residual plot you get vertical streaks of points, with one vertical streak for each predicted value. The vertical streaks should be symmetric around 0, and the vertical extent should be consistent across the plot. The scale-location plot also has vertical streaks of points, and the typical height of these streaks should be the same across the plot.
If some streaks are longer than others in the residual plot, or some streaks are higher than others in the scale-location plot, then we have evidence of non-constant variance. Of course, the data are random, so the lengths and heights will not be exactly the same, even if our model is a perfect representation of our data.
In R, plot(modelfit,which=1)
produces a residual plot,
and plot(modelfit,which=3)
produces the scale-location plot.
plot(modelfit)
produces the residual plot, the normal probability plot,
the scale-location plot, and a plot that is generally of less
interest in designed experiments (where balance, or near balance,
ensures that the leverages are all nearly the same).
We want to see no pattern in the residual plot and the scale-location plot. For the resin lifetime data on the original scale, we see that the variability is higher when the predicted value is higher. This type of residual plot is common and is often referred to as a right-opening megaphone.
> plot(lifetimes.separate.means,which=1)

Figure 28.1: Residuals vs fitted for original resin lifetimes
> plot(lifetimes.separate.means,which=3)

Figure 28.2: Scale-location plot for original resin lifetimes
The car
package has a residual plotting command that will
plot boxplots of the residuals by treatment group as
well as residuals against fitted values. In this case,
they just look like mirror images of each other, because
the means are in order of the treatments.
> car::residualPlots(lifetimes.separate.means)
Warning in residualPlots.default(model, ...): No possible lack-of-fit tests

Figure 28.3: Residual plots from car
> plot(cloud.fit,which=1)
> plot(cloud.fit,which=3)

Comparing variances is more difficult, in a statistical sense, than comparing means. This does not mean a more complex calculation, it means that the statistical properties of the tests to compare variances are not as good as we would like. For example, most of the classical tests of equality of variance are incredibly sensitive to non-normality, sensitive to the point that they are essentially useless in practice. I usually say don’t test variances, but if you have to, use Levene’s test. Here we see that even with this glaringly obvious non-constant variance, Levene’s test is not significant.
> car::leveneTest(lifetimes.separate.means)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 2.3746 0.07286 .
32
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To be honest, this would all have been obvious to us if we had done something simple like making boxplots of the original data by treatment group.
> boxplot(Time~temp,data=ResinLifetimes,main="(Non-log) Resin Lifetimes")

Figure 28.4: Boxplots of original resin lifetimes
Redoing the plots with a log y-axis is analogous to working with logged data. Group variability is much more similar now.
> boxplot(Time~temp,data=ResinLifetimes,main="Log Resin Lifetimes",log='y')

Figure 28.5: Boxplots of original resin lifetimes
> car::leveneTest(cloud.fit)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 2.8628 0.09687 .
50
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
28.3 Normality/outliers
28.3.1 Outliers
An outlier is a data point that does not follow the pattern established by the rest of the data. Outliers can be bad data. Examples include transcription errors when copying or typing data and procedural problems (such as Joe sneezed in the test tube; check those lab notebooks). Outliers can also be the most important data in your data set (such as the trace of a Higgs Boson in the mountain of results at CERN).
Truly bad data should be removed from your analysis and, ideally, replaced with corrected data.
More common is that you have an outlier, but there is no reason to suspect that it is bad data. What that means is that your model is not good enough to fit the data properly. However, it is often quite challenging to establish a new and improved model that can accommodate all of the data. In such situations, it is common to analyze the data twice, once including and once excluding the outlier(s). If the conclusions agree, then lucky you: your inference does not depend on just a few values. If the conclusions disagree, then you either need to look for a new model/analysis method, or you need to report both conclusions noting how the results depend on just a few data points.
Choosing to retain or exclude data points based on which one gives you your desired answer is research fraud.
How do we locate those unusual data points?
If our assumptions are correct, then the Studentized
residuals follow a t-distribution with N-g-1 degrees of freedom.
One simple
thing is to look for outliers by finding the largest
(absolute) studentized residual and getting
its t-tail area (with N-g-1 df). You can do it by hand, or
there is a function in the car
library that
does it all at once. If you look at all
of the Studentized residuals, you need to multiply
the tail areas by N as a Bonferroni correction
for multiple comparisons.
> max(abs(rstudent(lifetimes.separate.means)))
[1] 3.471028
> 2*pt(-max(abs(rstudent(lifetimes.separate.means))),31)*37
[1] 0.05729818
outlierTest()
function in the car
automates
the process.
> car::outlierTest(lifetimes.separate.means) # easier way
No Studentized residuals with Bonferroni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferroni p
1 3.471028 0.0015486 0.057298
> car::outlierTest(cloud.fit)
rstudent unadjusted p-value Bonferroni p
27 6.212291 1.1009e-07 5.7245e-06
> use.these <- rep(TRUE,52)
> use.these[27] <- FALSE
> cloud.fit.subset1 <- lm(rainfall~seeding,data=CloudSeeding,subset=use.these)
> car::outlierTest(cloud.fit.subset1)
rstudent unadjusted p-value Bonferroni p
28 4.214633 0.00010977 0.0055984
29 4.038566 0.00019293 0.0098392
> use.these[28:29] <- FALSE
> cloud.fit.subset2 <- lm(rainfall~seeding,data=CloudSeeding,subset=use.these)
> car::outlierTest(cloud.fit.subset2)
rstudent unadjusted p-value Bonferroni p
1 5.005484 8.63e-06 0.00042287
Yet another outlier! In fact, we can keep stripping outliers for some time.
The problem with the cloud seeding data is not really outliers. But if you look for outliers first, you will find them.
You need to consider other issues such as non-constant variance and more general non-normality before pursuing outlier removal.
28.3.2 Non-Normality
A graphical way to assess normality is
a normal probability plot of
(some version of) the residuals. In fact,
if you try to plot a fitted linear model, R will
produce a normal probability plot along with three other plots;
plot(model,which=2)
gives you the normal probability plot
of the standardized residuals.
For the resin lifetime data on the original scale, we see long tails. Actually, this “long tail” is mostly from combining normals with different variances; the combination looks like long tails.
> plot(lifetimes.separate.means,which=2)

Figure 28.6: NPP of original resin lifetimes
> plot(cloud.fit,which=2)

28.4 Temporal dependence
When data are collected in some time order, it is not uncommon for \(\epsilon_{ij}\)s that are close in time to be more similar to each other than we might expect from independence. This is positive serial correlation or positive temporal dependence. The reverse, negative serial correlation where nearby in time errors are less similar than we expect from independence, can happen, but it is much less common than positive serial correlation.
The obvious graphical approach to detecting serial correlation is to plot the residuals in time order and look to see if there are patterns. For example, stretches of positive residuals followed by stretches of negative residuals would indicate positive serial correlation.
The oven data are in time order in the data set, so we can just plot them to look for serial correlation. Thetype="b"
argument in plot()
says connect the dots. Doing this
often makes issues more clear. The abline()
usage
puts a horizontal line at 0.
> plot(residuals(Oven.fit),type="b")
> abline(h=0)

Figure 28.7: Residuals from oven data in time order
We see that points above the line tend to cluster together, and points below the line tend to cluster together. This is positive serial correlation.
We can perform a formal test using Durbin-Watson. Here the statistic is in the “uh oh” range (below 1.5 or above 2.5), but the p-value is only just .06.> car::durbinWatsonTest(Oven.fit)
lag Autocorrelation D-W Statistic p-value
1 0.2131219 1.513854 0.06
Alternative hypothesis: rho != 0
There are many other tests of serial correlation that
fall into the category of “runs tests”; we’ll just look at runs.test()
in the tseries
package. It considers a two-valued (positive and negative) factor
variable in time order.
It considers consecutive groups of the same value to be
a run, and then counts the number of runs. Two few or too many
indicate association (positive and negative respectively).
So we need to take our residuals in time order and turn them into a two-valued factor. Here we split them as positive or negative.
We see below that there are fewer runs than we might expect from random data, meaning that positives and negatives are more tightly clustered than one would expect (p-value .001).
> pos.not.neg <- factor(residuals(Oven.fit) > 0)
> tseries::runs.test(pos.not.neg)
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Runs Test
data: pos.not.neg
Standard Normal = -3.2761, p-value = 0.001052
alternative hypothesis: two.sided
Remember, only assess temporal association when there actually is a time order for your data.
28.5 And Now for the Bayesians
Assessment options for Bayesians are similar to, but less extensive than, those for non-Bayesians. We can compute residuals as observed data minus the posterior expected value of the model fit. We can then do residual versus fitted value plots and normal probability plots as above. In addition, we can compute the predictive distribution for each observation, and then plot the observation in that distribution. If it is way out in the tails, then we have an outlier.
Bayesian fit of the resin lifetimes.> library(bcfcdae)
> bfit.resin <- bglmm(Time~temp,data=ResinLifetimes,quiet=TRUE,seed=10026)
> plot(bfit.resin)


The pointwise plot does not show any grand outliers, but it does reveal the nonconstant variance in another way. The model is fit with constant variance, so the width of the predictive distributions is basically the same across all of the data. However, when the mean is highest, the observed data are far out in the predictive distribution, because the fitted constant variance is smaller than the actual variance for that treatment. Conversely, when the mean is smallest, the observed data are pretty much in the center of the predictive distribution, because the fitted constant variance is larger than the actual variance for that treatment.
> plot(bfit.resin,"pointwise")
NULL
> bfit.clouds <- bglmm(rainfall~seeding,data=CloudSeeding,quiet=TRUE,
+ seed=10027)
Warning: There were 21 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
adapt_delta
.
> bfit.clouds <- bglmm(rainfall~seeding,data=CloudSeeding,quiet=TRUE,
+ seed=10027,adapt_delta=.995)
Warning: There were 2 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
> bfit.clouds <- bglmm(rainfall~seeding,data=CloudSeeding,quiet=TRUE,
+ seed=10027,adapt_delta=.995,sigma.shape=3) # default was 2
> plot(bfit.clouds)


> plot(bfit.clouds,"pointwise")
NULL