Chapter 30 Fixing Problems the Bayesian Way

30.1 Data sets

We will illustrate fixing problems using Bayesian methods with the data sets we used in the previous chapter. As before:

> library(cfcdae)
> data(ResinLifetimes)
> lifetimes.fit <- lm(Time~temp,data=ResinLifetimes)
> data(CloudSeeding)
> cloud.fit <- lm(rainfall~seeding,data=CloudSeeding)
Another data set we will explore is TissueTearing. Here weight is added to a facial tissue stretched across an open jar until the tissue tears. There are four treatments: tissues of brand A or brand B, used either dry or wet.
> data(TissueTearing)
> tearing.fit <- lm(tear.wt~treatment,data=TissueTearing)
A fifth data set we will explore is TissueVaporization. For these data, a laser scalpel is drawn across a rat liver at a rate of .01 inches/second. We measure the width of the vaporized tissue that results as we use the laser at four different power settings.
> data(TissueVaporization)
> vaporization.fit <- lm(width~power,data=TissueVaporization)
A sixth data set we will explore is PeaGermination. Each experimental unit was a group of 20 seeds. Each group was pre-soaked either 12 or 24 hours and then placed on either potting soil or paper towel. The response is the number of germinated seeds after 5 days.
> data(PeaGermination)
> peas.fit <- lm(germinated~treatment,data=PeaGermination)
A seventh data set is the number of Copepoda found in wetland soils 14 days after artificial snow melt. The units are randomly assigned to either neutral or acidic artificial snow melt.
> data(Copepoda)
> Copepoda.fit <- lm(count~pH,data=Copepoda)

Note that the responses in the PeaGermination and Copepoda data sets are counts. Thus we might already be skeptical of models that assume normal distributions for the responses.

Our last data set is CloudBackup, which gives the time in seconds to save a 50MB file to a cloud backup service over the internet and then recover it. Each of three services is used ten times, with the 30 observations assigning services at random. The 30 runs are done consecutively, and the data are listed in time order.
> data(CloudBackup)
> backup.fit <- lm(updowntime~service,data=CloudBackup)

Again, as discussed in the previous chapter, we will we will approach non-normality, non-constant variance, and dependence one-by-one as well as discuss the various approaches to handling violations of assumptions completely separately. In practice, you are going to be assessing all of the assumptions for any given data set and potentially dealing with more than one problem at a time. In addition, you may well try more than one approach to dealing with violations of assumptions in a single data set.

30.2 Bayesian Alternatives

The bglmm() function has a number of options that expand its use beyond independent, constant variance, normally distributed errors.

  • The family argument specifies a response distribution.
    • "gaussian": normally distributed data; the default.
    • "t4": data follow a t-distribution with 4 degrees of freedom. This is a long tailed alternative to normal that can accommodate outliers.
    • "poisson": data follow a Poisson distribution. This distribution uses a log link function.
    • "negbin": data follow a negative binomial distribution. This distribution uses a log link function. A negative binomial is similar to a Poisson distribution, but it has a larger variance than a Poisson for the same mean.
    • "binomial": data follow a binomial distribution. This distribution uses a logit link function.
    • "betabinom": data follow a beta binomial distribution. This distribution uses a logit link function. A beta binomial is similar to a binomial, but it has a larger variance for the same mean.
  • The varform argument allows a specification for the variance structure in the data. It is appropriate when the family is gaussian or t4.
    • "constant": the default. Variability is constant for all data.
    • "power": the standard deviation of the response is proportional to a power of the fitted value (power to be estimated). This option uses a identity link function (the default), but you can also specify a log link function.
    • ~1|groups: fit a different variance at each level of the factor groups. Interactions of factors are also allowed.
    • c(“ar1”,“timename”,“groupsname”): fit an AR1 correlation structure with time given by the variable with name “timename” and any independent groups given by the factor variable with name “groupsname”. For example, c(“ar1”,“mytime”,“replications”).
  • The boxcox argument can be set to true. In that case, bglmm() will also sample across power family transformations.

30.3 Resin Lifetimes via Transformation

In this example we let bglmm() sample across power family transformations to find a good transformation power. bglmm() is using the full Box-Cox transformation (rescaled power, not just power).

> library(bcfcdae)
> library(rstan)
Loading required package: StanHeaders
Loading required package: ggplot2
rstan (Version 2.21.7, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
> lifetimes.bboxcox <- bglmm(Time~temp,data=ResinLifetimes,boxcox=TRUE,quiet=TRUE,
+                            seed=10028,adapt_delta = .95,
+                            max_treedepth = 12)
Need to compile the model. Compiling now.
Trying to compile a simple C file

Note that we needed to increase adapt_delta and max_treedepth to avoid divergent transitions.

The diagnostic plots look good:
> plot(lifetimes.bboxcox,"trace")

> plot(lifetimes.bboxcox,"autocor")
Here is the summary information:
> summary(lifetimes.bboxcox)[,c(1,3,4,8:10)]
                   mean     sd     2.5%   97.5% n_eff Rhat
(Intercept)      66.600 14.900  44.9000 101.000   935 1.01
temp1            36.200  3.010  30.6000  42.500  1600 1.00
temp2            11.200  2.260   6.7000  15.500  2570 1.00
temp3            -6.010  2.400 -10.7000  -1.340  2060 1.00
temp4           -17.100  2.290 -21.7000 -12.700  2110 1.00
sigma0            6.600  0.865   5.1700   8.570  2630 1.00
lambda            0.283  0.166  -0.0254   0.614  1150 1.00
sigma.Intercept  57.900 24.000  25.4000 119.000  3270 1.00
sigma.temp       34.300 16.000  15.8000  77.500  2060 1.00

The power parameter is called lambda. The posterior mean transformation power is .283 with an interval from -.03 to .61. This is similar to, but shifted a bit up from, what we obtained from the Box-Cox plot.

I actually would not use the coefficients from this model, because they represent a mixture of different powers, making the coefficients difficult to interpret. I would choose a power, make a single transformation, and then look at the model coefficients.

30.4 Accommodating outliers in the tissue vaporization data

Recall that the tissue vaporization data had an outlier, so let’s try Bayesian fitting using a t-distribution with 4 degrees of freedom instead of a normal distribution. The t-distribution with 4 degrees of freedom has fairly long tails, so it should do a better job of accommodating an outlier.

> vaporization.bt4 <- bglmm(width~power-1,
+    data=TissueVaporization,family="t4",quiet=TRUE,seed=10029)
Need to compile the model. Compiling now.
Trying to compile a simple C file
The diagnostic plots look good, and the residual plots indicate that using the t-distribution has down weighted the outlier considerably.
> plot(vaporization.bt4)
However, looking at the summary we see that this model is yielding standard errors for the treatment means that are rather larger than we obtained above from either removing the outlier or using a robust estimator.
> summary(vaporization.bt4)[,c(1,3,4,8:10)]
             mean     sd   2.5% 97.5% n_eff Rhat
power1      1.130 0.1250 0.8680 1.380  2540    1
power2      1.550 0.1360 1.2900 1.830  3090    1
power3      1.910 0.1250 1.6400 2.160  2590    1
power4      2.240 0.2430 1.5600 2.560  1510    1
sigma0      0.197 0.0948 0.0745 0.432   948    1
sigma.power 2.050 0.7780 1.0600 4.070  1980    1

One explanation for the large standard errors is that this Bayesian analysis is using default priors. The default prior mean for sigma0 is the square root of the MSE from the ordinary fit of the full data. For the vaporization data, that prior mean is .37, which is a factor of 4 larger than the root MSE from the data with the outlier removed: .091. If we used sigma0.mean=.1 to try to reflect what we know happens when the outlier is down weighted, we would get standard errors closer to what we get removing the outlier. Note, however, that both the default prior and this alternative prior are not really Bayesian, as both are setting a prior based on the data, and that is like using the information in the data twice.

30.5 Models for variance using the Resin Lifetimes

We have discussed two models for the situation where variances are non-constant: separate variances by group and variance proportional to a power of the fitted value. We can do both using bglmm().

Let’s try fitting the resin lifetimes data using separate variances by treatment and standard deviation proportional to a power of the fitted values.

> lifetimes.bsep <- bglmm(Time~temp-1,data=ResinLifetimes,
+                         varform=~1|temp,quiet=TRUE,seed=10030)

Again, all of the diagnostic plots look fine.

Look at the fit summary (sigma0[1] etc. are the five different group standard deviations):
> summary(lifetimes.bsep)[,c(1,3,4,8:10)]
            mean     sd  2.5%  97.5% n_eff  Rhat
temp1      85.70  4.880 75.30  95.40  3140 1.000
temp2      43.30  3.640 36.00  50.60  3560 0.999
temp3      24.40  2.590 19.20  29.80  3890 1.000
temp4      15.70  0.978 13.80  17.70  2830 0.999
temp5      11.90  2.130  7.53  16.20  2560 1.000
sigma0[1]  13.10  3.270  8.45  21.10  3060 1.000
sigma0[2]  10.10  2.590  6.41  16.50  4020 1.000
sigma0[3]   7.25  2.050  4.39  12.10  3130 1.000
sigma0[4]   2.35  0.966  1.26   4.86  2370 1.000
sigma0[5]   4.82  1.860  2.52   9.62  2980 1.000
sigma.temp 57.40 21.500 30.00 109.00  3690 1.000

Compare back with what we obtained with gls(), and we see that the Bayesian posterior distributions for group means have standard deviations larger than the standard errors of the gls() fits. This is particularly true for the smaller variances. In part this is because the posterior standard deviation for the group means already includes variability due to the group standard deviations being unknown, whereas the frequentist residual standard deviation simply estimates the group standard deviation, and then the effect of not knowing the standard deviation exactly is accounted for in confidence intervals by the t-multiplier.

A second issue here is that bglmm() uses a single prior mean standard deviation for all five groups; it is larger than the small variances and smaller than the large variances. It pulls the small variances up more than it affects the larger variances. bglmm() does not have a mechanism to put different priors on the different group variances, so there is not much we can do about this.

Now let’s try fitting with the standard deviation proportional to a power of the fitted values. We saw increasing variance, so we expect a positive power.
> lifetimes.bpower <- bglmm(Time~temp-1,data=ResinLifetimes,
+                         varform="power",quiet=TRUE,
+                         sdpower.range = c(0,2),
+                         ,seed=10032)
Need to compile the model. Compiling now.
Trying to compile a simple C file
> summary(lifetimes.bpower)[,c(1,3,4,8:10)]
             mean     sd   2.5%   97.5% n_eff  Rhat
temp1      85.700  4.520 77.200  94.800  3630 0.999
temp2      43.800  3.040 38.200  49.800  3050 1.000
temp3      24.800  2.220 20.700  29.400  3800 1.000
temp4      15.400  1.870 12.000  19.500  3280 1.000
temp5      12.000  1.720  8.770  15.600  3260 0.999
sigma0      1.290  0.975  0.276   4.080  1350 1.000
power       0.562  0.184  0.199   0.934  1590 1.000
sigma.temp 56.900 21.300 30.100 112.000  2490 1.000

The standard errors of the group means for this model are a bit closer to what we saw with gls(). Again, the posterior standard deviations from bglmm() take into account the variability in the power.

Using loo(), the separate variances model is preferred, but not overwhelmingly. It uses more parameters, but the improvement in fit compensates for that.

> loo(lifetimes.bsep)
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
[1] 252.2149
> loo(lifetimes.bpower)
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
[1] 253.7822
However, the Bayes factor favors the separate variances model much more strongly:
> bayes_factor(lifetimes.bsep,lifetimes.bpower)
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Estimated Bayes factor in favor of bridge1 over bridge2: 10.51828

30.6 Fitting AR1 errors

We can fit AR1 errors in bglmm() to account for autocorrelation using the varform arguement. For AR1, varform should be a character vector of length 3; the first element is "ar1", the second element is the name of an R variable containing the times of the observations, and the third element is the name of an R variable (a factor) containing the groups (different groups are assumed to be independent). If there is only one group, the last variable should be a factor with a single level.

Consider the cloud backup data, which have 30 data points in time order all in the same group.
> mytime <- 1:30
> mygroups <- factor(rep(1,30))
> backup.bar1 <- bglmm(log(updowntime)~service,data=CloudBackup,
+                      varform=c("ar1","mytime","mygroups"),
+                      quiet=TRUE,seed=10033)
Need to compile the model. Compiling now.
Trying to compile a simple C file
Warning in bglmm(log(updowntime) ~ service, data = CloudBackup, varform =
c("ar1", : data have been sorted into time order by group
> summary(backup.bar1)
                   mean  se_mean     sd    2.5%    25%     50%     75%   97.5%
(Intercept)      5.6000 0.001110 0.0479  5.5200  5.570  5.6000  5.6300  5.7200
service1        -0.4880 0.000425 0.0253 -0.5360 -0.504 -0.4880 -0.4700 -0.4370
service2        -0.1400 0.000470 0.0283 -0.1940 -0.158 -0.1400 -0.1210 -0.0839
sigma0           0.0837 0.000519 0.0220  0.0376  0.070  0.0837  0.0977  0.1260
rho              0.4890 0.003960 0.1690  0.1460  0.384  0.4920  0.6000  0.8120
sigma.Intercept  6.5700 0.055700 3.1500  2.5300  4.260  5.8800  8.1800 14.2000
sigma.service    0.6910 0.005720 0.2990  0.3140  0.477  0.6240  0.8290  1.4700
                n_eff  Rhat
(Intercept)      1850 1.000
service1         3560 0.999
service2         3630 1.000
sigma0           1800 1.000
rho              1820 1.000
sigma.Intercept  3190 0.999
sigma.service    2730 1.000

(The diagnostics all look fine.) The posterior mean for autocorrelation (rho) is .49, a bit smaller than estimated by gls() (.56), but the posterior interval is fairly wide.

All of the services are different:
> linear.contrast(backup.bar1,service,allpairs=TRUE)
      post. mean   post. sd      lower      upper     approx BF
1 - 2 -0.3479707 0.04730532 -0.4408279 -0.2540766  2.570635e+11
1 - 3 -1.1149508 0.04227694 -1.1975667 -1.0316041 4.885544e+150
2 - 3 -0.7669800 0.04764600 -0.8594177 -0.6718734  8.503035e+55
attr(,"class")
[1] "linear.contrast"

30.7 Other distributions

The pea germination data are success/failure data: numbers germinated and ungerminated out of 20. We might expect these to follow a binomial distribution. We can do this in bglmm() using the family="binomial" argument (the response is two columns giving successes and failures for each case).
> peas.bglm <- bglmm(cbind(germinated,nongerminated)~treatment-1,
+                 data=PeaGermination,family="binomial",
+                   quiet=TRUE,seed=10034)
Need to compile the model. Compiling now.
Trying to compile a simple C file
Diagnostics all look fine.
> summary(peas.bglm)[,c(1,3,4,8:10)]
                 mean    sd    2.5% 97.5% n_eff Rhat
treatment1       0.71 0.335  0.0682  1.36  4920    1
treatment2       0.83 0.344  0.1760  1.53  4500    1
treatment3      -1.85 0.469 -2.8700 -1.01  4230    1
treatment4       1.83 0.457  1.0100  2.81  4050    1
sigma.treatment  1.75 0.684  0.8460  3.43  3670    1

The means for treatments are .71, .83, -1.85, and 1.83 with considerable overlap of posterior invervals for several treatments. However, these means are given on the logit scale. To get means and confidence intervals on the probability scale we need to convert the logit \(x\) to the probability \(p\) via \(p = \exp(x)/(1+\exp(x))\). We can do this via direct conversion of the summary statistics on the logit scale. However, if we want intervals for contrasts, we need to extract the MCMC samples, convert them to the probability scale and then take means or intervals.

For direct computation on summaries:
> tmp <- summary(peas.bglm)[1:4,c(1,4,8)]
> exp(tmp)/(1+exp(tmp))
                mean       2.5%     97.5%
treatment1 0.6704012 0.51704339 0.7957597
treatment2 0.6963549 0.54388677 0.8220063
treatment3 0.1358729 0.05365665 0.2669799
treatment4 0.8617617 0.73302015 0.9432138
Working via MCMC interations:
> logit.mcmc <- fixef(peas.bglm,summary=FALSE)
> p.mcmc <- exp(logit.mcmc)/(1+exp(logit.mcmc))
> apply(p.mcmc,2,mean)
treatment1 treatment2 treatment3 treatment4 
 0.6663144  0.6916571  0.1443809  0.8537514 
> apply(p.mcmc,2,quantile,probs=c(.025,.975))
       
        treatment1 treatment2 treatment3 treatment4
  2.5%   0.5170453  0.5437795 0.05379851  0.7334650
  97.5%  0.7963867  0.8216908 0.26778620  0.9429778

The posterior mean on the probability scale is not quite the same as the back-transformation of the mean on the logit scale, but the interval estimates are the same (at least up to the ambiguity from the ways that quantiles can be computed).

The Copepoda data were counts of … copepoda. A default model for such counts could be the Poisson. We can get a Poisson fit using the family="poisson" argument.
> copepoda.bp <- bglmm(count~pH,data=Copepoda,family="poisson",
+                      quiet=TRUE,seed=10035)
Need to compile the model. Compiling now.
Trying to compile a simple C file
Pearson residuals ought to have standard deviation (about) 1, so we expect most of them to be between -2 and 2. For these data, we have much larger residuals, indicating that the variability in these data is larger than we get from a Poisson.
> plot(copepoda.bp)
An alternative to the Poisson is the negative binomial, obtained via family="negbin".
> copepoda.bnb <- bglmm(count~pH,data=Copepoda,family="negbin",
+                      quiet=TRUE,seed=10036,adapt_delta = .95)
Need to compile the model. Compiling now.
Trying to compile a simple C file

The diagnostics look fine, but are not shown here.

Look at the summaries:
> summary(copepoda.bnb)[,c(1,3,4,8:10)]
                  mean     sd   2.5%   97.5% n_eff Rhat
(Intercept)      5.120  0.183  4.780   5.500  2270    1
pH1             -0.137  0.190 -0.508   0.243  2170    1
phi              7.120  3.840  1.890  16.700  2430    1
sigma.Intercept 87.600 81.700  6.850 311.000  3670    1
sigma.pH        24.000 24.400  0.885  90.200  4200    1

We see the effects (on the log scale because of the log link); the interval for the treatment effect is practically centered at 0. There is an additional parameter phi estimated to be about 7.12. A negative binomial with mean \(\mu\) has variance \(\mu + \mu^2/\phi\) (a Poisson essentially corresponds to an infinite phi). The mean is \(\mu = \exp(5.12) = 167.3\), so the variance is estimated to be \(167.3 + 167.3^2/7.12 = 4098\) for a standard deviation of 64. A Poisson with the same mean would have a standard deviation of 12.9.