Chapter 30 Fixing Problems the Bayesian Way
30.1 Data sets
We will illustrate fixing problems using Bayesian methods with the data sets we used in the previous chapter. As before:
> library(cfcdae)
> data(ResinLifetimes)
> lifetimes.fit <- lm(Time~temp,data=ResinLifetimes)
> data(CloudSeeding)
> cloud.fit <- lm(rainfall~seeding,data=CloudSeeding)
TissueTearing
. Here weight
is added to a facial tissue stretched across an open
jar until the tissue tears. There are four treatments: tissues
of brand A or brand B, used either dry or wet.
> data(TissueTearing)
> tearing.fit <- lm(tear.wt~treatment,data=TissueTearing)
TissueVaporization
. For these
data, a laser scalpel is drawn across a rat liver at a rate of
.01 inches/second. We measure the width of the vaporized tissue that
results as we use the laser at four different power settings.
> data(TissueVaporization)
> vaporization.fit <- lm(width~power,data=TissueVaporization)
PeaGermination
. Each experimental
unit was a group of 20 seeds. Each group was pre-soaked either 12 or
24 hours and then placed on either potting soil or paper towel. The response
is the number of germinated seeds after 5 days.
> data(PeaGermination)
> peas.fit <- lm(germinated~treatment,data=PeaGermination)
> data(Copepoda)
> Copepoda.fit <- lm(count~pH,data=Copepoda)
Note that the responses in the PeaGermination
and Copepoda
data sets
are counts. Thus we might already be skeptical of models that assume
normal distributions for the responses.
CloudBackup
, which gives the time in seconds to
save a 50MB file to a cloud backup service over the internet and then
recover it. Each of three services is used ten times, with the 30
observations assigning services at random. The 30 runs are done
consecutively, and the data are listed in time order.
> data(CloudBackup)
> backup.fit <- lm(updowntime~service,data=CloudBackup)
Again, as discussed in the previous chapter, we will we will approach non-normality, non-constant variance, and dependence one-by-one as well as discuss the various approaches to handling violations of assumptions completely separately. In practice, you are going to be assessing all of the assumptions for any given data set and potentially dealing with more than one problem at a time. In addition, you may well try more than one approach to dealing with violations of assumptions in a single data set.
30.2 Bayesian Alternatives
The bglmm()
function has a number of options
that expand its use beyond independent, constant variance,
normally distributed errors.
- The
family
argument specifies a response distribution."gaussian"
: normally distributed data; the default."t4"
: data follow a t-distribution with 4 degrees of freedom. This is a long tailed alternative to normal that can accommodate outliers."poisson"
: data follow a Poisson distribution. This distribution uses a log link function."negbin"
: data follow a negative binomial distribution. This distribution uses a log link function. A negative binomial is similar to a Poisson distribution, but it has a larger variance than a Poisson for the same mean."binomial"
: data follow a binomial distribution. This distribution uses a logit link function."betabinom"
: data follow a beta binomial distribution. This distribution uses a logit link function. A beta binomial is similar to a binomial, but it has a larger variance for the same mean.
- The
varform
argument allows a specification for the variance structure in the data. It is appropriate when the family isgaussian
ort4
."constant"
: the default. Variability is constant for all data."power"
: the standard deviation of the response is proportional to a power of the fitted value (power to be estimated). This option uses a identity link function (the default), but you can also specify a log link function.~1|groups
: fit a different variance at each level of the factorgroups
. Interactions of factors are also allowed.- c(“ar1”,“timename”,“groupsname”): fit an AR1 correlation structure with time given by the variable with name “timename” and any independent groups given by the factor variable with name “groupsname”. For example, c(“ar1”,“mytime”,“replications”).
- The
boxcox
argument can be set to true. In that case,bglmm()
will also sample across power family transformations.
30.3 Resin Lifetimes via Transformation
In this example we let bglmm()
sample across power family
transformations to find a good transformation power. bglmm()
is using the full Box-Cox transformation (rescaled power, not just power).
> library(bcfcdae)
> library(rstan)
Loading required package: StanHeaders
Loading required package: ggplot2
rstan (Version 2.21.7, GitRev: 2e1f913d3ca3)
For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)
> lifetimes.bboxcox <- bglmm(Time~temp,data=ResinLifetimes,boxcox=TRUE,quiet=TRUE,
+ seed=10028,adapt_delta = .95,
+ max_treedepth = 12)
Need to compile the model. Compiling now.
Trying to compile a simple C file
Note that we needed to increase adapt_delta
and
max_treedepth
to avoid divergent transitions.
> plot(lifetimes.bboxcox,"trace")
> plot(lifetimes.bboxcox,"autocor")

> summary(lifetimes.bboxcox)[,c(1,3,4,8:10)]
mean sd 2.5% 97.5% n_eff Rhat
(Intercept) 66.600 14.900 44.9000 101.000 935 1.01
temp1 36.200 3.010 30.6000 42.500 1600 1.00
temp2 11.200 2.260 6.7000 15.500 2570 1.00
temp3 -6.010 2.400 -10.7000 -1.340 2060 1.00
temp4 -17.100 2.290 -21.7000 -12.700 2110 1.00
sigma0 6.600 0.865 5.1700 8.570 2630 1.00
lambda 0.283 0.166 -0.0254 0.614 1150 1.00
sigma.Intercept 57.900 24.000 25.4000 119.000 3270 1.00
sigma.temp 34.300 16.000 15.8000 77.500 2060 1.00
The power parameter is called lambda
.
The posterior mean transformation power
is .283 with an interval from -.03 to .61.
This is similar to, but shifted a bit
up from, what we obtained from
the Box-Cox plot.
I actually would not use the coefficients from this model, because they represent a mixture of different powers, making the coefficients difficult to interpret. I would choose a power, make a single transformation, and then look at the model coefficients.
30.4 Accommodating outliers in the tissue vaporization data
Recall that the tissue vaporization data had an outlier, so let’s try Bayesian fitting using a t-distribution with 4 degrees of freedom instead of a normal distribution. The t-distribution with 4 degrees of freedom has fairly long tails, so it should do a better job of accommodating an outlier.
> vaporization.bt4 <- bglmm(width~power-1,
+ data=TissueVaporization,family="t4",quiet=TRUE,seed=10029)
Need to compile the model. Compiling now.
Trying to compile a simple C file
> plot(vaporization.bt4)


> summary(vaporization.bt4)[,c(1,3,4,8:10)]
mean sd 2.5% 97.5% n_eff Rhat
power1 1.130 0.1250 0.8680 1.380 2540 1
power2 1.550 0.1360 1.2900 1.830 3090 1
power3 1.910 0.1250 1.6400 2.160 2590 1
power4 2.240 0.2430 1.5600 2.560 1510 1
sigma0 0.197 0.0948 0.0745 0.432 948 1
sigma.power 2.050 0.7780 1.0600 4.070 1980 1
One explanation for the large standard errors is that this
Bayesian analysis is using default priors. The
default prior mean for sigma0
is the square root
of the MSE from the ordinary fit of the full data.
For the vaporization data, that prior mean is .37,
which is a factor of 4 larger than the root MSE
from the data with the outlier removed: .091.
If we used sigma0.mean=.1
to try to reflect what we know happens when
the outlier is down weighted, we would get standard
errors closer to what we get removing the outlier.
Note, however, that both the default prior and this
alternative prior are not really Bayesian, as both
are setting a prior based on the data,
and that is like
using the information in the data twice.
30.5 Models for variance using the Resin Lifetimes
We have discussed two models for the situation where
variances are non-constant: separate variances by group and
variance proportional to a power of the fitted value. We
can do both using bglmm()
.
Let’s try fitting the resin lifetimes data using separate variances by treatment and standard deviation proportional to a power of the fitted values.
> lifetimes.bsep <- bglmm(Time~temp-1,data=ResinLifetimes,
+ varform=~1|temp,quiet=TRUE,seed=10030)
Again, all of the diagnostic plots look fine.
Look at the fit summary (sigma0[1]
etc. are the five
different group standard deviations):
> summary(lifetimes.bsep)[,c(1,3,4,8:10)]
mean sd 2.5% 97.5% n_eff Rhat
temp1 85.70 4.880 75.30 95.40 3140 1.000
temp2 43.30 3.640 36.00 50.60 3560 0.999
temp3 24.40 2.590 19.20 29.80 3890 1.000
temp4 15.70 0.978 13.80 17.70 2830 0.999
temp5 11.90 2.130 7.53 16.20 2560 1.000
sigma0[1] 13.10 3.270 8.45 21.10 3060 1.000
sigma0[2] 10.10 2.590 6.41 16.50 4020 1.000
sigma0[3] 7.25 2.050 4.39 12.10 3130 1.000
sigma0[4] 2.35 0.966 1.26 4.86 2370 1.000
sigma0[5] 4.82 1.860 2.52 9.62 2980 1.000
sigma.temp 57.40 21.500 30.00 109.00 3690 1.000
Compare back with what we obtained with gls()
, and
we see that the Bayesian posterior distributions
for group means
have standard deviations larger than
the standard errors of the gls()
fits. This is
particularly true for the smaller variances.
In part this is because the posterior standard deviation
for the group means already includes variability due to
the group standard deviations being unknown, whereas the
frequentist residual standard deviation simply estimates the
group standard deviation, and then the effect of not knowing
the standard deviation exactly is accounted for in
confidence intervals by the t-multiplier.
A second issue here is that bglmm()
uses a single prior
mean standard deviation
for all five groups; it is larger than the small variances
and smaller than the large variances. It pulls the small
variances up more than it affects the larger
variances. bglmm()
does not have a mechanism
to put different priors on the different group variances,
so there is not much we can do about this.
> lifetimes.bpower <- bglmm(Time~temp-1,data=ResinLifetimes,
+ varform="power",quiet=TRUE,
+ sdpower.range = c(0,2),
+ ,seed=10032)
Need to compile the model. Compiling now.
Trying to compile a simple C file
> summary(lifetimes.bpower)[,c(1,3,4,8:10)]
mean sd 2.5% 97.5% n_eff Rhat
temp1 85.700 4.520 77.200 94.800 3630 0.999
temp2 43.800 3.040 38.200 49.800 3050 1.000
temp3 24.800 2.220 20.700 29.400 3800 1.000
temp4 15.400 1.870 12.000 19.500 3280 1.000
temp5 12.000 1.720 8.770 15.600 3260 0.999
sigma0 1.290 0.975 0.276 4.080 1350 1.000
power 0.562 0.184 0.199 0.934 1590 1.000
sigma.temp 56.900 21.300 30.100 112.000 2490 1.000
The standard errors of the group means for this
model are a bit closer to what we saw with gls()
.
Again, the posterior standard deviations from bglmm()
take into account the variability in the power.
Using loo()
, the separate variances model
is preferred, but not overwhelmingly. It uses more parameters, but the
improvement in fit compensates for that.
> loo(lifetimes.bsep)
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
[1] 252.2149
> loo(lifetimes.bpower)
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
[1] 253.7822
> bayes_factor(lifetimes.bsep,lifetimes.bpower)
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Estimated Bayes factor in favor of bridge1 over bridge2: 10.51828
30.6 Fitting AR1 errors
We can fit AR1 errors in bglmm()
to
account for autocorrelation using the varform
arguement.
For AR1, varform
should be a character vector of length 3;
the first element is "ar1"
, the second element is the name
of an R variable containing the times of the observations, and
the third element is the name of an R variable (a factor) containing
the groups (different groups are assumed to be independent). If there is only one group, the last variable should
be a factor with a single level.
> mytime <- 1:30
> mygroups <- factor(rep(1,30))
> backup.bar1 <- bglmm(log(updowntime)~service,data=CloudBackup,
+ varform=c("ar1","mytime","mygroups"),
+ quiet=TRUE,seed=10033)
Need to compile the model. Compiling now.
Trying to compile a simple C file
Warning in bglmm(log(updowntime) ~ service, data = CloudBackup, varform =
c("ar1", : data have been sorted into time order by group
> summary(backup.bar1)
mean se_mean sd 2.5% 25% 50% 75% 97.5%
(Intercept) 5.6000 0.001110 0.0479 5.5200 5.570 5.6000 5.6300 5.7200
service1 -0.4880 0.000425 0.0253 -0.5360 -0.504 -0.4880 -0.4700 -0.4370
service2 -0.1400 0.000470 0.0283 -0.1940 -0.158 -0.1400 -0.1210 -0.0839
sigma0 0.0837 0.000519 0.0220 0.0376 0.070 0.0837 0.0977 0.1260
rho 0.4890 0.003960 0.1690 0.1460 0.384 0.4920 0.6000 0.8120
sigma.Intercept 6.5700 0.055700 3.1500 2.5300 4.260 5.8800 8.1800 14.2000
sigma.service 0.6910 0.005720 0.2990 0.3140 0.477 0.6240 0.8290 1.4700
n_eff Rhat
(Intercept) 1850 1.000
service1 3560 0.999
service2 3630 1.000
sigma0 1800 1.000
rho 1820 1.000
sigma.Intercept 3190 0.999
sigma.service 2730 1.000
(The diagnostics all look fine.)
The posterior mean for autocorrelation (rho
) is .49, a bit smaller than
estimated by gls()
(.56), but the posterior interval is
fairly wide.
> linear.contrast(backup.bar1,service,allpairs=TRUE)
post. mean post. sd lower upper approx BF
1 - 2 -0.3479707 0.04730532 -0.4408279 -0.2540766 2.570635e+11
1 - 3 -1.1149508 0.04227694 -1.1975667 -1.0316041 4.885544e+150
2 - 3 -0.7669800 0.04764600 -0.8594177 -0.6718734 8.503035e+55
attr(,"class")
[1] "linear.contrast"
30.7 Other distributions
The pea germination data are success/failure data: numbers germinated and ungerminated out of 20. We might expect these to follow a binomial distribution. We can do this inbglmm()
using the family="binomial"
argument (the response is two
columns giving successes and failures for each case).
> peas.bglm <- bglmm(cbind(germinated,nongerminated)~treatment-1,
+ data=PeaGermination,family="binomial",
+ quiet=TRUE,seed=10034)
Need to compile the model. Compiling now.
Trying to compile a simple C file
> summary(peas.bglm)[,c(1,3,4,8:10)]
mean sd 2.5% 97.5% n_eff Rhat
treatment1 0.71 0.335 0.0682 1.36 4920 1
treatment2 0.83 0.344 0.1760 1.53 4500 1
treatment3 -1.85 0.469 -2.8700 -1.01 4230 1
treatment4 1.83 0.457 1.0100 2.81 4050 1
sigma.treatment 1.75 0.684 0.8460 3.43 3670 1
The means for treatments are .71, .83, -1.85, and 1.83 with considerable overlap of posterior invervals for several treatments. However, these means are given on the logit scale. To get means and confidence intervals on the probability scale we need to convert the logit \(x\) to the probability \(p\) via \(p = \exp(x)/(1+\exp(x))\). We can do this via direct conversion of the summary statistics on the logit scale. However, if we want intervals for contrasts, we need to extract the MCMC samples, convert them to the probability scale and then take means or intervals.
For direct computation on summaries:> tmp <- summary(peas.bglm)[1:4,c(1,4,8)]
> exp(tmp)/(1+exp(tmp))
mean 2.5% 97.5%
treatment1 0.6704012 0.51704339 0.7957597
treatment2 0.6963549 0.54388677 0.8220063
treatment3 0.1358729 0.05365665 0.2669799
treatment4 0.8617617 0.73302015 0.9432138
> logit.mcmc <- fixef(peas.bglm,summary=FALSE)
> p.mcmc <- exp(logit.mcmc)/(1+exp(logit.mcmc))
> apply(p.mcmc,2,mean)
treatment1 treatment2 treatment3 treatment4
0.6663144 0.6916571 0.1443809 0.8537514
> apply(p.mcmc,2,quantile,probs=c(.025,.975))
treatment1 treatment2 treatment3 treatment4
2.5% 0.5170453 0.5437795 0.05379851 0.7334650
97.5% 0.7963867 0.8216908 0.26778620 0.9429778
The posterior mean on the probability scale is not quite the same as the back-transformation of the mean on the logit scale, but the interval estimates are the same (at least up to the ambiguity from the ways that quantiles can be computed).
The Copepoda data were counts of … copepoda. A default model for such counts could be the Poisson. We can get a Poisson fit using thefamily="poisson"
argument.
> copepoda.bp <- bglmm(count~pH,data=Copepoda,family="poisson",
+ quiet=TRUE,seed=10035)
Need to compile the model. Compiling now.
Trying to compile a simple C file
> plot(copepoda.bp)


family="negbin"
.
> copepoda.bnb <- bglmm(count~pH,data=Copepoda,family="negbin",
+ quiet=TRUE,seed=10036,adapt_delta = .95)
Need to compile the model. Compiling now.
Trying to compile a simple C file
The diagnostics look fine, but are not shown here.
Look at the summaries:> summary(copepoda.bnb)[,c(1,3,4,8:10)]
mean sd 2.5% 97.5% n_eff Rhat
(Intercept) 5.120 0.183 4.780 5.500 2270 1
pH1 -0.137 0.190 -0.508 0.243 2170 1
phi 7.120 3.840 1.890 16.700 2430 1
sigma.Intercept 87.600 81.700 6.850 311.000 3670 1
sigma.pH 24.000 24.400 0.885 90.200 4200 1
We see the effects (on the log scale because of the log link);
the interval for the treatment effect is practically centered at 0.
There is an additional parameter phi
estimated
to be about 7.12. A negative binomial with mean
\(\mu\) has variance \(\mu + \mu^2/\phi\) (a Poisson essentially
corresponds to an infinite phi
).
The mean is \(\mu = \exp(5.12) = 167.3\), so the variance is estimated
to be \(167.3 + 167.3^2/7.12 = 4098\) for a standard deviation of
64. A Poisson with the same mean would have a standard deviation of
12.9.