Chapter 6 One-Sample Bayesian Procedures
The key philosophical difference with Bayesian procedures is that we express our prior knowledge about parameters and other unknowns using prior distributions and update those distributions to posterior distributions after observing the data. The key practical difference with Bayesian procedures is that there are generally no simple computations. Instead, we get approximate samples from the posterior distributions via Monte Carlo simulation.
There are several tools for doing the Monte Carlo
in R including jags
(which does Gibbs sampling),
rstan
(which does Hamiltonian sampling), brms
,
and others. We will illustrate using the bglmm
function in cfcdae
, which is a front end for rstan
.
We will demonstrate using the data from the RunStitch
example
in cfcdae
, which we have used before. Each worker run stitches collars using two different
setups: the conventional setup and an ergonomic setup. The two
runs are made in random order for each worker, and the interest
is in any difference in average speed between the two setups.
Load the runstitch data from the package and create differences between Standard and Ergonomic.
data(RunStitch)
<- RunStitch[,"Standard"]-RunStitch[,"Ergonomic"]
differences head(differences)
[1] 1.03 -0.04 0.26 0.30 -0.97 0.04
6.1 bglmm
Prior Distributions
bglmm
places a normal prior with mean zero and standard deviation
\(\sigma_k\) on all coefficients in term k of the model. At present
in our example, we have at most one term in our formula: the mean.
bglmm
then puts a second level prior on \(\sigma_k\), assuming that
it follows a gamma distribution with expected value \(\tau_k\) and
shape parameter \(\nu_k\). A large value of \(\nu_k\) means that you are
very certain that \(\sigma_k\) is near \(\tau_k\), whereas a small
value indicates uncertainty.
To illustrate the degree of uncertainty, consider the ratio of the .995 quantile to the .005 quantile of the distribution. This is one measure of the spread of the middle 99% of the distribution.
<- c(1,1.5,2:10,15,20,30,50,100,200,400,1000)
shape <- qgamma(.995,shape)/qgamma(.005,shape)
ratio data.frame(shape,ratio)
shape ratio
1 1.0 1057.012101
2 1.5 178.999426
3 2.0 71.792473
4 3.0 27.448349
5 4.0 16.330513
6 5.0 11.683607
7 6.0 9.206618
8 7.0 7.686343
9 8.0 6.663908
10 9.0 5.930983
11 10.0 5.380372
12 15.0 3.893019
13 20.0 3.224391
14 30.0 2.587675
15 50.0 2.081903
16 100.0 1.676711
17 200.0 1.440322
18 400.0 1.294069
19 1000.0 1.176992
With a shape of just 1, the plausible range of values extends over 3 orders of magnitude. By the time you to shape 5, the range is down to just one order of magnitude. You need a huge shape parameter to really nail down a value for \(\sigma_k\).
We also need a prior for the error standard deviation \(\sigma\). We use the same approach of giving a prior expected value for \(\sigma\) and a shape parameter indicating degree of certainty.
If you do not specify the shape parameter, bglmm
uses 1.5, which
is rather dispersed. If you do not specify the prior expected
value for a standard deviation, bglmm
defaults to a value that
is probably bigger than is reasonable.
6.2 RunStitch data
We want to illustrate fitting the Bayesian model, but we also want to illustrate how the results might depend on the prior distribution you choose. We want to consider prior means of either 0 or .5 (representing two different persons’ prior beliefs). We will also vary the prior standard deviation for the mean from nearly 0 up to .8. Generally, we will assume quite vague prior knowledge about the standard deviation of the data. However, we will also give one example where we assume much stronger certainty about the standard deviation to see what does, and does not, change.
We want to fit Bayesian models where we have a prior mean
\(\mu\) for the differences and some known prior standard deviation
for the differences. Things are fairly straightforward if
the prior mean is 0, because that is what bglmm
assumes.
Just set the expected standard deviation to the desired value and
use a very large shape to make it effectively constant.
If the prior mean we want to fit is not zero, then we need
to subtract the prior mean from the data and use a zero prior
mean for the adjusted data.
Before seeing the data we might have a vague idea of how big the standard deviation of the data might be. Perhaps we think the prior mean is 1, but we are quite uncertain. For that, we might use a shape of 1.5 (ratio around 180). On the other hand, we might be more certain and use a shape of 6 (ratio about 10).
6.2.1 Model Fitting
Our first model is for someone who has a very strong
prior belief that the mean should be near zero, evidenced
by the small expected prior standard deviation for the
intercept (sigma.scale.int
) and large shape (gamma.shape.int
).
We use expected prior error standard deviation of 1
(sigma.scale0
), along with a small shape indicating
vague information (gamma.shape0
). The quiet
argument
suppresses a lot of the progress information that is printed.
set.seed(20210330)
0.0.vague <- bglmm(differences~1,
fit.sigma.scale.int=.001,gamma.shape.int=1000,
sigma.scale0=1,gamma.shape0=1.5,quiet=TRUE)
Need to compile the model. Compiling now.
Trying to compile a simple C file
We now fit several similar models, where we gradually increase the prior standard deviation of the mean.
0.1.vague <- bglmm(differences~1,
fit.sigma.scale.int=.1,gamma.shape.int=1000,
sigma.scale0=1,gamma.shape0=1.5,quiet=TRUE)
0.5.vague <- bglmm(differences~1,
fit.sigma.scale.int=.5,gamma.shape.int=1000,
sigma.scale0=1,gamma.shape0=1.5,quiet=TRUE)
0.8.vague <- bglmm(differences~1,
fit.sigma.scale.int=.8,gamma.shape.int=1000,
sigma.scale0=1,gamma.shape0=1.5,quiet=TRUE)
To illustrate the effect of being more certain about the standard deviation of the data, we also fit a model with a larger shape.
0.5.notsovague <- bglmm(differences~1,
fit.sigma.scale.int=.5,gamma.shape.int=1000,
sigma.scale0=1,gamma.shape0=6,quiet=TRUE)
Finally, we fit three models that use a prior mean on the differences of .5, and use prior standard deviations of .001, .1, and .5.
5.0.vague <- bglmm(differences-.5~1,
fit.sigma.scale.int=.001,gamma.shape.int=1000,
sigma.scale0=1,gamma.shape0=1.5,quiet=TRUE)
5.1.vague <- bglmm(differences-.5~1,
fit.sigma.scale.int=.1,gamma.shape.int=1000,
sigma.scale0=1,gamma.shape0=1.5,quiet=TRUE)
5.5.vague <- bglmm(differences-.5~1,
fit.sigma.scale.int=.5,gamma.shape.int=1000,
sigma.scale0=1,gamma.shape0=1.5,quiet=TRUE)
6.2.2 Model Assessment
One risk with Bayesian analysis via MCMC is that the Markov chains might not be behaving properly. Here are three good, but not foolproof, ways to check your fit. If these show say that the chain is behaving badly, then it is behaving badly. However, a chain can be behaving badly without these methods catching the problem.
First, bglmm
runs multiple chains (four by default).
If you plot these traces, the four chains should all be
on top of each other and the plot should look like
a broad, blurry band without any patters. Uninteresting
is good for the trace plots. Here we just look at
the intercept, but you should look at all the parameters.
It looks fine.
plot(fit.0.5.vague,plottype="trace",pars="(Intercept)")
Second, successive values of the Markov chains are correlated.
Typically, close values are more highly correlated and the
correlation eventually dies off to zero. We want it to go
to zero quickly and stay near zero for
every parameter. Again, this looks fine.
plot(fit.0.5.vague,plottype="autocor",pars="(Intercept)")
Warning: Ignoring unknown parameters: fun.y
No summary function supplied, defaulting to `mean_se()`
Third, the
summary()
method for bglmm
objects includes
a column labeled Rhat
. You want the Rhat values to be small,
which means down near 1. If they’re bigger than 1.1 or 1.2, you
may have problems. Here, things are fine.
summary(fit.0.5.vague)[,"Rhat"]
(Intercept) sigma0 sigma.Intercept
1 1 1
6.2.3 Model Results
We’re eager to see the results, and now that we have
assess the chains as working well (well, at least for fit.0.5.vague
and the intercept), we can look at the
results.
The simplest way to see the results of the
fit is to use the summary()
function.
For each parameter in the model, it produces
- The posterior mean for the parameter.
- The posterior standard error for the posterior mean.
- The posterior standard deviation of the parameter.
- The 2.5, 25, 50, 75, and 97.5 percentiles of the posterior distribution (these can be reset via optional arguments).
- The effective sample size of the chain (more is better, and closer to 4,000 is good for default chain sizes).
- The Rhat value.
Note: the posterior standard error is essentially a numerical issue. If you run the chains longer and longer, this standard error will decrease to zero. However, the posterior standard deviation, which measures the true statistical variability in the posterior, does not decrease as you increase the size of the chains.
Here we look at a few of the columns for the different
models. Watch how the posterior intervals for the intercept
and error variance (sigma0
) change, or don’t change,
across the models. Recall that the intervals for the
intercept in the models where we subtracted .5 from
the data need to have that .5 added back.
summary(fit.0.0.vague)[,c("mean","sd","2.5%","97.5%","Rhat")]
mean sd 2.5% 97.5% Rhat
(Intercept) 6.09e-06 0.000990 -0.001950 0.00194 1
sigma0 6.78e-01 0.091500 0.529000 0.89400 1
sigma.Intercept 1.00e-03 0.000031 0.000943 0.00106 1
summary(fit.0.1.vague)[,c("mean","sd","2.5%","97.5%","Rhat")]
mean sd 2.5% 97.5% Rhat
(Intercept) 0.0701 0.07670 -0.0793 0.221 1.000
sigma0 0.6690 0.08840 0.5260 0.882 1.000
sigma.Intercept 0.1000 0.00313 0.0940 0.106 0.999
summary(fit.0.5.vague)[,c("mean","sd","2.5%","97.5%","Rhat")]
mean sd 2.5% 97.5% Rhat
(Intercept) 0.164 0.1150 -0.0616 0.393 1
sigma0 0.664 0.0896 0.5140 0.865 1
sigma.Intercept 0.500 0.0156 0.4680 0.531 1
summary(fit.0.5.notsovague)[,c("mean","sd","2.5%","97.5%","Rhat")]
mean sd 2.5% 97.5% Rhat
(Intercept) 0.165 0.1210 -0.0763 0.397 1.000
sigma0 0.687 0.0918 0.5380 0.890 0.999
sigma.Intercept 0.500 0.0157 0.4690 0.531 1.000
summary(fit.0.8.vague)[,c("mean","sd","2.5%","97.5%","Rhat")]
mean sd 2.5% 97.5% Rhat
(Intercept) 0.173 0.1250 -0.0675 0.413 1
sigma0 0.666 0.0891 0.5180 0.864 1
sigma.Intercept 0.799 0.0249 0.7520 0.848 1
summary(fit.5.5.vague)[,c("mean","sd","2.5%","97.5%","Rhat")] #ignore intercept
mean sd 2.5% 97.5% Rhat
(Intercept) -0.307 0.1200 -0.548 -0.0732 1
sigma0 0.667 0.0909 0.516 0.8690 1
sigma.Intercept 0.500 0.0155 0.470 0.5310 1
summary(fit.5.1.vague)[,c("mean","sd","2.5%","97.5%","Rhat")] #ignore intercept
mean sd 2.5% 97.5% Rhat
(Intercept) -0.126 0.0827 -0.2860 0.0396 1.000
sigma0 0.695 0.0965 0.5360 0.9200 1.000
sigma.Intercept 0.100 0.0032 0.0941 0.1060 0.999
summary(fit.5.0.vague)[,c("mean","sd","2.5%","97.5%","Rhat")] #ignore intercept
mean sd 2.5% 97.5% Rhat
(Intercept) -0.000018 1.02e-03 -0.002010 0.00197 1
sigma0 0.735000 9.77e-02 0.578000 0.95700 1
sigma.Intercept 0.000999 3.14e-05 0.000937 0.00106 1
summary(fit.5.5.vague)["(Intercept)",c("mean","2.5%","97.5%")]+.5
mean 2.5% 97.5%
0.1930 -0.0480 0.4268
summary(fit.5.1.vague)["(Intercept)",c("mean","2.5%","97.5%")]+.5
mean 2.5% 97.5%
0.3740 0.2140 0.5396
summary(fit.5.0.vague)["(Intercept)",c("mean","2.5%","97.5%")]+.5
mean 2.5% 97.5%
0.499982 0.497990 0.501970
Let’s tabulate those intervals so that they can be more easily compared.
Model | Lower | Upper |
---|---|---|
0.0.vague | 0.00 | 0.00 |
0.1.vague | -0.08 | 0.22 |
0.5.vague | -0.06 | 0.39 |
0.5.notsovague | -0.08 | 0.40 |
0.8.vague | -0.07 | 0.41 |
5.5.vague | -0.05 | 0.43 |
5.1.vague | 0.21 | 0.54 |
5.0.vague | 0.50 | 0.50 |
Looking at the interval estimates above, we see that the precise priors (at 0 or .5) give us precise posterior estimates of the at the same place; this was expected. The models with .1 prior standard deviations for the intercept pull the values toward the prior mean value (upper end pulled down for the 0 prior mean, lower end pulled up for the .5 prior mean). Once you get to prior standard deviations for the mean of .5 or .8, it really doesn’t matter what the prior mean is, the posterior intervals are (roughly) the same. Changing the certainty about the data standard deviation also made essentially no change in the posterior interval.
6.2.4 Model Selection
We have looked at a lot of models. Ordinarily we wouldn’t be comparing models with lots of different prior parameters, because we would know our prior distributions and use them. Again, we are doing it here to illustrate what happens as those prior parameters change.
That said, we can still assess which models fit the data
better. One way to do that is via the LOOIC, which we
can get via the function loo()
.
Smaller values of LOOIC are better. All
except fit.5.1.vague
and fit.5.0.vague
are roughly
equivalent, but those last two are larger.
loo(fit.0.0.vague)
[1] 62.12145
loo(fit.0.1.vague)
[1] 61.33113
loo(fit.0.5.vague)
[1] 61.78026
loo(fit.0.5.notsovague)
[1] 61.66798
loo(fit.0.8.vague)
[1] 61.89743
loo(fit.5.5.vague)
[1] 61.97736
loo(fit.5.1.vague)
[1] 63.32825
loo(fit.5.0.vague)
[1] 66.68975
Bayes factors are a second way to assess models, but be aware that Bayes factors can be sensitive to priors. Let’s compare all these models to the one with a point prior at 0. According to the usual guidelines for judging Bayes factors, there is no differentiation between the point prior at 0 model and any of the others except for the last two, where the point prior is positively preferred.
bayes_factor(fit.0.0.vague,fit.0.1.vague,silent=TRUE)
Estimated Bayes factor in favor of bridge1 over bridge2: 0.82632
bayes_factor(fit.0.0.vague,fit.0.5.vague,silent=TRUE)
Estimated Bayes factor in favor of bridge1 over bridge2: 1.54156
bayes_factor(fit.0.0.vague,fit.0.5.notsovague,silent=TRUE)
Estimated Bayes factor in favor of bridge1 over bridge2: 1.04228
bayes_factor(fit.0.0.vague,fit.0.8.vague,silent=TRUE)
Estimated Bayes factor in favor of bridge1 over bridge2: 2.31803
bayes_factor(fit.0.0.vague,fit.5.5.vague,silent=TRUE)
Estimated Bayes factor in favor of bridge1 over bridge2: 1.76297
bayes_factor(fit.0.0.vague,fit.5.1.vague,silent=TRUE)
Estimated Bayes factor in favor of bridge1 over bridge2: 3.68983
bayes_factor(fit.0.0.vague,fit.5.0.vague,silent=TRUE)
Estimated Bayes factor in favor of bridge1 over bridge2: 10.47117