Chapter 10 Basic Bayesian Procedures

The key philosophical aspect of Bayesian procedures is that we express our prior knowledge about parameters and other unknowns using prior (probability) distributions and update those distributions to posterior (probability) distributions after observing the data. The key practical difference with Bayesian procedures is that there are generally no simple computations. Instead, we get approximate samples from the posterior distributions via Monte Carlo simulation, and then make statements about the posteriors based on the approximate samples.

There are several tools for doing the Monte Carlo in R including jags (which does Gibbs sampling), rstan (which does Hamiltonian sampling), brms, and others. We will illustrate using the bglmm() function in bcfcdae, which is a front end for rstan.

We will demonstrate bglmm() using the data from the RunStitch example in cfcdae, which we have used before. Each worker run stitches collars using two different setups: the conventional setup and an ergonomic setup. The two runs are made in random order for each worker, and the interest is in any difference in average speed between the two setups.

Load the runstitch data from the package and create differences between Standard and Ergonomic.

> data(RunStitch)
> differences <- RunStitch[,"Standard"] - RunStitch[,"Ergonomic"]

10.1 Prior Distributions

General Bayesian packages such as rstan and jags allow you to set up practically arbitrary prior distributions for parameters. bglmm() has a much narrower set of priors available, but they still allow us to accomplish a lot in the context of the models for which bglmm() was designed.

10.1.1 Gamma distributions

A gamma distribution is a common distribution for positive random variables; for example, a chi-squared distribution is a special case of a gamma. bglmm() parameterizes gamma distributions via the expected value \(\tau\) and shape parameter \(\nu\). A large value of the shape parameter means that you believe that the random variable is near \(\tau\), whereas a small shape value indicates uncertainty.

To illustrate the degree of uncertainty, consider the ratio of the .995 quantile to the .005 quantile of the distribution. This is one measure of the spread of the middle 99% of the gamma distribution. See show this ratio changes as a function of shape.
> shape <- c(1, 1.5, 2:10, 15, 20, 30, 50, 100, 200, 400, 1000, 20000)
> ratio <- qgamma(.995, shape) / qgamma(.005, shape)
> data.frame(shape, ratio)
     shape       ratio
1      1.0 1057.012101
2      1.5  178.999426
3      2.0   71.792473
4      3.0   27.448349
5      4.0   16.330513
6      5.0   11.683607
7      6.0    9.206618
8      7.0    7.686343
9      8.0    6.663908
10     9.0    5.930983
11    10.0    5.380372
12    15.0    3.893019
13    20.0    3.224391
14    30.0    2.587675
15    50.0    2.081903
16   100.0    1.676711
17   200.0    1.440322
18   400.0    1.294069
19  1000.0    1.176992
20 20000.0    1.037100

With a shape of just 1, the plausible range of values for \(\sigma\) extends over 3 orders of magnitude. By the time you get to shape 5, the range is down to just one order of magnitude. You need a huge shape parameter to indicate that the random variable must be very near \(\tau\).

10.1.2 Priors for terms in the linear predictor

Terms in the linear predictor for a model could be group means, differences of group means, regression slopes, or other parameters depending on the complexity of the model. The prior for these parameters is normal with mean zero and standard deviation \(\sigma_i\) for parameters in the ith term of the model. Parameters from different terms are assumed to be independent. Parameters within the same term might not be independent.

Note that these priors are all centered at zero. This is sensible when the prior presumption is that the term/effect/coefficient is most likely to be small, but need not be small. For example, we might have a prior belief that the difference between treatment means might be near zero. However, this is often a dubious assumption for a parameter that describes the overall mean of the data. For parameters like that, the standard deviation of the prior distribution might need to be very large to be reasonable.

10.1.3 Priors for standard deviations

Many models assume that responses are normally distributed with a certain set of means and a certain standard deviation (sometimes more than one standard deviation). Call this standard deviation \(\sigma_0\). Robust alternatives to normal distributions have an analogous scaling factor. bglmm() uses a gamma prior distribution for \(\sigma_0\), with shape parameter sigma0.shape and expected value sigma0.mean. A large value of sigma0.shape means that you are very certain that \(\sigma_0\) is near sigma0.mean, whereas a small value indicates uncertainty.

Parameters in the i-th term of the linear predictor are assumed to be normal with mean zero and some standard deviation \(\sigma_i\). We usually don’t know \(\sigma_i\), so we put a second-level prior on \(\sigma_i\).

By default, bglmm() assumes that \(\sigma_i\) follows a gamma distribution with expected value sigma.mean and shape parameter sigma.shape. You may specify sigma.mean and sigma.shape as vectors, with one value for each term in the model, or you may specify them as scalars, in which case the same mean and shape is used for all the terms. (What really happens is that any values you specify get recycled enough times to fill out the full vectors of means and shapes.)

If you do not specify the shape parameter for a gamma prior on a standard deviation, bglmm() uses 2, which is rather dispersed. If you do not specify the prior expected value for a standard deviation, bglmm() defaults to a value that is based on a frequentist fit (definitely not a very Bayesian thing to do).

An alternative is available for \(\sigma\) when usecauchy=TRUE is used as an argument to bglmm(). In that case, we assume that \(\log(\sigma_i/\sigma_0)\) follows a standard Cauchy distribution, independently across the different \(\sigma_i\) parameters. This prior encourages shrinkage to zero when there is relatively little evidence that the parameters are non-zero, and it does little shrinkage toward zero when there is evidence that the parameters are far from zero. This has proven to be fairly unstable numerically in practice.

10.2 Other bglmm() arguments

There is only one required argument for bglmm(), and that is the formula for the model. There are (hopefully reasonable) defaults for all other arguments. As with lm(), there are data= and subset= arguments that let you tell R where to look for data and which data to use.

bglmm() uses the rstan package to do Markov chain Monte Carlo. There are several optional arguments that control the MCMC. Here are some that might be useful now; others can wait.

  • n.samples says how long each Markov chain should be.
  • n.chains says how many parallel chains to run.
  • warmup says how many of the n.samples should be discarded as warm up runs.
  • thin controls how the MCMC results are retained. For example, thin=5 retains every fifth value.
  • quiet=TRUE suppresses the printing of progress information.
  • adapt_delta and max_treedepth control how the MCMC is run. You may receive diagnostic information telling you to increase their values to improve the MCMC run.
  • seed An optional integer seed for the random number generator. You need this to make your results repeatable.

Examples Suppose that we have the differences for paired data, and we just want to fit the mean. Using the defaults and the gamma priors gives us:

library(cfcdae)
library(bcfcdae)
library(rstan)
fit <- blgmm(differences ~ 1)

If we wanted to assume that the prior mean was .5, we could do:

fit <- bglmm(differences-.5 ~ 1)

This would tell us how much the mean deviates from .5. That is, you would need to add .5 back into the posterior values to get the estimate of the mean with the prior mean being .5.

If we wanted to specify that the standard deviation (around 0) of the prior distribution for the mean was probably around .1, but we were not really certain, we could do:

fit <- bglmm(differences-.5 ~ 1, sigma.mean=.1, sigma.shape=20)

If we were really sure that the prior standard deviation should be .1, we could do:

fit <- bglmm(differences-.5 ~ 1, sigma.mean=.1,sigma.shape=2000)

Suppose that we had the notched boards data and wished to estimate the difference in means in the notched boards data. Using defaults, we would do

fit <- bglmm(strength ~ shape,data=NotchedBoards)

This model fits the overall mean and the deviation from the overall mean for each shape.

If we wanted to use a large standard deviation for the intercept (overall mean) parameter and a smaller one for the difference, we could use:

fit <- bglmm(strength ~ shape,data=NotchedBoards,sigma.mean.int=300)

If we wanted to fit the means individually we could do:

fit.notch.indiv <- bglmm(strength ~ shape-1,data=NotchedBoards,
  seed=10001, # for repeatability
  quiet=TRUE, # to suppress progress informaiton
  sigma.mean=300,sigma.shape=3)

10.3 Compilation

One drawback of using bglmm() is that R will need to compile the model specification into an executable format. It only needs to do that once for each type of model, but it will need to do it every time you start a new R session. This has two regretable consequences:

  1. It’s slow.
  2. You need to have the right software or packages to do the compilation. R will try to download what is needed, but student experience has been that this is not always a failure-free process.

10.4 Looking at the results

There are both numerical and graphical summaries available. Begin with the summary() function. For the output of bglmm(), summary() provides a matrix with ten columns and one row for each parameter. The parameters include the obvious parameters in the model such as \(\sigma_0\) and the parameters in the model for the mean, but also included are the prior standard deviations of the parameters in the model for the mean. For example, for the fit bglmm(strength~shape-1) the parameters are the means for the two levels of shape (called shape1 and shape2), sigma0, and sigma.shape. Here is the summary:

> summary(fit.notch.indiv)
             mean se_mean     sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
shape1      289.0   0.270  17.60 255.0 277.0 289.0 300.0   323  4280    1
shape2      296.0   0.311  17.70 261.0 285.0 296.0 307.0   331  3250    1
sigma0       63.6   0.152   9.33  48.3  56.9  62.5  69.1    85  3750    1
sigma.shape 330.0   2.300 126.00 155.0 239.0 305.0 393.0   639  3000    1

The first column of the summary is the estimated posterior mean (for all intents and purposes, this is the usual Bayesian estimate) and third column is the estimated posterior standard deviation. The second column (labeled se_mean) is a numerical assessment of how accurately we have computed the posterior mean; it is usually not of much direct interest.

Summary columns 4 through 8 give percentiles of the posterior distribution for each parameter; these are the percentiles that you use to form a posterior credible interval (analog of a confidence interval). By default, the 2.5, 25, 50, 75, and 97.5 percentiles are reported. You can change those by including a probs argument, for example, probs=c(.005,.025,.975,.995).

You can also choose which parameters to summarize by including a pars argument, for example, pars=c("shape1","shape2","sigma0").

There are several graphics available to study the results. Simply plotting a bglmm() fit gives two plots: the Pearson residuals against fitted values, and a normal probability plot of the Pearson residuals.
> par(mfrow=c(1,2))
> plot(fit.notch.indiv)
Residual plots for notched boards

Figure 10.1: Residual plots for notched boards

> par(mfrow=c(1,1))
You can get histograms of the posterior densities of parameters:
> plot(fit.notch.indiv,"histogram",pars=c("shape1","shape2"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histograms of posterior densities

Figure 10.2: Histograms of posterior densities

You can get interval estimates of the parameters:
> plot(fit.notch.indiv,"interval",pars=c("shape1","shape2"))
ci_level: 0.8 (80% intervals)
outer_level: 0.95 (95% intervals)
Interval estimates of shape means

Figure 10.3: Interval estimates of shape means

You can get a plot of where each data point falls in its predicted distribution:
> plot(fit.notch.indiv,"pointwise")
Predictive distributions for data

Figure 10.4: Predictive distributions for data

NULL

An outlier would be out in the tail of its predictive distribution.

10.5 Assessing the MCMC

An MCMC works well when there is little correlation between successive samples and when the Markov chain fully explores the posterior distribution. It would be nice if MCMC always worked well, but problems are not infrequent. Here are some good, but not foolproof, ways to check your fit. If one of these says that the chain is behaving badly, then it is behaving badly. However, a chain can be behaving badly without these methods catching the problem.

The last two columns of the summary of a bglmm() fit provide some assessment. n_eff is the effective sample size; here, bigger is better. By default, bglmm() runs four chains of 2,000 iterations each, with the first half discarded. Thus the default reported sample size is 4,000. If there is a lot of correlation in the iterations, the effective sample size will be much smaller.

Rhat is a non-convergence diagnostic. (This is in some dispute, but the builders of rstan, which bglmm() uses, like it and use it.) If Rhat is big, say bigger than 1.1 or 1.2, then you know that the MCMC is not mixing well, and the estimates you get are not very reliable. Sadly, and Rhat down near 1 does not mean that the MCMC is mixing well, it just means that you have not discovered it mixing poorly.

There are graphical analogs of these, too. We can plot the autocorrelation that is present in the iterates for each variable of interest. For a well-behaving chain these autocorrelations should drop to 0 quickly. Note, you may need to do library(rstan) prior to making some of these plots.
> plot(fit.notch.indiv,"autocor",pars=c("shape1","shape2"))
Autocorrelation for shape means

Figure 10.5: Autocorrelation for shape means

Here the autocorrelations drop to 0 quickly.

We can also plot the iterates for a parameter for each chain as a trace, and then overlay the traces. Ideally, these plots should look like blah nothing, with no patterns over time or separation between the multiple chains.
> plot(fit.notch.indiv,"trace",pars=c("shape1","shape2"))
Trace plots for shape means

Figure 10.6: Trace plots for shape means

These trace plots look fine.

10.6 Divergent transitions

One problem that the MCMC can encounter is “divergent transitions.” This is an ominous warning message that you sometimes receive after bglmm(). Divergent transitions do not always signify a problem, but they are not to be encouraged.

Here are three things you can do to try to reduce divergent transitions:

  1. Increase adapt_delta above its default value of .8. Anything below 1 is allowable. Thus you might try adding the argument adapt_delta=.95 or higher.
  2. Increase max_treedepth above its default value of 10. Thus you might try adding the argument max_treedepth=15.
  3. Try a tighter prior distribution. This might involve setting sigma.shape to a larger value, for example sigma.shape=5.

One other thing to note is that divergent transitions can arise when the model you’re using simply doesn’t fit the data very well. This could happen when you put a tight prior near 0 on a parameter that the data would suggest are rather farther from 0.

10.7 Comparing Bayesian fits

Suppose that we have two Bayesian models for data and we want to compare them with the idea of choosing the better model. We have two approaches. The first computes an information criterion for each model and selects the model with the lower IC. We will use the Leave One Out Information Criterion, or loo for short. The second approach computes the Bayes factor of model 1 to model 2. Factors greater than 1 favor model 1. The Bayes factor is fairly difficult to compute, and we typically only get an approximation. loo is easier to compute, but still non-trivial.

Consider two models for the notched boards data. One model is our previous model that computes a mean for each group, and the second model computes a single mean across both groups.
> fit.single <- bglmm(strength ~ 1,data=NotchedBoards,
+                     seed=10003, # for repeatability
+                     sigma.mean=300,sigma.shape=3,quiet=TRUE)
Use the loo() function in the loo package to compute the criteria:
> loo::loo(fit.notch.indiv)
[1] 291.7324
> loo::loo(fit.single)
[1] 289.7221

fit.single has the lower value, so it is preferred, but it is only lower by 2, which gives a reasonable, but not overwhelming, preference.

How about the Bayes factor?
> bayes_factor(fit.single,fit.notch.indiv)
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Iteration: 1
Iteration: 2
Iteration: 3
Iteration: 4
Iteration: 5
Estimated Bayes factor in favor of bridge1 over bridge2: 19.88565

The Bayes factor prefers the model with a single mean. A Bayes factor in the range of 3 to 20 is considered a positive preference for fit.single, and 19.9 is very close to being a strong preference.