University of Minnesota, Twin Cities School of Statistics Charlie's Home Page

A few years ago I wrote a couple of web pages
(about one long run and burn-in)
that were an attempt to clarify some of the issues about so-called
MCMC diagnostics

. But I must admit that those pages do not address
the issue directly. This page does.

- Why are you so hard on MCMC diagnostics?
- A Digression about Regression Diagnostics
- Back to MCMC Diagnostics
- Conclusions

Isn't
bogosity

, despite it's humor, a bit strong?
We know this is
ha ha, only serious.

And aren't you being inconsistent? Where is your rant about regression diagnostics?

And why are you the only one out of step? If everyone else likes MCMC diagnostics, despite their problems, what's the matter with you?

I don't mind regression diagnostics. I've never taught a regression course, but I do recommend simple regression diagnostics (plot of residuals versus fitted values, Q-Q plot of residuals) in intro stats.

But I also tell the students about their limited usefulness.

Regression diagnostics don't even *claim* to reliably diagnose
*all* problems. Whatever diagnostic you use will miss some problems.
That's why leave-

diagnostics and other complicated
diagnostics are interesting.
`k`-out

So what I tell students in intro stats is that the purpose of diagnostics
is not to find *all* problems.

The purpose of regression diagnostics is to find obvious, gross, embarrassing problems that jump out of simple plots.

Back in the dinosaur era, when plots were made out of lots of asterisks on ugly green striped fanfold paper, and there wasn't any graphics software, you made plots with FORTRAN print statements, people did lots of ridiculously bogus regressions because they couldn't see how bogus they were.

Nowadays, there is no excuse for that. The diagnostics only take seconds to do.

But, here is an interesting question to ponder. How bad does heteroscedasticity have to be before you can diagnose it? This depends on the sample size, so say 100. My answer, derived from student performance on exam questions, is you need a factor of 3 (at least!) in error standard deviation from one side of the plot to the other, 2 just isn't enough.

Let's try it (courtesy of
rweb.stat.umn.edu/Rweb). Click the Submit

button to see
an example. (This only works if you are student, faculty, or staff at
the University of Minnesota. If you are not, then you have to cut and paste
the R commands into R on your own computer.)

Not so easy to see even with `het <- 3`

. Try changing
`het`

to lower values (just edit the text in the web form
and resubmit).

If MCMC diagnostics were similar, with similarly limited claims, there would be nothing to object to.

I don't mind seat-of-the-pants MCMC diagnostics, such as time series plots, acf (autocorrelation function) plots, or Q-Q plots of batch means.

I've always used these myself, and I recommend them to students, for example, see the package vignette from my MCMC package for R.

There's no problem with these so long as it is understood that

these diagnostics only find obvious, gross, embarrassing problems that jump out of simple plots.

They are worthless for finding subtle problems.

Consider MCMC as a black box (see Wikipedia and Webopedia entries). We have software that runs a Markov chain having a specified equilibrium distribution. We don't know anything other than that.

- We don't know any details of the Markov chain transition mechanism.
- We don't know any
good

starting points. - We don't know anything about the equilibrium distribution except what we learn from running the MCMC software.

This may seem extreme. How often do you know absolutely nothing about your MCMC algorithm and its equilibrium distribution?

On the other hand how many users of MCMC ever use theorems about MCMC
convergence? The user may know something

but not enough to
mathematically prove anything.

Thus the black box

view is not extreme. It reflects the situation
most MCMC users find themselves in.

Now what MCMC theory applies to black box MCMC? None of it!

And what MCMC diagnostics are useful? None of them!

The reason is obvious. Suppose there is an event `B`
having high probability under the equilibrium distribution, and
also suppose there exist bad

starting points from which it takes
the MCMC sampler software a very long time to reach `B`
(say longer than the age of the universe even if
Moore's Law
continues to hold until then). What chance do you have to diagnose

this?

None whatsoever, unless you can somehow guarantee that you start
at a good

starting point. But this is precisely what the black
box view assumes you cannot do!

In a word, no.

To go off on a somewhat unrelated rant, MCMC isn't even statistics, it is a tool. It calculates (approximates, estimates, whatever) by Monte Carlo probabilities and expectations that you cannot do analytically (either with pencil an paper or with a computer algebra system). The problems it is applied to need have nothing to do with probability and statistics.

The empirical analysis of the Markov chain,
as in the
package vignette from my MCMC package for R, does involve statistics.
But such analyses come with no more solid guarantee than diagnostics.
If your chain works,

then the empirical analysis gives accurate
Monte Carlo standard errors. If your chain doesn't work, then the
empirical analysis is
garbage in, garbage out.

So MCMC has something to do with statistics, sort of, but not really. Fundamentally, it has nothing to do with statistics.

You have an expectation you want to calculate. It is a well-defined
number, no more random than
∫_{0}^{1} `x`^{3} `dx`.

If you think like a statistician about MCMC, this expectation, call it θ, is an unknown quantity, so call it a parameter, and your MCMC answer is a statistical point estimate of this parameter.

And in this way of thinking MCMC is exactly like regression. If you
get an unlucky

sample, there's nothing you can do. Better luck
next time!

But nobody, not even statisticians, actually thinks about MCMC this way!
No one is satisfied with better luck next time

. For one thing,
better luck

may take longer than the age of the universe to happen.
And for another thing, most people don't think of the expectation you
want to calculate as an unknown parameter

and the MCMC sample
as a given

so if it is unlucky

then there's no remedy.

Not only is the sample not given

, neither is
the sampler. There are zillions of samplers with
the same equilibrium distribution.

My favorite way to improve samplers is simulated tempering. But there are lots of other ways to improve samplers. If you haven't tried hard to improve your sampler, then you can't expect any sympathy about your convergence problems.

But after you have tried hard to improve your sampler, after you have the best sampler you can devise, what then?

In the black box view, all samplers with the same equilibrium distribution are exactly alike!

- We don't know anything about these samplers other than that they have this equilibrium distribution.
- We don't know anything about the equilibrium distribution except what we learn from running these samplers (now plural).

But there is one obvious consequence of the black box view

To find out anything, you have to run the sampler! The longer the run the better!

If you don't know any good

starting points (and the black box
view assumes you don't), then restarting the sampler at many bad

starting points is (as we used to say in the sixties)
part of the problem, not part of the solution

And this issue is not merely theoretical.
People who have done really hard problems with MCMC and have
worked really hard on validation (worrying not only about convergence
but also about code correctness) have stories where problems didn't show
up except in a really long run taking *weeks* of computer time.

It is a sad fact about scholarly literature that it is foolishly optimistic.
Everything must be given highly positive
spin.
If it isn't the referees will stomp all over it.
Thus the literature has a
file drawer problem much larger than is generally recognized,
extending far beyond `P` > 0.05.

That is why horror stories about weeks of MCMC running being necessary
to diagnose

problems do not appear in the literature.

What does appear in the literature (even in my own papers)
are toy problems

. Many statisticians find this term offensive.
They take great pride in having real problems as examples. But these
real

data turn out to be very small with only a few variables,
being only a small subset of the data originally collected,
and the questions addressed by the analysis turn out to have nothing to do with
the actual scientific (business, whatever) questions the data were
collected to address.

I sometimes call this *Pooh-Bah data* after the character in
Gilbert and Sullivan's Mikado who has the line

Merely corroborative detail, intended to give artistic verisimilitude to an otherwise bald and unconvincing narrative.

I understand how hard it is to do justice to real data in a statistical paper or textbook. Neither students nor referees nor other readers have any patience with it. Often the only thing that can be learned from real data is that it is very complicated, too messy for any simple analysis to work.

So we use toy data instead.

But understandable as it may be, this use of toy data teaches some very bad habits.

It's hard to know what lessons to learn from toy examples.

When are toy data too simplistic? When have they been chosen (consciously or unconsciously) to avoid problematic features of the method being illustrated? Does the method (consciously or unconsciously) use features of the toy problem that are not analogous to real applications?

I coined the term honest cheating

for statistical cheating that
is done right out in the open with nothing hidden from the reader, so
by the canons of scientific publication is completely honest.
The classic example is multiple testing without correction.
It's bogus, but knowledgeable readers are given enough information to
see exactly how bogus it is and dismiss the claims of the paper.
Naive readers are fooled.

Similar honest cheating

goes on in the MCMC diagnostics literature.

- A diagnostic is dreamed up.
- A toy equilibrium distribution is dreamed up which is completely understood analytically without any MCMC.
- A toy MCMC sampler is dreamed up that exhibits the problem the diagnostic was designed to diagnose. Analytic knowledge of the equilibrium distribution is used in designing the sampler's flaws.
- The diagnostic does indeed diagnose the failure of the toy sampler.
- Well, duh!

It's bogus, but knowledgeable readers are given enough information to see exactly how bogus it is and dismiss the claims of the paper. Naive readers are fooled.

So I'm not really saying anything so different from what the other MCMC experts say (a bit ruder perhaps).

- If you can't get any theoretical guarantees about your MCMC sampler, then diagnostics are no more help than Linus's security blanket.
- If your sampler is too complicated to do theory about (and most are), then you are in the black box situation.
- When you are in the black box situation, only long runs of the sampler, the longer the better, have any chance of telling you anything correct about the stationary distribution.
- If you are worried about your sampler, improve it! There is no substitute for getting the best sampler you can.

Last modified: October 15, 2012 (fixed broken links).

Last modified before that: January 8, 2006.