University of Minnesota, Twin Cities School of Statistics Charlie's Home Page
A few years ago I wrote a couple of web pages
(about one long run and burn-in)
that were an attempt to clarify some of the issues about so-called
MCMC diagnostics
. But I must admit that those pages do not address
the issue directly. This page does.
Isn't
bogosity
, despite it's humor, a bit strong?
We know this is
ha ha, only serious.
And aren't you being inconsistent? Where is your rant about regression diagnostics?
And why are you the only one out of step? If everyone else likes MCMC diagnostics, despite their problems, what's the matter with you?
I don't mind regression diagnostics. I've never taught a regression course, but I do recommend simple regression diagnostics (plot of residuals versus fitted values, Q-Q plot of residuals) in intro stats.
But I also tell the students about their limited usefulness.
Regression diagnostics don't even claim to reliably diagnose
all problems. Whatever diagnostic you use will miss some problems.
That's why leave-k-out
diagnostics and other complicated
diagnostics are interesting.
So what I tell students in intro stats is that the purpose of diagnostics is not to find all problems.
The purpose of regression diagnostics is to find obvious, gross, embarrassing problems that jump out of simple plots.
Back in the dinosaur era, when plots were made out of lots of asterisks on ugly green striped fanfold paper, and there wasn't any graphics software, you made plots with FORTRAN print statements, people did lots of ridiculously bogus regressions because they couldn't see how bogus they were.
Nowadays, there is no excuse for that. The diagnostics only take seconds to do.
But, here is an interesting question to ponder. How bad does heteroscedasticity have to be before you can diagnose it? This depends on the sample size, so say 100. My answer, derived from student performance on exam questions, is you need a factor of 3 (at least!) in error standard deviation from one side of the plot to the other, 2 just isn't enough.
Let's try it (courtesy of
rweb.stat.umn.edu/Rweb). Click the Submit
button to see
an example. (This only works if you are student, faculty, or staff at
the University of Minnesota. If you are not, then you have to cut and paste
the R commands into R on your own computer.)
Not so easy to see even with het <- 3
. Try changing
het
to lower values (just edit the text in the web form
and resubmit).
If MCMC diagnostics were similar, with similarly limited claims, there would be nothing to object to.
I don't mind seat-of-the-pants MCMC diagnostics, such as time series plots, acf (autocorrelation function) plots, or Q-Q plots of batch means.
I've always used these myself, and I recommend them to students, for example, see the package vignette from my MCMC package for R.
There's no problem with these so long as it is understood that
these diagnostics only find obvious, gross, embarrassing problems that jump out of simple plots.
They are worthless for finding subtle problems.
Consider MCMC as a black box (see Wikipedia and Webopedia entries). We have software that runs a Markov chain having a specified equilibrium distribution. We don't know anything other than that.
goodstarting points.
This may seem extreme. How often do you know absolutely nothing about your MCMC algorithm and its equilibrium distribution?
On the other hand how many users of MCMC ever use theorems about MCMC
convergence? The user may know something
but not enough to
mathematically prove anything.
Thus the black box
view is not extreme. It reflects the situation
most MCMC users find themselves in.
Now what MCMC theory applies to black box MCMC? None of it!
And what MCMC diagnostics are useful? None of them!
The reason is obvious. Suppose there is an event B
having high probability under the equilibrium distribution, and
also suppose there exist bad
starting points from which it takes
the MCMC sampler software a very long time to reach B
(say longer than the age of the universe even if
Moore's Law
continues to hold until then). What chance do you have to diagnose
this?
None whatsoever, unless you can somehow guarantee that you start
at a good
starting point. But this is precisely what the black
box view assumes you cannot do!
In a word, no.
To go off on a somewhat unrelated rant, MCMC isn't even statistics, it is a tool. It calculates (approximates, estimates, whatever) by Monte Carlo probabilities and expectations that you cannot do analytically (either with pencil an paper or with a computer algebra system). The problems it is applied to need have nothing to do with probability and statistics.
The empirical analysis of the Markov chain,
as in the
package vignette from my MCMC package for R, does involve statistics.
But such analyses come with no more solid guarantee than diagnostics.
If your chain works,
then the empirical analysis gives accurate
Monte Carlo standard errors. If your chain doesn't work, then the
empirical analysis is
garbage in, garbage out.
So MCMC has something to do with statistics, sort of, but not really. Fundamentally, it has nothing to do with statistics.
You have an expectation you want to calculate. It is a well-defined number, no more random than ∫01 x3 dx.
If you think like a statistician about MCMC, this expectation, call it θ, is an unknown quantity, so call it a parameter, and your MCMC answer is a statistical point estimate of this parameter.
And in this way of thinking MCMC is exactly like regression. If you
get an unlucky
sample, there's nothing you can do. Better luck
next time!
But nobody, not even statisticians, actually thinks about MCMC this way!
No one is satisfied with better luck next time
. For one thing,
better luck
may take longer than the age of the universe to happen.
And for another thing, most people don't think of the expectation you
want to calculate as an unknown parameter
and the MCMC sample
as a given
so if it is unlucky
then there's no remedy.
Not only is the sample not given
, neither is
the sampler. There are zillions of samplers with
the same equilibrium distribution.
My favorite way to improve samplers is simulated tempering. But there are lots of other ways to improve samplers. If you haven't tried hard to improve your sampler, then you can't expect any sympathy about your convergence problems.
But after you have tried hard to improve your sampler, after you have the best sampler you can devise, what then?
In the black box view, all samplers with the same equilibrium distribution are exactly alike!
But there is one obvious consequence of the black box view
To find out anything, you have to run the sampler! The longer the run the better!
If you don't know any good
starting points (and the black box
view assumes you don't), then restarting the sampler at many bad
starting points is (as we used to say in the sixties)
part of the problem, not part of the solution
And this issue is not merely theoretical. People who have done really hard problems with MCMC and have worked really hard on validation (worrying not only about convergence but also about code correctness) have stories where problems didn't show up except in a really long run taking weeks of computer time.
It is a sad fact about scholarly literature that it is foolishly optimistic. Everything must be given highly positive spin. If it isn't the referees will stomp all over it. Thus the literature has a file drawer problem much larger than is generally recognized, extending far beyond P > 0.05.
That is why horror stories about weeks of MCMC running being necessary
to diagnose
problems do not appear in the literature.
What does appear in the literature (even in my own papers)
are toy problems
. Many statisticians find this term offensive.
They take great pride in having real problems as examples. But these
real
data turn out to be very small with only a few variables,
being only a small subset of the data originally collected,
and the questions addressed by the analysis turn out to have nothing to do with
the actual scientific (business, whatever) questions the data were
collected to address.
I sometimes call this Pooh-Bah data after the character in Gilbert and Sullivan's Mikado who has the line
Merely corroborative detail, intended to give artistic verisimilitude to an otherwise bald and unconvincing narrative.
I understand how hard it is to do justice to real data in a statistical paper or textbook. Neither students nor referees nor other readers have any patience with it. Often the only thing that can be learned from real data is that it is very complicated, too messy for any simple analysis to work.
So we use toy data instead.
But understandable as it may be, this use of toy data teaches some very bad habits.
It's hard to know what lessons to learn from toy examples.
When are toy data too simplistic? When have they been chosen (consciously or unconsciously) to avoid problematic features of the method being illustrated? Does the method (consciously or unconsciously) use features of the toy problem that are not analogous to real applications?
I coined the term honest cheating
for statistical cheating that
is done right out in the open with nothing hidden from the reader, so
by the canons of scientific publication is completely honest.
The classic example is multiple testing without correction.
It's bogus, but knowledgeable readers are given enough information to
see exactly how bogus it is and dismiss the claims of the paper.
Naive readers are fooled.
Similar honest cheating
goes on in the MCMC diagnostics literature.
It's bogus, but knowledgeable readers are given enough information to see exactly how bogus it is and dismiss the claims of the paper. Naive readers are fooled.
So I'm not really saying anything so different from what the other MCMC experts say (a bit ruder perhaps).
Last modified: October 15, 2012 (fixed broken links).
Last modified before that: January 8, 2006.