This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/).
The version of R used to make this document is 4.0.3.
The version of the rmarkdown
package used to make this document is 2.6.
Statisticians have long had a pejorative term "data snooping" or "data dredging", which many newer terms such a \(P\)-hacking and HARKing serve as partial replacements. Outside of statistics, the term "cherry picking" is often used to indicate selective use of evidence: ignoring evidence against one's pet theory and possibly mischaracterising putative evidence for it.
Many Bayesians have long argued that "optional stopping" for Bayesians is not a problem. It does not change the likelihood and hence does not change Bayesian inference. A recent paper arguing this is
Jeffrey N. Rouder (2014)
Optional stopping: No problem for Bayesians
Psychonomic Bulletin & Review, 21, 301–308
DOI:10.3758/s13423-014-0595-4
More astute Bayesians have realized that this is a problem.
Rianne de Heide and Peter D. Grünwald (2020)
Why optional stopping can be a problem for Bayesians
Psychonomic Bulletin & Review, online
DOI:10.3758/s13423-020-01803-x
Paul R. Rosenbaum and Donald B. Rubin (1984)
Sensitivity of Bayes Inference with Data-Dependent Stopping Rules
The American Statistician, 38, 106–109
DOI:10.1080/00031305.1984.10483176
Here we look at stopping rules similar to those discussed by Rosenbaum and Rubin but even more "data snooping". We do Bayesian inference of independent and identically distributed normal data with known variance and flat prior so there is no difference between frequentist and Bayesian intervals. Thus there is nothing inherently Bayesian about the rest of our discussion. What "data snooping" we allow affects frequentists and Bayesians in exactly the same way.
Without loss of generality, assume the known variance of our data is one. Our stopping rule is that we do not stop until the sample mean \(\bar{x}_n\) is greater than \(C / \sqrt{n}\) where \(n\) is the sample size and \(C > 0\) is a data-snooper-specified constant. Note that the frequentist or Bayesian interval estimate of the true unknown mean (with, as mentioned above, flat prior on this unknown mean) is \(\bar{x}_n \pm z_{\alpha / 2} / \sqrt{n}\), where \(z_{\alpha / 2}\) is the \(1 - \alpha / 2\) quantile of the standard normal distribution, where \(1 - \alpha\) is the desired frequentist coverage probability or Bayesian posterior probability. Hence large enough \(C\) guarantees that the resulting interval will exclude zero, even when zero is the true unknown mean.
The law of the iterated logarithm says \[ \limsup_{n \to \infty} \frac{\bar{x}_n}{\sqrt{2 \log(\log(n)) / n}} = 1 \] So with probability one \(\bar{x}_n\) will eventually exceed \(C / \sqrt{n}\) for any \(C\).
But we may have to wait a very long time for that to happen. So we allow for minimum and maximum values of the sample size. And we see what happens. For efficiency, we code our stopping rule in C.
#include <R.h>
#include <Rmath.h>
#include <Rinternals.h>
#include <R_ext/Utils.h>
#include <math.h>
SEXP foo(SEXP crit_in, SEXP nmin_in, SEXP nmax_in) {
if (! isReal(crit_in))
error("argument crit must be storage mode double");
if (! isReal(nmin_in))
error("argument nmin must be storage mode double");
if (! isReal(nmax_in))
error("argument nmax must be storage mode double");
if (LENGTH(crit_in) != 1)
error("argument crit must be length one");
if (LENGTH(nmin_in) != 1)
error("argument nmin must be length one");
if (LENGTH(nmax_in) != 1)
error("argument nmax must be length one");
const double crit = REAL(crit_in)[0];
const long nmin = REAL(nmin_in)[0];
const long nmax = REAL(nmax_in)[0];
if (crit <= 0)
error("argument crit must be positive");
if (nmin <= 0)
error("argument nmin must be positive");
if (nmax <= 0)
error("argument nmax must be positive");
const double critsq = crit * crit;
GetRNGstate();
double sum_x = 0.0;
double n = 0.0;
for (;;) {
R_CheckUserInterrupt();
sum_x += norm_rand();
n += 1;
if (n >= nmin && sum_x > 0 && sum_x * sum_x > critsq * n) break;
if (n >= nmax) break;
}
PutRNGstate();
SEXP result = PROTECT(allocVector(REALSXP, 2));
REAL(result)[0] = sum_x;
REAL(result)[1] = n;
UNPROTECT(1);
return result;
}
And exercise our code as follows.
set.seed(42)
nboot <- 30
crit <- qnorm(0.95)
crit
## [1] 1.644854
nmax <- 1e9
nmin <- 1e2
foo_star <- NULL
for (iboot in 1:nboot)
foo_star <- rbind(foo_star,
.Call("foo", as.double(crit), as.double(nmin), as.double(nmax)))
foo_star[ , 1] <- foo_star[ , 1] / sqrt(foo_star[ , 2])
colnames(foo_star) <- c("z", "n")
foo_star <- as.data.frame(foo_star)
print(foo_star, row.names = FALSE)
## z n
## 1.6451428 7150240
## 1.6448615 724582889
## 1.6660137 2299
## 1.6455300 144
## 1.7311039 119
## 1.6455967 17716
## 1.6464479 745835
## 1.6448807 3465383
## 0.4604622 1000000000
## 2.0905846 100
## 1.6553572 6549
## 1.6465587 99237
## -1.4574581 1000000000
## 1.6451190 332259
## 1.6449366 73604375
## 1.6491055 3304
## -0.5747132 1000000000
## 1.6474857 7176
## 1.6448879 19307419
## 0.9594091 1000000000
## -0.7115764 1000000000
## 1.7909918 100
## 0.9675176 1000000000
## -0.4218012 1000000000
## 1.6488675 16973
## 1.6505782 3399
## 1.6460052 89953
## 0.4606342 1000000000
## -1.2585646 1000000000
## 1.6501299 531
Having a smaller nmin
would allow not only for shorter runs but also would allow for a smaller proportion of runs hitting nmax
because there can be less negativity for our stopping rule to overcome.
In the runs above we had \(n\) less than nmax
with probability 0.7 (estimated, standard error 0.084)
If we decrease nmin
to the minimum allowable, then we get a higher probability.
nmin <- 1
foo_star <- NULL
for (iboot in 1:nboot)
foo_star <- rbind(foo_star,
.Call("foo", as.double(crit), as.double(nmin), as.double(nmax)))
foo_star[ , 1] <- foo_star[ , 1] / sqrt(foo_star[ , 2])
colnames(foo_star) <- c("z", "n")
foo_star <- as.data.frame(foo_star)
print(foo_star, row.names = FALSE)
## z n
## 1.98339058 6
## 1.65123737 5481
## 1.66004563 1
## -0.01745529 1000000000
## 1.64488731 327446003
## 1.89862772 5
## 1.66972236 3799
## 1.70603012 42
## 1.67207643 1
## 2.05752042 11
## 1.75571501 1
## 1.65357076 579
## 1.73293120 33
## 1.64489872 23145336
## 0.16516065 1000000000
## 1.64568351 10290
## 1.65616908 3569
## 1.64565728 5859
## -1.05717551 1000000000
## 1.64602081 15299
## 1.64487472 270884872
## 1.64612420 51621
## 1.80804637 73
## 1.65043125 23389
## 1.65237816 3591
## 1.66255064 57
## 1.66428354 3682
## 1.88605638 1
## 1.64513525 4897600
## 1.64521104 4389553
Now we have \(n\) less than nmax
with probability 0.9 (estimated, standard error 0.055)
With the critical value we chose the 90% interval is guaranteed to not contain the true unknown parameter value (zero in our simulations) with probability one conditional on \(n\) less than nmax
.
Our analysis using the law of the iterated logarithm shows we can increase this probability to 100% by setting nmax <- Inf
but that may result in huge amounts of computing time being used.
We will just show what happens by bumping nmax
by a factor of 100.
nmax <- nmax * 100
foo_star <- NULL
for (iboot in 1:nboot)
foo_star <- rbind(foo_star,
.Call("foo", as.double(crit), as.double(nmin), as.double(nmax)))
foo_star[ , 1] <- foo_star[ , 1] / sqrt(foo_star[ , 2])
colnames(foo_star) <- c("z", "n")
foo_star <- as.data.frame(foo_star)
print(foo_star, row.names = FALSE)
## z n
## -2.3706516 100000000000
## 1.6558742 3170
## 1.6457539 66883
## -1.6195818 100000000000
## 1.6456764 12438
## 1.6505346 7
## 1.6452212 4795523
## 1.6484579 1042
## -1.7841709 100000000000
## 1.6543119 494
## 1.7974073 37
## 1.8698472 11
## 1.6673684 80
## 1.7556894 7
## 1.6448561 8818986051
## 1.6573505 28
## 1.6752640 2141
## -0.5107717 100000000000
## 1.7337795 19
## 1.6448803 3408106
## 1.6464094 239236
## 1.6612827 787
## 1.6448648 5805155111
## 1.6450392 20831
## 1.6468050 4159
## 1.6448578 5434275625
## -0.3973488 100000000000
## 1.8058648 3
## 1.9054245 1
## -0.4673361 100000000000
Now we have \(n\) less than nmax
with probability 0.8 (estimated, standard error 0.073)