Conditional Probability

Conditional probability is just like unconditional probability except it depends on the observed value(s) of some random variable(s).

If there are two random variables \(X\) and \(Y\), and you observe \(X\) before you observe \(Y\), then what you find out about \(X\) may change what you expect to find out about \(Y\) when you observe it.

If the (joint) probability mass function (PMF) for \(X\) and \(Y\) is \(f\), then the conditional PMF of \(Y\) given \(X\) is proportional to \(f(x, y)\) thought of as a function of \(y\) for fixed \(x\). So when we are conditioning on \(x\), the variable \(x\) is no longer playing the role of a random variable. It is fixed at its observed value throughout the discussion. Thus the PMF of \(Y\) given \(X\) is \[ f(y \mid x) = c \cdot f(x, y) \] where the constant \(c\) is chosen to make the left-hand side a probability distribution thought of as a function of \(y\) for fixed \(x\), that is, \[\begin{gather*} f(y \mid x) \ge 0, \qquad \text{for all $y$} \\ \sum_y f(y \mid x) = 1 \end{gather*}\]

The PMF \(f(\,\cdot\, | x)\) depends on \(x\), but \(x\) is not an argument of the function, so it is really more like a parameter. In fact, there is really no difference between a parametric family of probability distributions \(f_\theta\) and a conditional distribution. In both case the distribution depends on something that is not considered and argument.

For this reason, Bayesian statisticians always write parametric families as conditional distributions. They write \(f(x \mid \theta)\) instead of \(f_\theta(x)\), but that is getting a bit ahead of ourselves.

The main point of this section is that conditioning changes the distribution, and the conditioning variable(s) is/are treated as fixed in the conditional distribution.

Agresti Section 1.2.5

Poisson Sampling

When we assume Poisson sampling we are assuming that the cells of the contingency table, which contain counts (nonnegative-integer-valued random variables), are independent Poisson (not necessarily with the same mean, hence not necessarily identically distributed).

Multinomial Sampling

If we start with Poisson sampling and condition on the total we get multinomial sampling. If \(X_1\), \(\ldots,\) \(X_n\) are independent Poisson random variables and \(N = X_1 + \cdots + X_n\), then the conditional distribution of the random vector \(X = (X_1, \ldots, X_n)\) given \(N\) is multinomial.

Recall that individuals being sampled form a Poisson process if no individual has any influence on any other. If you take a sample of such individuals over a fixed time interval, then you get Poisson sampling. If you take a sample of such individuals until you get a predetermined number of individuals, then you get multinomial sampling.

Conditioning on \(N\) is the same as fixing \(N\).

Product Multinomial Sampling

Here we fix more than one sum of counts. Let \(\mathcal{A}\) be a partition of the index set of the counts. This means each index \(i\) is an element of exactly one \(A \in \mathcal{A}\). Now we fix (condition on) \[ N_A = \sum_{i \in A} X_i, \qquad A \in \mathcal{A}. \]

Now let \(X_A\) denote the subvector of \(X\) than has just components \(X_i\) for \(i \in A\). Then these subvectors are independent random vectors conditional on all of the \(N_A\). And the conditional distribution of \(X_A\) given all of the \(N_A\) is multinomial with sample size \(N_A\).

For a more concrete example, suppose we have a two-way table, in which the data form a matrix. If we condition on the row sums (those are the \(N_A\) defined above), then the rows of the table are independent multinomial random vectors. If we condition on the column sums, then the columns of the table are independent multinomial random vectors.

If we start with Poisson sampling, and condition on \(N_A\), \(A \in \mathcal{A}\) as defined above, then we get product multinomial sampling.

If we start with multinomial sampling, and condition on \(N_A\), \(A \in \mathcal{A}\) as defined above, then we get product multinomial sampling.

Suppose \(\mathcal{B}\) is a partition of the index set that is coarser than \(\mathcal{A}\), which means that every element of \(\mathcal{A}\) is contained in some element of \(\mathcal{B}\). If we start with product multinomial sampling for partition \(\mathcal{B}\) and condition on \(N_A\), \(A \in \mathcal{A}\) as defined above, then we get product multinomial sampling for partition \(\mathcal{A}\).

In all three cases, no matter where we start, when we condition on more stuff we get the same thing as if we had decided to fix that stuff in advance.

Summary

Whether you consider this deep or trivial is up to you.

We will find that most of the time it doesn't matter what sampling scheme we "assume" because the same statistical inference results. (But not always.)

Likelihood Inference

Likelihood Function

For parametric family of distributions with PMF \(f_\theta\), the likelihood for the model when the observed data are \(x\) is just PMF \[ L(\theta) = f_\theta(x) \] with the roles of the parameter and data interchanged. In the PMF the data \(x\) is the variable and the parameter \(\theta\) is fixed. In the likelihood (left-hand side) the data \(x\) is fixed at its observed value and the parameter \(\theta\) is the variable.

You may think this is trivial, but everybody in statistics observes this pedantic distinction. You can really see the difference when it comes to calculus. The derivative of the likelihood function \(L\) requires us to differentiate with respect to \(\theta\) (because that is the variable in that function). The derivative of the PMF \(f_\theta\) requires us to differentiate with respect to \(x\) (because that is the variable in that function).

Log Likelihood Function

For mathematical convenience, the log likelihood is often preferred. \[ l(\theta) = \log L(\theta) \]

Modifications

For reasons that cannot be fully understood until we are done with both frequentist likelihood inference and Bayesian inference (all of which is likelihood-based) it makes no difference whatsoever to statistical inference (frequentist or Bayesian) if

  • additive terms that do not contain the parameter(s) are dropped from the log likelihood

  • multiplicative terms that do not contain the parameter(s) are dropped from the likelihood

"Principle" (in Scare Quotes) of Maximum Likelihood

You might see in places where they dumb down thing to the point of being wrong. That the maximizer of the likelihood is a good point estimate of the parameter.

This is not a "principle" you should find in any statistics book.

There are statistical models such that the more data you have the worse the maximum likelihood estimator (MLE) is.

What is true is that for statistical models satisfying "suitable regularity conditions" (and neither we nor Agresti go into what that is) we have the following results.

  1. the MLE is a consistent and asymptotically normal (CAN) estimator of the parameter, and no other estimator can do better asymptotically than the MLE, except perhaps at a negligible set of parameter values.

  2. the MLE is a root-\(n\)-consistent, that is, \[ \sqrt{n} (\hat{\theta}_n - \theta) \] converges in distribution to a mean-zero normal distribution (a multivariate normal distribution if \(\theta\) is a vector), where \(\hat{\theta}_n\) is the MLE for sample size \(n\).

  3. (under somewhat weaker regularity conditions than for item 1) if \(\tilde{\theta}_n\) is any root-\(n\)-consistent estimator, that is, \[ \sqrt{n} (\tilde{\theta}_n - \theta) \] converges to any distribution whatsoever, and we define \(\hat{\theta}_n\) to be the nearest local maximum of the likelihood to \(\tilde{\theta}_n\), then \(\hat{\theta}_n\) is again CAN and best possible (except perhaps for a negligible set of parameter values).

  4. The asymptotic variance in the asymptotic (normal) distribution is inverse Fisher information, either observed or expected.

    • Observed Fisher information is minus the Hessian (second derivative) matrix of the log likelihood, evaluated at the MLE.

    • Expected Fisher information is the expectation of observed Fisher information (treating the data as random).

  5. Log-linear models for categorical data analysis that are full exponential families (more on that later) always satisfy the regularity conditions for item 1.

  6. Curved submodels of those in item 4 (for example Agresti Sec. 1.5.4) always satisfy the regularity conditions for item 3 (but not necessarily for item 1).

Item 4 means that, so long as we can write down the log likelihood and calculate two derivatives, we know the asymptotic distribution of the MLE (under "suitable regularity conditions"). Item 5 says that for most models used in this class (but not all), the MLE can be defined as the global maximizer of the log likelihood. Item 6 says that for all models used in this class, the MLE can be defined as the the nearest local maximum to a root-\(n\)-consistent estimator.

So there is no "principle" that says the global maximizer is best (or even exists). (http://www.stat.umn.edu/geyer/5102/examp/like.html#mix discusses a statistical model for which the global maximizer never exists because the supremum of the log likelihood is infinity. Unfortunately, that model is not categorical data analysis.) But at least for categorical data analysis we do have the estimator of item 6, which we can call the MLE and does have desirable properties.

Except all of this is as sample size goes to infinity. Nothing says the exact sampling distribution of the MLE (for the \(n\) we are at) is well approximated by the asymptotic distribution no matter what \(n\) is.

But Geyer (2013, IMS Collections, Vol. 10, pp. 1--24) following earlier authors shows that if the log likelihood is well approximated by a quadratic function, then the sampling distribution of the MLE is close to its asymptotic distribution (Agresti also mentions this). So that gives us some purchase on what we need to know about whether asymptotic approximation is good.

We can also use simulation (the so-called parametric bootstrap, more on this later) to check how good asymptotic approximation, and also to make the approximation better if it isn't good already.

Agresti Section 1.3.2

One quibble. It so happens that the setting the first derivative of the log likelihood equal to zero and solving for the parameter seems to give the MLE \(\hat{\pi} = x / n\). But this is sloppy. The derivative does not exist at the boundary of the parameter space, and even if one uses one-sided derivatives, calculus does not say that the derivative is zero if the maximum occurs on the boundary. Thus to be careful, one needs better analysis (http://www.stat.umn.edu/geyer/5102/slides/s3.pdf slides 20, 21, 25, and 31).

Note that something is fishy about \(\pi = 0\) and \(\pi = 1\) anyway. In this case the asymptotic variance is zero (\(\sqrt{\pi (1 - \pi)} = 0\) in either case). So all the asymptotics says is \[ \sqrt{n} (\hat{\pi}_n - \pi) \stackrel{\mathcal{D}}{\longrightarrow} 0 \] (the right-hand side is the distribution concentrated at zero). It doesn't tell us much that is useful. (More on this later.)

Agresti Section 1.3.3

Likelihood-based hypothesis tests come in three kinds.

These tests are all asymptotically equivalent in the sense that (under suitable regularity conditions) they are asymptotically equivalent that means that for very large sample sizes they will have nearly equal values of the test statistic and \(P\)-value for the same data. (The test statistic and \(P\)-value are random quantities when the data are considered random, but for the same data all three tests will have nearly the same test statistics and \(P\)-values. The difference between test statistics for any two of these test will be negligible (for large \(n\)) compared to either test statistic itself.)

Thus asymptotics gives us no reason to choose one or the other.

Recommendations about which to use are based on mathematical convenience, pedagogical convenience, or simulations. Simulations, of course, must be based on one particular model (or perhaps a few models) and so cannot be general.

Under mathematical convenience we have

In particular, R function summary computes lots of \(P\)-values for lots of tests all based on having fit only one model, the big model (and all of the \(P\)-values are for tests of little models that have one of the coefficients set to zero). Thus these must all be Wald tests.

The most famous example of score tests (and one which was invented long before general score tests) is the Pearson Chi-Square test for categorical data analysis. So whenever we use that, we are using a score test.

Exact formulas for the test statistics for the Wald and Rao tests are given in my PhD level theory notes http://www.stat.umn.edu/geyer/8112/notes/tests.pdf, Section 1. But they are hard to apply except in simple special cases (and sometimes not even then). So mostly when we do one or the other of these tests, we will have a computer do all the hard work (as will be seen in the notes for Section 1.4 in Agresti).

TL;DR There is no one right way to do a hypothesis test.

Coverage of Confidence Intervals

Because the data (in this course) are discrete (counts), only a finite set of data values contains almost all of the probability. It follows that the actual coverage probability of a confidence interval (no matter what the recipe is) cannot be a flat function. It cannot be 0.95 for all values of the parameter (or any other confidence level other than 0.95). As the parameter (say \(\theta\)) moves from inside to outside (or vice versa) the confidence interval for some data value (say \(x\)) the coverage probability must jump by \(f_\theta(x)\).

http://www.stat.umn.edu/geyer/5102/examp/coverage.html illustrates this for a variety of confidence intervals for the binomial distribution, those covered in Section 1.4 of Agresti plus some more.

TL;DR There is no one right way to do a confidence interval. Which is better than the others is a matter of opinion.

No confidence interval (recipe) for discrete data can be exact (actually achieve its nominal coverage for all values of the true unknown parameter).

Fuzzy confidence intervals (Geyer and Meeden, 2005, Statistical Science, 20, 358--387) are exact (actually achieve their nominal coverage for all values of the true unknown parameter) but are more complicated to use and harder to interpret and explain. We will spend a little time on them (5421 notes on binomial and 5421 notes on fuzzy) but not now.

Agresti Section 1.3.4

Frequentist statistical inference procedures come in trios

Every hypothesis test procedure determines a confidence interval produced by "inverting" the test. And vice versa.

If one has a confidence interval procedure (that constructs confidence intervals with any coverage probability), then the Hodges-Lehmann estimator associated with this procedure is the point to which these confidence intervals shrink as the coverage probability goes to zero.

In this course we will always use MLE for frequentist point estimates (Bayes is different, more on that later).

But we see that we have three kinds of hypothesis tests (Wald, Wilks, Rao) and so three different confidence interval procedures. But they all give back the MLE as the Hodges-Lehmann estimator.

Agresti Section 1.4

For this section we move to some old notes, revised for this year as binom.Rmd.

Agresti Section 1.4.3

I have to disagree with Agresti here. When the MLE is on the boundary of the parameter space, it is true that the score and likelihood confidence intervals are not completely ridiculous, and the Wald interval is completely ridiculous (zero width).

However, when the true unknown parameter value is on the boundary of the parameter space, the "usual regularity conditions" for maximum likelihood are not satisfied. Thus, in this situation, none of the three tests, Wald, Wilks, or Rao, have any theoretical justification whatsoever.

Hence we recommend the intervals proposed by Geyer (2009, Electronic Journal of Statistics, 3, 259--289) and illustrated on the web page about coverage of confidence intervals. For the binomial distribution these intervals for coverage \(1 - \alpha\) are

  • when \(x = 0\), the confidence interval is \((0, 1 − \alpha^{1 ⁄ n})\), and

  • when \(x = n\), the confidence interval is \((\alpha^{1 ⁄ n}, 1)\).

These, at least, have a theoretical justification. For these data this interval is

alpha <- 0.05
n <- 25
c(0, 1 - alpha^(1 / n))
## [1] 0.0000000 0.1129281

Fuzzy intervals (Geyer and Meeden, 2005, Statistical Science, 20, 358--387; notes on fuzzy for this course) also do the right thing (are exact-exact) for this case.

Here is the 95% fuzzy confidence interval for the data discussed by Agresti (\(x = 0\), \(n = 25\)).

library(ump)
fci.binom(0, 25)
## 95 percent fuzzy confidence interval
## core is empty
## support is [0, 0.1322)

The interpretation is that the fuzzy interval is like part credit on a test question. Instead of points being either in or out of the interval, they are considered partially in an partially out. The fuzzy confidence interval is a function saying how much each point is partially in. This ranges from 0.95 (for 0.95 coverage) down to zero. This fuzzy confidence interval is zero for \(p\) above 0.1322 (we had to use cut-and-paste for this number, which one should never do, because of bad design of R function fci.binom which gives no programmatic access to this number; it just prints the number and does not return anything as a value). If one wanted a conventional confidence interval corresponding to this fuzzy interval it would be the reported support \([0, 0.1322)\), but this would be only conservative-exact rather than exact-exact. It would have coverage higher than the specified 0.95. The fuzzy interval says we don't need to consider all of these points to be fully in the interval to get the required coverage. (The fact that no point is considered fully in the interval is just a curious fact about the way these intervals work. It is related to the probability of \(x = 0\) goes to 1 as \(p \to 0\), so one would get more than the specified 0.95 coverage if points near zero had the graph of the fuzzy confidence interval greater than 0.95.)

Note that these intervals, which are theoretically justified, say that the score and likelihood intervals reported by Agresti, which are not theoretically justified (for true unknown parameter on the boundary of the parameter space) are too short, in fact, way too short. This can be seen from the fact that their coverage dips way below 0.95 for \(p\) near zero and one (web page about coverage of confidence intervals).

Agresti Section 1.4.4

Use fuzzy (preceding section) instead of mid-\(P\)-value.

Agresti Section 1.5

We have our own notes for this multi-simple.Rnw.

But there is a lot more to say on this subject. Basically, this whole course is about data that can be taken to arise from either Poisson, multinomial, or product multinomial sampling. Basically, the whole textbook (Agresti) and all of the lecture notes for this course are on this subject.

Agresti Section 1.5.6: Structural Zeros

Sometimes when a contingency table is laid out in an array, some cells have probability zero by design. They cannot occur. R function chisq.test cannot handle this case. But several other R functions usually can.

For the example in Section 1.5.6 in Agresti the data are

n <- matrix(c(30, 0, 63, 63), nrow = 2)
n
##      [,1] [,2]
## [1,]   30   63
## [2,]    0   63
And the log likelihood for the alternative hypothesis is \[\begin{align*} l(\pi) & = n_{1 1} \log(\pi^2) + n_{1 2} \log[\pi (1 - \pi)] + n_{2 2} \log(1 - \pi) \\ & = 2 n_{1 1} \log(\pi) + n_{1 2} \log(\pi) + n_{1 2} \log(1 - \pi) + n_{2 2} \log(1 - \pi) \end{align*}\]

(middle unnumbered displayed equation on p. 21 in Agresti). Applying calculus we get the following displayed equation in Agresti \[ l'(\pi) = \frac{2 n_{1 1}}{\pi} + \frac{n_{1 2}}{\pi} - \frac{n_{1 2}}{1 - \pi} - \frac{n_{2 2}}{1 - \pi} \] If we are shaky on the calculus, even R can do this

D(quote(n11 * log(pi^2) + n12 * log(pi * (1 - pi)) + n22 * log(1 - pi)),
    name = "pi")
## n11 * (2 * pi/pi^2) + n12 * (((1 - pi) - pi)/(pi * (1 - pi))) - 
##     n22 * (1/(1 - pi))

But it gets a somewhat messier formula. Setting our derivative equal to zero and multiplying both sides by \(\pi (1 - \pi)\) gives \[ 2 n_{1 1} (1 - \pi) + n_{1 2} (1 - \pi) - n_{1 2} \pi - n_{2 2} \pi = 0 \] or \[ 2 n_{1 1} + n_{1 2} - (2 n_{1 1} + n_{1 2} + n_{1 2} + n_{2 2}) \pi = 0 \] the unique solution of which is \[ \hat{\pi} = \frac{2 n_{1 1} + n_{1 2}}{2 n_{1 1} + n_{1 2} + n_{1 2} + n_{2 2}} \] and this agrees with the bottom unnumbered displayed equation on p 21 in Agresti.

pihat <- (2 * n[1,1] + n[1,2]) / (2 * n[1,1] + 2 * n[1,2] + n[2,2])
pihat
## [1] 0.4939759

That is the MLE for \(\pi\) for the null hypothesis. Then the MLE for the expected cell counts is the vector

e <- sum(n) * c(pihat^2, pihat * (1 - pihat), 1 - pihat)
e
## [1] 38.06590 38.99434 78.93976

And this agrees with Agresti. Hence the Pearson chi-square test statistic is

o <- as.vector(n)
o <- o[o > 0]
tstat <- sum((o - e)^2 / e)
tstat
## [1] 19.70606
pchisq(tstat, lower.tail = FALSE, df = 1)
## [1] 9.031458e-06

The null hypothesis has one parameter. The alternative has two parameters (because the probabilities for the three cells that are not the structural zero must sum to one). Hence the degrees of freedom is the difference \(2 - 1 = 1\).

We could also do a likelihood ratio test. The value of the log likelihood for the null hypothesis is

pihat.vec.0 <- e / sum(n)
pihat.vec.1 <- o / sum(n)
logl0 <- sum(o * log(pihat.vec.0))
logl1 <- sum(o * log(pihat.vec.1))
tstat <- 2 * (logl1 - logl0)
tstat
## [1] 17.73787
pchisq(tstat, lower.tail = FALSE, df = 1)
## [1] 2.535289e-05

We will mercifully not try a Wald test.

None of this may make much sense. To do an analysis of these data one should really have already taken Stat 5101-5102 but that is not a prerequisite for the course.

TL;DR If there are structural zeros, then you either really need to know your theoretical statistics or you need help from someone who does. Standard software isn't set up for this.

Agresti 1.6

Skip for now, we will take up Bayes later