Statistics 3701 (Geyer, Spring 2017) Homework 6

Rules

See the Section about Rules for Quizzes and Homeworks on the General Info page.

Your work handed into Moodle should be a plain text file with R commands and comments that can be run to produce what you did. We do not take your word for what the output is. We run it ourselves.

Note: Plain text specifically excludes Microsoft Word native format (extension .docx). If you have to Word as your text editor, then save as and choose the format to be Text (.txt) or something like that. Then upload the saved plain text file.

Note: Plain text specifically excludes PDF (Adobe Portable Document Format) (extension .pdf). If you use Sweave, knitr, or Rmarkdown, upload the source (extension .Rnw or .Rmd) not PDF or any other kind of output.

If you have questions about the quiz, ask them in the Moodle forum for this quiz. Here is the link for that https://ay16.moodle.umn.edu/mod/forum/view.php?id=1368653.

You must be in the classroom, Armory 202, while taking the quiz.

Quizzes must uploaded by the end of class (1:10). Moodle actually allows a few minutes after that. Here is the link for uploading the quiz https://ay16.moodle.umn.edu/mod/assign/view.php?id=1368662.

Homeworks must uploaded before midnight the day they are due. Here is the link for uploading the homework. https://ay16.moodle.umn.edu/mod/assign/view.php?id=1368665.

Quiz 6

Problem 1

This is a simulation study (course notes about simulation).

We are going to study almost but not exactly the same model as in (Section 5 of those course notes). The model for this problem is normal with mean θ and variance θ⁴ (not θ² as in the example in the notes).

The estimators we want to compare are the sample mean (calculated by mean(x) if the data are x) and the signed square root of the sample sd (calculated by sign(mean(x)) * sqrt(sd(x)) if the data are x). Just those two estimators.

Use sample size n = 15.

Use simulation truth parameter value θ = e (that is, theta <- exp(1)).

Use at least 10⁴ simulations.

Like in the course notes, make a table comparing the MSE of these estimators (with MCSE as appropriate). You do not have to format these nicely in a table using Rmarkdown. As long as you print each number (one estimate of MSE and one estimate of MCSE of MSE for each estimator, which is four numbers) with a clear indication (maybe a comment, maybe a row or column label in a matrix) of what it is, that's enough.

The R code that makes the tables at the ends of Section 5.5 of those course notes and Section 5.6 of those course notes is hidden (not in the HTML) but can be seen in the Rmarkdown source of those course notes.

Be sure to use the principle of common random numbers.

Problem 2

The R command


x <- read.csv(url("http://www.stat.umn.edu/geyer/5102/data/prob7-1.txt"))$x

assigns one R object x, which is the data vector for this problem.

These same data were also used Section 5 of the course notes on models, Part II and Section 9 of those course notes, but unlike in those notes we are going to investigate different estimators.

The estimators of location and scale we are going to use are the 25% trimmed mean (calculated by mean(x, trim = 0.25) if the data are x) and the interquartile range (calculated by IQR(x) if the data are x), respectively. Both are highly robust, tolerating up to 25% gross outliers or errors in the data.

Calculate a 95% bootstrap t confidence interval using these data and these estimators (the trimmed mean is the estimator of the parameter of interest and the IQR is the sdfun).

Use the method of the first example in Section 6.1.4 of the course notes on the bootstrap.

Do not use the method of the second example in that section where the sdfun argument is omitted.

You could also use the method of Section 6.1.3 of those notes, but why? It is a lot more code to fuss with.

Use at least 10⁴ bootstrap samples.

Problem 3

This problem is about two-parameter maximum likelihood, which was covered in Sections 6–9 of the course notes on models, Part II.

In particular, it is about Wald-type confidence intervals, exemplified in Section 9.3.1 of those course notes and Section 9.3.2 of those course notes.

However, in order to not use a distribution already used in the course notes, we are going to use the logistic distribution (which gives logistic regression its name).

The R function dlogis calculates the PDF of this distribution and works like all other d functions for distributions R knows about. The help page help("dlogis") gives a mathematical equation for the PDF but you shouldn't need it for this problem (you can use it if you think you need to).

This is a symmetric distribution. The population median is the center of symmetry, which is the location parameter. The interquartile range (IQR) of the standard logistic distribution is


> qlogis(0.75) - qlogis(0.25)
[1] 2.197225

times the scale parameter (the number above is for the default value of the scale parameter, which is one). We can see this from


> foo <- rexp(5)
> foo
[1] 1.3009717 0.1212703 0.2030711 0.6550454 0.9039590
> (qlogis(0.75, scale = foo) - qlogis(0.25, scale = foo)) / foo
[1] 2.197225 2.197225 2.197225 2.197225 2.197225

Hence the sample median is a good estimator of location and the IQR divided by qlogis(0.75) - qlogis(0.25) is a good estimator of scale.

The R command


x <- read.csv(url("http://www.stat.umn.edu/geyer/s17/3701/data/q6p3.csv"))$x

assigns one R object x, which is the data vector for this problem.

Use these starting values to find MLE of both parameters (location and scale) for these data assuming they are IID logistic distributed.
Provide Wald-type 95% confidence intervals of both parameters (location and scale) using observed Fisher information to compute standard errors for the point estimates.

Homework 6

Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.

Problem 4

The R command


foo <- read.csv(url("http://www.stat.umn.edu/geyer/s17/3701/data/q6p4.csv"))
gout1 <- glm(y ~ x1 + x2, family = binomial, data = foo)
summary(gout1)
gout2 <- glm(y ~ poly(x1, x2, degree = 2), family = binomial, data = foo)
summary(gout2)
anova(gout1, gout2, test = "Chisq")

does one test of model comparison for a logistic regression, comparing linear and quadratic models.

Not trusting the asymptotics, we want to do a parametric bootstrap calculation of the P-value.

The code


aout <- anova(gout1, gout2, test = "Chisq")
aout[2, "Pr(>Chi)"]

obtains that P-value.

The code


ystar <- simulate(gout1)$sim_1

simulates a new response vector ystar under the null hypothesis. (The R function simulate is a generic function that package authors may provide to simulate models.)

Simulate at least 10⁴ simulations of this model (the MLE for the null hypothesis). Fit the null (linear) and alternative (quadratic) models to each, obtain the P-value and store it. Suppose you call the vector you store all these bootstrap P-values in pstar.

Suppose you call the P-value for the actual data phat.

A P-value is a statistic — a function of the data that is not a function of the parameters — so if we think of it as just a test statistic that we don't know the distribution of (which is why we are bootstrapping it), then we use the bootstrap distribution as its sampling distribution. Since we reject the null hypothesis when the P-value is low, the way to calculate the P-value from this simulation is


mean(c(phat, pstar) <= phat)

Note that we include the P-value for the actual data in with all the bootstrap P-value. This is valid because under the null hypothesis it has almost the same distribution. It also prevents one getting P = 0 by just doing too few simulations.

Ideally, the R functions glm and anova should always calculate the P-value accurately using care about computer arithmetic (like we do). Unfortunately, they do not. You may get warnings from glm.fit, the R function that glm calls to do the actual model fitting. And you may get NA or NaN for a bootstrap P. This would not happen if the R functions glm and anova were coded with more attention to computer arithmetic. But they aren't.

Ignore the warnings.
If you get NA or NaN for any bootstrap P-value, replace it with 1.0. This means we do not reject the null hypothesis at any α level. If you cannot compute a P-value, then you have to assume it is no evidence at all against the null hypothesis, so this is the correct thing to do.

Problem 5

This problem is almost the same as the preceding one. We are going to do the same calculation but think of it as a simulation study rather than as a bootstrap.

With the same vector pstar computed in the preceding problem. Check whether it gives valid tests. Under the null hypothesis a test rejects the null hypothesis at level α with probability α. Using a P-value to conduct the test, the test rejects at level α when P ≤ α. This says, if the test statistic is continuous, that it is uniformly distributed on the interval (0, 1), that is

Pr(P ≤ α) = α for all α

We do not want to calculate this for an infinite number of α, and that wouldn't make sense even if we could do it because nboot is finite. Moreover, we are only interested in low values of α.

Thus, here is what the problem requires.

Estimate Pr(P ≤ α) for α = 0.01, 0.02, …, 0.10.
Also provide MCSE of these estimates.

So the problem requires you to calculate 20 numbers (10 estimates, 10 MCSE).

Hint: recall that MCSE of zero-or-one-valued random variables can be calculated using the formulas for standard errors of sample proportions from STAT 3011 as discussed in (Section 6.1.5 of the course notes on the bootstrap).

Problem 6

In this problem we are going to bootstrap a likelihood-based confidence interval. In particular, we are going to bootstrap the 95% likelihood interval for the scale parameter σ calculated in (Section 9.3.3 of the course notes on models, part II) and reported in (summary section following it) as (2.86,6.65).

The theory on which this test is based is in Section 7.1.1 of those notes.

That theory says, in other words than in that section of the notes, that twice the difference of log likelihoods, the first maximizing over both parameters and the second holding the parameter of interest (that's σ in this problem) fixed, has an asymptotic distribution that is chi-square on one degree of freedom when the null hypothesis is true, that is, when the value σ₀ at which we fix σ is the true unknown parameter value.

If we don't believe the asymptotics and want to do a parametric bootstrap instead, we should derive a critical value from simulation of the null distribution of the test statistic.

Hence we do the following steps in each iteration of bootstrap simulation. Use bootstrap sample size nboot = 1e4.

Simulate data from the MLE distribution. The parameter estimates were reported to be 9.266664 for μ and 4.362070 for σ in Section 9.2 of those notes. The R function rcauchy will simulate such data. The data sample size was never shown in those notes but was n = 45.
Calculate the maximum of the log likelihood optimizing over both parameters and also just maximizing over μ while holding σ fixed at its true unknown parameter value in the bootstrap world, which is the value you are using for the simulation distribution.
Calculate the bootstrap test statistic (twice the difference in log likelihoods: big model minus little model). Since we are using minus the log likelihood for the convenience of the R function nlm, it will be little model minus big model (because minus times minus is plus). You can check that you got it the right way around because the test statistic has to be positive (the unconstrained maximum has to be larger than the constrained maximum).
Store each value of the bootstrap test statistic.

The result of this simulation is (the bootstrap estimate of) the sampling distribution of the test statistic under the null hypothesis. The critical value for making a 95% likelihood-based-confidence interval is the 0.95 quantile of this distribution. Find that critical value.

We won't bother actually making the confidence interval. It would be done exactly the same as in Section 9.3.3 of those notes) except with crit.chisq defined as we calculate in this problem rather than


> qchisq(0.95, df = 1)
[1] 3.841459

which is derived from asymptotic theory and used in those notes. (But you do not have to actually do that.)

Hint: An R function that calculates minus the log likelihood for the two-parameter Cauchy distribution is found in Section 9.2 of those notes. But this function needs to be modified for this problem.

In maximizing over μ and σ, one needs to
- make the data an argument because the simulated data changes in each bootstrap iteration,
- redefine the function each time through the loop with the data name inside the function the same as outside,
- or define the function only once with the data name inside the function the same as outside.
In maximizing over μ holding σ fixed, one needs to make a version of mlogl that has only arguments
- mu and x if you are making the data an argument
- or just mu if you are not making the data an argument.

Statistics 3701 (Geyer, Spring 2017) Homework 6

Rules

Quiz 6

Problem 1

Problem 2

Problem 3

Homework 6

Problem 4

Problem 5

Problem 6

Navigation

Contents