Rules
See the Section about Rules for Quizzes and Homeworks on the General Info page.
Your work handed into Moodle should be a plain text file with R commands and comments that can be run to produce what you did. We do not take your word for what the output is. We run it ourselves.
Note: Plain text
specifically excludes
Microsoft Word native format (extension .docx
). If you have to
Word as your text editor, then save as and choose the format to be
Text (.txt)
or something like that. Then upload the
saved plain text file.
Note: Plain text
specifically excludes
PDF (Adobe Portable Document Format) (extension .pdf
). If you
use Sweave, knitr, or Rmarkdown, upload the source (extension .Rnw
or .Rmd
) not PDF or any other kind of output.
If you have questions about the quiz, ask them in the Moodle forum for this quiz. Here is the link for that https://ay16.moodle.umn.edu/mod/forum/view.php?id=1368653.
You must be in the classroom, Armory 202, while taking the quiz.
Quizzes must uploaded by the end of class (1:10). Moodle actually allows a few minutes after that. Here is the link for uploading the quiz https://ay16.moodle.umn.edu/mod/assign/view.php?id=1368662.
Homeworks must uploaded before midnight the day they are due. Here is the link for uploading the homework. https://ay16.moodle.umn.edu/mod/assign/view.php?id=1368665.
Quiz 6
Problem 1
This is a simulation study (course notes about simulation).
We are going to study almost but not exactly the same model as in (Section 5 of those course notes). The model for this problem is normal with mean θ and variance θ4 (not θ2 as in the example in the notes).
The estimators we want to compare are the sample mean
(calculated by mean(x)
if the data are x
)
and the signed square root of the sample sd
(calculated by sign(mean(x)) * sqrt(sd(x))
if the data
are x
). Just those two estimators.
Use sample size n
= 15.
Use simulation truth parameter value θ = e
(that is, theta <- exp(1)
).
Use at least 104 simulations.
Like in the course notes, make a table comparing the MSE of these estimators (with MCSE as appropriate). You do not have to format these nicely in a table using Rmarkdown. As long as you print each number (one estimate of MSE and one estimate of MCSE of MSE for each estimator, which is four numbers) with a clear indication (maybe a comment, maybe a row or column label in a matrix) of what it is, that's enough.
The R code that makes the tables at the ends of Section 5.5 of those course notes and Section 5.6 of those course notes is hidden (not in the HTML) but can be seen in the Rmarkdown source of those course notes.
Be sure to use the principle of common random numbers.
Problem 2
The R command
x <- read.csv(url("http://www.stat.umn.edu/geyer/5102/data/prob7-1.txt"))$xassigns one R object
x
, which is the data vector for this
problem.
These same data were also used Section 5 of the course notes on models, Part II and Section 9 of those course notes, but unlike in those notes we are going to investigate different estimators.
The estimators of location and scale we are going to use are
the 25% trimmed mean (calculated by mean(x, trim = 0.25)
if
the data are x
) and
the interquartile range (calculated by IQR(x)
if
the data are x
), respectively.
Both are highly robust, tolerating up to 25% gross outliers or errors in
the data.
Calculate a 95% bootstrap t confidence interval using these
data and these estimators (the trimmed mean is the estimator of the parameter
of interest and the IQR is the sdfun
).
Use the method of the first example in Section 6.1.4 of the course notes on the bootstrap.
Do not use the method of the second example in that section where
the sdfun
argument is omitted.
You could also use the method of Section 6.1.3 of those notes, but why? It is a lot more code to fuss with.
Use at least 104 bootstrap samples.
Problem 3
This problem is about two-parameter maximum likelihood, which was covered in Sections 6–9 of the course notes on models, Part II.
In particular, it is about Wald-type confidence intervals, exemplified in Section 9.3.1 of those course notes and Section 9.3.2 of those course notes.
However, in order to not use a distribution already used in the course notes, we are going to use the logistic distribution (which gives logistic regression its name).
The R function dlogis
calculates the PDF of this distribution
and works like all other
functions for distributions
R knows about. The help page d
help("dlogis")
gives a mathematical
equation for the PDF but you shouldn't need it for this problem (you can use
it if you think you need to).
This is a symmetric distribution. The population median is the center of symmetry, which is the location parameter. The interquartile range (IQR) of the standard logistic distribution is
> qlogis(0.75) - qlogis(0.25) [1] 2.197225times the scale parameter (the number above is for the default value of the scale parameter, which is one). We can see this from
> foo <- rexp(5) > foo [1] 1.3009717 0.1212703 0.2030711 0.6550454 0.9039590 > (qlogis(0.75, scale = foo) - qlogis(0.25, scale = foo)) / foo [1] 2.197225 2.197225 2.197225 2.197225 2.197225Hence the sample median is a good estimator of location and the IQR divided by
qlogis(0.75) - qlogis(0.25)
is a good estimator of scale.
The R command
x <- read.csv(url("http://www.stat.umn.edu/geyer/s17/3701/data/q6p3.csv"))$xassigns one R object
x
, which is the data vector for this
problem.
- Use these starting values to find MLE of both parameters (location and scale) for these data assuming they are IID logistic distributed.
- Provide Wald-type 95% confidence intervals of both parameters (location and scale) using observed Fisher information to compute standard errors for the point estimates.
Homework 6
Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.
Problem 4
The R command
foo <- read.csv(url("http://www.stat.umn.edu/geyer/s17/3701/data/q6p4.csv")) gout1 <- glm(y ~ x1 + x2, family = binomial, data = foo) summary(gout1) gout2 <- glm(y ~ poly(x1, x2, degree = 2), family = binomial, data = foo) summary(gout2) anova(gout1, gout2, test = "Chisq")does one test of model comparison for a logistic regression, comparing linear and quadratic models.
Not trusting the asymptotics, we want to do a parametric bootstrap calculation of the P-value.
The code
aout <- anova(gout1, gout2, test = "Chisq") aout[2, "Pr(>Chi)"]obtains that P-value.
The code
ystar <- simulate(gout1)$sim_1simulates a new response vector
ystar
under the null hypothesis.
(The R function simulate
is a generic function that package
authors may provide to simulate models.)
Simulate at least 104 simulations of this model (the MLE for
the null hypothesis). Fit the null (linear) and alternative (quadratic)
models to each, obtain the P-value and store it. Suppose
you call the vector you store all these bootstrap P-values
in pstar
.
Suppose you call the P-value for the actual data phat
.
A P-value is a statistic — a function of the data that is not a function of the parameters — so if we think of it as just a test statistic that we don't know the distribution of (which is why we are bootstrapping it), then we use the bootstrap distribution as its sampling distribution. Since we reject the null hypothesis when the P-value is low, the way to calculate the P-value from this simulation is
mean(c(phat, pstar) <= phat)Note that we include the P-value for the actual data in with all the bootstrap P-value. This is valid because under the null hypothesis it has almost the same distribution. It also prevents one getting P = 0 by just doing too few simulations.
Ideally, the R functions glm
and anova
should
always calculate the P-value accurately using care about computer
arithmetic (like we do). Unfortunately, they do not. You may get warnings
from glm.fit
, the R function that glm
calls to
do the actual model fitting. And you may get NA
or NaN
for a bootstrap P
. This would not happen
if the R functions glm
and anova
were coded
with more attention to computer arithmetic. But they aren't.
- Ignore the warnings.
- If you get
NA
orNaN
for any bootstrap P-value, replace it with 1.0. This means we do not reject the null hypothesis at any α level. If you cannot compute a P-value, then you have to assume it is no evidence at all against the null hypothesis, so this is the correct thing to do.
Problem 5
This problem is almost the same as the preceding one. We are going to do the same calculation but think of it as a simulation study rather than as a bootstrap.
With the same vector pstar
computed in the preceding problem.
Check whether it gives valid tests. Under the null hypothesis a test
rejects the null hypothesis at level α with probability α.
Using a P-value to conduct the test, the test rejects at level
α when P ≤ α. This says, if the test statistic
is continuous, that it is uniformly distributed on the interval (0, 1),
that is
We do not want to calculate this for an infinite number of α,
and that wouldn't make sense even if we could do it because nboot
is finite. Moreover, we are only interested in low values of α.
Thus, here is what the problem requires.
- Estimate Pr(P ≤ α) for α = 0.01, 0.02, …, 0.10.
- Also provide MCSE of these estimates.
So the problem requires you to calculate 20 numbers (10 estimates, 10 MCSE).
Hint: recall that MCSE of zero-or-one-valued random variables can be calculated using the formulas for standard errors of sample proportions from STAT 3011 as discussed in (Section 6.1.5 of the course notes on the bootstrap).
Problem 6
In this problem we are going to bootstrap a likelihood-based confidence interval. In particular, we are going to bootstrap the 95% likelihood interval for the scale parameter σ calculated in (Section 9.3.3 of the course notes on models, part II) and reported in (summary section following it) as (2.86,6.65).
The theory on which this test is based is in Section 7.1.1 of those notes.
That theory says, in other words than in that section of the notes, that twice the difference of log likelihoods, the first maximizing over both parameters and the second holding the parameter of interest (that's σ in this problem) fixed, has an asymptotic distribution that is chi-square on one degree of freedom when the null hypothesis is true, that is, when the value σ0 at which we fix σ is the true unknown parameter value.
If we don't believe the asymptotics and want to do a parametric bootstrap instead, we should derive a critical value from simulation of the null distribution of the test statistic.
Hence we do the following steps in each iteration of bootstrap simulation.
Use bootstrap sample size nboot
= 1e4
.
- Simulate data from the MLE distribution. The parameter estimates
were reported to be 9.266664 for μ and 4.362070 for σ in
Section 9.2 of those notes. The R function
rcauchy
will simulate such data. The data sample size was never shown in those notes but wasn
= 45. - Calculate the maximum of the log likelihood optimizing over both parameters and also just maximizing over μ while holding σ fixed at its true unknown parameter value in the bootstrap world, which is the value you are using for the simulation distribution.
- Calculate the bootstrap test statistic (twice the difference in
log likelihoods: big model minus little model). Since we are using
minus the log likelihood for the convenience of the R function
nlm
, it will be little model minus big model (because minus times minus is plus). You can check that you got it the right way around because the test statistic has to be positive (the unconstrained maximum has to be larger than the constrained maximum). - Store each value of the bootstrap test statistic.
The result of this simulation is (the bootstrap estimate of) the sampling distribution of the test statistic under the null hypothesis. The critical value for making a 95% likelihood-based-confidence interval is the 0.95 quantile of this distribution. Find that critical value.
We won't bother actually making the confidence interval. It would be done
exactly the same as in
Section 9.3.3 of those notes) except with crit.chisq
defined as we calculate in this problem rather than
> qchisq(0.95, df = 1) [1] 3.841459which is derived from asymptotic theory and used in those notes. (But you do not have to actually do that.)
Hint: An R function that calculates minus the log likelihood for the two-parameter Cauchy distribution is found in Section 9.2 of those notes. But this function needs to be modified for this problem.
- In maximizing over μ and σ, one needs to
- make the data an argument because the simulated data changes in each bootstrap iteration,
- redefine the function each time through the loop with the data name inside the function the same as outside,
- or define the function only once with the data name inside the function the same as outside.
- In maximizing over μ holding σ fixed, one needs to
make a version of
mlogl
that has only arguments-
mu
andx
if you are making the data an argument - or just
mu
if you are not making the data an argument.
-