Statistics 3701 (Geyer, Spring 2017) Homework 3

Rules

See the Section about Rules for Quizzes and Homeworks on the General Info page.

Your work handed into Moodle should be a plain text file with R commands and comments that can be run to produce what you did. We do not take your word for what the output is. We run it ourselves.

Note: Plain text specifically excludes Microsoft Word native format (extension .docx). If you have to Word as your text editor, then save as and choose the format to be Text (.txt) or something like that. Then upload the saved plain text file.

Note: Plain text specifically excludes PDF (Adobe Portable Document Format) (extension .pdf). If you use Sweave, knitr, or Rmarkdown, upload the source (extension .Rnw or .Rmd) not PDF or any other kind of output.

If you have questions about the quiz, ask them in the Moodle forum for this quiz. Here is the link for that https://ay16.moodle.umn.edu/mod/forum/view.php?id=1279315.

You must be in the classroom, Armory 202, while taking the quiz.

Quizzes must uploaded by the end of class (1:10). Moodle actually allows a few minutes after that. Here is the link for uploading the quiz https://ay16.moodle.umn.edu/mod/assign/view.php?id=1279330.

Homeworks must uploaded before midnight the day they are due. Here is the link for uploading the homework. https://ay16.moodle.umn.edu/mod/assign/view.php?id=1279355.

Quiz 3

Problem 1

Write an R function that, like the example in Section 8 of the course notes about computer arithmetic except that we want it to be for the geometric distribution rather than the binomial distribution.

The geometric distribution has parameter θ and data x and log likelihood

l(θ) = x θ + log(1 − exp(θ))

where, as always, exp denotes the exponential function and log the natural logarithmic function (as in R). The data x is nonnegative-integer-valued (0, 1, 2, …). The parameter θ is negative-real-valued (− ∞ < θ < 0).

Your function should have signature

    logl(theta, x)

and return a list having components

value, the value of the function,
gradient, the first derivative of the function, and
hessian, the second derivative of the function.

Like the example in the course notes, the point is to be careful about computer arithmetic, avoiding overflow and catastrophic cancellation as much as possible.

For this problem, you do not have to check for invalid argument values.

Show your function working for parameter values


thetas <- (- 10^seq(3, - 3))

and for data values both


x <- 0

and


x <- 5

Problem 2

Test your function for the preceding problem like the example in Section 8 of the course notes about computer arithmetic

To test with the function value compare with the values of the function

logl.too <- function(theta) dgeom(x, 1 - exp(theta), log = TRUE)

and to test the derivatives use numerical derivatives (you can also apply the other method using derivatives calculated by the R function D if you like, but that does not count; we are only going to grade the comparison to numerical derivatives).

Note that numerical differentiation does not give perfectly accurate derivatives. So there may be small discrepancies between the two methods of calculation.

Problem 3

(This problem is about finding errors in data, but that is too hard for a quiz, so we are just going to test some skills that are usefull for that.)

We will use the data


foo <- read.csv("http://www.stat.umn.edu/geyer/s17/3701/data/q3p3.csv", stringsAsFactors = FALSE)

(This reads in a data frame having variables w, x, y, and z). The categorical variables are w and z.

In this problem we consider the answer better if it is done without using a loop.

How many distinct values does w have?
If you wanted to fit a linear model that regresses y on w, x, and z with z being treated as categorical (like w), what would you have have to do to make that work right?
Find the largest value of y for each value of w.
Find the second largest value of y for each value of w.

Homework 3

Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.

Problem 4

This problem was revised about 2:00 am Thursday.

This is a modification of practice problem 3.1. We are going to do it getting the correct answers for large negative or large positive values of the parameter.

This problem is so complicated, that it cannot be explained with the kludgy math that can be done in HTML, so I made a supplementary notes in Rmarkdown (which uses Javascript to render math): Notes on the Zero-Truncated Poisson Distribution. But you only need to look at them if you want more details. We will copy a lot of the math here.

m = exp(θ)
μ = m ⁄ (1 − exp(− m))
l(θ) = x θ − m − log(1 − exp(− m))
l'(θ) = x − μ
l''(θ) = − μ (1 − μ + m) = − μ (1 − μ exp(− m))

These are the same formulas as in practice problem 3.1 except for the last. We now have two formulas (the one in the practice problem and a new one).

Unless you were extremely clever, your solutions to practice problem 3.1 did not work for large negative θ or large positive θ. My solution did not work for these cases either.

As discussed in the supplementary notes linked above, there are three cases to look out for

large negative θ, and in this case x = 1 causes special problems,
large positive θ, and in this case when x ⋅ θ overflows causes a special problem, and
moderate sized θ, positive or negative.

For the function itself, for large negative θ use

l(θ) = (x - 1) θ − m − log(1 − m ⁄ 2 + m² ⁄ 3! − m³ ⁄ 4! + … + (−1)^{k − 1} m^{k − 1} ⁄ k! + …)

For all other θ except for very large positive θ the formula

l(θ) = x θ − m − log(1 − exp(− m))

can be used.

For θ so large that x ⋅ θ overflows (is Inf) the correct value for l(θ) is -Inf but neither of the formulas above give that, so this has to be done as a special case.

The mean μ is not part of what your function is supposed to return, but is useful in intermediate calculations. For all but large negative θ use the formula

μ = m ⁄ (1 − exp(− m))

but for very large negative θ use

μ = 1 ⁄ a

where

a = 1 − m ⁄ 2 + m² ⁄ 3! − m³ ⁄ 4! + … + (−1)^{k − 1} m^{k − 1} ⁄ k! + …

(the reason for this is that the other formula could give NaN or Inf depending on how it is calculated, but this formula works).

For the first derivative, the hard case is when θ is large negative and x = 1. In that case use the formula

l'(θ) = m b ⁄ a

where

b = − 1 ⁄ 2 + m ⁄ 3! − m² ⁄ 4! + … + (−1)^{k − 1} m^{k − 2} ⁄ k! + …

and a is the same infinite series as defined above.

In all other cases the formula

l'(θ) = x − μ

gives correct results (when μ is calculated correctly, as described above).

For the second derivative, no formula works for all θ but

l''(θ) = − μ (1 − μ + m)

works except for large positive θ (where it gives catastrophic cancellation because m and μ are nearly equal and very large), and

l''(θ) = = − μ (1 − μ exp(− m))

works except for large negative θ (where it gives catastrophic cancellation because μ exp(− m) is nearly equal to 1) and except for when μ overflows (is Inf) in which case the formula gives NaN but the correct answer is -Inf.

With those formulas and the usual care about using log1p and expm1, the log likelihood and two derivatives can be coded correctly for all θ, including infinite values.

The correct values for θ = ∞ are

l(θ) = − ∞
l'(θ) = − ∞
l''(θ) = − ∞

The correct values for θ = − ∞ and x = 1 are

l(θ) = 0
l'(θ) = 0
l''(θ) = 0

The correct values for θ = − ∞ and x > 1 are

l(θ) = − ∞
l'(θ) = x - 1
l''(θ) = 0

So the problem is to redo practice problem 3.1 getting good answers for all θ, including even infinite values (computer infinite).

Apply your function to the following values of θ


thetas <- c(1, 3, 10, 30, 100, 300, 1000, .Machine$double.xmax, Inf)
thetas <- c(- rev(thetas), 0, thetas)

and for x = 1 and for x = 5.

It is your decision how many terms of each infinite series to use and for what values of θ to use the formulas with infinite series.

These are rapidly converging series and converge for all values of m, but the smaller m is, particularly when m is much less than 1, the faster the series converges. When m is much less than one, the index k does not have to be very large before the k-th term is less than the machine epsilon.

Problem 5

This is a modification of practice problem 3.2. In order to make the solution not exactly the same as the practice problem solution, this time we are requiring that you use the formulas calculated by differentiating

l(θ) = x θ − exp(θ) − log(1 − exp(− exp(θ))))

You can either do the derivatives yourself or let R do it. Compare the function you wrote for the preceding problem to evaluating these expressions in an obvious manner (like in Section 8 of the course notes about computer arithmetic). You can also apply the other method using numerical derivatives calculated by the R package numDeriv if you like, but that does not count for credit.

Compare for the values of θ and x required in the preceding problem.

Comment on any discrepancies between your function and evaluating the formula in the obvious way.

Problem 6

This problem is about finding stuff in data frames. This problem does not require you to write a function. These data are about CRAN packages. Each row of the data frame is one package.

The R command


load(url("http://www.stat.umn.edu/geyer/s17/3701/data/cran.rda"))

loads one R object: a data frame called bar.

What are the names of the variables in this data frame?
What are the values of the variable NeedsCompllation? According to the authoritative reference (Writing R Extensions, Section 1.1.1) this field should be either "yes" or "no", the former indicating that the package contains C, C++, Fortran, Objective-C, or Objective-C++ code that is compiled and called by R functions in the package. What proportion of packages have "yes" for this field? What proportion of packages have something other than "yes", "no", or NA for this field?
What are the values of the variable ByteCompile? According to the authoritative reference (Writing R Extensions, Section 1.1.1, linked above) this field should be any of "yes", "no", "true", or "false", possibly capitalized, either "yes" or "true" indicating that the R functions in the package should be compiled (which makes them run about twice as fast). What proportion of packages have "yes" or "true" with any of the letters capitalized or not for this field?
What are the names of the packages that have the word fuzzy, capitalized or not, in either the Title variable or the Description variable. The name of the package is the Package variable.

Hint: The R function grepl may help.