Rules
See the Section about Rules for Quizzes and Homeworks on the General Info page.
Your work handed into Moodle should be a plain text file with R commands and comments that can be run to produce what you did. We do not take your word for what the output is. We run it ourselves.
Note: Plain text
specifically excludes
Microsoft Word native format (extension .docx
). If you have to
Word as your text editor, then save as and choose the format to be
Text (.txt)
or something like that. Then upload the
saved plain text file.
Note: Plain text
specifically excludes
PDF (Adobe Portable Document Format) (extension .pdf
). If you
use Sweave, knitr, or Rmarkdown, upload the source (extension .Rnw
or .Rmd
) not PDF or any other kind of output.
If you have questions about the quiz, ask them in the Moodle forum for this quiz. Here is the link for that https://ay16.moodle.umn.edu/mod/forum/view.php?id=1279315.
You must be in the classroom, Armory 202, while taking the quiz.
Quizzes must uploaded by the end of class (1:10). Moodle actually allows a few minutes after that. Here is the link for uploading the quiz https://ay16.moodle.umn.edu/mod/assign/view.php?id=1279330.
Homeworks must uploaded before midnight the day they are due. Here is the link for uploading the homework. https://ay16.moodle.umn.edu/mod/assign/view.php?id=1279355.
Quiz 3
Problem 1
Write an R function that, like the example in Section 8 of the course notes about computer arithmetic except that we want it to be for the geometric distribution rather than the binomial distribution.
The geometric distribution has parameter θ and data x and log likelihood
where, as always, exp denotes the exponential function and log the natural logarithmic function (as in R). The data x is nonnegative-integer-valued (0, 1, 2, …). The parameter θ is negative-real-valued (− ∞ < θ < 0).
Your function should have signature
logl(theta, x)and return a list having components
-
value
, the value of the function, -
gradient
, the first derivative of the function, and -
hessian
, the second derivative of the function.
Like the example in the course notes, the point is to be careful about computer arithmetic, avoiding overflow and catastrophic cancellation as much as possible.
For this problem, you do not have to check for invalid argument values.
Show your function working for parameter values
thetas <- (- 10^seq(3, - 3))and for data values both
x <- 0and
x <- 5
Problem 2
Test your function for the preceding problem like the example in Section 8 of the course notes about computer arithmetic
To test with the function value compare with the values of the function
logl.too <- function(theta) dgeom(x, 1 - exp(theta), log = TRUE)and to test the derivatives use numerical derivatives (you can also apply the other method using derivatives calculated by the R function
D
if you like, but that does not count;
we are only going to grade the comparison to numerical derivatives).
Note that numerical differentiation does not give perfectly accurate derivatives. So there may be small discrepancies between the two methods of calculation.
Problem 3
(This problem is about finding errors in data, but that is too hard for a quiz, so we are just going to test some skills that are usefull for that.)
We will use the data
foo <- read.csv("http://www.stat.umn.edu/geyer/s17/3701/data/q3p3.csv", stringsAsFactors = FALSE)(This reads in a data frame having variables
w
, x
,
y
, and z
). The categorical variables
are w
and z
.
In this problem we consider the answer better if it is done without using a loop.
- How many distinct values does
w
have? - If you wanted to fit a linear model that regresses
y
onw
,x
, andz
withz
being treated as categorical (likew
), what would you have have to do to make that work right? - Find the largest value of
y
for each value ofw
. - Find the second largest value of
y
for each value ofw
.
Homework 3
Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.
Problem 4
This problem was revised about 2:00 am Thursday.
This is a modification of practice problem 3.1. We are going to do it getting the correct answers for large negative or large positive values of the parameter.
This problem is so complicated, that it cannot be explained with the kludgy math that can be done in HTML, so I made a supplementary notes in Rmarkdown (which uses Javascript to render math): Notes on the Zero-Truncated Poisson Distribution. But you only need to look at them if you want more details. We will copy a lot of the math here.
μ = m ⁄ (1 − exp(− m))
l(θ) = x θ − m − log(1 − exp(− m))
l'(θ) = x − μ
l''(θ) = − μ (1 − μ + m) = − μ (1 − μ exp(− m))
These are the same formulas as in practice problem 3.1 except for the last. We now have two formulas (the one in the practice problem and a new one).
Unless you were extremely clever, your solutions to practice problem 3.1 did not work for large negative θ or large positive θ. My solution did not work for these cases either.
As discussed in the supplementary notes linked above, there are three cases to look out for
- large negative θ, and in this case x = 1 causes special problems,
- large positive θ, and in this case when x ⋅ θ overflows causes a special problem, and
- moderate sized θ, positive or negative.
For the function itself, for large negative θ use
For all other θ except for very large positive θ the formula
can be used.
For θ so large that x ⋅ θ overflows
(is Inf
)
the correct value for l(θ) is -Inf
but neither
of the formulas above give that, so this has to be done as a special case.
The mean μ is not part of what your function is supposed to return, but is useful in intermediate calculations. For all but large negative θ use the formula
but for very large negative θ use
where
(the reason for this is that the other formula could give NaN
or Inf
depending on how it is calculated, but this formula
works).
For the first derivative, the hard case is when θ is large negative and x = 1. In that case use the formula
where
and a is the same infinite series as defined above.
In all other cases the formula
For the second derivative, no formula works for all θ but
works except for large positive θ (where it gives catastrophic cancellation because m and μ are nearly equal and very large), and
works except for large negative θ (where it gives catastrophic
cancellation because μ exp(− m) is nearly equal to 1)
and except for when μ overflows (is Inf
) in which case the
formula gives NaN
but the correct answer is -Inf
.
With those formulas and the usual care about using log1p
and expm1
, the log likelihood and two derivatives can
be coded correctly for all θ, including infinite values.
The correct values for θ = ∞ are
l'(θ) = − ∞
l''(θ) = − ∞
The correct values for θ = − ∞ and x = 1 are
l'(θ) = 0
l''(θ) = 0
The correct values for θ = − ∞ and x > 1 are
l'(θ) = x - 1
l''(θ) = 0
So the problem is to redo practice problem 3.1 getting good answers for all θ, including even infinite values (computer infinite).
Apply your function to the following values of θ
thetas <- c(1, 3, 10, 30, 100, 300, 1000, .Machine$double.xmax, Inf) thetas <- c(- rev(thetas), 0, thetas)and for x = 1 and for x = 5.
It is your decision how many terms of each infinite series to use and for what values of θ to use the formulas with infinite series.
These are rapidly converging series and converge for all values of m, but the smaller m is, particularly when m is much less than 1, the faster the series converges. When m is much less than one, the index k does not have to be very large before the k-th term is less than the machine epsilon.
Problem 5
This is a modification of practice problem 3.2. In order to make the solution not exactly the same as the practice problem solution, this time we are requiring that you use the formulas calculated by differentiating
You can either do the derivatives yourself or let R do it.
Compare the function you wrote for the preceding problem to evaluating these
expressions in an obvious manner (like in
Section 8 of the course notes about computer arithmetic).
You can also apply the other method using numerical derivatives calculated
by the R package numDeriv
if you like, but that does not count
for credit.
Compare for the values of θ and x required in the preceding problem.
Comment on any discrepancies between your function and evaluating the formula in the obvious way.
Problem 6
This problem is about finding stuff in data frames. This problem does not require you to write a function. These data are about CRAN packages. Each row of the data frame is one package.
The R command
load(url("http://www.stat.umn.edu/geyer/s17/3701/data/cran.rda"))loads one R object: a data frame called
bar
.
- What are the names of the variables in this data frame?
- What are the values of the variable
NeedsCompllation
? According to the authoritative reference (Writing R Extensions, Section 1.1.1) this field should be either"yes"
or"no"
, the former indicating that the package contains C, C++, Fortran, Objective-C, or Objective-C++ code that is compiled and called by R functions in the package. What proportion of packages have"yes"
for this field? What proportion of packages have something other than"yes"
,"no"
, orNA
for this field? - What are the values of the variable
ByteCompile
? According to the authoritative reference (Writing R Extensions, Section 1.1.1, linked above) this field should be any of"yes"
,"no"
,"true"
, or"false"
, possibly capitalized, either"yes"
or"true"
indicating that the R functions in the package should be compiled (which makes them run about twice as fast). What proportion of packages have"yes"
or"true"
with any of the letters capitalized or not for this field? - What are the names of the packages that have the word
fuzzy
, capitalized or not, in either theTitle
variable or theDescription
variable. The name of the package is thePackage
variable.
Hint: The R function grepl
may help.