Statistics 3701 (Geyer, Spring 2017) Homework 4

Rules

See the Section about Rules for Quizzes and Homeworks on the General Info page.

Your work handed into Moodle should be a plain text file with R commands and comments that can be run to produce what you did. We do not take your word for what the output is. We run it ourselves.

Note: Plain text specifically excludes Microsoft Word native format (extension .docx). If you have to Word as your text editor, then save as and choose the format to be Text (.txt) or something like that. Then upload the saved plain text file.

Note: Plain text specifically excludes PDF (Adobe Portable Document Format) (extension .pdf). If you use Sweave, knitr, or Rmarkdown, upload the source (extension .Rnw or .Rmd) not PDF or any other kind of output.

If you have questions about the quiz, ask them in the Moodle forum for this quiz. Here is the link for that https://ay16.moodle.umn.edu/mod/forum/view.php?id=1310928.

You must be in the classroom, Armory 202, while taking the quiz.

Quizzes must uploaded by the end of class (1:10). Moodle actually allows a few minutes after that. Here is the link for uploading the quiz https://ay16.moodle.umn.edu/mod/assign/view.php?id=1310947.

Homeworks must uploaded before midnight the day they are due. Here is the link for uploading the homework. https://ay16.moodle.umn.edu/mod/assign/view.php?id=1310954.

Quiz 4

Problem 1

Scrape the data from the table(s) in the following web page

http://www.stat.umn.edu/geyer/s17/3701/data/big10.html

following the example Section 4 of the course notes about data

And answer the following questions. Your answers must be computed entirely using R operating on that web page. Simply reading the answers yourself gets no credit. You have to tell R how to get the answers. Print your answers so the grader can see them.

Note that the R function readHTMLTable in the CRAN package XML that was used for that example reads in all items in the table as character strings. You will have to convert them to numbers if you want to use them as numbers. The R function as.numeric will convert character strings to numbers if they are numbers.

Read the data in this web page, convert numeric columns in the tables to type "numeric".
In the conference a win counts 3 points, a loss zero points, and a tie counts either 1 or 2 points depending on a shootout. Verify that the points are calculated correctly (SOW stands for shootout wins).
Outside the conference, shootout wins don't count. A win counts 2 points and a tie one point. What would the points be if they were counted for overall records? Associate team names with the numbers you calculate so we can see which is which.
Since the teams did not play the same number of non-conference games, adjust the numbers calculated in part (c) by dividing by games played.

Problem 2

This problem uses the data read in by


foo <- read.csv("http://www.stat.umn.edu/geyer/s17/3701/data/q4p2.csv", stringsAsFactors = FALSE)

which makes foo a data frame having variables speed (quantitative), state (categorical), color (categorical), and y (zero-or-one).

Treat y as the response to be predicted by the other three variables.

Following the example Section 3.3 of the course notes about statistical models fit a GLM that has each of the predictor variables as main effects (no interactions).

Perform tests of statistical hypotheses about whether each of these variables can be dropped from the model without making the model fit the data worse.

Interpret the P-values for these tests. What model do they say is the most parsimonious model that fits the data?

Problem 3

This problem uses the data read in by


foo <- read.csv("http://www.stat.umn.edu/geyer/s17/3701/data/q4p3.csv")

which makes foo a data frame having variables x and y both quantitative.

Treat y as the response to be predicted by x.

Following the example Section 3.4.2.3 of the course notes about statistical models fit a GAM that assumes the conditional mean of y given x is a smooth function (but no parametric assumptions about this smooth function).

On a scatter plot of the data, add lines that are the lower and upper end points of 95% confidence intervals for the mean of y given x for each value of x. As in the example in the course notes, do not adjust these intervals to obtain simultaneous coverage.

This is the first question that asks for a plot. For this question, not only upload your R code, but also the plot as a PDF file called q4p3.pdf.

Also give numeric 95% confidence intervals for the conditional mean of y given x for the x values 0, 20, 40, 60, 80, 100.

Homework 4

Homework problems start with problem number 4 because, if you don't like your solutions to the problems on the quiz, you are allowed to redo them (better) for homework. If you don't submit anything for problems 1–3, then we assume you liked the answers you already submitted.

Problem 4

This is a problem about JSON, but like the examples in Section 5 of the course notes about data, we don't actually deal with JSON. The R function fromJSON in the CRAN package jsonlite returns R data structures, which are the only thing we have to deal with.

Follow the example in Section 5.2 of the course notes about data to read in some data about CRAN.


library(jsonlite)
foo <- fromJSON("http://crandb.r-pkg.org/-/latest")

(this may take a few seconds).

If you look at some of these data


head(foo)

you see that foo is a list, each component of foo has a name that is the name of a CRAN package, and each component of foo is itself a list, and the names of the components of that list correspond to the names of fields of the DESCRIPTION file.

The authoritative reference for what these data are about (contents of the DESCRIPTION file in CRAN packages) is Section 1.1.1 of Writing R Extensions. Information about the License field of this file is in Section 1.1.2 of Writing R Extensions. You probably do not need to look at these references to do this problem, but they have been provided in case you think you need to.

Using these data, answer the following questions.

The number of fields in the DESCRIPTION file is different for different packages (some fields are mandatory, others are optional). Produce a vector of all field names in all packages. How many unique field names are there?
Every package has a field NeedsCompliation that is "yes" if the package contains C, C++, or Fortran code that is compiled and called from R (many R functions work this way). Produce a vector of all values of the field NeedsCompilation in all packages. How many packages need compilation? What proportion of packages need compilation?
Every package has a field License that specifies which licenses among the licenses on the web page https://www.r-project.org/Licenses/ the package is licensed under. Produce a vector of all the packages that are licensed under some version of the GPL (the most common license). Note that AGPL and LGPL don't count and your answer should not include packages that have only these and not also GPL in their license field. The way to find stuff in character strings is the R function grep, and the way to find complicated matches is to have the match string be a regular expression, which is documented on the R help page ?regex. There we see that \b matches word boundaries. So \bGPL\b should match GPL but not AGPL or LGPL. And that is correct, but in R strings a backslash character followed by another character is a new character, so to put this regular expression in an R string (as we must do to hand it to the grep function) we have to escape the backslashes (each \\ puts a single character \ in the string). Thus the pattern to match is "\\bGPL\\b" when put in an R character string. How many packages are licensed under the GPL? What proportion of packages are licensed under the GPL?
Some packages have a field Depends that says which version of R itself and which versions of other packages this package depends on (and won't work unless they are present). This field is optional. What is the structure of the Depends component of items of foo that have such a component?
- Produce a list of the names of packages on which each package depends. Do not include "R" in this list. Do not include version information in this list.
- Produce a vector of the names of packages on which any other package depends, repeating the name for each dependency (so we can count how many times any package is in the Depends field of another package).
- Produce a table of counts of how many times each package that appears in some Depends field does so. (Hint: the R function table is useful here.)
- Reorder your table so it is in decreasing order of the counts.
- Notice that the packages that have the highest counts are core or recommended packages, which are in
```
rcore <- c("base", "compiler", "datasets", "graphics", "grDevices",
    "grid", "methods", "parallel", "splines", "stats", "stats4",
    "tcltk", "tools", "translations", "utils", "boot", "class", "cluster",
    "codetools", "foreign", "KernSmooth", "lattice", "MASS", "Matrix",
    "mgcv", "nlme", "nnet", "rpart", "spatial", "survival")
```
  Eliminate these packages from your table, and again produce a table reordered so it is in decreasing order of the counts. I found the R function setdiff useful here.

Problem 5

This problem is about SQL databases. It starts off following Section 6.1 of the course notes about data, but this problem is more complicated than the example in the notes and requires more SQL.

Set up a database by doing


load(url("http://www.stat.umn.edu/geyer/s17/3701/data/q4p5.rda"))
library(DBI)
mydb <- dbConnect(RSQLite::SQLite(), "")
dbWriteTable(mydb, "depends", d$depends)
dbWriteTable(mydb, "imports", d$imports)
dbWriteTable(mydb, "suggests", d$suggests)
dbWriteTable(mydb, "linking", d$linking)
rm(d)
ls()

This requires that you have the CRAN packages DBI and RSQLite both installed. The rm command removes the R object that had contained the data, as the ls command shows. There is nothing left but the database connection.


dbListTables(mydb)

shows there are four tables, and, for example,


dbGetQuery("SELECT * FROM depends LIMIT 10")

(which is sort of the equivalent of the R function head) shows what this table is about. It is the same data as in the preceding problem except that all dependencies on "R" or any of the core or recommended packages have already been removed.

All four tables in the database have the same structure (the same field names and the same kind of field entries: all names of CRAN packages). The tables only differ in which field of the DESCRIPTION file of the CRAN packages the data come from. The fields are Depends, Imports, Suggests, and LinkingTo.

We want to do more or less what we did in the last problem, except that we want to do as much as possible using SQL, and we want to use the combined data from all four tables. This requires SQL not covered in the notes. If you just cannot do it with SQL, then do it using R, but you have to start with the commands above and cannot re-load the object d from the URL used above. So you have to at least use SQL to get the data out of the database.

We want to do the following.

Extract the packto columns of each table, combining them into a new table. An SQL statement that starts
```
CREATE TABLE temp AS SELECT
```
and continues with the rest of a SELECT statement creates a new table named temp that is the table that results from the SELECT statement. For more information see this web page about CREATE TABLE (the AS part is about halfway down the page).
The SQL UNION ALL operator can be used to combine SELECT statements putting all the results in one result. UNION removes duplicates; UNION ALL does not, which is what we want here, because we want to count the duplicates. For more information see this web page about UNION and UNION ALL.
It is probably easiest to just get this table and do the rest in R, but, if you want to continue with SQL, the SQL statement
```
SELECT packto, COUNT(packto) FROM temp GROUP BY packto
```
does the equivalent of the R function table, as can be seen by executing it. For more information see this web page about GROUP BY (an example using COUNT with GROUP BY is about halfway down the page).
However, the weird name of the second column of this table befuddled me as to how I could sort on it. So I did
```
SELECT packto, COUNT(packto) AS packcount FROM temp GROUP BY packto
```
where the AS gives the count a new name. For more information see this web page about AS.
Now we want to sort these results. The SQL that uses an ORDER BY clause. For more information see this web page about ORDER BY. Note that the DESC modifier asks for a sort in descending order.
But this isn't the last thing we want to do. We only want to look at the CRAN packages that have at least 100 packages depending on them. The SQL syntax for that is to add
```
WHERE packcount >= 100
```
For more information see this web page about WHERE.
But no matter what I tried I could not get all of this to work in one query. So I created yet another table that had the counting done and then did the sort and extraction of the counts over 100 in another statement.

Problem 6

This problem is about smoothing.

The R command


foo <- read.csv("http://www.stat.umn.edu/geyer/s17/3701/data/q4p3.csv")

makes foo a data frame having variables x and y both quantitative. This is the same data as for Problem 3 above. As there, we treat y as the response to be predicted by x.

The difference is that we are going to try kernel smoothing (a method not described in the course notes). Use the R function locpoly and dpill in the R package KernSmooth, which is a recommended package that comes with every R installation to fit a smooth to these data (locpoly does the smoothing and dpill does the bandwidth selection, i. e., choosing the right amount of smoothness).

You have to read the help pages and follow the examples to do this problem.

Like problem 3 above, this problem also needs a plot to show your solution. Do a scatter plot with the estimated regression function superimposed. For this question, not only upload your R code, but also the plot as a PDF file called q4p6.pdf.