Rweb in Stat 3011 Home Page Stat 3011 Home Page About the Rweb in Stat 3011 Web Pages
For an example dataset we use the data for Example 4.2 in Devore and Peck. To use this data, type
http://superior.stat.umn.edu/~charlie/3011/te0402.datin the "Dataset URL" window (as explained on the variables and data entry page). Then submit the R command
plot(x, y)to get the scatter plot of the data. It should look just like the left hand panel of Figure 4.3 in the book except for the labels. R just puts the variable names
x
and y
as
axis labels. To get labels like in the book, we can use optional
arguments to the plot
command
plot(x, y, xlab="Soil pH", ylab="Dieback %")Another useful optional argument is
main
, which
puts a title on the plot, for example
plot(x, y, xlab="Soil pH", ylab="Dieback %", main="Northern Vermont Sugarbushes")
Least squares regression fitting is done in R using the lm
(for "linear model") command
foo <- lm(y ~ x)does the regression of
y
on x
and stores the
results of the calculation in an R dataset named foo
that
is mostly useful for further computations. The command
coefficients(foo)extracts the least squares regression coefficients (intercept and slope) from this output, and
fitted.values(foo) residuals(foo)produce the fitted values (Devore and Peck call them "predicted values") and residuals.
The R command abline
adds a line to a plot.
Putting this all together.
plot(x, y, xlab="Soil pH", ylab="Dieback %") foo <- lm(y ~ x) abline(coefficients(foo))makes the scatter plot with the regression line drawn.
Actually, through the magic of defaults, the simpler
plot(x, y, xlab="Soil pH", ylab="Dieback %") foo <- lm(y ~ x) abline(foo)also works.
For this section, we will use the data set for Problem 11.20 in Devore and Peck
http://superior.stat.umn.edu/~charlie/3011/ex1120.dat
The R command summary
applied to the output of the
lm
command produces most of what is needed for problems
about linear regression.
Type the URL above in the data URL window and submit
foo <- lm(y ~ x) summary(foo)and Rweb prints
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -22.8330 -6.5139 0.7797 7.9072 19.8375 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 32.0809 13.3185 2.409 0.0426 * x 0.5549 0.3101 1.790 0.1113 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: 14.3 on 8 degrees of freedom Multiple R-Squared: 0.2859, Adjusted R-squared: 0.1966 F-statistic: 3.203 on 1 and 8 degrees of freedom, p-value: 0.1113
The important line in the printout is the one in red about the regression
coefficient for "x
". What the label means is that the equation of
the sample regression line is y = a + b x and the coefficient of the
variable x is b (the slope).
Reading across, the numbers in this row are
If we wanted a one-tailed instead of a two-tailed test, we would, as usual, cut the P-value in half (because two tails have twice the probability of one tail) getting P = 0.05565. Since the estimate b is positive, this is evidence for positive values of the parameter . Thus this one-tailed test is upper-tailed with hypotheses
summary
command above.
The relevant critical value here is a t critical value for n - 2 degrees of freedom. (But you don't even have to remember n - 2 or what n is. Rweb prints the degrees of freedom for error in the same printout. We've marked it in green there.) The degrees of freedom is also given by
df.residual(foo)
To get a t critical value from a confidence level and degrees of
freedom, we use the qt
command, in this case
- qt(0.025, 8)gives the critical value 2.306 for a 95% confidence interval (0.025 because 95% in the middle leaves 2.5% in each tail and 8 degrees of freedom read out of the printout above).
Thus a 95% confidence interval for the slope . of the population regression line is
0.5549 - c(-1,1) * qt(0.025, 8) * 0.3101or (-0.1602, 1.2700).
Because of the relation
For confidence intervals for correlations, you will have to take a more advanced course. It's only the tests that are the same.
These two philosophically quite distinct notions have very similar formulas
and are done by the same Rweb function predict
(for details see the
on-line help).
The predicted interval having 95% probability of covering the y value for a newly observed individual having the x value 50 is given by the Rweb code
foo <- lm(y ~ x) predict(foo, data.frame(x=50), interval="prediction")The result is (24.57, 95.08).
A 95% confidence interval for the value of the population regression function at the point x = 50 (that is, for + x with 50 plugged in for x) is given by the Rweb code
foo <- lm(y ~ x) predict(foo, data.frame(x=50), interval="confidence")The result is (47.34, 72.31).
What Devore and Peck call a residual plot in Chapter 4 (page 138) is produced by the Rweb commands
foo <- lm(y ~ x) plot(x, residuals(foo), xlab="Soil pH", ylab="Residuals") abline(h=0)The
abline(h=0)
command is not necessary. It just puts the
x-axis on the plot, like the examples done in class.
In Chapter 11 Devore and Peck are selling a slightly different plot
in which the residuals are replaced by what they "standardized residuals"
and other people call "studentized residuals". In fact, there are two
kinds of such residuals, so-called "internally studentized residuals"
and "externally studentized residuals". Devore and Peck are talking about
the first kind (internally studentized) and Rweb has a function
rstudent
that produces the other.
foo <- lm(y ~ x) plot(x, rstudent(foo), xlab="Soil pH", ylab="Residuals") abline(h=0)
The two plots made by Rweb and the plots Devore and Peck are talking about
look very much alike. Of the three the one with the externally studentized
residuals (using rstudent
) does a better job of finding outliers.
A normal Q-Q plot of regression residuals is often used as a diagnostic. (The residuals should be approximately normally distributed). The R commands
foo <- lm(y ~ x) qqnorm(residuals(foo)) qqline(residuals(foo))produce this plot. It is interpreted the same as any other Q-Q plot. Lack of linearity indicates non-normality.
As we saw with residual plots, it is better to use studentized residuals
in place of ordinary residuals (replace residuals
with
rstudent
)
foo <- lm(y ~ x) qqnorm(rstudent(foo)) abline(0, 1)The reason for the different command to draw a line is that for standardized residuals we supposedly know the scale. The residuals are supposedly standard normal. Hence we want the line with slope one.