Rweb in Stat 3011 Home Page Stat 3011 Home Page About the Rweb in Stat 3011 Web Pages

Simple Linear Regression in Rweb (Stat 3011)

Scatter Plots
Fitting the Regression Line
Plotting the Regression Line
Inference about the Slope
- Hypothesis Tests
- Confidence Intervals
Inference about the Correlation Coefficient
Prediction Intervals and Confidence Intervals for the Regression Function
Diagnostic Plots
- Residuals versus Predictor
- Q-Q Plot of the Residuals

Scatter Plots

For an example dataset we use the data for Example 4.2 in Devore and Peck. To use this data, type

     http://superior.stat.umn.edu/~charlie/3011/te0402.dat

in the "Dataset URL" window (as explained on the variables and data entry page). Then submit the R command

     plot(x, y)

to get the scatter plot of the data. It should look just like the left hand panel of Figure 4.3 in the book except for the labels. R just puts the variable names x and y as axis labels. To get labels like in the book, we can use optional arguments to the plot command

     plot(x, y, xlab="Soil pH", ylab="Dieback %")

Another useful optional argument is main, which puts a title on the plot, for example

     plot(x, y, xlab="Soil pH", ylab="Dieback %",
        main="Northern Vermont Sugarbushes")

Fitting the Regression Line

Least squares regression fitting is done in R using the lm (for "linear model") command

     foo <- lm(y ~ x)

does the regression of y on x and stores the results of the calculation in an R dataset named foo that is mostly useful for further computations. The command

     coefficients(foo)

extracts the least squares regression coefficients (intercept and slope) from this output, and

     fitted.values(foo)
     residuals(foo)

produce the fitted values (Devore and Peck call them "predicted values") and residuals.

Plotting the Regression Line

The R command abline adds a line to a plot. Putting this all together.

     plot(x, y, xlab="Soil pH", ylab="Dieback %")
     foo <- lm(y ~ x)
     abline(coefficients(foo))

makes the scatter plot with the regression line drawn.

Actually, through the magic of defaults, the simpler

     plot(x, y, xlab="Soil pH", ylab="Dieback %")
     foo <- lm(y ~ x)
     abline(foo)

also works.

Inference about the Slope

For this section, we will use the data set for Problem 11.20 in Devore and Peck

     http://superior.stat.umn.edu/~charlie/3011/ex1120.dat

The R command summary applied to the output of the lm command produces most of what is needed for problems about linear regression. Type the URL above in the data URL window and submit

     foo <- lm(y ~ x)
     summary(foo)

and Rweb prints

    Call: 
    lm(formula = y ~ x) 
 
    Residuals: 
         Min       1Q   Median       3Q      Max  
    -22.8330  -6.5139   0.7797   7.9072  19.8375  
 
    Coefficients: 
                Estimate Std. Error t value Pr(>|t|)   
    (Intercept)  32.0809    13.3185   2.409   0.0426 * 
    x             0.5549     0.3101   1.790   0.1113
    --- 
    Signif. codes:  0  `***'  0.001  `**'  0.01  `*'  0.05  `.'  0.1  ` '  1  
 
    Residual standard error: 14.3 on 8 degrees of freedom
    Multiple R-Squared: 0.2859,     Adjusted R-squared: 0.1966  
    F-statistic: 3.203 on 1 and 8 degrees of freedom,       p-value: 0.1113

The important line in the printout is the one in red about the regression coefficient for "x". What the label means is that the equation of the sample regression line is y = a + b x and the coefficient of the variable x is b (the slope).

Reading across, the numbers in this row are

b = 0.5549 (it is labeled "estimate" because, of course, it is only an estimate of the slope of the population regression line),
s_b = 0.3101, the standard error (estimated standard deviation of the sampling distribution) of b,
t = 1.790, the t statistic for testing the null hypothesis H₀ : = 0,
and P = 0.1113, the P-value for the two-tailed of this hypothesis.

Hypothesis Tests

To be precise about the hypotheses that go with the test having P = 0.1113 for its P-value, this is a two-tailed test with hypotheses

If we wanted a one-tailed instead of a two-tailed test, we would, as usual, cut the P-value in half (because two tails have twice the probability of one tail) getting P = 0.05565. Since the estimate b is positive, this is evidence for positive values of the parameter . Thus this one-tailed test is upper-tailed with hypotheses

Confidence Intervals

We follow the general form for a confidence interval

point estimate ± (critical value) (standard error of the estimate)

Here the point estimate in question is b and its standard error is given in the printout of the summary command above.

The relevant critical value here is a t critical value for n - 2 degrees of freedom. (But you don't even have to remember n - 2 or what n is. Rweb prints the degrees of freedom for error in the same printout. We've marked it in green there.) The degrees of freedom is also given by

     df.residual(foo)

To get a t critical value from a confidence level and degrees of freedom, we use the qt command, in this case

     - qt(0.025, 8)

gives the critical value 2.306 for a 95% confidence interval (0.025 because 95% in the middle leaves 2.5% in each tail and 8 degrees of freedom read out of the printout above).

Thus a 95% confidence interval for the slope . of the population regression line is

     0.5549 - c(-1,1) * qt(0.025, 8) * 0.3101

or (-0.1602, 1.2700).

Inference about the Correlation Coefficient

Because of the relation

it follows that

= 0 if and only if

= 0. Thus a test of the hypotheses

is exactly the same as a test of the hypotheses

So for tests about correlations, see the preceding section.

For confidence intervals for correlations, you will have to take a more advanced course. It's only the tests that are the same.

Prediction Intervals and Confidence Intervals for the Regression Function

These two philosophically quite distinct notions have very similar formulas and are done by the same Rweb function predict (for details see the on-line help).

The predicted interval having 95% probability of covering the y value for a newly observed individual having the x value 50 is given by the Rweb code

     foo <- lm(y ~ x)
     predict(foo, data.frame(x=50), interval="prediction")

The result is (24.57, 95.08).

A 95% confidence interval for the value of the population regression function at the point x = 50 (that is, for + x with 50 plugged in for x) is given by the Rweb code

     foo <- lm(y ~ x)
     predict(foo, data.frame(x=50), interval="confidence")

The result is (47.34, 72.31).

Diagnostic Plots

Residuals versus Predictor

What Devore and Peck call a residual plot in Chapter 4 (page 138) is produced by the Rweb commands

     foo <- lm(y ~ x)
     plot(x, residuals(foo), xlab="Soil pH", ylab="Residuals")
     abline(h=0)

The abline(h=0) command is not necessary. It just puts the x-axis on the plot, like the examples done in class.

In Chapter 11 Devore and Peck are selling a slightly different plot in which the residuals are replaced by what they "standardized residuals" and other people call "studentized residuals". In fact, there are two kinds of such residuals, so-called "internally studentized residuals" and "externally studentized residuals". Devore and Peck are talking about the first kind (internally studentized) and Rweb has a function rstudent that produces the other.

     foo <- lm(y ~ x)
     plot(x, rstudent(foo), xlab="Soil pH", ylab="Residuals")
     abline(h=0)

The two plots made by Rweb and the plots Devore and Peck are talking about look very much alike. Of the three the one with the externally studentized residuals (using rstudent) does a better job of finding outliers.

Q-Q Plot of the Residuals

A normal Q-Q plot of regression residuals is often used as a diagnostic. (The residuals should be approximately normally distributed). The R commands

     foo <- lm(y ~ x)
     qqnorm(residuals(foo))
     qqline(residuals(foo))

produce this plot. It is interpreted the same as any other Q-Q plot. Lack of linearity indicates non-normality.

As we saw with residual plots, it is better to use studentized residuals in place of ordinary residuals (replace residuals with rstudent)

     foo <- lm(y ~ x)
     qqnorm(rstudent(foo))
     abline(0, 1)

The reason for the different command to draw a line is that for standardized residuals we supposedly know the scale. The residuals are supposedly standard normal. Hence we want the line with slope one.

Simple Linear Regression in Rweb (Stat 3011)

Contents