University of Minnesota
School of Statistics

MacAnova Regression Example

Statistics 5021
February 13, 2003

Here is a short example of the use of MacAnova in regression. The data are from Table 2.12 (data file ta02_012.txt) which is referenced in Ex. 2.73.

The predictor variable, rural, is a measurement of air pollution at a rural location 10 miles southwest of a city. The response variable, urban, is a similar measurement for the same day in the city. There are several values missing. Cases containing missing values are ignored by regress().

Cmd> readdata("")
# File ta02_012.txt from publisher's web site
# Data from Table 2.12, p. 176 of IPS4
# Col. 1: rural = particulate level in rural location
# Col. 2: city = particulate level in city location
Read from file "TP1:Stat5021:Data:Ch02:ta02_012.txt"
Column 1 saved as REAL vector rural with 7 missing value(s)
Column 2 saved as REAL vector city with 6 missing value(s)

Cmd> list(rural,city)
city            REAL   36   
rural           REAL   36   

Cmd> print(format:"7.0f",rural,city)
rural:
 (1) MISSING      67      42      33      46 MISSING      43
 (8)      54 MISSING MISSING MISSING MISSING      38      88
(15)     108      57      70      42      43      39 MISSING
(22)      52      48      56      44      51      21      74
(29)      48      84      51      43      45      41      47
(36)      35
city:
 (1)      39      68      42      34      48      82      45
 (8) MISSING MISSING      60      57 MISSING      39 MISSING
(15)     123      59      71      41      42      38 MISSING
(22)      57      50      58      45      69      23      72
(29)      49      86      51      42      46 MISSING      44
(36)      42

You can identify the cases with missing values using ismissing().

Cmd> ruralmissing <- ismissing(rural); citymissing <- ismissing(city)

Cmd> nonmissing <- !ruralmissing && !citymissing

Cmd> print(format:"2.0f",ruralmissing,citymissing,nonmissing)
ruralmissing:        T means missing, F not missing
 (1) T  F  F  F  F  T  F  F  T  T  T  T  F  F  F  F  F  F  F  F 
(21) T  F  F  F  F  F  F  F  F  F  F  F  F  F  F  F 
citymissing:         T means missing, F not missing
 (1) F  F  F  F  F  F  F  T  T  F  F  T  F  T  F  F  F  F  F  F 
(21) T  F  F  F  F  F  F  F  F  F  F  F  F  T  F  F 
nonmissing:         F means missing, T not missing
 (1) F  T  T  T  T  F  T  F  F  F  F  F  T  F  T  T  T  T  T  T 
(21) F  T  T  T  T  T  T  T  T  T  T  T  T  F  T  T 

Plotting commands omit cases with missing values:

Cmd> plot(rural,city,xlab:"Rural air pollution reading",\
	ylab:"City Pollution level",\
	title:"City vs rural air pollution readings")
Scatter plot of city vs rural pollution
Cmd> regress("city=rural") # linear regression, x=rural,y=city
Model used is city=rural
WARNING: cases with missing values deleted
                Coef      StdErr           t
CONSTANT     -2.5801      2.7277    -0.94589
rural         1.0935    0.050597      21.612

N: 36, Omitted: 10, MSE: 20.063, DF: 24, R^2: 0.95113
Regression F(1,24): 467.08, Durbin-Watson: 1.8007
To see the ANOVA table type 'anova()'

Cmd> COEF # intercept and slope saved by regress()
    CONSTANT       rural
     -2.5801      1.0935

Cmd> a <- COEF[1]; b <- COEF[2]

Cmd> x0 <- vector(15,110); y0 <- a + b*x0# extreme pts on line


Cmd> addlines(x0,y0,title:"City vs rural with LS regression line")
Scatter plot of city vs rural pollution with LS line
Cmd> RESIDUALS # vector of residuals computed by regress()
 (1)     MISSING     -2.6848     -1.3472     0.49441     0.27882
 (6)     MISSING     0.55934     MISSING     MISSING     MISSING
(11)     MISSING     MISSING    0.026875     MISSING      7.4814
(16)    -0.74975     -2.9653     -2.3472     -2.4407     -2.0666
(21)     MISSING      2.7178    0.091809    -0.65624    -0.53417
(26)      15.811      2.6165     -6.3394    -0.90819     -3.2744
(31)     -2.1887     -2.4407    -0.62767     MISSING     -4.8147
(36)      6.3074

RESIDUALS[i] is MISSING if either x[i] or y[i] is MISSING.

Cmd> RESIDUALS[nonmissing] # select non missing residuals
 (1)     -2.6848     -1.3472     0.49441     0.27882     0.55934
 (6)    0.026875      7.4814    -0.74975     -2.9653     -2.3472
(11)     -2.4407     -2.0666      2.7178    0.091809    -0.65624
(16)    -0.53417      15.811      2.6165     -6.3394    -0.90819
(21)     -3.2744     -2.1887     -2.4407    -0.62767     -4.8147
(26)      6.3074

In plots, you don't have to worry about MISSING values since they are ignored.

Here is a plot against case number. There are gaps where data is missing.

Cmd> plot(1,RESIDUALS,title:"Residuals vs case number",\
		xlab:"Case number", ylab:"Residuals")
Residuals vs case number

Here is a plot of residuals vs the predictor variable, rural.

Cmd> plot(rural,RESIDUALS,xlab:"Rural pollution measurement",\
  ylab:"Residuals", title:"Residuals vs predictor (rural pollution)")
Residuals vs predictor (rural pollution)

The case with the largest residual is not particularly influential, since its x-value is in the middle of the pack. However, the second largest residual with rural pollution over 100 is influential since it is an x-outlier.

Here is a normal quantile plot of the residuals. rankits() does the right thing with MISSING values, but does give a warning message.

Cmd> plot(rankits(RESIDUALS),RESIDUALS,ylab:"Residuals",\
	xlab:"Normal quantiles (Normal scores)"
	title:"Normal quantile plot of residuals")
WARNING:  MISSING values in argument to rankits()
Residuals vs normal quantiles

There a kink in the plot at around z = 1. The residuals don't appear very normal. It's possible that's because the fit included outliers, including an influential one.

Let's redo the regression omitting the cases with the two largest outliers.

Cmd> outliers <- grade(abs(RESIDUALS),down:T)[run(2)]
WARNING: missing values in argument(s) to abs()
WARNING:  MISSING values in argument to grade()

Cmd> outliers # case numbers of two largest residuals
(1)          26          15

Cmd> RESIDUALS[outliers]
(1)      15.811      7.4814

Cmd> regress("{city[-outliers]}={rural[-outliers]}") # regression w/o outliers
Model used is {city[-outliers]}={rural[-outliers]}
WARNING: cases with missing values deleted
                          Coef      StdErr           t
CONSTANT                1.6988      1.6226      1.0469
{rural[-outliers]}     0.98564    0.032106        30.7

N: 34, Omitted: 10, MSE: 4.5946, DF: 22, R^2: 0.97719
Regression F(1,22): 942.47, Durbin-Watson: 2.04
To see the ANOVA table type 'anova()'
Cmd> vector(a,b) # original coefficients
    CONSTANT       rural
     -2.5801      1.0935

Cmd> a1 <- COEF[1]; b1 <- COEF[2] # new coefficients

Now I replot all the data and add both lines. I added the arrows and circles by hand.

Cmd> plot(rural,city,show:F)

Cmd> addlines(x0,a+b*x0,show:F) # add LS line all data

Cmd> addlines(x0,a1+b1*x0,linetype:2,show:F) # LS line, no outliers

Cmd> showplot(xlab:"Rural pollution",ylab:"City pollution",\
title:"City vs urban pollution with two lines")
Scatter plot with two lines

The new line fits the relationship considerably better, except for the outliers.

Now another normal quantile plot with the new residuals.

Cmd> plot(rankits(RESIDUALS),RESIDUALS,ylab:"Residuals",\
	xlab:"Normal quantiles (Normal scores)",\
	title:"Normal quantile plot of residuals omitting outliers")
WARNING:  MISSING values in argument to rankits()
Outlier free residuals vs normal quantiles

There is still a bit of a kink, but on the whole the plot is closer to a straight line than before.

You can use regpred() to predict the values for which there were rural data but no city data.

Cmd> selector <- !ruralmissing && citymissing

Cmd> run(1,length(rural))[selector] # case numbers
(1)           8          14          34

Cmd> regpred(rural[selector])
component: estimate
(1)      54.923      88.435       42.11        Predicted values
component: SEest
(1)     0.46985      1.3365     0.50203
component: SEpred
(1)      2.1944       2.526      2.2015

Here is a more "white box" way:

Cmd> a1 + b1*rural[selector]
(1)      54.923      88.435       42.11

C. Bingham
Updated Thu Feb 13 10:18:09 CST 2003