Here is a short example of the use of MacAnova in regression. The data are from Table 2.12 (data file ta02_012.txt) which is referenced in Ex. 2.73.
The predictor variable, rural, is a measurement of air pollution at a rural location 10 miles southwest of a city. The response variable, urban, is a similar measurement for the same day in the city. There are several values missing. Cases containing missing values are ignored by regress().
Cmd> readdata("")
# File ta02_012.txt from publisher's web site
# Data from Table 2.12, p. 176 of IPS4
# Col. 1: rural = particulate level in rural location
# Col. 2: city = particulate level in city location
Read from file "TP1:Stat5021:Data:Ch02:ta02_012.txt"
Column 1 saved as REAL vector rural with 7 missing value(s)
Column 2 saved as REAL vector city with 6 missing value(s)
Cmd> list(rural,city)
city REAL 36
rural REAL 36
Cmd> print(format:"7.0f",rural,city)
rural:
(1) MISSING 67 42 33 46 MISSING 43
(8) 54 MISSING MISSING MISSING MISSING 38 88
(15) 108 57 70 42 43 39 MISSING
(22) 52 48 56 44 51 21 74
(29) 48 84 51 43 45 41 47
(36) 35
city:
(1) 39 68 42 34 48 82 45
(8) MISSING MISSING 60 57 MISSING 39 MISSING
(15) 123 59 71 41 42 38 MISSING
(22) 57 50 58 45 69 23 72
(29) 49 86 51 42 46 MISSING 44
(36) 42
You can identify the cases with missing values using ismissing().
Cmd> ruralmissing <- ismissing(rural); citymissing <- ismissing(city) Cmd> nonmissing <- !ruralmissing && !citymissing Cmd> print(format:"2.0f",ruralmissing,citymissing,nonmissing) ruralmissing: T means missing, F not missing (1) T F F F F T F F T T T T F F F F F F F F (21) T F F F F F F F F F F F F F F F citymissing: T means missing, F not missing (1) F F F F F F F T T F F T F T F F F F F F (21) T F F F F F F F F F F F F T F F nonmissing: F means missing, T not missing (1) F T T T T F T F F F F F T F T T T T T T (21) F T T T T T T T T T T T T F T T
Plotting commands omit cases with missing values:
Cmd> plot(rural,city,xlab:"Rural air pollution reading",\ ylab:"City Pollution level",\ title:"City vs rural air pollution readings")![]()
Cmd> regress("city=rural") # linear regression, x=rural,y=city Model used is city=rural WARNING: cases with missing values deleted Coef StdErr t CONSTANT -2.5801 2.7277 -0.94589 rural 1.0935 0.050597 21.612 N: 36, Omitted: 10, MSE: 20.063, DF: 24, R^2: 0.95113 Regression F(1,24): 467.08, Durbin-Watson: 1.8007 To see the ANOVA table type 'anova()' Cmd> COEF # intercept and slope saved by regress() CONSTANT rural -2.5801 1.0935 Cmd> a <- COEF[1]; b <- COEF[2] Cmd> x0 <- vector(15,110); y0 <- a + b*x0# extreme pts on line Cmd> addlines(x0,y0,title:"City vs rural with LS regression line")![]()
Cmd> RESIDUALS # vector of residuals computed by regress() (1) MISSING -2.6848 -1.3472 0.49441 0.27882 (6) MISSING 0.55934 MISSING MISSING MISSING (11) MISSING MISSING 0.026875 MISSING 7.4814 (16) -0.74975 -2.9653 -2.3472 -2.4407 -2.0666 (21) MISSING 2.7178 0.091809 -0.65624 -0.53417 (26) 15.811 2.6165 -6.3394 -0.90819 -3.2744 (31) -2.1887 -2.4407 -0.62767 MISSING -4.8147 (36) 6.3074
RESIDUALS[i] is MISSING if either x[i] or y[i] is MISSING.
Cmd> RESIDUALS[nonmissing] # select non missing residuals (1) -2.6848 -1.3472 0.49441 0.27882 0.55934 (6) 0.026875 7.4814 -0.74975 -2.9653 -2.3472 (11) -2.4407 -2.0666 2.7178 0.091809 -0.65624 (16) -0.53417 15.811 2.6165 -6.3394 -0.90819 (21) -3.2744 -2.1887 -2.4407 -0.62767 -4.8147 (26) 6.3074
In plots, you don't have to worry about MISSING values since they are ignored.
Here is a plot against case number. There are gaps where data is missing.
Cmd> plot(1,RESIDUALS,title:"Residuals vs case number",\ xlab:"Case number", ylab:"Residuals")![]()
Here is a plot of residuals vs the predictor variable, rural.
Cmd> plot(rural,RESIDUALS,xlab:"Rural pollution measurement",\ ylab:"Residuals", title:"Residuals vs predictor (rural pollution)")![]()
The case with the largest residual is not particularly influential, since its x-value is in the middle of the pack. However, the second largest residual with rural pollution over 100 is influential since it is an x-outlier.
Here is a normal quantile plot of the residuals. rankits() does the right thing with MISSING values, but does give a warning message.
Cmd> plot(rankits(RESIDUALS),RESIDUALS,ylab:"Residuals",\ xlab:"Normal quantiles (Normal scores)" title:"Normal quantile plot of residuals") WARNING: MISSING values in argument to rankits()![]()
There a kink in the plot at around z = 1. The residuals don't appear very normal. It's possible that's because the fit included outliers, including an influential one.
Let's redo the regression omitting the cases with the two largest outliers.
Cmd> outliers <- grade(abs(RESIDUALS),down:T)[run(2)]
WARNING: missing values in argument(s) to abs()
WARNING: MISSING values in argument to grade()
Cmd> outliers # case numbers of two largest residuals
(1) 26 15
Cmd> RESIDUALS[outliers]
(1) 15.811 7.4814
Cmd> regress("{city[-outliers]}={rural[-outliers]}") # regression w/o outliers
Model used is {city[-outliers]}={rural[-outliers]}
WARNING: cases with missing values deleted
Coef StdErr t
CONSTANT 1.6988 1.6226 1.0469
{rural[-outliers]} 0.98564 0.032106 30.7
N: 34, Omitted: 10, MSE: 4.5946, DF: 22, R^2: 0.97719
Regression F(1,22): 942.47, Durbin-Watson: 2.04
To see the ANOVA table type 'anova()'
Cmd> vector(a,b) # original coefficients
CONSTANT rural
-2.5801 1.0935
Cmd> a1 <- COEF[1]; b1 <- COEF[2] # new coefficients
Now I replot all the data and add both lines. I added the arrows and circles by hand.
Cmd> plot(rural,city,show:F) Cmd> addlines(x0,a+b*x0,show:F) # add LS line all data Cmd> addlines(x0,a1+b1*x0,linetype:2,show:F) # LS line, no outliers Cmd> showplot(xlab:"Rural pollution",ylab:"City pollution",\ title:"City vs urban pollution with two lines")![]()
The new line fits the relationship considerably better, except for the outliers.
Now another normal quantile plot with the new residuals.
Cmd> plot(rankits(RESIDUALS),RESIDUALS,ylab:"Residuals",\ xlab:"Normal quantiles (Normal scores)",\ title:"Normal quantile plot of residuals omitting outliers") WARNING: MISSING values in argument to rankits()![]()
There is still a bit of a kink, but on the whole the plot is closer to a straight line than before.
You can use regpred() to predict the values for which there were rural data but no city data.
Cmd> selector <- !ruralmissing && citymissing Cmd> run(1,length(rural))[selector] # case numbers (1) 8 14 34 Cmd> regpred(rural[selector]) component: estimate (1) 54.923 88.435 42.11 Predicted values component: SEest (1) 0.46985 1.3365 0.50203 component: SEpred (1) 2.1944 2.526 2.2015
Here is a more "white box" way:
Cmd> a1 + b1*rural[selector] (1) 54.923 88.435 42.11