University of Minnesota
School of Statistics

Frequently Asked Questions About Computing and MacAnova

Statistics 5303

Statistics 5303 Home Page

Questions on:

Accessing PDF files

Q. I was able to access the class notes which you posted on the web as Acrobat PDF documents. However, they are not in a readable form. Do you have an alternative way or form to forward the class notes?

A. I don't have any alternative way or form to forward the class notes. I have been using this method for several years and students have generally had little trouble once they get the hang of it.

One thing I recommend because it gives you more control over the process is to download PDF documents explicitly. On a Windows computer, click on the link using the right button so that a menu pops up. On a Macintosh, hold down the button when you select the link until a popup menu appears. In both cases you should then select Save link as or Save target as on the popup menu. Save the file some place where you will be able to find it, perhaps on the "desk top". Then, leaving the browser, double click on the file's icon to read it and/or print it in Acrobat Reader.

Most computers come with Acrobat Reader installed. If you don't have it, you can download a free copy from http://www.adobe.com/products/acrobat/readstep2.html.

Most browsers have "plugins" that enable you to read PDF documents in your browser when you click on the link with the left button or without holding down the button. Unfortunately, when you do it this way the actual document is squirreled away in a hard-to-find cache folder, sometimes using an obscure name, and you may not be able to find it to read or print after you quit your browser. That's one reason why I recommend explicitly downloading PDF files.

If you do have trouble with this process, please let me know and I will try to help.


Printing neat ANOVA tables

Q. Once I have calculated all the variables for an ANOVA table, is there a way to construct it in MacAnova?

A. I'm not certain I understand the question. The following is based on the assumption that what you want is to print a nice neat ANOVA table in MacAnova.

Suppose you have means and an MSE from a 4 greatment experiment with 7 observations per treatment so your dftreatment = 3 and dferror = 24.

Suppose the means are 29.12, 30.13, and 27.39 and MSE = 5.32 and you compute the components of the ANOVA table as follows:

Cmd> means <- vector(29.12, 30.13,27.39)

Cmd> grandmean <- sum(means)/3 # OK because sample sizes equal

Cmd> 7*sum((means-grandmean)^2) # SS_trt
(1)       26.881

Cmd> mse <- 5.32

Cmd> 24*mse # ss_error
(1)       127.68

Cmd> df <- vector(3, 24)

Cmd> ss <- vector(26.881, 127.68)

Cmd> ms <- ss/df

Cmd> f <- ms/ms[2]

Now you need to make a table with row and column labels containing these results.

Cmd> table <- hconcat(df,ss,ms,f) # puts them side-by-side

Cmd> table # table without labels
(1,1)            3       26.881       8.9603       1.6843
(2,1)           24       127.68         5.32            1

Now use setlabels() to add row and column labels.

Cmd> setlabels(table,structure(vector("Treatment","Error"),\
  vector("DF","SS","MS","F")))

Cmd> table # table with labels
                    DF           SS           MS            F
Treatment            3       26.881       8.9603       1.6843
Error               24       127.68         5.32            1

Why am I getting an error message when I use contrast()?

Q. I am trying to answer Ex. 5.2.b.

Here is some output which ends with an obscure error message. Idon't know what I'm doing wrong. I'm trying to construct a contrast but don't quite understand/ know how to calculate the contrast. Please assist.

Cmd> muhats <- vector(3.2892, 10.256, 8.1157, 8.1825, 7.5622); muhats
(1)       3.2892       10.256       8.1157       8.1825       7.5622

Cmd> alphahats <- muhats-sum(muhats)/5; alphahats
(1)      -4.1919       2.7749      0.63458      0.70138      0.08108

Cmd> w <- vector(vector(1,1)/2,-vector(1,1,1)/3); w
(1)          0.5          0.5     -0.33333     -0.33333     -0.33333

Cmd> vector(sum(w),sum(w*muhats), sum(w*alphahats))
(1)   1.1102e-16      -1.1809      -1.1809

Cmd> treatment <- vector(1,2,3,4,5)

Cmd> treatment <- factor(treatment)

Cmd> stuff <- contrast(treatment,w); stuff
ERROR: no active non-regression GLM model

Q. The problem with what you are trying to do is that contrast() works only after you have done an ANOVA using anova(). It uses some remnants of that computation that were not necessarily printed to compute the contrast.

The error message is a little arcane, I'm afraid. A GLM (Generalized Linear Model) is what any of a number of commands including anova(), regress(), logistic(), ... sets up. If you haven't run one of these commands, no GLM model is "active."

You have already computed what would be the estimate component of the output of contrast() as sum(w*alphahats). To get the standard error (se component) and SS (ss component) you do a "white box" computation using the appropriate formulas. For example, if w is your vector of weights and n is the vector of sample sizes (or a scalar if all samples sizes are equal) and mse your MSE, then

  Cmd> se <- sqrt(sum(n*w^2)*mse)

computes the standard error.


Plots to diagnose non-constant standard deviation

Q. I am still unsure how to do Pr.6.1.a. I don't understand how to plot the data once it is transformed, nor do I understand what I'm seeing once the data is transformed looking at these two columns. I need some more explanation. Does the cloud seeding example have to do with understanding transformations?

Cmd> y
 (1)           97           83           85           64           52
 (6)           48           96           87           84           72
(11)           56           58           92           78           78
(16)           63           44           49           95           81
(21)           79           74           50           53

Cmd> p<- .0001; hconcat(log(y),(y^p-1)/p)
 (1,1)       4.5747       4.5758
 (2,1)       4.4188       4.4198
 (3,1)       4.4427       4.4436
 (4,1)       4.1589       4.1597
 (5,1)       3.9512        3.952
 (6,1)       3.8712        3.872
 (7,1)       4.5643       4.5654
 (8,1)       4.4659       4.4669
 (9,1)       4.4308       4.4318
(10,1)       4.2767       4.2776
(11,1)       4.0254       4.0262
(12,1)       4.0604       4.0613
(13,1)       4.5218       4.5228
(14,1)       4.3567       4.3577
(15,1)       4.3567       4.3577
(16,1)       4.1431        4.144
(17,1)       3.7842       3.7849
(18,1)       3.8918       3.8926
(19,1)       4.5539       4.5549
(20,1)       4.3944       4.3954
(21,1)       4.3694       4.3704
(22,1)       4.3041        4.305
(23,1)        3.912       3.9128
(24,1)       3.9703       3.9711

A. First, your question about the two columns computed as I presented in class.

All you were supposed to see confirmation of the plausibility that the Box-Cox transformation with p = 0 was the same as the natural log.

The column on the right is Box-Cox with p very close to 0 and it is almost the same as the left column of natural logs. If you used p = .0000001 it would be even closer.

There are two kinds of plots that shed light on whether variances are the same in different groups.

One is a plot of standard deviation and/or variance vs the mean.

Cmd> stats <- tabs(y,treat,mean:T,stddev:T)

Cmd> plot(stats$mean, stats$stddev, ymin:0)

When the standard deviation is the same in every group, you should get a horizontal pattern of points. When the standard deviation increases with the mean, the plot should trend up to the right. When the standard deviation decreases as the mean increases, the plot should trend down to the right.

I added the ymin:0 to the argument list so that the zero line would be in the plot. This gives a better perspective as to the pattern.

A variant on this is to plot the log of the standard deviations against the log of the means.

Cmd> plot(log(stats$mean), log(stats$stddev))

The other type of plot is a plot of residuals against the fitted values, which in this case are just the group means. You must first run anova() and then resvsyhat().

Cmd> anova("y=treat")

Cmd> resvsyhat()

After trying a transformation, say, doing y1 <- sqrt(y), you can make the same plots based on y1 (you have to do anova("y1=treat") before using resvsyhat() again).

Yes the cloudseeding example has to do with using transformations. There's a big difference between the standard deviations of the two treatments. Taking a suitable transformation helps a lot and also makes the distributions more normal.

You can find a transformation by trial and error, trying "standard" transformations such as log, sqrt, cube root (y^(1/3)), reciprocal (1/y), 1/sqrt(y), ... or you can use boxcoxvec() (see the handout on Checking ANOVA Assumptions).

In Problem 6.1, the data are percentages. The text suggests that in some cases the so called arcsine transformation, actually arcsine(sqrt(y)) is appropriate with proportions = percentages/100.

You can compute this in MacAnova with y by

  Cmd> y1 <- asin(sqrt(y/100))

You divide y by 100 to convert a percent to a proportion.

You could also try various powers using boxcoxvec(). But also, since 100 - y is the percent of non-big blue stem and is positive, you might use boxcoxvec() to find a power to transform 100 - y.

There are actually at least three transformations that work well here. Although they are different, they have a very similar effect on y.


What does boxcox() do?

I don't understand the usage of boxcox(). I have looked it up in MacAnova but the explanation is too cryptic. Can you make it more transparent?

Cmd> help(boxcox)
boxcox(var,Pow) computes the Box-Cox transformation of the data in
vector or matrix var.  When var is a matrix, the transformation is
applied to each column separately.  If GM is the geometric mean of the
values in a vector, boxcox(y,Pow) computes (y^Pow-1)/(Pow*(GM)^(Pow-1))
when Pow != 0, and GM*log(y) when Pow == 0.  Boxcox is implemented as a
macro.

It is fairly cryptic. The difference between the Box-Cox as I described it in lecture, is that, for the sake of simplicity, I did not include the factor GM^(Pow-1) in the denominator. This value is the same for every case you are transforming and hence does not affect the usefulness of the transformation to correct a problem with ANOVA data.

GM is the Geometric mean which you can calculate by finding natural logs of the data, calculating their mean, and then taking the natural anti log (exponential) of that mean.

Cmd> Y # data to be transformed
(1)      5.3222      4.9979      3.6147      4.5278      5.6057

Cmd> boxcox(Y,.5) # Box-Cox equivalent of sqrt(Y) = Y^(1/2)
(1)      5.7023      5.3908      3.9321      4.9208      5.9669

Cmd> GM <- exp(describe(log(Y),mean:T)); GM # geometric mean
(1)      4.7588

Cmd> (Y^.5 - 1)/(.5*(GM^(.5- 1))) # Box-Cox "by hand"
(1)      5.7023      5.3908      3.9321      4.9208      5.9669

When you actually come to use a transformation, even when the choice was guided by boxcoxvec(), you seldom if ever use boxcox(), but just the power y^p (or log(y) or log10(y) when p = 0 is indicated.)

Cmd> sqrt(Y) # doesn't look much like the Box-Cox values
(1)       2.307      2.2356      1.9012      2.1279      2.3676

Cmd> cor(sqrt(Y),boxcox(Y,.5))[1,2] # correlation is 1
(1,1)           1

This last shows that the correlation of sqrt(Y) and boxcox(Y,.5) is 1 and hence they would plot exactly on a straight line. Thus they are equivalent from the point of view of correcting data that do not satisfy ANOVA assumptions.


What are Cook's D, HII and rankits

Q. I followed the example in the lecture for Pr. 6.1. Perhaps I'm completely misled since you didn't quite go over that. What is Cook's D and HII? What does rankits mean?

A. HII is a vector always computed by anova(), regress() and other similar commands. An element HII[j] is sometimes called the leverage of case j. It enters into the standard deviation of the ith residual as sigma*sqrt(1 - HII[j]). In the one-way ANOVA, HII[j] for a case in group i is 1/ni, so the estimated standard deviation of a residual in group i is sqrt(mse*(1 - 1/n[i])).

Cook's D is a measure of how much a particular case influences the parameter estimates. In the ANOVA with equal sample sizes it is primarily affected by the size of the residual. With unequal sample sizes, the same size residual in a small group has a greater affect on estimates than in a large group, so D is larger for large residuals in small groups than for equivalent residuals in large groups. The Cook it's named after is Professor Dennis Cook of the School of Statistics. He developed it about 25 years ago and it has become a standard part of the output of almost every good computer program for doing regression or ANOVA.

A rankit is another name for a normal score. As computed in MacAnova, the normal score corresponding to the ith response in order of size is the normal probability point corresponding to probability (i - 3/8)/(n+1/4), that is invnor((i-.375)/(n+.25)).
Cmd> Y # a short vector of data
(1)      5.3222      4.9979      3.6147      4.5278      5.6057

Cmd> rankits(Y)
(1)      0.4972           0     -1.1798     -0.4972      1.1798

Cmd> rank(Y)
(1)           4           3           1           2           5

Cmd> invnor((rank(Y) - .375)/(5 + .25))
(1)      0.4972           0     -1.1798     -0.4972      1.1798

I used to use "rankit" a lot; now I try to use "normal score".

Here is some MacAnova output.

Cmd> readdata("",treat,y) # read data
Read from file "D:\stats 5303\pr6-1.txt"
Column 1 saved as REAL vector treat 
Column 2 saved as REAL vector y 

Cmd> treat <- factor(treat)

Cmd> list(treat,y)
treat           REAL   24    FACTOR with 6 levels
y               REAL   24   

Cmd> n <- tabs(y,treat,count:T);n # sample sizes
(1)            4            4            4            4            4
(6)            4

Cmd> 1/n
(1)         0.25         0.25         0.25         0.25         0.25
(6)         0.25

Cmd> anova("y=treat")
Model used is y=treat
                DF           SS           MS
CONSTANT         1   1.2298e+05   1.2298e+05
treat            5       6398.3       1279.7
ERROR1          18        323.5       17.972

Cmd> list(HII)# anova() has created vector HII
HII             REAL   24   

Cmd> unique(HII) # every value is 1/4, the sample size for each group
(1)         0.25

Cmd> stuff <- resid() # compute residuals and other quantities

Cmd> stuff[run(5),] # first 5 rows of resid output
          Depvar    StdResids          HII     Cook's D      t-stats
(1)           97      0.54475         0.25     0.016486      0.53382
(2)           83      0.20428         0.25    0.0023184      0.19876
(3)           85      0.95332         0.25     0.050489      0.95077
(4)           64      -1.1576         0.25     0.074446      -1.1694
(5)           52      0.40856         0.25    0.0092736      0.39891

Cmd> J <- grade(abs(stuff[,5]),down:T)

Cmd> #J contains case numbers in decreasing order of abs(t_stats)

Cmd> stuff[J[run(10)],]# rows with 10 largest abs(t_stats)
           Depvar    StdResids          HII     Cook's D      t-stats
(17)           44      -1.7704         0.25      0.17414      -1.8933
(12)           58       1.6343         0.25      0.14838        1.721
(22)           74       1.5662         0.25      0.13627       1.6377
(11)           56       1.4981         0.25      0.12468       1.5561
(16)           63        -1.43         0.25       0.1136      -1.4761
(8)            87       1.2938         0.25     0.092993       1.3202
(14)           78      -1.1576         0.25     0.074446      -1.1694
(4)            64      -1.1576         0.25     0.074446      -1.1694
(6)            48      -1.0895         0.25     0.065945      -1.0955
(10)           72       1.0214         0.25      0.05796       1.0227

There are no really large residuals and hence no large values of Cook's D.


How can I compute the power for a contrast?

Q. I am working on Stats 5303 Ex. 7.5. I am not clear on what to use for the number of groups when I am doing contrasts within the larger study. For example in part A I used g = 6; should I be using something different?

Here are some computations I made for part (a):

Cmd> g <- 6;mse <- 0.25;n <- 4;errorrate <- 0.01;D <- 1

If the high cost tapes average is 1 unit different from the low cost tapes then I think we should use the optimistic noncentrality parameter formula. The pessimistic noncentrality parameter wouldn't make sense because if only two of the treatments effects were 1 unit apart than the average would be less than one.

Cmd> noncen_opt <- g*(D^2)/(4*mse); noncen_opt
(1)            6

Cmd> power(noncen_opt,g,errorrate,n)
(1)      0.73125

Using the optimistic noncentrality parameter formula we get a power of 0.73, much greater than using the pessimistic noncentrality parameter formula.

Cmd> noncen_pess <- D^2/(2*mse);noncen_pess
(1)            2

Cmd> power(noncen_pess,g,errorrate,n)
(1)      0.18628

Does the pessimistic estimate really not play a role in this question?

A. What you have done makes sense in the context of an overall F-test of the hypothesis all 6 treatments are the same, but I don't think it addresses what is wanted in part (a) (which is, unfortunately, not very clearly written).

The test referred to in (a) is a test of the contrast with coefficients (1/4, -1/2, 1/4, -1/2, 1/4, 1/4). See Section 7.4. In MacAnova, you will have to use power2() instead of power(), using numerator df = 1, denominator df = ng - g = g(n-1). In computing the noncentrality parameter you use the value 1 for iwii, the mean of the contrast, taken from the statement of the problem.

The optimistic/pessimistic/intermediate choice for alternative is meaningful when you are choosing a sample size so that the power of the overall F on g-1 and N-g d.f. is a specified number. It is not relevant for testing a contrast.


Choice of weights to define a contrast that involves more than 2 treatments

Q.In Ex. 7.5 (b) how does the fact that I'm comparing only brand A and brand B come into play?

On Page 2 of the lab handout in the methods for "determining sample size to meet a confidence interval margin or error goal" is the line:

Cmd> n <- t_025^2*mse*((1)^2 + (-1)^2)/error_margin^2

I am confused by the 1 and -1. The sheet says they are the wi's for two means. How was it decided that there should be one 1, one -1 and the rest should be all 0?

The handout said that the confidence interval width was 0.5 so I don't understand why they would make D = 2. How do I pick these 1 and -1 terms for part B? I am wondering if I can follow the same form on the handout or if it needs to be modified in some way to fit comparing 2 types of brand A with 2 types of brand B.

A. On your confusion about -1, 1 and 0's as values for the wi's:

A pairwise contrast compares the effects of two treatments. For instance in comparing treatment 3 and 5 the contrast would have the form alpha3 - alpha5. When g = 6, this can be written as

(0)1 + (0)2 + (1)3 + (0)4 + (-1)5 + (0)6

that is, iwii where w is vector(0,0,1,0,-1,0). In general, any pairwise contrast will have one 1, one -1 and g-2 0's.

But that's not directly relevant here, since this is not pairwise contrast, comparing just two treatments. There are 2 treatments, 1 and 2, involving brand A, and 2 treatments (3, and 4) involving brand B. The approprate contrast would give equal weights to treatments 1 and 2, the negative of these weights to treatments 3 and 4, and weight 0 to treatments 5 and 6.

The '2' in part b is not D, but, in Oehlert's notation, W = the desired width of the confidence interval. It corresponds to a margin of error M = 1 in the notation in the handout.

Query on Exercise 6.6

Q On Exercise 6.6, I am not sure if I should use the transformation or not, because it looks OK.

Also, the question is asking me to compare different conditions, I wonder if I should I do contrast for all of them or if I can use pairwise test.

A On Ex 6.3. Non-constant variance is not the only thing to worry about. You should also be looking for outliers.

Fixing data is often a stepwise procedure -- you first have to fix one problem before another problem is apparent.

Even when there are apparent outliers, you should probably first try to find a transformation. It is not uncommon for what looks like an outlier to be revealed as nothing out of the ordinary when you take a transformation.

If an outlier persists and is extreme, you probably need to delete it.

Use output from resid() to identify any outliers, preferably using a objective Bonferronized t-test of the t-statistic values in the last column (externally Studentized residuals). If you find an outlier, run the ANOVA without it and again look at residuals plots. If there is still an outlier, try omitting that case and try another ANOVA.

When you think you have outlier-free data, seek a tranformation to stabilize the variance if it looks like that's called for.

Here's a fairly easy way to re-run an ANOVA leaving out a case, say case 6:

  Cmd> anova("{y[-6]} = {treat[-6]}",fstat:T)

To leave out several cases, say 6, 10, and 15, do something like this:

  Cmd> J <- vector(6, 10, 15)

  Cmd> anova("{y[-J]} = {treat[-J]}",fstat:T)

Or, you could create new variables by y1 <- y[-J] and treat1 <- treat[-J] followed by anova("y1=treat1").

You need to read the Ex 6.6 carefully. There are three questions being asked:

  1. Whether the instrument performs the same (relative to the standard method) across all conditions
  2. Whether it performs the same at both coal-fired plants.
  3. Whether it performs the same when the smoke is artificially generated and when it is real (that is from plant no. 1 or plant no. 2).

What kind of a test do you need for 1? What kind of a test do you need for 2 and 3?

There isn't an absolutely unique answer, but some answers are better than others. Using pairwise() implicitly tests a lot of hypotheses. Generally, you don't want to do more tests than necessary than necessary since that increases the Type I error rate, or if you protect against that using the HSD or BSD, it increases the Type II error rate.


Query on Ex. 7.5

Q. I'm still confused about Ex. 7.5. I don't understand how you'd figure in the information about contrasts into calculation of power and sample size based on the handouts you've given us. I'm used to calculating contrasts with a given data set including "treat". So I ignored the weights discussed and figured the zeta1 was 1. Please enlighten me!

Here is some output.

Cmd> power2(4,1,.01,18)
(1)      0.22475

Cmd> samplesize(4,6,.01,.2)
(1)            3

A. You need to recall what a contrast sum(w*ybars) you are computing from a data set is estimating. It is estimating the quantity sum(w*alpha), defined in terms of the unknown parameters alphai.

As Oehlert (p. 158) makes clear, the non-centrality parameter of t2 = F statistic with 1 and dferror degrees of freedom is (in the equal sample size case)

zeta = n*sum(w*alpha)^2/{sigma^2*sum(w^2)}
so the n = 1 non-centrality parameter is
zeta1 = sum(w*alpha)^2/{sigma^2*sum(w^2)}

This explicitly involves the contrast weights. The denominator is part of the square of the (true) standard error of sum(w*ybars) when 2 is right.

If the alternative is sum(w*alpha) = D, then zeta = n*D^2/{sigma^2*sum(w^2)} and you compute the power by power2(n*zeta_1,1,alpha,df_error). You can't compute it using power() since that works only for the F-test of H0: all 's = 0.

So in a power problem involving a contrast, you need to figure out what the contrast coefficients wi are and what is the value D = sum(w*alpha) that is the alternative. In 7.5 (a), you are looking at averages so w = vector(.25,.25, -.5,-.5,.25,.25) and D = 1.

Here's one way to do it.

Cmd> g <- 6; n <- 4

Cmd> df_error <- g*(n-1) # or n*g - g

Cmd> w <- vector(.25,.25,-.5,-.5,.25,.25);sum(w)
(1)           0

Cmd> alpha <- .01; D <- 1

Cmd> sigmasq <- .25

Cmd> zeta1 <- D^2/(sigmasq*sum(w^2)); zeta1
(1)      5.3333

Cmd> power2(n*zeta1,1,alpha,df_error)
(1)     0.94558

Cmd> 1 - cumF(invF(1-alpha,1,df_error),1,df_error,n*zeta1) #using non-cen F
(1)     0.94558

On 7.5 (b), the contrast is vector(.5,.5,-.5,-.5,0,0). The goal has to do with confidence interval width or margin of error size and you can't use samplesize(). n satisfies the equation

n = M^2/(MSE*sum(w^2)*t_025^2)
where t.025 is computed using df = g*(n-1).
Cmd> w1 <- vector(.5,.5,-.5,-.5,0,0) # contrast coefficients

Cmd> W <- 2; M <- W/2 # M = desired margin of error 

Cmd> t_025 <- 2 # starting value

Cmd> mse <- sigmasq

Cmd> n_new <- M^2/(mse*sum(w1^2)*t_025^2); n_new # 1st stab
(1)           1

n = 1 won't work since it would mean dferror = g(n-1) = 0. So see if n = 2 will give a small enough margin of error.

Cmd> n_new <- 2

Cmd> t_025 <- invstu(1 - .025,g*(n_new - 1)); t_025
(1)      2.4469

Cmd> t_025*sqrt(sum(w1^2)*mse/n_new) # margin of error for n = 2
(1)     0.86511

Since .86511 < M = 1, n = 2 is an OK sample size.


Finding significant outliers

Q. Is there a easy way to find the "most significant" outliers in a data set?

A. There is a way to isolate the most significant outliers, as ranked by the externally studentized residuals computed by resid(). Whether it is really easy is problematical.

Here is an example. Previously to the following commands anova() was run.

Cmd> residinfo <- resid()

Cmd> residinfo[run(5),] # first 5 cases
         Depvar   StdResids         HII    Cook's D     t-stats
(1)      5.9708      2.2362     0.10417    0.058145      2.2907
(2)      6.0585     0.61866     0.10417   0.0044505     0.61643
(3)      6.1875     0.85774     0.10417   0.0085549     0.85641
(4)       6.151     0.69358     0.10417   0.0055936     0.69147
(5)      5.9484    -0.45228     0.10417   0.0023786    -0.45018

Cmd> K <- grade(abs(residinfo[,5]),down:T)[run(10)]

This last command computes the row numbers of the 10 largest values of t-stats in absolute value. Without down:T it would return the row numbers of the 10 smallest values which are of no interest.

Now select these rows from residinfo.

Cmd> residinfo[K,] 
          Depvar   StdResids         HII    Cook's D     t-stats
(1)       5.9708      2.2362     0.10417    0.058145      2.2907
(91)      6.1168       2.125     0.10417    0.052508      2.1704
(89)       5.461     -2.0877     0.10417    0.050679     -2.1302
(94)      5.8101      2.0684     0.10417     0.04975      2.1095
(50)      5.6916     -2.0574     0.10417    0.049222     -2.0977
(24)      5.2835     -2.0331     0.10417    0.048065     -2.0717
(56)      5.2896     -1.9552     0.10417     0.04445     -1.9885
(83)      5.8133     -1.9111     0.10417    0.042471     -1.9417
(72)      5.7462      1.8639     0.10417    0.040397      1.8916
(75)      5.9617     -1.8452     0.10417    0.039591     -1.8719

Note that the last column is in order of decreasing absolute value. You can identify the case from the row labels on the left. Case 1 has the most extreme outlier; case 91 is next, followed by case 89. No outlier is particularly extreme as can be seen by comparing to a Student's t probability point Bonferronized by n = number of cases.

Cmd> n <- nrows(residinfo); n
(1)          96

Cmd> invstu(1 - .025/n,DF[5]-1)
(1)      3.6076

Making factors in standard order

Q. How can I create factors for treatments in standard order?

A. This is not hard to do using rep(). Type help(rep) for details on its use.

Suppose you have data in standard order for a three factor experiment, with factor levels a = 2, b = 2, c = 3 and n = 4 replicates. With all the values for a replicate grouped together.

Cmd> a <- 2; b <- 2; c <- 3; n <- 4

Cmd> Afactor <- factor(rep(run(a),b*c*n))

Cmd> Bfactor <- factor(rep(rep(run(b),rep(a,b)),c*n))

Cmd> Cfactor <- factor(rep(rep(run(c),rep(a*b,c)),n))

Afactor, Bfactor and Cfactor are the factor variables in standard order. For a randomized block experiment with each replicate in a block, you would also need a factor to indicate the replicates.

Cmd> repl <- factor(rep(run(n),rep(a*b*c,n)))

Cmd> list(Afactor,Bfactor,Cfactor,repl)
Afactor         REAL   48    FACTOR with 2 levels
Bfactor         REAL   48    FACTOR with 2 levels
Cfactor         REAL   48    FACTOR with 3 levels
repl            REAL   48    FACTOR with 4 levels

Here is what the first 24 (two replicates) values of these factors look like:

Cmd> hconcat(Afactor,Bfactor,Cfactor,repl)[run(24),] # first 2 reps
 (1,1)           1           1           1           1
 (2,1)           2           1           1           1
 (3,1)           1           2           1           1
 (4,1)           2           2           1           1
 (5,1)           1           1           2           1
 (6,1)           2           1           2           1
 (7,1)           1           2           2           1
 (8,1)           2           2           2           1
 (9,1)           1           1           3           1
(10,1)           2           1           3           1
(11,1)           1           2           3           1
(12,1)           2           2           3           1
(13,1)           1           1           1           2
(14,1)           2           1           1           2
(15,1)           1           2           1           2
(16,1)           2           2           1           2
(17,1)           1           1           2           2
(18,1)           2           1           2           2
(19,1)           1           2           2           2
(20,1)           2           2           2           2
(21,1)           1           1           3           2
(22,1)           2           1           3           2
(23,1)           1           2           3           2
(24,1)           2           2           3           2

The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Minnesota.

C Bingham
kb@umn.edu

Updated Fri Nov 8 17:38:21 CST 2002