Stat 5303 Computing FAQ

Questions on:

Accessing PDF files
Printing neat ANOVA tables
Why am I getting an error message when I use contrast()?
Plots to diagnose non-constant standard deviation
What does boxcox() do?
What are Cook's D, HII and rankits?
How can I compute the power of a test of a contrast?
Choice of weights to define a contrast that involves more than 2 treatments
Query on Exercise 6.6
Query on Ex. 7.5
Is there a easy way to identify the most extreme outliers?
How can I create factors for treatments in standard order?

Accessing PDF files

Q. I was able to access the class notes which you posted on the web as Acrobat PDF documents. However, they are not in a readable form. Do you have an alternative way or form to forward the class notes?

A. I don't have any alternative way or form to forward the class notes. I have been using this method for several years and students have generally had little trouble once they get the hang of it.

One thing I recommend because it gives you more control over the process is to download PDF documents explicitly. On a Windows computer, click on the link using the right button so that a menu pops up. On a Macintosh, hold down the button when you select the link until a popup menu appears. In both cases you should then select Save link as or Save target as on the popup menu. Save the file some place where you will be able to find it, perhaps on the "desk top". Then, leaving the browser, double click on the file's icon to read it and/or print it in Acrobat Reader.

Most computers come with Acrobat Reader installed. If you don't have it, you can download a free copy from http://www.adobe.com/products/acrobat/readstep2.html.

Most browsers have "plugins" that enable you to read PDF documents in your browser when you click on the link with the left button or without holding down the button. Unfortunately, when you do it this way the actual document is squirreled away in a hard-to-find cache folder, sometimes using an obscure name, and you may not be able to find it to read or print after you quit your browser. That's one reason why I recommend explicitly downloading PDF files.

If you do have trouble with this process, please let me know and I will try to help.

Printing neat ANOVA tables

Q. Once I have calculated all the variables for an ANOVA table, is there a way to construct it in MacAnova?

A. I'm not certain I understand the question. The following is based on the assumption that what you want is to print a nice neat ANOVA table in MacAnova.

Suppose you have means and an MSE from a 4 greatment experiment with 7 observations per treatment so your df_treatment = 3 and df_error = 24.

Suppose the means are 29.12, 30.13, and 27.39 and MSE = 5.32 and you compute the components of the ANOVA table as follows:

Now you need to make a table with row and column labels containing these results.

Why am I getting an error message when I use contrast()?

Here is some output which ends with an obscure error message. Idon't know what I'm doing wrong. I'm trying to construct a contrast but don't quite understand/ know how to calculate the contrast. Please assist.

Q. The problem with what you are trying to do is that contrast() works only after you have done an ANOVA using anova(). It uses some remnants of that computation that were not necessarily printed to compute the contrast.

The error message is a little arcane, I'm afraid. A GLM (Generalized Linear Model) is what any of a number of commands including anova(), regress(), logistic(), ... sets up. If you haven't run one of these commands, no GLM model is "active."

You have already computed what would be the estimate component of the output of contrast() as sum(w*alphahats). To get the standard error (se component) and SS (ss component) you do a "white box" computation using the appropriate formulas. For example, if w is your vector of weights and n is the vector of sample sizes (or a scalar if all samples sizes are equal) and mse your MSE, then

Plots to diagnose non-constant standard deviation

Q. I am still unsure how to do Pr.6.1.a. I don't understand how to plot the data once it is transformed, nor do I understand what I'm seeing once the data is transformed looking at these two columns. I need some more explanation. Does the cloud seeding example have to do with understanding transformations?

All you were supposed to see confirmation of the plausibility that the Box-Cox transformation with p = 0 was the same as the natural log.

The column on the right is Box-Cox with p very close to 0 and it is almost the same as the left column of natural logs. If you used p = .0000001 it would be even closer.

There are two kinds of plots that shed light on whether variances are the same in different groups.

When the standard deviation is the same in every group, you should get a horizontal pattern of points. When the standard deviation increases with the mean, the plot should trend up to the right. When the standard deviation decreases as the mean increases, the plot should trend down to the right.

I added the ymin:0 to the argument list so that the zero line would be in the plot. This gives a better perspective as to the pattern.

A variant on this is to plot the log of the standard deviations against the log of the means.

The other type of plot is a plot of residuals against the fitted values, which in this case are just the group means. You must first run anova() and then resvsyhat().

After trying a transformation, say, doing y1 <- sqrt(y), you can make the same plots based on y1 (you have to do anova("y1=treat") before using resvsyhat() again).

Yes the cloudseeding example has to do with using transformations. There's a big difference between the standard deviations of the two treatments. Taking a suitable transformation helps a lot and also makes the distributions more normal.

You can find a transformation by trial and error, trying "standard" transformations such as log, sqrt, cube root (y^(1/3)), reciprocal (1/y), 1/sqrt(y), ... or you can use boxcoxvec() (see the handout on Checking ANOVA Assumptions).

In Problem 6.1, the data are percentages. The text suggests that in some cases the so called arcsine transformation, actually arcsine(sqrt(y)) is appropriate with proportions = percentages/100.

You could also try various powers using boxcoxvec(). But also, since 100 - y is the percent of non-big blue stem and is positive, you might use boxcoxvec() to find a power to transform 100 - y.

There are actually at least three transformations that work well here. Although they are different, they have a very similar effect on y.

What does boxcox() do?

I don't understand the usage of boxcox(). I have looked it up in MacAnova but the explanation is too cryptic. Can you make it more transparent?

It is fairly cryptic. The difference between the Box-Cox as I described it in lecture, is that, for the sake of simplicity, I did not include the factor GM^(Pow-1) in the denominator. This value is the same for every case you are transforming and hence does not affect the usefulness of the transformation to correct a problem with ANOVA data.

GM is the Geometric mean which you can calculate by finding natural logs of the data, calculating their mean, and then taking the natural anti log (exponential) of that mean.

When you actually come to use a transformation, even when the choice was guided by boxcoxvec(), you seldom if ever use boxcox(), but just the power y^p (or log(y) or log10(y) when p = 0 is indicated.)

This last shows that the correlation of sqrt(Y) and boxcox(Y,.5) is 1 and hence they would plot exactly on a straight line. Thus they are equivalent from the point of view of correcting data that do not satisfy ANOVA assumptions.

What are Cook's D, HII and rankits

Q. I followed the example in the lecture for Pr. 6.1. Perhaps I'm completely misled since you didn't quite go over that. What is Cook's D and HII? What does rankits mean?

A. HII is a vector always computed by anova(), regress() and other similar commands. An element HII[j] is sometimes called the leverage of case j. It enters into the standard deviation of the i^th residual as sigma*sqrt(1 - HII[j]). In the one-way ANOVA, HII[j] for a case in group i is 1/n_i, so the estimated standard deviation of a residual in group i is sqrt(mse*(1 - 1/n[i])).

Cook's D is a measure of how much a particular case influences the parameter estimates. In the ANOVA with equal sample sizes it is primarily affected by the size of the residual. With unequal sample sizes, the same size residual in a small group has a greater affect on estimates than in a large group, so D is larger for large residuals in small groups than for equivalent residuals in large groups. The Cook it's named after is Professor Dennis Cook of the School of Statistics. He developed it about 25 years ago and it has become a standard part of the output of almost every good computer program for doing regression or ANOVA.

How can I compute the power for a contrast?

Q. I am working on Stats 5303 Ex. 7.5. I am not clear on what to use for the number of groups when I am doing contrasts within the larger study. For example in part A I used g = 6; should I be using something different?

If the high cost tapes average is 1 unit different from the low cost tapes then I think we should use the optimistic noncentrality parameter formula. The pessimistic noncentrality parameter wouldn't make sense because if only two of the treatments effects were 1 unit apart than the average would be less than one.

Using the optimistic noncentrality parameter formula we get a power of 0.73, much greater than using the pessimistic noncentrality parameter formula.

A. What you have done makes sense in the context of an overall F-test of the hypothesis all 6 treatments are the same, but I don't think it addresses what is wanted in part (a) (which is, unfortunately, not very clearly written).

The test referred to in (a) is a test of the contrast with coefficients (1/4, -1/2, 1/4, -1/2, 1/4, 1/4). See Section 7.4. In MacAnova, you will have to use power2() instead of power(), using numerator df = 1, denominator df = ng - g = g(n-1). In computing the noncentrality parameter you use the value 1 for

_iw_i

_i, the mean of the contrast, taken from the statement of the problem.

The optimistic/pessimistic/intermediate choice for alternative is meaningful when you are choosing a sample size so that the power of the overall F on g-1 and N-g d.f. is a specified number. It is not relevant for testing a contrast.

Choice of weights to define a contrast that involves more than 2 treatments

Q.In Ex. 7.5 (b) how does the fact that I'm comparing only brand A and brand B come into play?

On Page 2 of the lab handout in the methods for "determining sample size to meet a confidence interval margin or error goal" is the line:

I am confused by the 1 and -1. The sheet says they are the w_i's for two means. How was it decided that there should be one 1, one -1 and the rest should be all 0?

The handout said that the confidence interval width was 0.5 so I don't understand why they would make D = 2. How do I pick these 1 and -1 terms for part B? I am wondering if I can follow the same form on the handout or if it needs to be modified in some way to fit comparing 2 types of brand A with 2 types of brand B.

A pairwise contrast compares the effects of two treatments. For instance in comparing treatment 3 and 5 the contrast would have the form alpha

₃ -

₅. When g = 6, this can be written as

that is,

_iw_i

_i where w is vector(0,0,1,0,-1,0). In general, any pairwise contrast will have one 1, one -1 and g-2 0's.

But that's not directly relevant here, since this is not pairwise contrast, comparing just two treatments. There are 2 treatments, 1 and 2, involving brand A, and 2 treatments (3, and 4) involving brand B. The approprate contrast would give equal weights to treatments 1 and 2, the negative of these weights to treatments 3 and 4, and weight 0 to treatments 5 and 6.

The '2' in part b is not D, but, in Oehlert's notation, W = the desired width of the confidence interval. It corresponds to a margin of error M = 1 in the notation in the handout.

Query on Exercise 6.6

Q On Exercise 6.6, I am not sure if I should use the transformation or not, because it looks OK.

Also, the question is asking me to compare different conditions, I wonder if I should I do contrast for all of them or if I can use pairwise test.

A On Ex 6.3. Non-constant variance is not the only thing to worry about. You should also be looking for outliers.

Fixing data is often a stepwise procedure -- you first have to fix one problem before another problem is apparent.

Even when there are apparent outliers, you should probably first try to find a transformation. It is not uncommon for what looks like an outlier to be revealed as nothing out of the ordinary when you take a transformation.

Use output from resid() to identify any outliers, preferably using a objective Bonferronized t-test of the t-statistic values in the last column (externally Studentized residuals). If you find an outlier, run the ANOVA without it and again look at residuals plots. If there is still an outlier, try omitting that case and try another ANOVA.

When you think you have outlier-free data, seek a tranformation to stabilize the variance if it looks like that's called for.

Or, you could create new variables by y1 <- y[-J] and treat1 <- treat[-J] followed by anova("y1=treat1").

What kind of a test do you need for 1? What kind of a test do you need for 2 and 3?

There isn't an absolutely unique answer, but some answers are better than others. Using pairwise() implicitly tests a lot of hypotheses. Generally, you don't want to do more tests than necessary than necessary since that increases the Type I error rate, or if you protect against that using the HSD or BSD, it increases the Type II error rate.

Query on Ex. 7.5

Q. I'm still confused about Ex. 7.5. I don't understand how you'd figure in the information about contrasts into calculation of power and sample size based on the handouts you've given us. I'm used to calculating contrasts with a given data set including "treat". So I ignored the weights discussed and figured the zeta1 was 1. Please enlighten me!

A. You need to recall what a contrast sum(w*ybars)

you are computing from a data set is estimating. It is estimating the quantity sum(w*alpha)

, defined in terms of the unknown parameters alpha

_i.

As Oehlert (p. 158) makes clear, the non-centrality parameter of t² = F statistic with 1 and df_error degrees of freedom is (in the equal sample size case)

This explicitly involves the contrast weights. The denominator is part of the square of the (true) standard error of sum(w*ybars)

when

² is right.

If the alternative is sum(w*alpha)

= D, then $zeta = n*D^2/{sigma^2*sum(w^2)}$ and you compute the power by power2(n*zeta_1,1,alpha,df_error). You can't compute it using power() since that works only for the F-test of H₀: all

's = 0.

So in a power problem involving a contrast, you need to figure out what the contrast coefficients w_i are and what is the value D = sum(w*alpha)

that is the alternative. In 7.5 (a), you are looking at averages so w = vector(.25,.25, -.5,-.5,.25,.25) and D = 1.

On 7.5 (b), the contrast is vector(.5,.5,-.5,-.5,0,0). The goal has to do with confidence interval width or margin of error size and you can't use samplesize(). n satisfies the equation

n = 1 won't work since it would mean df_error = g(n-1) = 0. So see if n = 2 will give a small enough margin of error.

Finding significant outliers

A. There is a way to isolate the most significant outliers, as ranked by the externally studentized residuals computed by resid(). Whether it is really easy is problematical.

This last command computes the row numbers of the 10 largest values of t-stats in absolute value. Without down:T it would return the row numbers of the 10 smallest values which are of no interest.

Note that the last column is in order of decreasing absolute value. You can identify the case from the row labels on the left. Case 1 has the most extreme outlier; case 91 is next, followed by case 89. No outlier is particularly extreme as can be seen by comparing to a Student's t probability point Bonferronized by n = number of cases.

Making factors in standard order

A. This is not hard to do using rep(). Type help(rep) for details on its use.

Suppose you have data in standard order for a three factor experiment, with factor levels a = 2, b = 2, c = 3 and n = 4 replicates. With all the values for a replicate grouped together.

Afactor, Bfactor and Cfactor are the factor variables in standard order. For a randomized block experiment with each replicate in a block, you would also need a factor to indicate the replicates.

The views and opinions expressed in this page are strictly those of the page author. The contents of this page have not been reviewed or approved by the University of Minnesota.

Frequently Asked Questions About Computing and MacAnova

Statistics 5303

Accessing PDF files

Printing neat ANOVA tables

Why am I getting an error message when I use `contrast()`?

Plots to diagnose non-constant standard deviation

What does `boxcox()` do?

What are Cook's D, HII and rankits

How can I compute the power for a contrast?

Choice of weights to define a contrast that involves more than 2 treatments

Query on Exercise 6.6

Query on Ex. 7.5

Finding significant outliers

Making factors in standard order

Frequently Asked Questions About Computing and MacAnova

Statistics 5303

Accessing PDF files

Printing neat ANOVA tables

Why am I getting an error message when I use contrast()?

Plots to diagnose non-constant standard deviation

What does boxcox() do?

What are Cook's D, HII and rankits

How can I compute the power for a contrast?

Choice of weights to define a contrast that involves more than 2 treatments

Query on Exercise 6.6

Query on Ex. 7.5

Finding significant outliers

Making factors in standard order

Why am I getting an error message when I use `contrast()`?

What does `boxcox()` do?