This handout illustrates some of the features of MacAnova by analyzing some data in Moore and McCabe (M & M).
We first look the data for women in Ex. 1.23 on p. 28 of M & M. They are the SSHA (Survey of Study Habits and Attitudes) scores for 18 women freshmen in a college.
M A C A N O V A 4.13
An Interactive Program for Statistical Analysis and Matrix Algebra
For information on major features, type 'help(macanova)'
For information on linear models and GLM's, type 'help(glm)'
For latest information on changes, type 'help(news)'
For information on Macintosh version, type 'help(macintosh)'
Version of 01/15/03 (Power Macintosh [CW5])
Type 'help(copyright)' for copyright and warranty info
Copyright (C) 1994 - 2003 Gary W. Oehlert and Christopher Bingham
MacAnova home page: http://www.stat.umn.edu/macanova
New keyword phrases 'upper:T' on cumxxx() and invxxx() functions
Keywords 'dimensions' and 'margins' on sum() and other functions allow
easier computation of marginal totals.
Type 'help(updates_411)' and 'help(updates_412)' for a summary of
changes from release 1 of Version 4.11
Cmd> women <- vector(154,109,137,115,152,140,154,178,101,103,\
126,126,137,165,165,129,200,148) #enter the data
Cmd> is the prompt. You type any instructions (commands) for MacAnova just after the prompt. Note the space.
The symbol <- is used to assign to a named variable (women) immediately to the left of <- what is immediately to its right (vector(154, ..., 148)).
Command vector() to the right of <- gathers together the data inside (...) to create a vector, a variable containing several numbers. The backslash \ at the end of the line means the line is continued on the next line.
The result is that the line assigns to a newly created variable women the data on the right hand side.
Typing the name of a variable prints out its values:
Cmd> women # Data from Ex. 1.23 on p. 28 of Moore & McCabe (1) 154 109 137 115 152 (6) 140 154 178 101 103 (11) 126 126 137 165 165 (16) 129 200 148
Anything after "#" on a MacAnova command treated as a comment, but is otherwise ignored.
You can compute a lot of descriptive statistics at once using command describe(). (A convention I use is that the parentheses () are part of the name of the command):
These lines illustrate some other conventions that I use in handouts:Cmd> describe(women) component: n Sample size n (1) 18 component: min Minimum value (1) 101 component: q1 Sample lower quartile (1) 126 component: median Sample median (1) 138.5 component: q3 Sample upper quartile (1) 154 component: max Maximum value (1) 200 component: mean y-bar = sample mean (1) 141.06 component: var s^2 = sample variance
What is printed after the line with describe(women) is actually the value of describe(women).
describe() creates a special type of variable named a structure, with components named n, min, q1, median, q2, max, mean, and var. The reason the result was printed is that it was not assigned to a variable. Rather than throw the result away, MacAnova prints it. When you assign the result to a variable using <- you can save all these numbers.
Cmd> stats <- describe(women)# Save output from describe, a structure Cmd> stats # print the value of stats component: n (1) 18 component: min (1) 101 component: q1 (1) 126 component: median (1) 138.5 component: q3 (1) 154 component: max (1) 200 component: mean (1) 141.06 component: var (1) 698.88
You can look at individual pieces of a structure using a $ sign:
Cmd> stats$median # look at just component with name median (1) 138.5 Cmd> s <- sqrt(stats$var); s # compute and print standard deviation (1) 26.436
This last line shows you can do more than one thing on a line if you separate things using a semicolon ;. The command before ";" creates a new variable s as the square root of the variance that was computed by describe(). Then typing s after the semicolon is just like typing it after a prompt -; it prints the value.
Here's how to make a stem plot of these data. MacAnova always uses single digit leaves.
Cmd> stemleaf(women)
3 10|139
4 11|5
7 12|669
9 13|77
9 14|08
7 15|244
4 16|55
2 17|8
High 200
1|1 represents 11 Leaf digit unit = 1
This used double digit stems and the leaves represent 10s. Leaves are always truncated and not rounded. In this case there is no truncation since all the data values are exactly represented. If we add a value 144.9, it is truncated to 144 and goes on the 14 stem as a 4 and not a 5 as it would if it were rounded to 145:
Cmd> stemleaf(vector(women, 144.9))
3 10|139
4 11|5
7 12|669
9 13|77
( 3) 14|048
7 15|244
4 16|55
2 17|8
High 200 Outlier on the high side in leaf units
1|1 represents 11 Leaf digit unit = 1
As is probably obvious, vector(women,144.9) is a larger data set with the additional value 144.9.
The numbers to the left are sometimes called the depths; they are cumulative counts up from the smallest value and down from the largest value. Thus the depth on the 11 stem is 4 because there are 3 leaves on stem 10 and 1 leaf on stem 11, and the depth on the 15 stem is 7 = 1 + 1 + 2 + 3, the single High value + total number of leaves on stems 17, 16 and 15. The ( 3) marks the stem containing the median. The number is 3 because there are 3 leaves on that stem.
stemleaf() treats outliers a little differently from non-outliers. It recognizes 200 as a possible outlier on the high side and lists it separately after High.
You have some control of the number of stems that are used, and can suppress the depth.
Cmd> stemleaf(women,6,depth:F) # 6 limits the number of stems to 6 1*|0001 1t|22233 1f|44555 1s|667 High 20 1*|1 represents 110 Leaf digit unit = 10
Here we asked for no more than 6 stems and we got 4. Stem 1* combines stems 10 and 11. Stem 1t combines stems 12 and 13 (t stands for twos and threes), stem 1f combines stems 14 and 15 (f = fours and fives), and so on. Note that the stem digit now represents for 100's and the leaf digit represents 10's. A stem plot is not complete without this information.
depth:F is a what is called a keyword phrase. depth is the keyword name and F is its value. F stands for False which basically means no, so depth:F means no depth.
You can easily translate mathematical and algebraic formulas into
MacAnova commands. We can compute the two sums
and
in one line by
combining them using vector().
Cmd> vector(sum(women),sum(women^2))# compute two sums in one command (1) 2539 3.7002e+05
Here sum(women) computes
and sum(women^2) computes
.
vector() combines the two sums into a vector of length 2, that
is a variable with two values. This is printed since it was not
assigned to a new variable.
Incidentally, if you want, you can print more significant digits printed using command print() with keyword nsig.
Cmd> print(sum(women^2),nsig:8) NUMBER: (1) 370021
You can specify the number of significant digits displayed in all output by the command setoptions(nsig:m), where m is the number of digits you want.
Because you can translate formulas fairly easily into MacAnova commands, you can compute anything describe() does without using describe().
Cmd> n <- nrows(women); n # sample size (1) 18 Cmd> ybar <- sum(women)/n; ybar # sample mean (1) 141.06 Cmd> tmp <- sum((women - ybar)^2); tmp (1) 11881 Cmd> s <- sqrt(tmp/(n-1))#Cmd> s # sample standard deviation (1) 26.436
You can do this in one line, once you have computed ybar:
Cmd> sqrt(sum((women - ybar)^2)/(n-1)) (1) 26.436
but its often easier to use a an intermediate variable like tmp.
You can easily get the minimum and maximum using min() and max().
Cmd> vector(min(women), max(women)) (1) 101 200
To get the medians and quartiles you need to order the data using sort().
Cmd> sorted <- sort(women); sorted (1) 101 103 109 115 126 (6) 126 129 137 137 140 (11) 148 152 154 154 165 (16) 165 178 200
A new variable sorted has been created with the ordered (sorted) data values.
Cmd> vector(sorted[1], sorted[n]) # minimum and maximum values (1) 101 200
sorted[1] and sorted[n] are the first and last values of sorted, that is, the minimum and maximum since sorted is in increasing order. The elements inside [...] are called subscripts.
Here is another way to get these values, using a vector (colection of numbers) as a subscript.
Cmd> sorted[vector(1,n)] # get element 1 and n of sorted (1) 101 200
Since n = 18 is even, the median is the average of the two most central values, the 9th and 10th.
Cmd> (sorted[9] + sorted[10])/2 # median (1) 138.5
You could also use .5*sum(sorted[vector(9,10)]).
Graphical summaries are often at least as useful as numerical ones and a box plot is one of the easiest. You draw a box plot in MacAnova using command boxplot(). The trailing "\" indicates the command is continued on the next line:
Cmd> boxplot(women,xlab:"SSHA Score",\ title:"SSHA scores of 18 1st year college women")![]()
Here title:"SSHA scores of 18 1st year college women" puts a title for the plot and xlab:"SSHA Score" gives a label for the X-axis. You can label any graph the same way using keywords title, xlab and ylab.
If you prefer the more common box plot with vertically orented boxes, use vboxplot().
Cmd> vboxplot(women,ylab:"SSHA Score",\ title:"SSHA scores of 18 1st year college women") # label Y-axis this time![]()
If you want to check whether it is plausible that the data are normal, a normal scores or rankit plot may be helpful. This is essentially the same as normal quantile plot discussed and illustrated by M & M starting on p. 78. The MacAnova function that computes what is called the "z-score" on M&M's plots is rankits().
Cmd> plot(rankits(women),women,\ title:"Normal quantile plot of SSHA scores of 18 women",\ xlab:"Normal scores",ylab:"SSHA scores")![]()
If the data are approximately normal, a plot make this way should approximate a straight line. The plot shows no evidence the data are not normal.
You draw histograms using macro hist() (a macro is almost the same thing as a function and is used the same way). Although the sample size here is really pretty small to make a histogram, we do it just to show how its done.
Cmd> hist(women,6,xlab:"SSHA scores",freq:T)#histogram with 6 classes![]()
Keyword phrase freq:T directs hist() to make a frequency histogram for which the height of each bar is the number of values in the group. With relfreq:T , the bar heights are the relative frequencies or proportions.
Unfortunately, the boundaries of the bars chosen automatically in this simplest use of hist() are not "neat" in any way. The bar width is about 20 so you might want a histogram with bar edges at 100, 120, 140, 160, 180, 200. This is easy to do.
Cmd> hist(women,vector(100,120,140,160,180,200),\ xlab:"SSHA scores",freq:T) # histogram with 5 NEAT classes![]()
Instead of vector(100,120,140,160,180,200), you can use run(100,200,20). This computes numbers equally spaced by 20 starting with 100 and ending with 200. Even simpler is vector(120,20), where 120 is one of the potential bar boundaries and 20 is the desired bar width. vector(100,20) or vector(180,20) would do as well. vector(110,20), would give bar boundaries 90, 110, 130, ... .
Note the use of xlab:"SSHA scores" in these examples to label the X-axis.
It can be sometimes be hard to include high resolution plots in your homework. For most purpose a dumb plot will do just as well. Just include keyword phrase dumb:T as an argument to the plotting command. The graph will be in the regular output window and is created just using ordinary typewriter characters:
Cmd> hist(women,vector(120,20),dumb:T,\
xlab:"SSHA scores",freq:T,\
width:70,height:25) # histogram with 6 NEAT classes
Frequency histogram of women
++----------+----------+----------+----------+----------++
6+ ............ +
| . . |
| . . |
| . . |
5+ . . +
| . . |
F | . . |
r 4+............ ............ +
e |. . . . |
q |. . . . |
u 3+. . . ............ +
e |. . . . . |
n |. . . . . |
c 2+. . . . . +
y |. . . . . |
|. . . . . |
1+. . . . ............+
|. . . . . .|
|. . . . . .|
0+........................................................+
++----------+----------+----------+----------+----------++
100 120 140 160 180 200
SSHA scores
Here width:70 and height:25 tells MacAnova to make the plot 70 character widths wide and 25 lines tall and so it would fit in the handout. Otherwise it makes it fit the window.
Lets now use MacAnova to make the box plot of the percent Hispanic adults in each state in Example 1.17 on p. 47 of M & M. The data are in Table 1.2 on page 14.
The data set is long enough (n = 50) that it worth using one of the data file you can download from the web. It is in file ta01_002.txt in folder or directory Ch1.
Here is how you do it using command readdata():
Cmd> readdata("")
# File ta01_002.txt from publisher's web site
# Data from Table 1.2, p. 14 of IPS4
# Use readdata("",factors:F)
# Col. 1: state (CHARACTER vector of state names)
# Col. 2: hispanic = percent adult hispanics by state (2000)
Read from file "TP1:Stat5021:Stat5021S03:Data:Ch01:ta01_002.txt"
Column 1 saved as factor state
Column 2 saved as REAL vector hispanic
On a Macintosh, this brings up a dialog box similar to this one:
I had to use the pop-up menu above the list of files to get pointed at the Ch5 folder, and then I selected ex5_101.txt and clicked on Open.
This data set has two columns, a state abbreviation in column 1 and the percent in column 2 which was saved as variable hispanic. We can safely ignore state.
Cmd> hispanic[run(5)] # first 5 numbers in hispanic (1) 1.5 3.6 21.3 2.8 28.1 Cmd> stats <- describe(hispanic, n:T,mean:T, stddev:T); stats component: n (1) 50 component: mean (1) 6.796 component: stddev (1) 7.9817 Cmd> n <- stats$n; ybar <- stats$mean; s <- stats$stddev Cmd> vector(n,ybar,s) (1) 50 6.796 7.9817 Cmd> vboxplot(hispanic,lab:"Percent hispanic",\ title:"Boxplot Percent Hispanic by State")
Updated Mon Jan 27 07:46:32 CST 2003