University of Minnesota
School of Statistics

Statistics 5021

Example of the use of Macanova
January 23, 2003

This handout illustrates some of the features of MacAnova by analyzing some data in Moore and McCabe (M & M).

We first look the data for women in Ex. 1.23 on p. 28 of M & M. They are the SSHA (Survey of Study Habits and Attitudes) scores for 18 women freshmen in a college.

                             M A C A N O V A   4.13

       An Interactive Program for Statistical Analysis and Matrix Algebra
            For information on major features, type 'help(macanova)'
          For information on linear models and GLM's, type 'help(glm)'
              For latest information on changes, type 'help(news)'
          For information on Macintosh version, type 'help(macintosh)'
                   Version of 01/15/03 (Power Macintosh [CW5])
             Type 'help(copyright)' for copyright and warranty info
        Copyright (C) 1994 - 2003 Gary W. Oehlert and Christopher Bingham
              MacAnova home page: http://www.stat.umn.edu/macanova

        New keyword phrases 'upper:T' on cumxxx() and invxxx() functions
     Keywords 'dimensions' and 'margins' on sum() and other functions allow
                     easier computation of marginal totals.
        Type 'help(updates_411)' and 'help(updates_412)' for a summary of
                     changes from release 1 of Version 4.11

Cmd> women <- vector(154,109,137,115,152,140,154,178,101,103,\
         126,126,137,165,165,129,200,148) #enter the data

“Cmd> ” is the prompt. You type any instructions (commands) for MacAnova just after the prompt. Note the space.

The symbol <- is used to assign to a named variable (women) immediately to the left of <- what is immediately to its right (vector(154, ..., 148)).

Command vector() to the right of <- gathers together the data inside (...) to create a vector, a variable containing several numbers. The backslash \ at the end of the line means the line is continued on the next line.

The result is that the line assigns to a newly created variable women the data on the right hand side.

Typing the name of a variable prints out its values:

Cmd> women # Data from Ex. 1.23 on p. 28 of Moore & McCabe
 (1)         154         109         137         115         152
 (6)         140         154         178         101         103
(11)         126         126         137         165         165
(16)         129         200         148

Anything after "#" on a MacAnova command treated as a comment, but is otherwise ignored.

You can compute a lot of descriptive statistics at once using command describe(). (A convention I use is that the parentheses “()” are part of the name of the command):

Cmd> describe(women)
component: n                   	Sample size n
(1)          18
component: min                 	Minimum value
(1)         101
component: q1                  	Sample lower quartile
(1)         126
component: median              	Sample median
(1)       138.5
component: q3                  	Sample upper quartile
(1)         154
component: max                 	Maximum value
(1)         200
component: mean                	y-bar = sample mean
(1)      141.06
component: var                 	s^2 = sample variance
These lines illustrate some other conventions that I use in handouts:

What is printed after the line with describe(women) is actually the value of describe(women).

describe() creates a special type of variable named a structure, with components named n, min, q1, median, q2, max, mean, and var. The reason the result was printed is that it was not assigned to a variable. Rather than throw the result away, MacAnova prints it. When you assign the result to a variable using <- you can save all these numbers.

Cmd> stats <- describe(women)# Save output from describe, a structure

Cmd> stats # print the value of stats
component: n
(1)          18
component: min
(1)         101
component: q1
(1)         126
component: median
(1)       138.5
component: q3
(1)         154
component: max
(1)         200
component: mean
(1)      141.06
component: var
(1)      698.88

You can look at individual pieces of a structure using a $ sign:

Cmd> stats$median # look at just component with name median
(1)       138.5

Cmd> s <- sqrt(stats$var); s # compute and print standard deviation
(1)      26.436

This last line shows you can do more than one thing on a line if you separate things using a semicolon “;”. The command before ";" creates a new variable s as the square root of the variance that was computed by describe(). Then typing s after the semicolon is just like typing it after a prompt -; it prints the value.

Here's how to make a stem plot of these data. MacAnova always uses single digit leaves.

Cmd> stemleaf(women)
    3    10|139
    4    11|5
    7    12|669
    9    13|77
    9    14|08
    7    15|244
    4    16|55
    2    17|8
 High  200
          1|1 represents 11  Leaf digit unit = 1

This used double digit stems and the leaves represent 10’s. Leaves are always truncated and not rounded. In this case there is no truncation since all the data values are exactly represented. If we add a value 144.9, it is truncated to 144 and goes on the 14 stem as a 4 and not a 5 as it would if it were rounded to 145:

Cmd> stemleaf(vector(women, 144.9))
    3    10|139
    4    11|5
    7    12|669
    9    13|77
  ( 3)   14|048
    7    15|244
    4    16|55
    2    17|8
 High  200                    Outlier on the high side in leaf units
          1|1 represents 11  Leaf digit unit = 1

As is probably obvious, vector(women,144.9) is a larger data set with the additional value 144.9.

The numbers to the left are sometimes called the depths; they are cumulative counts up from the smallest value and down from the largest value. Thus the depth on the 11 stem is 4 because there are 3 leaves on stem 10 and 1 leaf on stem 11, and the depth on the 15 stem is 7 = 1 + 1 + 2 + 3, the single High value + total number of leaves on stems 17, 16 and 15. The ( 3) marks the stem containing the median. The number is 3 because there are 3 leaves on that stem.

stemleaf() treats outliers a little differently from non-outliers. It recognizes 200 as a possible outlier on the high side and lists it separately after High.

You have some control of the number of stems that are used, and can suppress the depth.

Cmd> stemleaf(women,6,depth:F) # 6 limits the number of stems to 6
   1*|0001
   1t|22233
   1f|44555
   1s|667
 High  20
   1*|1 represents 110  Leaf digit unit = 10

Here we asked for no more than 6 stems and we got 4. Stem 1* combines stems 10 and 11. Stem 1t combines stems 12 and 13 (“t” stands for “twos and threes”), stem 1f combines stems 14 and 15 (“f” = “fours and fives”), and so on. Note that the stem digit now represents for 100's and the leaf digit represents 10's. A stem plot is not complete without this information.

depth:F is a what is called a keyword phrase. depth is the keyword name and F is its value. F stands for False which basically means “no”, so depth:F means “no depth.”

You can easily translate mathematical and algebraic formulas into MacAnova commands. We can compute the two sums sum(y) and sum(y^2) in one line by combining them using vector().

Cmd> vector(sum(women),sum(women^2))# compute two sums in one command
(1)        2539  3.7002e+05

Here sum(women) computes sum(y) and sum(women^2) computes sum(y^2). vector() combines the two sums into a vector of length 2, that is a variable with two values. This is printed since it was not assigned to a new variable.

Incidentally, if you want, you can print more significant digits printed using command print() with keyword nsig.

Cmd> print(sum(women^2),nsig:8)
NUMBER:
(1)          370021

You can specify the number of significant digits displayed in all output by the command setoptions(nsig:m), where m is the number of digits you want.

Because you can translate formulas fairly easily into MacAnova commands, you can compute anything describe() does without using describe().

Cmd> n <- nrows(women); n # sample size
(1)          18

Cmd> ybar <- sum(women)/n; ybar # sample mean 
(1)      141.06

Cmd> tmp <- sum((women - ybar)^2); tmp 
(1)       11881

Cmd> s <- sqrt(tmp/(n-1))#sqrt(sum(y^2)/(n-1))

Cmd> s # sample standard deviation
(1)      26.436

You can do this in one line, once you have computed ybar:

Cmd> sqrt(sum((women - ybar)^2)/(n-1))
(1)      26.436

but it’s often easier to use a an intermediate variable like tmp.

You can easily get the minimum and maximum using min() and max().

Cmd> vector(min(women), max(women))
(1)         101         200

To get the medians and quartiles you need to order the data using sort().

Cmd> sorted <- sort(women); sorted
 (1)         101         103         109         115         126
 (6)         126         129         137         137         140
(11)         148         152         154         154         165
(16)         165         178         200

A new variable sorted has been created with the ordered (sorted) data values.

Cmd> vector(sorted[1], sorted[n]) # minimum and maximum values
(1)         101         200

sorted[1] and sorted[n] are the first and last values of sorted, that is, the minimum and maximum since sorted is in increasing order. The elements inside [...] are called subscripts.

Here is another way to get these values, using a vector (colection of numbers) as a subscript.

Cmd> sorted[vector(1,n)] # get element 1 and n of sorted
(1)         101         200

Since n = 18 is even, the median is the average of the two most central values, the 9th and 10th.

Cmd> (sorted[9] + sorted[10])/2 # median
(1)       138.5

You could also use .5*sum(sorted[vector(9,10)]).

Graphical summaries are often at least as useful as numerical ones and a box plot is one of the easiest. You draw a box plot in MacAnova using command boxplot(). The trailing "\" indicates the command is continued on the next line:

Cmd> boxplot(women,xlab:"SSHA Score",\
 		title:"SSHA scores of 18 1st year college women")
Horizontal boxplot of women's SSHA scores

Here title:"SSHA scores of 18 1st year college women" puts a title for the plot and xlab:"SSHA Score" gives a label for the X-axis. You can label any graph the same way using keywords title, xlab and ylab.

If you prefer the more common box plot with vertically orented boxes, use vboxplot().

Cmd> vboxplot(women,ylab:"SSHA Score",\
 title:"SSHA scores of 18 1st year college women") # label Y-axis this time
Vertical boxplot of women's SSHA scores

If you want to check whether it is plausible that the data are normal, a normal scores or rankit plot may be helpful. This is essentially the same as normal quantile plot discussed and illustrated by M & M starting on p. 78. The MacAnova function that computes what is called the "z-score" on M&M's plots is rankits().

Cmd> plot(rankits(women),women,\
		title:"Normal quantile plot of SSHA scores of 18 women",\
		xlab:"Normal scores",ylab:"SSHA scores")
Normal quantile plot of women's SSHA scores

If the data are approximately normal, a plot make this way should approximate a straight line. The plot shows no evidence the data are not normal.

You draw histograms using “macro” hist() (a macro is almost the same thing as a function and is used the same way). Although the sample size here is really pretty small to make a histogram, we do it just to show how it’s done.

Cmd> hist(women,6,xlab:"SSHA scores",freq:T)#histogram with 6 classes
Histogram of women's SSHA scores

Keyword phrase freq:T directs hist() to make a frequency histogram for which the height of each bar is the number of values in the group. With relfreq:T , the bar heights are the relative frequencies or proportions.

Unfortunately, the boundaries of the bars chosen automatically in this simplest use of hist() are not "neat" in any way. The bar width is about 20 so you might want a histogram with bar edges at 100, 120, 140, 160, 180, 200. This is easy to do.

 Cmd> hist(women,vector(100,120,140,160,180,200),\
		xlab:"SSHA scores",freq:T) # histogram with 5 NEAT classes
Histogram of women's SSHA scores with neat bar edges

Instead of vector(100,120,140,160,180,200), you can use run(100,200,20). This computes numbers equally spaced by 20 starting with 100 and ending with 200. Even simpler is vector(120,20), where 120 is one of the potential bar boundaries and 20 is the desired bar width. vector(100,20) or vector(180,20) would do as well. vector(110,20), would give bar boundaries 90, 110, 130, ... .

Note the use of xlab:"SSHA scores" in these examples to label the X-axis.

It can be sometimes be hard to include high resolution plots in your homework. For most purpose a “dumb” plot will do just as well. Just include keyword phrase dumb:T as an argument to the plotting command. The graph will be in the regular output window and is created just using ordinary typewriter characters:

Cmd> hist(women,vector(120,20),dumb:T,\
          xlab:"SSHA scores",freq:T,\
          width:70,height:25) # histogram with 6 NEAT classes
                         Frequency histogram of women
           ++----------+----------+----------+----------+----------++
          6+           ............                                 +
           |           .          .                                 |
           |           .          .                                 |
           |           .          .                                 |
          5+           .          .                                 +
           |           .          .                                 |
 F         |           .          .                                 |
 r        4+............          ............                      +
 e         |.          .          .          .                      |
 q         |.          .          .          .                      |
 u        3+.          .          .          ............           +
 e         |.          .          .          .          .           |
 n         |.          .          .          .          .           |
 c        2+.          .          .          .          .           +
 y         |.          .          .          .          .           |
           |.          .          .          .          .           |
          1+.          .          .          .          ............+
           |.          .          .          .          .          .|
           |.          .          .          .          .          .|
          0+........................................................+
           ++----------+----------+----------+----------+----------++
           100        120        140        160        180       200
                                  SSHA scores

Here width:70 and height:25 tells MacAnova to make the plot 70 character widths wide and 25 lines tall and so it would fit in the handout. Otherwise it makes it fit the window.

Let’s now use MacAnova to make the box plot of the percent Hispanic adults in each state in Example 1.17 on p. 47 of M & M. The data are in Table 1.2 on page 14.

The data set is long enough (n = 50) that it worth using one of the data file you can download from the web. It is in file ta01_002.txt in folder or directory Ch1.

Here is how you do it using command readdata():

Cmd> readdata("")
# File ta01_002.txt from publisher's web site
# Data from Table 1.2, p. 14 of IPS4
# Use readdata("",factors:F)
# Col. 1: state (CHARACTER vector of state names)
# Col. 2: hispanic = percent adult hispanics by state (2000)
Read from file "TP1:Stat5021:Stat5021S03:Data:Ch01:ta01_002.txt"
Column 1 saved as factor state 
Column 2 saved as REAL vector hispanic 

On a Macintosh, this brings up a dialog box similar to this one:

Macintosh File Navigation dialog box

I had to use the pop-up menu above the list of files to get pointed at the Ch5 folder, and then I selected ex5_101.txt and clicked on Open.

This data set has two columns, a state abbreviation in column 1 and the percent in column 2 which was saved as variable hispanic. We can safely ignore state.

Cmd> hispanic[run(5)] # first 5 numbers in hispanic
(1)         1.5         3.6        21.3         2.8        28.1 

Cmd> stats <- describe(hispanic, n:T,mean:T, stddev:T); stats
component: n
(1)          50
component: mean
(1)       6.796
component: stddev
(1)      7.9817

Cmd> n <- stats$n; ybar <- stats$mean; s <- stats$stddev

Cmd> vector(n,ybar,s)
(1)          50       6.796      7.9817

Cmd> vboxplot(hispanic,lab:"Percent hispanic",\
		title:"Boxplot Percent Hispanic by State")
Vertical boxplot of hispanic data

C. Bingham kb@umn.edu

Updated Mon Jan 27 07:46:32 CST 2003