University of Minnesota
School of Statistics

Statistics 5021
Solutions to First Midterm Examination

March 3, 2003

1. Here is MacAnova output which analyzes a data set consisting of the percentage of bone ash in the left tibia of 21 three week old chicks.

Cmd> bone_ash <- vector(42.36,34.26,44.94,40.61,37.93,44.53,36.85,\
            43.09,39.68,35.31,40.31,38.82,42.07,36.87,37.34,\
            37.33,38.06,39.99,39.22,34.06,33.26)

Cmd> sort(bone_ash)
 (1)       33.26       34.06       34.26       35.31       36.85
 (6)       36.87       37.33       37.34       37.93       38.06
(11)       38.82       39.22       39.68       39.99       40.31
(16)       40.61       42.07       42.36       43.09       44.53
(21)       44.94

Cmd> n <- length(bone_ash); n
(1)           21 

Cmd> stemleaf(bone_ash)
    1    33|2
    3    34|02
    4    35|3
    6    36|88
    9    37|339
  ( 2)   38|08
   10    39|269
    7    40|36
    5    41|
    5    42|03
    3    43|0
    2    44|59

          1|1 represents 1.1  Leaf digit unit = 0.1

Cmd> sum(bone_ash)
(1)       816.89

Cmd> sum(bone_ash^2)
(1)        31996

Cmd> sum((bone_ash - sum(bone_ash)/n)^2)
(1)       219.09

Cmd> sum(abs(bone_ash - describe(bone_ash,median:T)))
(1)        55.53

Find the following quantities without entering the data in your calculator. Show your work:

(a) (5 pts.) Sample standard deviation    3.310    

Cmd> stddev <- sqrt(219.09/(21-1)); stddev
(1)      3.3098

(b) (5 pts.) Sample range   11.68    

Cmd> range <- 44.94 - 33.26; range
(1)       11.68

(c) (10 pts.) What values correspond to the locations marked with arrows?

Boxplot with arrows to Max, Q3, Median


(i)    44.84   
(ii)    41.34   
(iii)    38.82   

(i), (ii) and (iii) are the maximum, the upper quartile and the median, respectively.

From the sorted output, the maximum is 44.94.

The upper quartile is the median of the top half, excluding the median, that is the median of the top 10.

Cmd> (40.61 + 42.07)/2
(1)       41.34

Because n = 21 is odd, the median is the middle value, that is value (n+1)/2 = 11 in order of size. From the sorted output this is 38.82.

2. On 20 consecutive days in 1963 in the Chignik-Black Lakes area of Alaska researchers made areal counts of bears. They also recorded the average morning wind speed. Here are the data:

DayWind Speed (mph)Bears SeenDayWind Speed (mph)Bears Seen
1 2.0 98 11 10.5 77
2 16.7 60 12 18.6 56
3 21.1 30 13 20.3 54
4 15.9 63 14 11.9 69
5 4.9 82 15 7.0 87
6 11.8 76 16 20.6 47
7 23.6 43 17 13.5 73
8 4.0 89 18 14.0 72
9 21.5 49 19 6.9 83
10 24.4 36 20 27.2 23

Here is a scatter plot of these data.

Number of Bears vs Morning Wind Speed

Here are the results of some computations (x = wind speed, y = number of bears):

n = 20 sum(x)= 296.4 sum(y)= 1267
sum(x^2)= 5423.7 sum(x*y)= 15970.4 sum(y^2)= 88491
sum((x-xbar)^2)= 1031.052 sum((x-xbar)*(y-ybar)= –2806.54 sum((y-ybar)^2)= 8226.55

(a) (10) What percent of the total variability in the number of bears can be explained linearly by variation in wind speed (show your work).

The proportion of total variability is r2, where r = Pearson's correlation = Formula for Pearson's r.

Cmd> r <- -2806.54/sqrt(1031.052*8226.55); r
(1)    -0.96366

Cmd> r^2 # proportion explained
(1)     0.92863

Cmd> 100*r^2 # percentage explained
(1)      92.863

(b) (15) On day 21, the wind speed was 5.1. Based on the least squares line, what is you best guess as to how many bears were seen.

Here x = 5.1, so the best guess would be a + b×x, where a and b are the intercept and slope of the least squares line. The LS estimate of b is Formula for LS slope and the least squares estimate of a is Formula for LS intercept.

Cmd> xbar <- 296.4/20; ybar <- 1267/20; vector(xbar,ybar)
(1)       14.82       63.35

Cmd> a <- ybar - b*xbar;a
(1)      103.69

Cmd> 103.69 + (-2.722)*5.1
(1)      89.808

(c) (10) Here is a plot of standardized residuals vs wind speed from the linear regression of bear counts on wind speed.

Standardized residuals vs Windspeed

Comment in 50 words or less on what this tells us about the suitability of simple linear regression for analyzing these data.

The curved arc pattern suggests a straight line fit may not be appropriate. In addition, a standardized residual of about –3 might be an outlier.

3.
Trapezoid shaped density

(a) (10) For a suitable value of h, the heavy line in the graph defines a density function. What is the value of h? Give reasons for your choice.

A density is always positive which this is. It also has total area 1. The area can be split into two triangles with base 1 and height h and one rectangle with base 1 and height h. Thus the total area is 2(1)(h)/2) + (1)(h) = 2h = 1. Conclusion: h = 1/2.

(b) (10) What proportion of the population described by the density had values of x > 3.5. If you were unable to find h in (a), give a formula involving h.

This is the area under the curve to the right of 3.5 which is half way between 3 and 4. Thus the height of the triangle is half way between h and 0, that is h/2. The area of the triangle is (1/2)(h/2)/2 = h/8 = 1/16.

(c) (10) Find the median and the upper and lower quartiles of the distribution defined by the density.

The median has half the area on either side. Since the density is symmetric around 2.5, the median is 2.5.

The quartiles cut off area 1/4 on each end. Since the area of each triangular tail is half the area of the middle rectangle, the lower and upper quartiles are 2 and 3, respectively.

4. Scores on the Wechsler Adult Intelligence Scale for the 20 to 34 age group are approximately normally distributed with mean 110 and standard deviation 25 while scores for the 60 to 64 age group are approximately normally distributed with mean 90 and standard deviation 25.

Sarah, who is 30, scored 135 on the test and her mother, who is 60, scored 120.

(a) (10) What percentiles in their age groups are Sarah and her mother?

You need to find the proportion (relative frequency) of individuals in the 20 to 34 age group who scored <= 135 and the proportion of the individuals in the 60 to 65 who scored <= 120. Here are the z-scores:

Cmd> z1 <- (135-110)/25; z2 <- (120-90)/25

Cmd> vector(z1,z2) # Z-scores 
(1)           1         1.2

Cmd> cumnor(vector(1,1.2)) # .8413 and .8849 from table
(1)     0.84134     0.88493

Cmd> 100*vector(.8413, .8849) # percentiles
(1)       84.13       88.49

It would not be unreasonable to round these 84 and 88.

(b) (10) What score does a person in the 20 to 34 age group have to exceed to be in the top 5%?

You need to find x with 95% of the scores to the left. For the standard normal, z = 1.64 and z = 1.65 correspond to areas .9495 and .9505, respectively. Since .9500 is half way between these z is half way between 1.64 and 1.65.

Cmd> z <- (1.64 + 1.65)/2; z
(1)        1.645

Cmd> 110 + 1.645*25 # unstandardize
(1)       151.12

C. Bingham

Updated Sun Mar 2 20:49:17 CST 2003