Applied Regression Including Computing and Graphics

Applied Linear Regression, Third Edition

Preface

Regression analysis answers questions about the dependence of a response variable on one or more predictors, including prediction of future values of a response, discovering which predictors are important, and estimating the impact of changing a predictor or a treatment on the value of the response. At the publication of the second edition of this book about twenty years ago, regression analysis using least squares was essentially the only methodology available to analysts interested in questions like these. Cheap, widely available high-speed computing has changed the rules for examining these questions. Modern competitors include nonparametric regression, neural networks, support vector machines, and tree-based methods, among others. A new field of computer science, called machine learning, adds diversity, and confusion, to the mix. With the availability of software, using a neural network or any of these other methods seems to be just as easy as using linear regression.

So, a reasonable question to ask is: Who needs a revised book on linear regression using ordinary least squares when all these other newer and presumably better methods exist? This question has several answers. First, most other modern regression modeling methods are really just elaborations or modifications of linear regression modeling. To understand, as opposed to use, neural networks or the support vector machine is nearly impossible without a good understanding of linear regression methodology. Second, linear regression methodology is relatively transparent, and as will be seen throughout this book. We can draw graphs that will generally allow us to see relationships between variables and decide if the models we use make any sense. Many of the more modern methods are much like a black box in which data are stuffed in at one end and answers pop out at the other without much hope for the non-expert to understand what goes on inside the box. Third, if you know how to do something in linear regression, the same methodology with only minor adjustment will usually carry over to other regression type problems for which least squares is not appropriate. For example, the methodology for comparing response curves for different values of a treatment variable when the response is continuous is studied in Chapter 6 of this book. Analogous methodology can be used when the response is a possibly censored survival time, even though the method of fitting needs to be appropriate for the censored response, and not least squares. The methodology of Chapter 6 is useful both in its own right when applied to linear regression problems, and as a set of core ideas that can be applied in other settings.

Probably the most important reason to learn about linear regression and least squares estimation is that even with all the new alternatives, most analysis of data continue to be based on this older paradigm. And why is this? The primary reason is that it works: least squares regression provides good, and useful, answers in many problems. Pick up the journals in any area where data is commonly used for prediction or estimation, and the dominant method used will be linear regression with least squares estimation.

What's new in this edition

Many of the examples and homework data sets from the second edition have been kept, although some have been updated. The fuel consumption data, for example, now uses 2001 values rather than 1974 values. Most of the derivations are the same as in the second edition, although the order of presentation is somewhat different. To keep the length of the book nearly unchanged, methods that failed to gain general usage have been deleted, as have the separate chapters on prediction and missing data. These latter two topics have been integrated into the remaining text.

The continuing theme of the second edition was the need for diagnostic methods, in which fitted models are analyzed for deficiencies, through analysis of residuals and influence. This emphasis was unusual when the second edition was published, and important quantities like Studentized residuals and Cook's distance were not readily available in the commercial software of the time.

Times have changed, and so has the emphasis of this book. This edition stresses graphical methods, and looking at data not only after fitting models to data, but also before fitting models to try to find reasonable models to fit, and how to understand them. This is reflected immediately in the new Chapter 1, which introduces the key ideas of looking at data with scatterplots and a somewhat less universal tool of scatterplot matrices. Most analyses and homework problems start with drawing graphs. We tailor analysis to correspond to what we see in the graphs, and this additional step can make modeling easier, and fitted models reflect the data more closely. Remarkably, this also lessens the need for diagnostic methods.

The emphasis on graphs leads to several additional methods and procedures that were not included in the second edition. The use of smoothers to help summarize a scatterplot is introduced early, although only a little of the theory of smoothing is presented (in Appendix A.5). Transformations of predictors and the response are stressed, and relatively unfamiliar methods based both on smoothing and on generalization of the Box-Cox method are presented in Chapter 7.

Another new topic included in the book is computationally intensive methods and simulation. The key example of this is the bootstrap, in Section 4.6, which can be used to make inferences about fitted models in small samples. A somewhat different computationally intensive method is used in an example in Chapter 10, which is a completely rewritten chapter on variable selection.

The book concludes with two expanded chapters on nonlinear and logistic regression, both of which are generalizations of the linear regression model. I have included these chapters to provide instructors and students with enough information for basic usage of these models, and to take advantage of the intuition about them gained from in-depth study of the linear regression model. Each of these can be treated at book-length, and appropriate references are given.

Mathematical level

The mathematical level of this book is roughly the same as the level of the second edition. Matrix representation of data is used, particularly in the derivation of the methodology in Chapters 2-4. Derivations are less frequent in later chapters, and so the necessary mathematics is less. Calculus is generally not required, except for an occasional use of a derivative, for the discussion of the delta method, Section 6.1.2 and for a few topics in the Appendix. The discussions requiring calculus can be skipped without much loss.

Computing and computer packages

Like the second edition, only passing mention of computer packages is made in the book. To help the reader make a connection between the text and a computer package for doing the computations, we provide several web companions for Applied Linear Regression that discuss how to use standard statistical packages for linear regression analysis. The packages covered include JMP, SAS, SPSS, R, and S-plus; others may be included after publication of the book. In addition, all the data files discussed in the book are also on the website. The web address for this material is http://www.stat.umn.edu/alr.

Some readers may prefer to have a book that integrates the text with the computer package more closely, and for this purpose I can recommend R. D. Cook and S. Weisberg (1999), Applied Regression Including Computing and Graphics, also published by John Wiley. This book includes a very user-friendly, free computer package called Arc that does everything that is described in that book and also nearly everything in Applied Linear Regression.

Teaching with this book

The first ten chapters of the book should provide adequate material for a one-quarter course on linear regression. For a semester-length course, the last two chapters can be added. A teacher's manual, primarily giving solutions to all the homework problems, can be obtained from the publisher by instructors.

Return.

S Weisberg
2004-11-29