Applied Linear Regression, Third Edition, by Sanford Weisberg, Copyright 2005, John Wiley and Sons, ISBN 0-471-66379-4. Reprinted by permission.
Regression analysis answers questions about the dependence of a response variable on one or more predictors, including prediction of future values of a response, discovering which predictors are important, and estimating the impact of changing a predictor or a treatment on the value of the response. At the publication of the second edition of this book about twenty years ago, regression analysis using least squares was essentially the only methodology available to analysts interested in questions like these. Cheap, widely available high-speed computing has changed the rules for examining these questions. Modern competitors include nonparametric regression, neural networks, support vector machines, and tree-based methods, among others. A new field of computer science, called machine learning, adds diversity, and confusion, to the mix. With the availability of software, using a neural network or any of these other methods seems to be just as easy as using linear regression.
So, a reasonable question to ask is: Who needs a revised book on linear regression using ordinary least squares when all these other newer and presumably better methods exist? This question has several answers. First, most other modern regression modeling methods are really just elaborations or modifications of linear regression modeling. To understand, as opposed to use, neural networks or the support vector machine is nearly impossible without a good understanding of linear regression methodology. Second, linear regression methodology is relatively transparent, and as will be seen throughout this book. We can draw graphs that will generally allow us to see relationships between variables and decide if the models we use make any sense. Many of the more modern methods are much like a black box in which data are stuffed in at one end and answers pop out at the other without much hope for the non-expert to understand what goes on inside the box. Third, if you know how to do something in linear regression, the same methodology with only minor adjustment will usually carry over to other regression type problems for which least squares is not appropriate. For example, the methodology for comparing response curves for different values of a treatment variable when the response is continuous is studied in Chapter 6 of this book. Analogous methodology can be used when the response is a possibly censored survival time, even though the method of fitting needs to be appropriate for the censored response, and not least squares. The methodology of Chapter 6 is useful both in its own right when applied to linear regression problems, and as a set of core ideas that can be applied in other settings.
Probably the most important reason to learn about linear regression and least squares estimation is that even with all the new alternatives, most analysis of data continue to be based on this older paradigm. And why is this? The primary reason is that it works: least squares regression provides good, and useful, answers in many problems. Pick up the journals in any area where data is commonly used for prediction or estimation, and the dominant method used will be linear regression with least squares estimation.
The continuing theme of the second edition was the need for diagnostic methods, in which fitted models are analyzed for deficiencies, through analysis of residuals and influence. This emphasis was unusual when the second edition was published, and important quantities like Studentized residuals and Cook's distance were not readily available in the commercial software of the time.
Times have changed, and so has the emphasis of this book. This edition stresses graphical methods, and looking at data not only after fitting models to data, but also before fitting models to try to find reasonable models to fit, and how to understand them. This is reflected immediately in the new Chapter 1, which introduces the key ideas of looking at data with scatterplots and a somewhat less universal tool of scatterplot matrices. Most analyses and homework problems start with drawing graphs. We tailor analysis to correspond to what we see in the graphs, and this additional step can make modeling easier, and fitted models reflect the data more closely. Remarkably, this also lessens the need for diagnostic methods.
The emphasis on graphs leads to several additional methods and procedures that were not included in the second edition. The use of smoothers to help summarize a scatterplot is introduced early, although only a little of the theory of smoothing is presented (in Appendix A.5). Transformations of predictors and the response are stressed, and relatively unfamiliar methods based both on smoothing and on generalization of the Box-Cox method are presented in Chapter 7.
Another new topic included in the book is computationally intensive methods and simulation. The key example of this is the bootstrap, in Section 4.6, which can be used to make inferences about fitted models in small samples. A somewhat different computationally intensive method is used in an example in Chapter 10, which is a completely rewritten chapter on variable selection.
The book concludes with two expanded chapters on nonlinear and logistic regression, both of which are generalizations of the linear regression model. I have included these chapters to provide instructors and students with enough information for basic usage of these models, and to take advantage of the intuition about them gained from in-depth study of the linear regression model. Each of these can be treated at book-length, and appropriate references are given.
Some readers may prefer to have a book that integrates the text with the computer package more closely, and for this purpose I can recommend R. D. Cook and S. Weisberg (1999), Applied Regression Including Computing and Graphics, also published by John Wiley. This book includes a very user-friendly, free computer package called Arc that does everything that is described in that book and also nearly everything in Applied Linear Regression.