Part 2 Considering the Analysis

When considering an analysis, we first need to translate the scientific question into a statistical question. A statistical question has three parts: What are the variables, and what kind are they? What are the observational units, that is, what “things” were these variables measured on? And finally, what relationships between these variables are we interested in? Once we’ve identified these, we can determine what analyses might be appropriate.

If you’re reading the scientific literature, you should know that in most cases, published work does use the right analysis, or at least a reasonable one. Not to say there aren’t poorly done analyses out there! But if you’re reading a paper for understanding, usually what you want to be thinking about is “why did they choose this particular analysis,” not “was this analysis correct.”

2.1 Variables

On to variables! What are the variables that were measured that you’ll use to answer your scientific question, and what kind of variables are they? There’s perhaps three main kinds, depending how you count. You might have categorical variables; if so, think about how many categories, or levels, each variable has. If it’s just two, it’s called a “binary” variable. Also, do the levels have a natural ordering? These are called “ordinal” variables. You might have numerical variables; if so, you want to consider if they’re roughly “normally” distributed. This means that the data is mostly in the middle, coming out evenly on both sides, with no distinct outliers. If the data is skewed, has outliers, or is non-normal in some other way, we’ll need different methods. Finally, you might have “survival” variables. These are numeric, in that they tell “how long” until something happened. But sometimes instead of knowing that it was, for example, exactly 30 days, we know instead that it was at least 30 days. This happens when individuals are lost to follow-up or the study ended before the event occurred.

2.2 Observational Units

Now, the observational units! What are the actual “things” that the data was collected on? Once you’ve identified the units, the most important question is whether or not they are “independent.” What does it mean to be independent? This means that your samples should come from separate units, and those units should be independent. That is, the units shouldn’t share conditions, except for those you’re testing, and, if it’s an experimental study, they should be randomized as individual units, not as groups.

Here’s a couple examples. The first is pigs in a pen. If you’re doing a feeding trial and the different feeds are applied at the pen level, that is, every pig in the pen gets the same diet, then the pigs within each pen are not independent, because there might be something about that pen that causes an effect in all the pigs together. A lab example would be taking technical subsamples from a biological sample, and measuring each subsample separately. These data points aren’t independent because they came from the same biological sample, and so if something is different about that sample, it could be different for all the subsamples. If your observational units are not independent, you’ll need more sophisticated analyses, called “mixed” or “multi-level” models.

2.3 Relationships of Interest

On to thinking about the relationships of interest. Here we’re thinking about the variables we’ve identified, and determining what relationships between them we want to investigate to answer our scientific question. Usually (but not always), there will be on variable that is the “response” variable, and one or more that are “explanatory” variables of interest. There may also be additional variables, called “covariates” that could affect this relationship in different ways. Once we’ve identified these relationships and roles, we can see what analyses are appropriate.

2.4 Choosing an Analysis

Our choice of analysis is going to depend on the answers to the questions we’ve looked at so far. First, what kind of variables are the response and explanatory variables? This table shows a general idea of what kinds of analyses are appropriate in each case. I won’t get into how each of these work or how to specifically interpret them; the idea here is simply to give you a basic road map to see how the analysis maps up with the kinds of variables involved in the scientific question. If you have a numerical response variable that is normally distributed, here you’ll likely see a t-test, an ANOVA, or a linear regression, depending on the kind of explanatory variables you have. If you have a categorical response variable, you’ll likely see a chi-squared (or Fisher’s test) or possibly a log-linear model if you have categorical explanatory variables, or a logistic regression if you have numerical, or multiple, explanatory variables. Again, if you have non-independent observations, you’ll need something more sophisticated, a “mixed” or “multi-level” model. And if you have survival data, you’ll likely see Kaplan-Meier curves and either log-rank tests or Cox proportional hazard models.