What is meant by regression modeling?
Linear Regression is one of the most common statistical modeling techniques. It is very powerful, important, and (at first glance) easy to teach. However, because it is such a broad topic it can be a minefield for teaching and discussion. It is common for angry experts to accuse writers of carelessness, ignorance, malice and stupidity. If the type of regression the expert reader is expecting doesn’t match the one the writer is discussing then the writer is assumed to be ill-informed. The writer is especially vulnerable to experts when writing for non-experts. In such writing the expert finds nothing new (as they already know the topic) and is free to criticize any accommodation or adaption made for the intended non-expert audience. We argue that many of the corrections are not so much evidence of wrong ideas but more due a lack of empathy for the necessary informality necessary in concise writing. You can only define so much in a given space, and once you write too much you confuse and intimidate a beginning audience.
Let’s start with a common definition of regression modeling from The Cambridge Dictionary of Statistics (B. S. Everitt, Cambridge 2005 printing):
A frequently applied statistical technique that serves as a basis for studying and characterizing a system of interest, by formulating a mathematical model of the relation between a response variable, y and a set of q explanatory variables x1, x2, … xq. The choice of the explicit form of the model may be based on previous knowledge of the system or on considerations such as “smoothness” and continuity of y as a function of the x variables. In very general terms all such models can be considered to be of the form.
y = f(x1,...xq) + e
where the function f reflects the true but unknown relationship between y and the explanatory variables. The random additive error e which is assumed to have mean 0 and variance sigma_e^2 reflects the dependence of y on quantities other than x1,…,xq. The goal is to formulate a function fhat(x1,x2,…,xp)[sic] that is a reasonable approximation of f. If the correct parametric form of f is known, then methods such as least squares estimation or maximum likelihood estimation can be used to estimate the set of unknown coefficients. If f is linear in the parameters, for example, then the model is that of multiple regression. If the experimenter is unwilling to assume a particular parametric form for f then nonparametric regression modeling can be used, for example kernel regression smoothing, recursive partitioning regression or multivariate adaptive regression splines.
This is a bit long for a non-expert audience. Also notice a single typo (writing p where you clearly mean q) is not evidence of a lack of knowledge, care or effort (typos happen). The definition has a lot of conditions, caveats and alternatives. For practical writing you need to take a slice of this definition (the slice closest to what you are actually going to use) and go with that.
And even this definition is not complete enough to be strictly correct to a hostile reader. An easy (and common) cheap shot would be to write the following: “The writer clearly does not understand the nature of regression as he fails to correct define regression as estimating expectations when attempting to discuss regression modeling.” Note: this is not the case: Everitt clearly has a deep understanding of regression. But knowing what you are talking about seems not to be a sufficient protection or defense.
Here is what a our hypothetical critic claimed to be missing:
A term usually reserved for the simple linear model involving a response y, that is a continuous variable and a single explanatory variable, x, related by the equation.
E(y) = a + b x
Where E denotes expected value. See also multiple regression and least squares estimation. [ ARA Chapter 1.]
Except this isn’t missing. This definition is also from Everitt. He just doesn’t have space for this aspect of regression in his discussion of regression modeling. You can only emphasize so many things at once (as you add more generalizations, caveats, conclusions and consequences you dilute core ideas).
When writing for the non-expert you need to make sure what you are writing is correct (so you are actually usefully educating) but you need to also be concise and (at least initially) anticipate the new reader’s initially naive expectations. If you spend a lot of time on a side issue, the non-expert will assume the side issue was the actual central topic of discussion. For example you don’t repeat over and over that you must assume the variances sigma_e^2 are uniformly bounded (which is in fact important), but use the fact that a new reader’s intuition often doesn’t yet include random variables with unbounded variance (saving you discussing the precaution). You spend your initial time addressing issues that are likely causing the reader conceptual trouble (such as how can a linear function approximate a non-linear one and how can you simultaneously estimate coefficients). You only bring in stuff that the naive reader isn’t likely to worry about later and only if it something they need to defend against. This style of writing is needed if you actually want to teach to a broad audience. But it leaves the writer vulnerable to the accusation that they don’t know what they are talking about (because you didn’t spend time on something that could theoretically invalidate your work, but that doesn’t tend to happen in application at hand). Beginning learners do need correct definitions, but they also need succinct and situationally relevant discourse.
As an example that even pure mathematics writing is commonly informal (and requires a friendly, not a hostile reading to make strict sense). Consider: one combinatorics course I attended (combinatorics being a specific type of mathematics) the lecturer used the following convention. For every theorem the phrase “for all sets” was to be taken to mean either “for all sets except the empty set” or “for all sets including the empty set” depending on which specialization actually worked in the theorem in question. This “sloppiness” improved and sped up discourse greatly. Many mathematicians do this. Take problem 1C from page 6 of “A Course in Combinatorics” van Lint and Wilson (1st Ed. 1993 reprint): “Show a connected graph on n vertices is a tree if and only if it has n-1 edges.” If you worry all the way down about empty sets you have a hard time sensibly deciding if an empty graph can be a tree (having to decide if a graph can have zero vertices, and having to decide if a zero vertex graph is connected; my skim of the book definitions seems to allow the empty graph as a connected graph; leading to the reasoning failure that a 0-node graph is a connected graph that is not a tree as it fails to have the required -1 edges).
But back to statistics and regression. Where does the term regression even come from or even mean in this context? From the Wikipedia:
The term “regression” was coined by Francis Galton in the nineteenth century to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean). For Galton, regression had only this biological meaning, but his work was later extended by Udny Yule and Karl Pearson to a more general statistical context. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian. This assumption was weakened by R.A. Fisher in his works of 1922 and 1925. Fisher assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher’s assumption is closer to Gauss’s formulation of 1821.
Galton’s regression is the observation that repeated experiments (like heights of decedents) tend to revert to the mean (meaning the children of a tallest child are not necessarily much taller than their cousins). The term “regression” is about expectations and implies separations of explainable variation, and unexplainable variation (treated as 0-mean noise, compatible with reversions to the mean).
From “On the Theory of Correlation”, G Udny Yule, Journal of the Royal Statistical Society, 1897 vol. 60 (4) pp. 812-854 (and speaking about the typical fit curve of y as a function of x over many data points):
It is a fact attested by statistical experience that these means do not lie chaotically all over the table, but range themselves more or less closely round a smooth curve, which we will name the curve of regression of x on y.
So regression methods evolve from finding the curve of regression, which itself is the best fit for groups of observations after allowing some of the variation to be declared “unexplained” and left in a noise term. This is a advance from mere fitting or solving where you might be trying to explain all of the observed variation in n-individuals using as many as n-variables.
Regression methods have multiple formulations with different strength of assumptions and different strength of conclusions. You can use varying assumptions to trade generality for power at will. Thus a hostile reader can equally criticize a writer who carefully states a distributional assumption (as they “clearly don’t know how general the method is”) or a writer that fails to make a distributional assumption (as they “clearly don’t know what they are doing”).
A wide range of applications fall under the rubric of linear regression. Some include:
- Simple least squares fitting. Running a line through only known data to minimize the total sum of square errors. No probability model or statistical theory is initially involved (you are not interpreting the fit as being maximum likelihood or useful for prediction), so few assumptions are actually needed. You can criticize a writer for failing to assume “the errors have expectation zero and are uncorrelated and have equal variances” (because then they can’t assume the Gauss-Markov theorem that lease squares is a best linear unbiased estimator and the full power of the method) or you can criticize a writer for making any such assumption (because then they fail to realize that least squares by definition minimizes square error and the full generality of the method).
- Estimating coefficients (either in a frequentist or Bayesian manner).
Now you certainly have to make statistical assumptions (you see the
data as a noisy transformation of unknown coefficients). For a frequentist analysis you need distributional assumptions on the noise process (to turn losses into likelihoods) and for the Bayesian analysis you need to make assumptions on the prior distribution of the unknown coefficients (to turn conditional likelihoods on observations onto posterior likelihoods on parameters).
- Making predictions. On thing that surprises most data scientists is that statisticians do not consider making predictions as
the most important use of models. Statisticians tend to emphasize finding relations as more important (hence their assumptions tend to be designed to make coefficient estimation correct, but no stronger to preserve generality). It turns out to make reliable predictions you may need slightly different assumptions (at the very least something like exchangeability of data). So you can always criticize good work on relations as not being theoretically sound for predictions or good work on predictions as not being theoretically sound for extracting relations (or criticize work careful enough to meet both goals as bringing in too many assumptions).
Please read carefully the following from the Professor Andrew Gelman (a statistics professor I highly respect):
In section 3.6 of my book with Jennifer we list the assumptions of the linear regression model. In decreasing order of importance, these assumptions are:
- Validity. Most importantly, the data you are analyzing should map to the research question you are trying to answer. This sounds obvious but is often overlooked or ignored because it can be inconvenient. . . .
- Additivity and linearity. The most important mathematical assumption of the regression model is that its deterministic component is a linear function of the separate predictors . . .
- Independence of errors. . . .
- Equal variance of errors. . . .
- Normality of errors. . . .
Further assumptions are necessary if a regression coefficient is to be given a causal interpretation . . .
Normality and equal variance are typically minor concerns, unless you’re using the model to make predictions for individual data points.
Notice that what to worry about depends on how you intend to use the regression result. Then look at the diversity of comments on the original article. Regression is clearly not a single method with one best set of conditions and one strongest set of consequences. There isn’t a “one true weakest set of assumptions that simultaneously gives sharpest results.”
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.