## Introduction

I want to spend some time thinking out loud about linear regression.

As a data science consultant and teacher I spend a lot of time using linear regression and teaching linear regression. I have found each of these pursuits can degenerate into mere doctrine or instructions. “do this,” “expect this,” “don’t do that,” “you should know,” and so on. What I want to do here is take a step back and think out loud about linear regression from first principles. To do attempt this I am going to start with the problem linear regression solves, and try to delay getting to the things so important that “everybody should known them without question.” So let’s think about a few things in a particular order.

## What is Linear Regression?

First, what is linear regression?

*Regression* itself is the modeling of a real-valued dependent variable or outcome `y`

(_{i}`i`

being a subscript that tells us which instance or example in our sample we are talking about) as a function of several real-valued explanatory variables `X`

(usually unfortunately called “independent variables”) as:_{i,j}

y_{i} ~ F(X_{i}) (1).

For non-real explanatory variables (such as strings) one needs an explicit or implicit re-coding of variables to real numbers, such as indicators, embeddings, or impact codes (in R or in Python).

In the frequentist formulation the `X`

and _{i}`y`

are traditionally a sequence of _{i}`m`

samples drawing independently from a population of possible `(X, y)`

pairs. Unless stated otherwise, any randomness or expectations are re-draws of samples from the same unobserved ideal population.

In *linear regression* we assume `F()`

is a linear (or affine) function of the form:

F(X_{i}) := sum_{j} B_{j} X_{i,j} + C (2).

The `B`

and `C`

are called the “coefficients” and “intercept” respectively, and are collectively called the “parameters.”

I have deliberately delayed introducing the usual assumptions of linear regression (good ref here) so we can develop them one at a time.

## The Role of Errors

The usual derivations of regression forswear the approximation `y`

of relation 1 and instead use “generative form” equation:_{i} ~ F(X_{i})

y_{i} = F(X_{i}) + e_{i} (3).

That is they say “assume `y`

_{i}*is generated as* `F(X`

plus an error term _{i})`e`

“._{i}

A great number of “modeling assumptions” are insisted on for the error term `e`

. We greatly hope that the `e`

is:

- Expected value zero (zero bias). That is the expected value
`E[e`

is zero. Expectation (_{i}]`E[]`

) taken over how our observable sample is drawn from a ideal possible population. - Independent of
`X`

. That is: knowledge of`X`

is no help in estimating_{i}`e`

. This is often weekend or specialized to ask that the_{i}`e`

homoscedastic which is asking for the expected value_{i}`E[e`

being a constant with respect to_{i}^{2}]`X`

. Note, our frequentist sampling assumption already trivially assures that_{i}`E[e`

is independent of_{i}^{2}]`i`

. - Spherically symmetric (in distribution). We want the distribution of
`R e`

to look a lot like the distribution of`e`

when`R`

is a rotation or orthonormal matrix. We want such a specific property because we want to use this exact property in a very exciting proof of how degrees of freedom work. Fortunately this odd sounding condition is met for more detailed plausible sounding distributions such as “normally distributed”, which is spherically symmetric. I’ll delay explaining this property until we use it. But be aware, if we are being honest we have to say we are are assuming this because we want this.

Notice we have no assumptions on the distribution of `X`

or `y`

. Technically we need `X`

to be “low variance” in the sense that a reasonable sized sample is representative of the population, though this statement is often neglected. We will return to representativeness in a bit. Also be aware, we do not routinely check or prove the above properties prior to fitting. It is most common to hope they hold, though there are some checks for homoscedasticity and normality of fit residuals, but we must remember residuals are not the same as errors.

## The Role of Residuals

Define `f()`

as our estimate for the (possibly unknown or unobserved) function `F()`

as:

f(X_{i}) := sum_{j} b_{j} X_{i,j} + c (4)

where `b`

and `c`

are our estimates of `B`

and `C`

. `b`

and `c`

are usually estimated from training data (data available to the analyst) by a process called fitting.

Define the residual `r`

as:_{i}

r_{i} := y_{i} - f(X_{i}) (5).

A linear regression model is considered good if the “loss” `sum`

is small._{i} (y_{i} - f(X_{i}))^{2}

Putting aside how we find `f()`

for a bit, let’s re-arrange the last equation to define the residuals `r`

._{i}

y_{i} = f(X_{i}) + r_{i} (6).

Notice how similar equations 3 and 6 are. Equation 3 is how we assume the data is generated, and equation 6 is how we are modeling the data. Our goal is to estimate a function `f()`

that is very similar to the unobserved function `F()`

. We do observe the `y`

and _{i}`X`

, which gives us information about _{i}`F()`

and the ability to form our estimate `f()`

.

Here is a small logic step that is too often skipped in teaching. If we have our estimate `f()`

is very similar to `F()`

, then we must have `r`

very similar to `e`

. Residuals and errors deliberately play very similar roles. The roles are so similar that in a rush one may confuse the two as being the same.

Now let’s reverse the logic of our previous idea and say if `r`

is very similar to `e`

then we achieve our fitting goal of having our estimate `f()`

being very similar to `F()`

.

The fitting procedure of linear regression is picked to minimize `sum`

r_{i}_{i}^{2} (or equivalently pick an admissible `f()`

minimizing `sum`

(y_{i}_{i} – f(X_{i}))^{2}). This is considered easy to do, as there is common software that does this (in R, in Python). From the definition of the fitting procedure we know the following about the residuals `r`

.

sum_{i} r_{i} = 0 (7).

This follows from the the `c`

in the definition of `f()`

in equation 4. Consider the partial derivative of `sum`

with respect to _{i} r_{i}^{2}`c`

. Some algebra shows that this is equal to `sum`

. This is zero exactly when _{i} -2 (y_{i} - f(X_{i}))`sum`

, and sums of squares are minimized where their partial derivatives are zero. As our fitting procedure is defined as minimizing _{i} (y_{i} - f(X_{i})) = 0`sum`

, we must have satisfy equation 7 in our fit solution. Note: we are not claiming all the residuals are zero- but that they cancel so that their sum is zero._{i} r_{i}^{2}

This gets us to a great point. The relation `sum`

is called being unbiased. Our residuals _{i} r_{i} = 0`r`

are mean-zero, or unbiased. So if they are to be similar to the earlier defined errors, then the errors `e`

must also already be nearly mean zero or nearly unbiased. Or to say it again: **we routinely insist that the errors in equation 3 be assumed expectation zero, as our imitation of them (the fitting residuals from equation 6) are going to be so**. There is no point checking that residuals are mean-zero, as the will always be if there is an intercept term (such as `c`

in the estimation formulation).

So we now can say why we assume errors to be expectation zero: because the usual estimate of them is expectation zero.

The earlier arguments can be extended to show for all `j`

we also have `sum`

, and thus we _{i}X_{i,j} r_{i} = 0*need* to assume `E[X`

for each _{i,j} e_{i}] = 0`j`

if we expect our residuals to be similar to the errors. This gives us that the residuals `r`

are forced to be *linearly independent* of `X`

by the fitting process (these sort of balance conditions are the core of linear regression, logistic regression, and even maximum entropy modeling). So if we are to have our residuals `r`

be similar to the errors `e`

, then we had better hope (or assume) that the errors `e`

are linearly independent of `X`

(or even the stronger assumption, distributionally independent).

With some effort we have now seen why we desire the first two of our three assumptions on the error structure data for linear regression. Our fit residuals will have these two properties by the nature of the fitting procedure, so they won’t correctly imitate errors that don’t have similar properties. The third property: normal distribution of errors isn’t needed unless we want to benefit from one of the theorems that use that property. In a later note we will show one great use of that third assumption.

## Conclusion

In this note we worked through two of three important assumptions we hope for (or impose) on data when we think the data is appropriate to model with a linear regression. I hope this hints at how much these assumption stem from need or desire, and not from evidence. Obviously working to make these assumptions true is very important in experiment design.

In a later note I hope to cover more of the nature of excess generalization error and over fitting. I have added a video lecture on how we use the decomposition of the regression problem into expected value and error or residual here.

Categories: Tutorials