The goal of this note is to try and characterize excess generalization error: how much worse your model works in production versus how well it appeared to work during training. The clarifying point is excess generalization error (also called overfit) isn’t so much the model performing unexpectedly poorly on new data but the model performing uncharacteristically well on training data. This note is going to derive the most common bound characterizing this undesirable effect.
To get the result, I first need to take some time to set up some detailed notation in the style of our lat note here.
For weighted linear regression we are usually picking a B that minimizes the expression:
(1/2) || W (Y - X B) ||2
|| * ||2 is the square of the usual 2-norm (or euclidian norm) of vectors.
In the above,
X is an m-row by n-column (m typical greater than n) matrix of explanatory variables. Each row of
X is called an instance, and is a vector of measurements that are available for the given instance.
Y is a m-vector with one outcome number of each corresponding instance row of
W is an m-row by m-column diagonal matrix that we use to specify the instance weights, or how important each instance is. Often
W is just the m by m identity matrix with 1s on the diagonal and zeros elsewhere. We will assume that our
W is diagonal, has non-negative entries, and is such that
W X is full column rank (in particular this implies we must already have
m ≥ n and
X itself full column rank).
It is common to convert the standard linear regression notation of
Y ~ X' B + C to the more concise (and uniform)
Y ~ X B by constructing
X to be
X' with a column that is all 1s added on. That is we: we treat the linear regression intercept as just another coefficient.
One would expect such a function is minimized at a stationary point where the partial derivatives of the above form with respect to
B are all zero.
With time, paper, and a deep comfort with matrix algebra one can work out the derivative condition in question looks like the following.
(XT WT W) Y - (XT WT W X) B = 0
One would then use a matrix algebra tool to invert divide out the term obscuring the
B in the above equation to get the following solution for
B = (XT WT W X )-1 XT WT W Y
To simplify the notation let’s collect the bulk of the above equation into a new symbol “
F := (XT WT W X )-1 XT WT W
Then our earlier solution is just:
B = F Y
Defining more linear operators
Let’s define further matrices or operators
G := X F
Z := (I - G)
I is the m by m identity matrix.
What these operators do is:
Festimates or, solves for, the regression coefficients (or parameters)
B. It is our fitter.
Gshows us what estimates for
Ywe would get using the coefficient estimate produced by
F. It is our guess.
Zshows us the residuals between
Yand our estimate of
Y. It is our zeroer or annihilator.
These operators have a few really cool properties:
G X = X
G G = G
Z X = 0
Z Z = Z
Confirming the identities
These identities can be checked by plugging in the definitions, using associativity to regroup terms, and then applying some declared inverses.
= (X (XT WT W X )-1 XT WT W) X # by definition
= X (XT WT W X )-1 (XT WT W X) # associativity
= X # applying inverse
= (X (XT WT W X )-1 XT WT W) (X (XT WT W X )-1 XT WT W) # by definition
= X (XT WT W X )-1 (XT WT W X) (XT WT W X )-1 XT WT W # associativity
= X (XT WT W X )-1 XT WT W # applying inverse
= G # matching definition
= (I - G) X # by definition
= X - X # applying G X = X, and algebra
= 0 # algebra
= (I - G) (I - G) # by definition
= I - 2 G + G G # algebra
= I - 2 G + G # by G G = G
= I - G # algebra
= Z # matching definition
The form “A A = A” is exactly the definition of a linear algebra projection. So
Z are both projections.
G looks like the identity for the columns of
Z is an annihilator the columns of
Using the operators
Now that we have some operators defined, let’s use them to prove something substantial.
One of the amazing things about linear regression is that there is a standard estimate of the excess generalization error available. That is: using in-sample performance (which is itself is an optimistic biased estimate) one can estimate expected performance on new similarly distributed data.
This is done using the important observation that generalization error isn’t actually the model doing worse on new data. It is the model over-performing on its own training data. For linear regression we can estimate or bound the magnitude of this over performance. Given the above tools, the argument is very quick.
Assume our observed outcomes
Y are real numbers generated as
Y = B X + e, where
X the instance rows,
B is the unknown true linear function, and
e is the error term. We observe only
Y, X so we don’t actually know
B, e. We further assume these m instances are a uniform independent samples drawn from a large unobserved population. We want to build an estimate
B and then estimate how well our estimate will work on a new sample drawn from the same unobserved population. We further assume
e is independent of
X, expectation zero, and normally distributed (please see this note for more on these definitions). Note, we don’t quite need to assume normally distributed, we are only going to use finite variance and spherically symmetric in expectation- two properties normally distributed random variables have.
Now let’s look at a common measure of error: mean square error. If
y is our estimate of
Y on the training data (or in sample loss or error) the mean squared error is defined as:
(1/m) ||Y - y||2
In our notation we have
y = G Y and even
Y - y = Z Y. So the observed in-sample error is equal to
(1/m) ||Z Y||2. However the correct estimate of the expected mean square error of
Y modeled as a linear function of the explanatory variables is
(1/m) ||e||2 (remember
e is not directly observable). If we can estimate how much smaller
(1/m) E[||Z Y||2] is than
(1/m) E[||e||2], we would then be done. Here
E means expected value over re-draws of the training sample (implying re-runs of the fitting procedure).
To get the desired estimate, let’s look closer at
(1/m) ||Z Y||2.
(1/m) E[||Y - y||2] # Can estimate as the observable (1/m) ||Y - y||2
= (1/m) E[||Z Y||2] # by definition
= (1/m) E[||Z (X B + e)||2] # by definition
= (1/m) E[||Z X B + Z e||2] # expanding
= (1/m) E[||Z e||2] # using Z X = 0
It remains to show how the above observable quantity relates to the unknown unbiased estimate of error
Z is a projection of rank
m - n, it is equal to
R P S for some rotation matrices,
P a diagonal matrix with the first
m - n entries of the diagonal all 1, and the remaining entries all zero. Rotation matrices (such as
R) have the property that
||R v|| = ||v|| for any vector
Keep in mind: we want to know
(1/m) E[||e||2], and we are willing to accept
(1/m) ||e||2 as an estimate of this expectation. However, we only directly observe
(1/m) ||Z e||2. So we have to relate this observed quantity to the desired quantity.
(1/m) E[||Y - y||2] # can estimate, as (1/m) ||Y - y||2 is observable
= (1/m) E[||Z e||2] # by earlier derivation
= (1/m) E[||R P S e||2] # by definition
= (1/m) E[||P (S e)||2] # as R is a rotation
= (1/m) (m-n) E[((S e)1)2] # linearity of expectation, and P keeps m-n out of m coordinates
= (1/m) ((m-n)/m) E[||S e||2] # spherical symmetry
= (1/m) ((m-n)/m) E[||e||2] # as S is a rotation
We now have
(m / (m - n)) (1/m) E[||Y - y||2] = (1/m) E[||e||2].
(1/m) E[||Y - y||2] is something we can estimate from performance on training data, and
(1/m) E[||e||2] is the expected out of sample performance on new data. So we have related something we can estimate to what we want to know.
That is: scaling the observed error up by an adjustment of
(m / (m - n)) corrects for the “things are easy on training data” bias. This is what drives the so-called adjusted R-squared or “degrees of freedom” arguments.
This is a fact that is “in all the books”, but you don’t often see the derivation.
Linear regression is well characterized, as we can write it in terms of well behaved linear operators operating on the observed dependent or outcome variable. With some notational cost we can quickly derive the expected out of sample performance of such a model, without any reference to test, hold-out, or cross-validation procedures. Our conclusion was: for a linear model with
n variables (counting the intercept term as a variable), and
m examples the expected mean square error on new data can be
m / (m-n) times as large as the expected mean square error on the training data.
For a video of a VC-dimension style argument (using critical sets of examples) bounding excess generalization error for classifiers in terms of accuracy, please see here.