I am conducting another machine learning / AI bootcamp this week. Starting one of these always makes me want to get more statistical commentaries down, just in case I need one. These classes have to move fast, and also move correctly. In this case I want to write about decomposition of variance.

A famous equation in statistics is:

SSyy = SSR + SSE

This decomposes `SSyy`

(the variation in a dependent variable quantity to be predicted, in this case a numeric regression target), into `SSR`

(variation of a prediction about the dependent variable’s mean) plus `SSE`

(variation between the quantity to be predicted and the prediction). This decomposition is fundamental in ANOVA, and also sometimes discussed in the context of OLS (ordinary least squares) linear regression.

Here is this equation in Gonick, Smith *The Cartoon Guide to Statistics*, 1993, Harper Collins page 194:

The equation is stated *inside* a box or context labeled “ANOVA.” It is true in this context, and some others (such as OLS regression). But it is important to remember we need a context for such an equation to hold.

Let’s see this equation in action using R.

First we set up some synthetic data where `x1, x2, x3`

are our proposed explanatory variables, and `y`

is the numeric dependent variable to be predicted.

```
n <- 100
d <- data.frame(
x1 = rnorm(n),
x2 = rnorm(n),
x3 = rnorm(n))
d$y <- 1 * d$x1 + 2 * d$x2 + 3 * d$x3 + rnorm(n)
```

We fit our model with `lm`

, and then show the balance conditions on `lm`

’s predictions.

```
m_1 <- lm(y ~ x1 + x2 + x3, data = d)
y_hat_1 <- predict(m_1, newdata = d)
show_balance <- function(y, y_hat) {
y_bar <- mean(y)
SSR <- sum((y_hat - y_bar)^2) # how our estimate differs from y-average
SSE <- sum((y - y_hat)^2) # how our estimate differs from actual y
SSyy <- sum((y - y_bar)^2) # how y differs from the y-average
eq <- (SSR + SSE) - SSyy # a famous cancellation, zero up to rounding error for OLS
paste0(
'SSR= ', sprintf("%0.3f", SSR),
', SSE= ', sprintf("%0.3f", SSE),
', SSyy= ', sprintf("%0.3f", SSyy),
', eq= ', sprintf("%0.3f", eq)
)
}
show_balance(d$y, y_hat_1)
```

`## [1] "SSR= 1333.364, SSE= 109.922, SSyy= 1443.285, eq= 0.000"`

The equation is true in the above instance. And it is *only* true when certain conditions are met by the predictor. Without these conditions established the equation may not hold.

What is often missed, or not discussed enough, is: this equality or decomposition requires certain properties of the predictor to hold. These properties are true for OLS if the model includes an intercept term. However, these properties are often not true for other models. For example: regularization kills the relation.

Let’s see the equation fail for an L2-regularized linear regression.

```
## Loading required package: Matrix
## Loaded glmnet 4.0-2
```

```
m_2 <- cv.glmnet(
as.matrix(d[, c('x1', 'x2', 'x3')]),
d$y,
alpha = 0)
y_hat_2 <- predict(
m_2,
newx = as.matrix(d[, c('x1', 'x2', 'x3')]),
s = m_2$lambda.min, # also try lambda.1se
type = 'response')
show_balance(d$y, y_hat_2)
```

`## [1] "SSR= 1142.413, SSE= 117.363, SSyy= 1443.285, eq= -183.509"`

Notice the equation does not hold, `eq`

did not cancel to something very near zero.

What this means is: one *can’t* claim the original identity out of context. One must establish the conditions it requires are met before claiming the conclusion. Relations like the above hold for predictions that are optimal with respect to shift and scaling. OLS predictors are optimal in this sense if there was an intercept term allowed in the model. Models such as regularized models, maximum likelihood models, and others may not have this property.

In statistics it can be just as bad to attempt to apply some of the equations everywhere as to “not know the equations.”

Categories: Opinion Statistics Statistics To English Translation Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.