A common mis-understanding of linear regression and logistic regression is that the intercept is thought to encode the unconditional mean or the training data prevalence.

This is easily seen to not be the case. Consider the following example in R.

We set up our example data.

```
# build our example data
# modeling y as a function of x1 and x2 (plus intercept)
d <- wrapr::build_frame(
"x1" , "x2", "y" |
0 , 0 , 0 |
0 , 0 , 0 |
0 , 1 , 1 |
1 , 0 , 0 |
1 , 0 , 0 |
1 , 0 , 1 |
1 , 1 , 0 )
knitr::kable(d)
```

x1 | x2 | y |
---|---|---|

0 | 0 | 0 |

0 | 0 | 0 |

0 | 1 | 1 |

1 | 0 | 0 |

1 | 0 | 0 |

1 | 0 | 1 |

1 | 1 | 0 |

And let’s fit a logistic regression.

```
## (Intercept) x1 x2
## -1.2055937 -0.3129307 1.3620590
```

The probability encoded in the intercept term is given as follows.

```
## 1
## 0.2304816
```

Notice the prediction 0.2304816 is neither the training outcome (`y`

) prevalence (0.2857143) nor the observed `y`

-rate for rows that have `x1, x2 = 0`

(0).

The non-intercept coefficients *do* have an interpretation as the expected change in log-odds ratio implied by a given variable (assuming all other variables are held constant, which may *not* be a property of the data!).

Categories: Statistics Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

You might want to add it does predict the observed y-rate for rows that have x1, x2 = 0 when you interact x1 and x2

Thanks, yes. If we add enough interaction variables, such as

`x1 * x2`

, then yes the model predictions will get the masses of each single-variable identifiable set correct.With the interaction these sets are: all rows (the intercept term), rows with

`x1=1`

, rows with`x2=1`

, and rows with the interaction`x1 * x2 = 1`

. By inclusion/exclusion we know set of rows with`x1, x2 = 0`

is the all rows plus the rows with`x1=1`

rows with`x2=1`

subtract off the double-counted`x1 * x2 = 1`

. So the model gets the average and total on the`x1, x2 = 0`

set correct. So in this special case the rate for the all zero rows is encoded in the intercept term.IF you follow the marginality principle, then the regression intercept cannot be interpreted at all when higher order effects are in the model.

The intercept does correspond to the weighted unconditional mean, where the weights are determined by the prevalence of the factor combinations in the experimental design.

“A common mis-understanding of linear regression and logistic regression is that the intercept is thought to encode the unconditional mean or the training data prevalence.”

I’ve never heard anyone make this claim. For linear regression, the intercept is often far outside the range of the data and considered an extrapolation.

I’ve heard it more for logistic regression. It isn’t a common practitioner issue, but more of a “student’s dream” issue. I’ve been teaching a lot lately, so run into just about everything as a question. I find it handy to have notes that can handle such digressions already available.

Reader Bob Carpenter shared “I just wanted to point out you should check out standardization for the intercept fallacy. It’s explained here:

https://stats.stackexchange.com/questions/176341/logistic-regression-intercept-representing-baseline-probability “