A common mis-understanding of linear regression and logistic regression is that the intercept is thought to encode the unconditional mean or the training data prevalence.
This is easily seen to not be the case. Consider the following example in R.
We set up our example data.
# build our example data # modeling y as a function of x1 and x2 (plus intercept) d <- wrapr::build_frame( "x1" , "x2", "y" | 0 , 0 , 0 | 0 , 0 , 0 | 0 , 1 , 1 | 1 , 0 , 0 | 1 , 0 , 0 | 1 , 0 , 1 | 1 , 1 , 0 ) knitr::kable(d)
And let’s fit a logistic regression.
## (Intercept) x1 x2 ## -1.2055937 -0.3129307 1.3620590
The probability encoded in the intercept term is given as follows.
## 1 ## 0.2304816
Notice the prediction 0.2304816 is neither the training outcome (
y) prevalence (0.2857143) nor the observed
y-rate for rows that have
x1, x2 = 0 (0).
The non-intercept coefficients do have an interpretation as the expected change in log-odds ratio implied by a given variable (assuming all other variables are held constant, which may not be a property of the data!).
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
You might want to add it does predict the observed y-rate for rows that have x1, x2 = 0 when you interact x1 and x2
Thanks, yes. If we add enough interaction variables, such as
x1 * x2, then yes the model predictions will get the masses of each single-variable identifiable set correct.
With the interaction these sets are: all rows (the intercept term), rows with
x1=1, rows with
x2=1, and rows with the interaction
x1 * x2 = 1. By inclusion/exclusion we know set of rows with
x1, x2 = 0is the all rows plus the rows with
x2=1subtract off the double-counted
x1 * x2 = 1. So the model gets the average and total on the
x1, x2 = 0set correct. So in this special case the rate for the all zero rows is encoded in the intercept term.
IF you follow the marginality principle, then the regression intercept cannot be interpreted at all when higher order effects are in the model.
The intercept does correspond to the weighted unconditional mean, where the weights are determined by the prevalence of the factor combinations in the experimental design.
“A common mis-understanding of linear regression and logistic regression is that the intercept is thought to encode the unconditional mean or the training data prevalence.”
I’ve never heard anyone make this claim. For linear regression, the intercept is often far outside the range of the data and considered an extrapolation.
I’ve heard it more for logistic regression. It isn’t a common practitioner issue, but more of a “student’s dream” issue. I’ve been teaching a lot lately, so run into just about everything as a question. I find it handy to have notes that can handle such digressions already available.
Reader Bob Carpenter shared “I just wanted to point out you should check out standardization for the intercept fallacy. It’s explained here: