Menu Home

The Intercept Fallacy

A common mis-understanding of linear regression and logistic regression is that the intercept is thought to encode the unconditional mean or the training data prevalence.

This is easily seen to not be the case. Consider the following example in R.

We set up our example data.

x1 x2 y
0 0 0
0 0 0
0 1 1
1 0 0
1 0 0
1 0 1
1 1 0

And let’s fit a logistic regression.

## (Intercept)          x1          x2 
##  -1.2055937  -0.3129307   1.3620590

The probability encoded in the intercept term is given as follows.

##         1 
## 0.2304816

Notice the prediction 0.2304816 is neither the training outcome (y) prevalence (0.2857143) nor the observed y-rate for rows that have x1, x2 = 0 (0).

The non-intercept coefficients do have an interpretation as the expected change in log-odds ratio implied by a given variable (assuming all other variables are held constant, which may not be a property of the data!).

Categories: Statistics Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

7 replies

    1. Thanks, yes. If we add enough interaction variables, such as x1 * x2, then yes the model predictions will get the masses of each single-variable identifiable set correct.

      With the interaction these sets are: all rows (the intercept term), rows with x1=1, rows with x2=1, and rows with the interaction x1 * x2 = 1. By inclusion/exclusion we know set of rows with x1, x2 = 0 is the all rows plus the rows with x1=1 rows with x2=1 subtract off the double-counted x1 * x2 = 1. So the model gets the average and total on the x1, x2 = 0 set correct. So in this special case the rate for the all zero rows is encoded in the intercept term.

      Like

  1. IF you follow the marginality principle, then the regression intercept cannot be interpreted at all when higher order effects are in the model.

    Like

  2. The intercept does correspond to the weighted unconditional mean, where the weights are determined by the prevalence of the factor combinations in the experimental design.

    library(data.table)
    
    d <- wrapr::build_frame(
      "x1"  , "x2", "y" | 
      0   , 0   , 0   | 
      0   , 0   , 0   |
      0   , 1   , 1   | 
      1   , 0   , 0   |
      1   , 0   , 0   | 
      1   , 0   , 1   | 
      1   , 1   , 0   )
    
    mean(d$y) #about 0.285
    
    d$x1 <- as.factor(d$x1)
    d$x2 <- as.factor(d$x2)
    
    setDT(d)
    d[, mean(y), by = .(x1, x2)][, mean(V1)] #about 0.33
    
    options(contrasts = c("contr.sum", "contr.poly"))
    
    fit <- glm(y ~ x1 + x2, data = d, family = binomial)
    # intercept is -0.681
    
    plogis(-0.681)
    # about 0.33
    

    Like

  3. “A common mis-understanding of linear regression and logistic regression is that the intercept is thought to encode the unconditional mean or the training data prevalence.”

    I’ve never heard anyone make this claim. For linear regression, the intercept is often far outside the range of the data and considered an extrapolation.

    Like

    1. I’ve heard it more for logistic regression. It isn’t a common practitioner issue, but more of a “student’s dream” issue. I’ve been teaching a lot lately, so run into just about everything as a question. I find it handy to have notes that can handle such digressions already available.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: