In the linear regression section of our book *Practical Data Science in R*, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against `log10(income)`

rather than directly against income.

One obvious reason for not regressing directly against income is that (in our example) income is restricted to be non-negative, a restraint that linear regression can’t enforce. Other reasons include the wide distribution of values and the relative or multiplicative structure of errors on outcomes. A common practice in this situation is to use *Poisson regression*, or generalized linear regression with a log-link function. Like all generalized linear regressions, Poisson regression is unbiased and *calibrated*: it preserves the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.

Regressing against the log of the outcome will not be calibrated; however it has the advantage that the resulting model will have lower relative error than a Poisson regression against income. Minimizing relative error is appropriate in situations when differences are naturally expressed in percentages rather than in absolute amounts. Again, this is common when financial data is involved: raises in salary tend to be in terms of percentage of income, not in absolute dollar increments.

Unfortunately, a full discussion of the differences between Poisson regression and regressing against log amounts was outside of the scope of our book, so we will discuss it in this note.

## The data

As we did in the book, we’ll use data from the 2016 US Census American Community Survay (ACS) Public Use Microdata Sample (PUMS) for our example. More information about the data can be found here. First, we’ll get the training and test data, and show how the expected income varies along different groupings (by sex, by employment, and by education):

```
library(zeallot)
library(wrapr)
location <- "https://github.com/WinVector/PDSwR2/raw/master/PUMS/incomedata.rds"
incomedata <- readRDS(url(location))
c(test, train) %<-% split(incomedata, incomedata$gp)
# A convenience function to calculate and display
# the conditional expected incomes
show_conditional_means <- function(d, outcome = "income") {
cols <- qc(sex, employment, education)
lapply(
cols := cols,
function(colname) {
aggregate(d[, outcome, drop = FALSE],
d[, colname, drop = FALSE],
FUN = mean)
})
}
display_tables <- function(tlist) {
for(vi in tlist) {
print(knitr::kable(vi))
}
}
```

`display_tables(show_conditional_means(train))`

sex | income |
---|---|

Male | 55755.51 |

Female | 47718.52 |

employment | income |
---|---|

Employee of a private for profit | 51620.39 |

Federal government employee | 64250.09 |

Local government employee | 54740.93 |

Private not-for-profit employee | 53106.41 |

Self employed incorporated | 66100.07 |

Self employed not incorporated | 41346.47 |

State government employee | 53977.20 |

education | income |
---|---|

no high school diploma | 31883.18 |

Regular high school diploma | 38052.13 |

GED or alternative credential | 37273.30 |

some college credit, no degree | 42991.09 |

Associate’s degree | 47759.61 |

Bachelor’s degree | 65668.51 |

Master’s degree | 79225.87 |

Professional degree | 97772.60 |

Doctorate degree | 91214.55 |

## Three models

Now we’ll model income as a function of age, sex, employment, and education three different ways:

```
# linear model for income
model_income <- lm(income ~ age+sex+employment+education,
data=train)
# linear model for log10(income)
model_logincome <- lm(log10(income) ~ age+sex+employment+education,
data=train)
# Quasipoisson model for income
model_pincome <- glm(income ~ age+sex+employment+education,
data=train,
family=quasipoisson)
```

Note that we are fitting a quasipoisson model for income; strictly speaking, a Poisson model assumes that the mean and variance of the data are the same, which is not true in general. A quasipoisson model relaxes the restriction on the variance of the data. We’ll still refer to this as a Poisson model for brevity.

Now we can use all three models to predict income for the training data.

```
train <- transform(train,
pred_lm = predict(model_income, train),
pred_lmlog = 10^predict(model_logincome, train),
pred_pois = predict(model_pincome,
train, type="response"))
knitr::kable(
summary(train[, qc(income, pred_lm, pred_pois, pred_lmlog)]))
```

income | pred_lm | pred_pois | pred_lmlog | |
---|---|---|---|---|

Min. : 1200 | Min. : -4682 | Min. : 15704 | Min. : 11977 | |

1st Qu.: 26700 | 1st Qu.: 36877 | 1st Qu.: 36480 | 1st Qu.: 30546 | |

Median : 41200 | Median : 50180 | Median : 47450 | Median : 40281 | |

Mean : 52373 | Mean : 52373 | Mean : 52373 | Mean : 44478 | |

3rd Qu.: 66000 | 3rd Qu.: 65962 | 3rd Qu.: 63669 | 3rd Qu.: 54397 | |

Max. :250000 | Max. :125969 | Max. :159583 | Max. :129216 |

Note that even though all actual incomes were positive, the linear model (`model_income`

) sometimes predicted negative income.

## Estimating aggregates

Now let’s compare how the predicted incomes roll up.

```
display_tables(
show_conditional_means(train,
qc(income, pred_lm, pred_pois, pred_lmlog))
)
```

sex | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|

Male | 55755.51 | 55755.51 | 55755.51 | 47081.99 |

Female | 47718.52 | 47718.52 | 47718.52 | 40895.21 |

employment | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|

Employee of a private for profit | 51620.39 | 51620.39 | 51620.39 | 43169.85 |

Federal government employee | 64250.09 | 64250.09 | 64250.09 | 58542.64 |

Local government employee | 54740.93 | 54740.93 | 54740.93 | 49988.61 |

Private not-for-profit employee | 53106.41 | 53106.41 | 53106.41 | 47475.45 |

Self employed incorporated | 66100.07 | 66100.07 | 66100.07 | 53189.40 |

Self employed not incorporated | 41346.47 | 41346.47 | 41346.47 | 31151.47 |

State government employee | 53977.20 | 53977.20 | 53977.20 | 50023.27 |

education | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|

no high school diploma | 31883.18 | 31883.18 | 31883.18 | 26978.21 |

Regular high school diploma | 38052.13 | 38052.13 | 38052.13 | 32437.46 |

GED or alternative credential | 37273.30 | 37273.30 | 37273.30 | 30816.91 |

some college credit, no degree | 42991.09 | 42991.09 | 42991.09 | 36184.14 |

Associate’s degree | 47759.61 | 47759.61 | 47759.61 | 40585.89 |

Bachelor’s degree | 65668.51 | 65668.51 | 65668.51 | 55130.77 |

Master’s degree | 79225.87 | 79225.87 | 79225.87 | 69437.91 |

Professional degree | 97772.60 | 97772.60 | 97772.60 | 81612.18 |

Doctorate degree | 91214.55 | 91214.55 | 91214.55 | 80679.19 |

The rollups of the predictions for the linear and Poisson models (`model_income`

and `model_pincome`

) match the rollups of the training data. The predictions from `model_logincome`

roll up too low. In fact, one can prove that by Jensen’s inequality, a linear model fit to log-income will always have a systematic bias (underprediction) when estimating expected income. This means that if one of the intended uses of the model is to estimate aggregates (grouped sums, conditional means), then a calibrated model like a linear or Poisson model is more appropriate.

## Predictions on individuals

If the primary purpose of the model is predictions on individuals, then biased models may still be acceptable, or even preferable. When predicting income, it’s often the case that you want to express uncertainty in relative terms: that is, predict income to within 5%, rather than predict income to within $50. So let’s see how each of the models performs in terms of relative error (on the training data):

```
rel_err <- function(x, y) {
mean(abs(y-x)/y)
}
lapply(train[, qc(pred_lm, pred_lmlog, pred_pois)],
function(p) rel_err(p, train$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
```

pred_lm | pred_lmlog | pred_pois |
---|---|---|

0.74858 | 0.615897 | 0.7437119 |

`model_logincome`

has a lower average relative error on estimated income than either of the models fit directly to income — not a great relative error, but that’s because our set of input variables isn’t informative enough. We can also compare the models’ performances in terms of root mean squared error (an absolute difference):

```
rmse <- function(x, y) {
sqrt(mean((y-x)^2))
}
lapply(train[, qc(pred_lm, pred_lmlog, pred_pois)],
function(p) rmse(p, train$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
```

pred_lm | pred_lmlog | pred_pois |
---|---|---|

31625.35 | 32616.57 | 31395.38 |

The models that are fit directly to income have lower RMSE than `model_logincome`

, but not dramatically so. In other words, `model_logincome`

seems to improve relative error, at the cost of a slightly larger RMSE.

## Performance on new data

The real test of the three models for income is how they perform on data not used to train the models. First, we’ll compare the rollups.

```
test <- transform(test,
pred_lm = predict(model_income, test),
pred_lmlog = 10^predict(model_logincome, test),
pred_pois = predict(model_pincome,
test, type="response"))
rollups <- show_conditional_means(test,
qc(income, pred_lm,
pred_pois,pred_lmlog))
```

`display_tables(rollups)`

sex | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|

Male | 55408.95 | 55903.57 | 55899.83 | 47173.10 |

Female | 46261.99 | 46876.96 | 47111.01 | 40361.71 |

employment | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|

Employee of a private for profit | 50717.96 | 51314.69 | 51362.44 | 42947.99 |

Federal government employee | 66268.05 | 64635.60 | 64881.32 | 58993.59 |

Local government employee | 52565.89 | 53730.43 | 54119.83 | 49450.23 |

Private not-for-profit employee | 52887.52 | 52830.80 | 53259.07 | 47642.49 |

Self employed incorporated | 67744.61 | 65538.68 | 66096.20 | 53189.42 |

Self employed not incorporated | 41417.25 | 41671.41 | 41507.17 | 31265.77 |

State government employee | 51314.92 | 54106.89 | 53973.39 | 50029.83 |

education | income | pred_lm | pred_pois | pred_lmlog |
---|---|---|---|---|

no high school diploma | 29903.70 | 31738.07 | 31783.60 | 26923.95 |

Regular high school diploma | 36979.33 | 37538.76 | 37746.81 | 32162.33 |

GED or alternative credential | 39636.86 | 37336.08 | 37177.50 | 30666.80 |

some college credit, no degree | 43490.42 | 43199.50 | 43270.86 | 36421.74 |

Associate’s degree | 48384.19 | 47167.06 | 47234.56 | 40140.43 |

Bachelor’s degree | 65268.96 | 66077.47 | 66141.27 | 55535.11 |

Master’s degree | 77180.40 | 79521.83 | 79594.17 | 69750.68 |

Professional degree | 94976.75 | 98649.58 | 99009.56 | 82575.73 |

Doctorate degree | 87535.83 | 91403.52 | 91742.54 | 81524.25 |

```
# see how close the rollups get to ground truth for employment
err_mag <- function(x, y) {
delta = y-x
sqrt(sum(delta^2))
}
employment <- rollups$employment
lapply(employment[, qc(pred_lm, pred_pois, pred_lmlog)],
function(p) err_mag(p, employment$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
```

pred_lm | pred_pois | pred_lmlog |
---|---|---|

4135.96 | 3831.967 | 21611.7 |

None of the models reproduced the true rollups perfectly. Just looking at the employment rollup, you can see that the rollups from `model_income`

and `model_pincome`

are usually fairly close to the actual rollups, while the rollups from `model_logincome`

are off — and consistently under. The pattern holds for the rollups by sex and education as well.

Let’s compare the models on individual predictions.

```
# relative error
lapply(test[, qc(pred_lm, pred_lmlog, pred_pois)],
function(p) rel_err(p, test$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
```

pred_lm | pred_lmlog | pred_pois |
---|---|---|

0.7508259 | 0.6222302 | 0.7543232 |

```
# root mean square error
lapply(test[, qc(pred_lm, pred_lmlog, pred_pois)],
function(p) rmse(p, test$income)) %.>%
as.data.frame(.) %.>%
knitr::kable(.)
```

pred_lm | pred_lmlog | pred_pois |
---|---|---|

31589.5 | 32389.97 | 31341.14 |

Again, `model_logincome`

returns predictions with the lowest relative error, but a slightly higher RMSE than the other two models.

## Conclusion

In this note we have shown the consequences of different modeling decisions, in particular the trade-off between bias and relative error. Notice that transforming the outcome and using a link function have different advantages. Which procedure you use depends on what is most important to your application: correctly estimating summary statistics, minimizing relative error, or minimizing squared error.

In our next article, we will show that common tree models are also non-calibrated, which means that despite their high accuracy on individual predictions, they do not correctly estimate summary statistics in an unbiased way. Later, we will address how to mitigate this issue.

**Postscript**

Our thanks to Jelmer Ypma for pointing us to references to corrections for loglinear models; these corrections reduce the bias and RMSE of estimates of `y`

that are based on predictions from a linear model for `log(y)`

. More information can be found in chapter 6.4 of *Introductory Econometrics: A Modern Approach* by Jeffrey Woolrich (2014).

Categories: Expository Writing Practical Data Science Pragmatic Data Science Pragmatic Machine Learning Statistics Tutorials

### Nina Zumel

Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.