Menu Home

Common Ensemble Models can be Biased

In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved.

However, when making predictions on individuals, a biased model may be preferable; biased models may be more accurate, or make predictions with lower relative error than an unbiased model. For example, tree-based ensemble models tend to be highly accurate, and are often the modeling approach of choice for many machine learning applications. In this note, we will show that tree-based models are biased, or uncalibrated. This means they may not always represent the best bias/variance trade-off.

Example: Predicting income

We’ll continue the example from the previous post: predicting income from demographic variables (sex, age, employment, education). The data is from the 2016 US Census American Community Survay (ACS) Public Use Microdata Sample (PUMS) for our example. More information about the data can be found here. First, we’ll get the training and test data, and show how the expected income varies along different groupings (by sex, by employment, and by education):

location <- ""
incomedata <- readRDS(url(location))

c(test, train) %<-% split(incomedata, incomedata$gp)

# A convenience function to calculate and display
# the conditional expected incomes
show_conditional_means <- function(d, outcome = "income") {
  cols <- qc(sex, employment, education)
    cols := cols, 
    function(colname) {
      aggregate(d[, outcome, drop = FALSE], 
                d[, colname, drop = FALSE], 
                FUN = mean)

display_tables <- function(tlist) {
  for(vi in tlist) {
sex income
Male 55755.51
Female 47718.52
employment income
Employee of a private for profit 51620.39
Federal government employee 64250.09
Local government employee 54740.93
Private not-for-profit employee 53106.41
Self employed incorporated 66100.07
Self employed not incorporated 41346.47
State government employee 53977.20
education income
no high school diploma 31883.18
Regular high school diploma 38052.13
GED or alternative credential 37273.30
some college credit, no degree 42991.09
Associate’s degree 47759.61
Bachelor’s degree 65668.51
Master’s degree 79225.87
Professional degree 97772.60
Doctorate degree 91214.55

Three models

We’ll fit three models to the data: two tree ensemble models (random forest and gradient boosting), and one (quasi)Poisson model–a calibrated model– for comparison.

# Quasipoisson model 
model_pincome <- glm(income ~ age+sex+employment+education,
# random forest model 
model_rf_1stage <- randomForest(income ~ age+sex+employment+education,

# gradient boosting model 

# build the model.matrix for the training data
train_mm <- model.matrix(income ~ age+sex+employment+education,

cvobj <- = list(objective="reg:linear"),
               label= train$income, 

evallog <- cvobj$evaluation_log
ntrees <- which.min(evallog$test_rmse_mean)

model_xgb <- xgboost(train_mm, 
                    label= train$income, 
                    verbose=0, nrounds=ntrees)

# make the predictions on training data

train <- transform(train,
                   pred_pois = predict(model_pincome, 
                                       train, type="response"),
                   pred_rf_raw = predict(model_rf_1stage, 
                                         newdata=train, type="response"),
                   pred_xgb = predict(model_xgb, train_mm))

First, we’ll compare the rollups of the predictions to the actual rollups.

outcomecols <- qc(income, pred_pois, pred_rf_raw, pred_xgb)
rollups <-show_conditional_means(train, outcomecols)
sex income pred_pois pred_rf_raw pred_xgb
Male 55755.51 55755.51 55261.04 54203.70
Female 47718.52 47718.52 48405.71 47326.59
employment income pred_pois pred_rf_raw pred_xgb
Employee of a private for profit 51620.39 51620.39 51276.95 50294.95
Federal government employee 64250.09 64250.09 61623.87 60596.12
Local government employee 54740.93 54740.93 55464.36 54121.91
Private not-for-profit employee 53106.41 53106.41 54135.75 53417.86
Self employed incorporated 66100.07 66100.07 63840.91 63391.52
Self employed not incorporated 41346.47 41346.47 46257.98 42578.69
State government employee 53977.20 53977.20 55530.86 54752.98
education income pred_pois pred_rf_raw pred_xgb
no high school diploma 31883.18 31883.18 40599.88 40287.54
Regular high school diploma 38052.13 38052.13 41864.18 36245.78
GED or alternative credential 37273.30 37273.30 42316.76 37654.63
some college credit, no degree 42991.09 42991.09 44303.14 41259.59
Associate’s degree 47759.61 47759.61 46831.13 44995.83
Bachelor’s degree 65668.51 65668.51 64131.61 64043.09
Master’s degree 79225.87 79225.87 70762.24 77177.23
Professional degree 97772.60 97772.60 77940.16 93507.90
Doctorate degree 91214.55 91214.55 76972.02 86496.11

Note that the rollups of the predictions from the two ensemble models don’t match the true rollups, even on the training data; unlike the Poisson model, the random forest and gradient boosting models are uncalibrated.

Model performance on holdout data

Let’s see the performance of the models on test data.

# build the model.matrix for the test data
test_mm <- model.matrix(income ~ age+sex+employment+education,

test <- transform(test,
                   pred_pois = predict(model_pincome, 
                                       test, type="response"),
                   pred_rf_raw = predict(model_rf_1stage, 
                                         newdata=test, type="response"),
                   pred_xgb = predict(model_xgb, test_mm))
outcomecols <- qc(income, pred_pois, pred_rf_raw, pred_xgb)
rollups <-show_conditional_means(test, outcomecols)
sex income pred_pois pred_rf_raw pred_xgb
Male 55408.95 55899.83 55210.82 54236.94
Female 46261.99 47111.01 47950.01 46705.38
employment income pred_pois pred_rf_raw pred_xgb
Employee of a private for profit 50717.96 51362.44 51040.56 49995.11
Federal government employee 66268.05 64881.32 61974.36 61574.06
Local government employee 52565.89 54119.83 54901.92 53703.97
Private not-for-profit employee 52887.52 53259.07 53987.90 53441.93
Self employed incorporated 67744.61 66096.20 63790.77 63100.00
Self employed not incorporated 41417.25 41507.17 46086.63 42296.44
State government employee 51314.92 53973.39 55262.11 54374.54
education income pred_pois pred_rf_raw pred_xgb
no high school diploma 29903.70 31783.60 40469.94 40169.21
Regular high school diploma 36979.33 37746.81 41648.80 35989.04
GED or alternative credential 39636.86 37177.50 42620.37 38180.68
some college credit, no degree 43490.42 43270.86 44449.98 41538.77
Associate’s degree 48384.19 47234.56 46309.68 44383.58
Bachelor’s degree 65268.96 66141.27 64387.44 64320.68
Master’s degree 77180.40 79594.17 70804.81 77491.04
Professional degree 94976.75 99009.56 78713.55 94974.29
Doctorate degree 87535.83 91742.54 76517.41 86141.53
# see how close the rollups get to ground truth for employment
err_mag <- function(x, y) {
  delta = y-x

employment <- rollups$employment
lapply(employment[, qc(pred_pois, pred_rf_raw, pred_xgb)],
       function(p) err_mag(p, employment$income)) %.>%  %.>% 
pred_pois pred_rf_raw pred_xgb
3831.967 8844.436 7474.311
Unnamed chunk 9 1

The calibrated Poisson model gives better estimates of the income rollups with respect to employment than either of the ensemble models, despite the fact that all the models have similar root mean squared error when making individual predictions.

# predictions on individuals 
rmse = function(x, y) {

lapply(test[, qc(pred_pois, pred_rf_raw, pred_xgb)],
       function(p) rmse(p, test$income)) %.>%  %.>% 
pred_pois pred_rf_raw pred_xgb
31341.14 31688.77 31299.37


In this example, the input variables were simply not informative enough, so the tree ensemble models performed equivalently to the Poisson model for predicting income. With more informative (and nonlinear) input variables, one can expect that ensemble models will outperform linear or generalized linear models, in terms of predictions on individuals. However, even these more accurate ensemble models can be biased, so they are not guaranteed to estimate important aggregates (grouped sums or conditional means) correctly.

In the next note, we’ll propose a polishing step on uncalibrated models that mitigates this bias, potentially enabling models that are both highly accurate on individuals, while estimating certain aggregates correctly.

Categories: Pragmatic Data Science Tutorials

Tagged as:

Nina Zumel

Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.

2 replies

  1. I don’t think this is a regression vs. ensemble methods issue, but rather an issue of the effects built in each model.

    Isn’t the reason that the glm approach exactly reproduces group averages is because the model is set to estimate those as fixed effects? In other words, the group averages are constrained to match the data (in contrast, a mixed effects model relaxes this constraint to stabilize estimate for small groups, a bias-variance tradeoff).

    Similarly, we expect that if the model doesn’t include interactions, then the 2-way group averages (e.g. sex x education) would not be exactly reproduced, and the random forest may actually be more calibrated w.r.t. these 2-way averages, since it models interactions to some degree (this might be part of why the predictions are better).

    1. For linear and generalized linear models: any effect available as a variable is going to be calibrated (not on the logistic regression case here). This follows from the Karush–Kuhn–Tucker conditions of optimality, or from the residuals being orthogonal to the quantity to be predicted.

      Yes, for any variable or interaction not explicitly included in the model or model correction step: there is no reason to expect unbiasedness or matching.

      However, the tree-based ensemble models are not similarly calibrated: even for variables they have as part of their models they are not-calibrated or biased. We see this failure in the examples. Why this is not prevented is a different question. Also: bias is not automatically forbidden, it is one thing we trade off against other considerations (else many other estimation techniques such as maximum likelihood would not be allowed).

      Correcting or calibrating is a known procedure (for instance it is often known as Platt’s scaling). Though, as pointed out in the article, it makes sense to use cross-validation methods to reduce nested model bias.