It is often said that “R is its packages.”
One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value
ranger::ranger() which we strongly advise overriding to
respect.unordered.factors=TRUE in applications.
To illustrate the issue we build a simple data set (split into training and evaluation) where the dependent (or outcome) variable
y is given as the sum of how many input level codes end in an odd digit minus how many input level codes end in an even digit.
Some example data is given below
print(head(dTrain)) ## x1 x2 x3 x4 y ## 77 lev_008 lev_004 lev_007 lev_011 0 ## 41 lev_016 lev_015 lev_019 lev_012 0 ## 158 lev_007 lev_019 lev_001 lev_015 4 ## 69 lev_010 lev_017 lev_018 lev_009 0 ## 6 lev_003 lev_014 lev_016 lev_017 0 ## 18 lev_004 lev_015 lev_014 lev_007 0
Given enough data this relation is easily learnable. In our example we have only 100 training rows and 20 possible levels for each input variable- so we at best get a noisy impression of how each independent (or input) variable affects
What the default ranger default training setting
respect.unordered.factors=FALSE does is decide that string-valued variables (such as we have here) are to be treated as “ordered”. This allows ranger to skip any of the expensive re-encoding of such variables as contrasts, dummies or indicators. This is achieved in ranger by only using ordered cuts in its underlying trees and is equivalent to re-encoding the categorical variable as the numeric order codes. These variables are thus essentially treated as numeric, and ranger appears to run faster over fairly complicated variables.
The above is good if all of your categorical variables were in fact known to have ordered relations with the outcome. We must emphasize that this is very rarely the case in practice as one of the main reasons for using categorical variables is that we may not a-priori know the relation between the variable levels and outcome and would like the downstream machine learning to estimate the relation. The default
respect.unordered.factors=FALSE in fact weakens the expressiveness of the ranger model (which is why it is faster).
This is simpler to see with an example. Consider fitting a ranger model on our example data (all code/data shared including classification and use of parallel here).
If we try to build a ranger model on the data using the default settings we get the following:
# default ranger model, treat categoricals as ordered (a very limiting treatment) m1 <- ranger(y~x1+x2+x3+x4, data=dTrain, write.forest=TRUE)
Keep in mind the 0.24 R-squared on test.
If we set
respect.unordered.factors=TRUE ranger takes a lot longer to run (as it is doing more work in actually respecting the individual levels of our categorical variables) but gets a much better result (test R-squared 0.54).
m2 <- ranger(y~x1+x2+x3+x4, data=dTrain, write.forest=TRUE, respect.unordered.factors=TRUE)
The loss of modeling power seen with the default
respect.unordered.factors=FALSE is similar to the undesirable loss of modeling power seen if one hash-encodes categorical levels. The default behavior of
ranger is essentially equivalent to calling
as.numeric(as.factor()) on the categorical columns. Everyone claims they would never do such a thing (hash or call
as.numeric()), but we strongly suggest inspecting your team’s work for these bad but tempting shortcuts.
If even one of the variables had 64 or more levels ranger would throw an exception and not complete training (as the randomForest library also does).
The correct way to feed large categoricals to a random forest model remains to explicitly introduce the dummy/indicators yourself or re-encode them as impact/effect sub models. Both of these are services supplied by the vtreat package so we demonstrate the technique here.
# vtreat re-encoded model ct <- vtreat::mkCrossFrameNExperiment(dTrain, c('x1','x2','x3','x4'), 'y') newvars <- ct$treatments$scoreFrame$varName[(ct$treatments$scoreFrame$code=='catN') & (ct$treatments$scoreFrame$sig<1)] m3 <- ranger(paste('y',paste(newvars,collapse=' + '),sep=' ~ '), data=ct$crossFrame, write.forest=TRUE) dTestTreated <- vtreat::prepare(ct$treatments,dTest, pruneSig=c(),varRestriction=newvars) dTest$rangerNestedPred <- predict(m3,data=dTestTreated)$predictions WVPlots::ScatterHist(dTest,'rangerNestedPred','y', 'ranger vtreat nested prediction on test', smoothmethod='identity',annot_size=3)
The point is a test R-squared of 0.6 or 0.54 is a lot better than an R-squared of 0.24. You do not want to achieve 0.24 if 0.6 is within easy reach. So at the very least when using ranger set
respect.unordered.factors=TRUE; for unordered factors (the most common kind) the default is making things easy for ranger at the expense of model quality.
Instructions explaining the use of
vtreat can be found here.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.