It is often said that “R is its packages.”
One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest package. Actually this appearance is due to the strange choice of default value
ranger::ranger() which we strongly advise overriding to
respect.unordered.factors=TRUE in applications.
To illustrate the issue we build a simple data set (split into training and evaluation) where the dependent (or outcome) variable
y is given as the sum of how many input level codes end in an odd digit minus how many input level codes end in an even digit.
Some example data is given below
print(head(dTrain)) ## x1 x2 x3 x4 y ## 77 lev_008 lev_004 lev_007 lev_011 0 ## 41 lev_016 lev_015 lev_019 lev_012 0 ## 158 lev_007 lev_019 lev_001 lev_015 4 ## 69 lev_010 lev_017 lev_018 lev_009 0 ## 6 lev_003 lev_014 lev_016 lev_017 0 ## 18 lev_004 lev_015 lev_014 lev_007 0
Given enough data this relation is easily learnable. In our example we have only 100 training rows and 20 possible levels for each input variable- so we at best get a noisy impression of how each independent (or input) variable affects
What the default ranger default training setting
respect.unordered.factors=FALSE does is decide that string-valued variables (such as we have here) are to be treated as “ordered”. This allows ranger to skip any of the expensive re-encoding of such variables as contrasts, dummies or indicators. This is achieved in ranger by only using ordered cuts in its underlying trees and is equivalent to re-encoding the categorical variable as the numeric order codes. These variables are thus essentially treated as numeric, and ranger appears to run faster over fairly complicated variables.
The above is good if all of your categorical variables were in fact known to have ordered relations with the outcome. We must emphasize that this is very rarely the case in practice as one of the main reasons for using categorical variables is that we may not a-priori know the relation between the variable levels and outcome and would like the downstream machine learning to estimate the relation. The default
respect.unordered.factors=FALSE in fact weakens the expressiveness of the ranger model (which is why it is faster).
This is simpler to see with an example. Consider fitting a ranger model on our example data (all code/data shared including classification and use of parallel here).
If we try to build a ranger model on the data using the default settings we get the following:
# default ranger model, treat categoricals as ordered (a very limiting treatment) m1 <- ranger(y~x1+x2+x3+x4, data=dTrain, write.forest=TRUE)
Keep in mind the 0.24 R-squared on test.
If we set
respect.unordered.factors=TRUE ranger takes a lot longer to run (as it is doing more work in actually respecting the individual levels of our categorical variables) but gets a much better result (test R-squared 0.54).
m2 <- ranger(y~x1+x2+x3+x4, data=dTrain, write.forest=TRUE, respect.unordered.factors=TRUE)
The loss of modeling power seen with the default
respect.unordered.factors=FALSE is similar to the undesirable loss of modeling power seen if one hash-encodes categorical levels. The default behavior of
ranger is essentially equivalent to calling
as.numeric(as.factor()) on the categorical columns. Everyone claims they would never do such a thing (hash or call
as.numeric()), but we strongly suggest inspecting your team’s work for these bad but tempting shortcuts.
If even one of the variables had 64 or more levels ranger would throw an exception and not complete training (as the randomForest library also does).
The correct way to feed large categoricals to a random forest model remains to explicitly introduce the dummy/indicators yourself or re-encode them as impact/effect sub models. Both of these are services supplied by the vtreat package so we demonstrate the technique here.
# vtreat re-encoded model ct <- vtreat::mkCrossFrameNExperiment(dTrain, c('x1','x2','x3','x4'), 'y') newvars <- ct$treatments$scoreFrame$varName[(ct$treatments$scoreFrame$code=='catN') & (ct$treatments$scoreFrame$sig<1)] m3 <- ranger(paste('y',paste(newvars,collapse=' + '),sep=' ~ '), data=ct$crossFrame, write.forest=TRUE) dTestTreated <- vtreat::prepare(ct$treatments,dTest, pruneSig=c(),varRestriction=newvars) dTest$rangerNestedPred <- predict(m3,data=dTestTreated)$predictions WVPlots::ScatterHist(dTest,'rangerNestedPred','y', 'ranger vtreat nested prediction on test', smoothmethod='identity',annot_size=3)
The point is a test R-squared of 0.6 or 0.54 is a lot better than an R-squared of 0.24. You do not want to achieve 0.24 if 0.6 is within easy reach. So at the very least when using ranger set
respect.unordered.factors=TRUE; for unordered factors (the most common kind) the default is making things easy for ranger at the expense of model quality.
Instructions explaining the use of
vtreat can be found here.
Categories: Expository Writing Pragmatic Data Science Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Thanks for this illustrative example that demonstrate how powerful can be “vtreat” an clarifies up a lot the meaning of “respect.unordered.factors” in “ranger” package.
In this example you outcome “y” can be considered as a categorical variable (multilevel), but you consider it as “numeric”.
I am trying to fully understand if “designTreatmentsC” is just for *binary* categorical outcomes. For multilevel class of problems, the outcome should it be converted to numeric?.
As far as I see, it can be coded without any kind of limitation (I mention this because xgboost for instante require values higher than 0).
Carlos, thanks for the interesting discussion!
As background we have the following.
vtreat::mkCrossFrameNExperimentare for numeric or regression problems.
vtreat::mkCrossFrameCExperimentare for binary categorization problems (in this case
ycan one of many types: character, factor, numeric and we treat it as a binary outcome using the user-supplied
outcometargetwhich says what value of
yis considered “TRUE” considering all other values to be “FALSE”). Currently
vtreatdoes not directly support multi-class problems (though one could try to emulate such as a series of binary classification problems).
For the problem at hand (the
rangerexample) we are treating
yas a numeric outcome to be regressed against. As you noticed this is not exploiting the domain fact that in this example
ycan only take on the values
4. With this many values regression isn’t a bad approximation (and it does get the domain advantage of being able to immediately exploit the order relations in
y). The most powerful way to encode this problem would be some hybrid that exploits both the moderate number of possible values (multi-class classification) and the order relations.
So really I see at least four types of predictive problems: regression, binary classification, unordered multinomial classification, and ordered multinomial classification.
vtreatdirectly supports the first two. It would be nice to also directly support unordered multinomial classification (the math is easy, just would require some code changes). And for this example problem a system that supported ordered multinomial outcomes would likely be the most powerful (though it is uncommon to see this implemented in general packages). Ordered multinomial could also be done, but it would take a bit more engineering.
I wanted to apply “vtreat” to a three class problem, by considering it as a regression problem.
The dataset has some columns with high cardinality and some other with NAs so *vtreat* was a good choice to handle all these thing together.
So far, I could handle the high cardinality with the hash-trick feature approach, but seeing your package, I saw a possibillity to treat everything in one shot. The algorithm I am using (after trying and assessing their performance, is a randomForest through “ranger”, that manages multiclass classification without any problem.
Although it is not a very recommendable way to proceed I am going to try to model it as a regression to take advantage of “vtreat” as it is now.
Since it is a three class problem (not too many outcome classes) I would suggest also trying building 3 “one versus rest” binary classifiers ( https://en.wikipedia.org/wiki/Multiclass_classification#One-vs.-rest ). It is slow (you end up running ranger 3 times) to get three probabilities ( pA versus pNotA, pB versus pNotB, and pC versus pNotC) and then re-normalizing them to sum to 1 as your “multiclass” classifier (pA/(pA+pB+pC), pB/(pA+pB+pC), pC/(pA+pB+pC); it is an abuse as if pA,pB,pC were really disjoint and complete probabilities on the same event they would already add up to 1). It is a bit slower (and theoretically a bit less powerful than an all in one multiclass classifier) but a good method all the same.
I used “vtreat” with the dataset I referred to you yesterday.
One of the doubts I got, following your example, was the reason to use just “catN” as a filter for “newvars” variable.
In my case, I tried first just “catN” that got a good scoring, but also including the “clean” variables the scoring got better. Besides high cardinality, my dataset also has many “NAs” and “vtreat” run smoothly all over it.
I want to try also with no filter and use all the new variables created by “vtreat”. In my case, that will provide many more new variables, although very sparse.
And also, I used “ranger” with “classification = TRUE” when I saw all of this.
I will let you know how this behaves.
I am a little confuse about is behind “catN”, “catD”, etc..
And in the different vignettes I could not find many details. If you please could point me out to the adequate place…
Sorry, I just found that you have a particular vignette with a description of the meaning of “catN”, “catD”, etc.. (different “VariableTypes”).
Not a problem, hope
vtreatis working well for you. For anyone else interested in what the variable types are here is a link to an explanation: http://winvector.github.io/vtreathtml/vtreatVariableTypes.html . Normally you just take all the new variables that turn out to be significant (especially all the “
clean” pass-throughs). Dealing with
NAis one of
vtreat‘s core services Nina Zumel was written a bit on this here http://winvector.github.io/DataPrep/EN-CNTNT-Whitepaper-Data-Prep-Using-R.pdf .
Thank you for this post! Inspired by this, I’ve implemented the approach described by Hastie et al. in their book “The Elements of Statistical Learning”, chapter 9.2.4 (see also https://github.com/imbs-hl/ranger/issues/36#issuecomment-203967512).
Since ranger v0.4.5 (available at https://github.com/imbs-hl/ranger) this method is used by default. I tried your example and it was as fast as the model with respect.unordered.factors=FALSE but as good as the model with respect.unordered.factors=TRUE.
Please note that there are now 3 options for respect.unordered.factors, see the ranger R help for details.
Hi Marvin, thanks for you note! Really neat to hear from a ranger developer! The “split by sorting and scanning” idea shown in 9.2.4 is pretty common in combinatorial optimization (I was a bit surprised to see the book says it is hard to prove theorems about it, but I guess the sorting step makes thing hard to reason about).
It would be great if more tree based methods didn’t require pre-encoding categorical variables to work well. As you have found there are some great ideas out there.
(edit: Wow! Version 0.4.5 came out 2016-05-31, so you are not kidding about taking some inspiration! Really neat, great stuff.)