Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.
In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.
In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.
Please read on for how to fix this.
First: remember the disasters you see are better than those you don’t. In the synthetic data we see failure to model a relation (even though there is one, by design). But it could easily be that some column lurking in a complex model is quietly degrading model performance, without being detected by fully ruining the model.
The reason Nina and I have written so much on the possible side-effects of re-encoding high cardinality categorical variables is that you don’t want to introduce more problems as you attempt to fix things. Also once you intervene, by supplying advice or a solution, you feel everything will be your fault. That being said, here is our advice:
Re-encode high categorical variable using impact or effects based ideas as we describe and implement in the vtreat R library.
Get your data science, predictive analytics, or machine learning house in order by fixing how you are treating incoming features and data. This is where the largest opportunities for improvement are available in real-world applications. In particular:
- Do not ignore large cardinality categorical variables.
- Do not blindly add large cardinality categorical variables to your model.
- Do not hash-encode large cardinality categorical variables.
- Consider using large cardinality categorical variables as join keys to pull in columns from external data sets.
Our advice: use vtreat. You will more and more often going forward be competing with models that use this library or similar concepts.
Once you have gotten to this level of operation then worry (as we do) about the statistical details of which processing steps are justifiable, safe, useful, and best. That is the topic we have been studying and writing on in depth (we call the potential bad issues over-fitting and nested model bias). Please read more from:
- More on preparing data (a great article on the concepts).
- Model evaluation and nested models (two recent talks we presented on these topics).
- Chapters 3,4,5 and 6 of Practical Data Science with R, (Zumel, Mount; Manning 2014) (where work through examining data, fixing data problems, evaluating models, and reasoning about data columns as single variable models in disguise).
- Laplace noising versus simulated out of sample methods (cross frames) (where this example came from).
Or invite us in to re-present one of our talks or work with your predictive analytics or data science team to adapt these techniques to your workflow, software, and problem domain. We have gotten very good results with the general methods in our vtreat library, but knowing a specific domain or problem structure can often let you do much more (for example: Nina’s work on y-aware scaling for geometric problems such as nearest neighbor classification and clustering).
Categories: Administrativia Opinion Pragmatic Data Science
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
There’s a simpler reason to avoid native factors and I suspect a common mistake the occurs when using randomForest’s in R. When avoiding the formula interface (as suggested in the doc), and if passed a data.frame, then randomForest uses the native levels as numeric variables!! Who knows where that will lead. But if you create a model.matrix, then the selection of columns treats each level as an independent variable (i.e. the model matrix is only used as a dumb matrix without terms). This also over selects the levels of the categories.
Thanks for your comment, that matches up with some “random creepy randomForst” experiences I have heard about. That is a lot to think about, and something to start checking.
I am not fully sure if the exact “factors get turned into integers” bug exists in the current version of randomForst. Or at least there are some code-paths that do not trigger it. For example in the below example both calls error-out with the “can’t deal with more than 53 levels” message, which makes me think they are both being handled properly as factors. However, that is not to say you haven’t seen that bug. I had always heard the reason to use the non-formula interface was “it was faster” (in some way beyond the expense of the formula call) and our group has seen different results between the two interfaces (beyond the difficulty of successfully declaring classification versus regression).
Lately I’ve been using the formula interface or ranger (gave up on trusting the randomForest documentation). For factors with a small number of levels the randomForest deep trees can mostly undo the mis-encoding to factor ordinals problem by cutting the numeric variable into a bunch of single-point intervals. But you lose a bit of the predictive power hoping for the undo. More reasons to give the ranger package a try.
It does bring up the related problem of making sure that things that superficially look like numbers, but are factors get treated as factors throughout. In a database you have the schema definitions to guide you. But not all data is so neat. Probably a good practice to force such values to start with a letter to ensure you get the behaviors you want.
The ranger package had a similar bug (hooked up to a control) that has since been fixed in a fairly clever manner (please see http://www.win-vector.com/blog/2016/05/on-ranger-respect-unordered-factors/ for me running into that).
Also directly calling model.matrix to expand factors (or using the extra indicator columns that vtreat supplies in addition to the impact codes) can interfere with randomForest’s variable partitioning strategy. randomForest tries to increase model diversity by allowing only a subset of variables into each tree. If a variable is explicitly split into indicators before moving to randomForst you are going to see different selection behavior than when you allow randomForest to control the show.