Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and product codes.
In a sort of “burying the lede” way I feel we may not have sufficiently emphasized that you really do need to perform such re-encodings. Below is a graph (generated in R, code available here) of the kind of disaster you see if you throw such variables into a model without any pre-processing or post-controls.
In the above graph each dot represents the performance of a model fit on synthetic data. The x-axis is model performance (in this case pseudo R-squared, 1 being perfect and below zero worse than using an average). The training pane represents performance on the training data (perfect, but over-fit) and the test pane represents performance on held-out test data (an attempt to simulate future application data). Notice the test performance implies these models are dangerously worse than useless.
Please read on for how to fix this.
First: remember the disasters you see are better than those you don’t. In the synthetic data we see failure to model a relation (even though there is one, by design). But it could easily be that some column lurking in a complex model is quietly degrading model performance, without being detected by fully ruining the model.
The reason Nina and I have written so much on the possible side-effects of re-encoding high cardinality categorical variables is that you don’t want to introduce more problems as you attempt to fix things. Also once you intervene, by supplying advice or a solution, you feel everything will be your fault. That being said, here is our advice:
Re-encode high categorical variable using impact or effects based ideas as we describe and implement in the vtreat R library.
Get your data science, predictive analytics, or machine learning house in order by fixing how you are treating incoming features and data. This is where the largest opportunities for improvement are available in real-world applications. In particular:
- Do not ignore large cardinality categorical variables.
- Do not blindly add large cardinality categorical variables to your model.
- Do not hash-encode large cardinality categorical variables.
- Consider using large cardinality categorical variables as join keys to pull in columns from external data sets.
Our advice: use vtreat. You will more and more often going forward be competing with models that use this library or similar concepts.
Once you have gotten to this level of operation then worry (as we do) about the statistical details of which processing steps are justifiable, safe, useful, and best. That is the topic we have been studying and writing on in depth (we call the potential bad issues over-fitting and nested model bias). Please read more from:
- More on preparing data (a great article on the concepts).
- Model evaluation and nested models (two recent talks we presented on these topics).
- Chapters 3,4,5 and 6 of Practical Data Science with R, (Zumel, Mount; Manning 2014) (where work through examining data, fixing data problems, evaluating models, and reasoning about data columns as single variable models in disguise).
- Laplace noising versus simulated out of sample methods (cross frames) (where this example came from).
Or invite us in to re-present one of our talks or work with your predictive analytics or data science team to adapt these techniques to your workflow, software, and problem domain. We have gotten very good results with the general methods in our vtreat library, but knowing a specific domain or problem structure can often let you do much more (for example: Nina’s work on y-aware scaling for geometric problems such as nearest neighbor classification and clustering).
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.