Many R users have found that vtreat rapidly becomes an indispensable step in their supervised machine learning workflows.
This tool accepts real world data which may have issues such as: missing values, categorical values with many levels, or even novel levels appearing during model application. The tool then faithfully and reliably converts this data into a ready for machine learning data frame that is entirely numeric, and without missing values. By faithful we mean: most of the relevant modeling information is preserved. And by reliable we mean: a number subtle over-fitting (or nested model bias) traps are avoided.
Once you get used to having this capability, it is hard to give up.
In our talk we will lay out the typical problems and how vtreat now also solves these problems for Python users.
We won’t have time to get deeply into it, but the Python version of vtreat is designed “be Pythonic” (or at least follow the patterns of other Python tools), so the calling conventions should be very familiar to scikit-learn users.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.