vtreat is a system for preparing messy real world data for predictive modeling tasks (classification, regression, and so on). In particular it is very good at re-coding high-cardinality string-valued (or categorical) variables for later use.
A nice introductory video lecture on vtreat can be found here, and the latest copy of the lecture slides here. Or, you can check out chapter 8 “Advanced data preparation” of Zumel, Mount, Practical Data Science with R, 2nd Edition, Manning 2019– which covers the use of vtreat.
The vtreat documentation is organized by task (regression, classification, multinomial classification, and unsupervised), language (R or Python) and interface style (design/prepare, or fit/prepare). In particular the R code now supports variations of the interfaces, allowing users to choose what works best with their coding style. Either design/prepare, which is very fluid when combined with wrapr::unpack notation or the fit/prepare (which uses mutable state to organize steps).
Rregression example, fit/prepare interface,
Rregression example, design/prepare/experiment interface.
Rclassification example, fit/prepare interface,
Rclassification example, design/prepare/experiment interface.
- Unsupervised tasks:
Runsupervised example, fit/prepare interface,
Runsupervised example, design/prepare/experiment interface.
- Multinomial classification:
Pythonmultinomial classification example,
Rmultinomial classification example, fit/prepare interface,
Rmultinomial classification example, design/prepare/experiment interface.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.