The Win-Vector LLC vtreat library is a library we supply (under a GPL license) for automating the simple domain independent part of variable cleaning an preparation.
The idea is you supply (in R) an example general
data.frame to vtreat’s
designTreatmentsC method (for single-class categorical targets) or
designTreatmentsN method (for numeric targets) and vtreat returns a data structure that can be used to
prepare data frames for training and scoring. A vtreat-prepared data frame is nice in the sense:
- All result columns are numeric.
- No odd type columns (dates, lists, matrices, and so on) are present.
- No columns have
- Categorical variables are expanded into multiple indicator columns with all levels present which is a good encoding if you are using any sort of regularization in your modeling technique.
- No rare indicators are encoded (limiting the number of indicators on the translated
- Categorical variables are also impact coded, so even categorical variables with very many levels (like zip-codes) can be safely used in models.
- Novel levels (levels not seen during design/train phase) do not cause
The idea is vtreat automates a number of standard inspection and preparation steps that are common to all predictive analytic projects. This leaves the data scientist more time to work on important domain specific steps. vtreat also leaves as much of variable selection to the down-stream modeling software. The goal of vtreat is to reliably (and repeatably) generate a
data.frame that is safe to work with.
This note explains a few things that are new in the vtreat library.
The typical use of vtreat is to defend down-stream modeling code from all kinds of typical incoming data problems. Such issues include:
- Categoricals with very large numbers of levels.
- Odd types (dates, matrix, and more).
- Novel levels (levels not seen during design/train phase).
- Outlier values.
- Variables that don’t move.
These are all things that “shouldn’t happen” but do happen often enough that you want a systematic notifications, treatments and defenses against them. Uncaught these issues can cause your model to error-out or skip examples during scoring (novel levels often cause this) or lurk subtly causing a (large or small) unnoticed loss in model quality.
A typical use looks like the following:
library('vtreat') # our design and training data frame dTrainC <- data.frame(x=c('a','a','a','b','b',NA), z=c(1,2,3,4,NA,6),y=c(FALSE,FALSE,TRUE,FALSE,TRUE,TRUE)) print(dTrainC) # build the treatment plan on the training frame treatmentsC <- designTreatmentsC(dTrainC,colnames(dTrainC),'y',TRUE) # treat the training frame and use this treated frame to build models dTrainCTreated <- prepare(treatmentsC,dTrainC,pruneLevel=c()) print(dTrainCTreated) # later, new test or application data arrives dTestC <- data.frame(x=c('a','b','c',NA),z=c(10,20,30,NA)) print(dTestC) # use the treatment plan to prepare this frame dTestCTreated <- prepare(treatmentsC,dTestC,pruneLevel=c()) print(dTestCTreated)
vtreat was designed to package and automate some of the more common steps from section 4.1 of Practical Data Science with R. This is not a replacement for actually looking at the data. The automation is just to leave the data scientist more time to work on important domain specific adaptions and transformations. Similarly vtreat does a little variable scoring- but leaves the bulk of variable selection to the modeling technique the data scientist chooses to use after treatment. We want vtreat to be very light-weight and easy to combine with other libraries.
A few things have been added since we introduced the Win-Vector LLC basic variable preparation library. In particular:
- You can now install directly from Github using Hadley Wickham’s devtools package! The R-code is as follows:
Previously you had to download and install the tar file by hand.
- A bit more documentation. Example:
We strongly encourage all data scientists to incorporate vtreat (or something like it) into their workflow.
Categories: Pragmatic Data Science
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
And some more:
Variable scoring is now (optionally) parallelized and tries to work out of sample in more circumstances.