Menu Home

R Tip: Use the vtreat Package For Data Preparation

If you are working with predictive modeling or machine learning in R this is the R tip that is going to save you the most time and deliver the biggest improvement in your results.

R Tip: Use the vtreat package for data preparation in predictive analytics and machine learning projects.

Vtreat

When attempting predictive modeling with real-world data you quickly run into difficulties beyond what is typically emphasized in machine learning coursework:

  • Missing, invalid, or out of range values.
  • Categorical variables with large sets of possible levels.
  • Novel categorical levels discovered during test, cross-validation, or model application/deployment.
  • Large numbers of columns to consider as potential modeling variables (both statistically hazardous and time consuming).
  • Nested model bias poisoning results in non-trivial data processing pipelines.

Any one of these issues can add to project time and decrease the predictive power and reliability of a machine learning project. Many real world projects encounter all of these issues, which are often ignored leading to degraded performance in production.

vtreat systematically and correctly deals with all of the above issues in a documented, automated, parallel, and statistically sound manner.

vtreat can fix or mitigate these domain independent issues much more reliably and much faster than by-hand ad-hoc methods.
This leaves the data scientist or analyst more time to research and apply critical domain dependent (or knowledge based) steps and checks.

If you are attempting high-value predictive modeling in R, you should try out vtreat and consider adding it to your workflow.

Both the software and the write-up have citable DOIs to make them easier to include in your methods sections and other write-ups.

vtreat 1.0.3 is now available for R users through CRAN.
This vtreat release adds some parallel performance improvements and new methods to track and characterize novel variable levels.

Win-Vector LLC offers semi-custom on-site training in the vtreat methodology (and support). Please reach out to us if your group is interested in such training.

Please cite as:

Mount, Zumel, (2018). The vtreat R package: a statistically sound data processor for predictive modeling. Journal of Open Source Software, 3(23), 584, https://doi.org/10.21105/joss.00584 .

DOI DOI

Categories: Pragmatic Data Science Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

9 replies

    1. Sorry you had trouble.

      It should work now. It is a pure R package (no C/C++) and there is sometimes a gap when packages do not instal without compiling for a little while after release (CRAN looks like it has a built Windows version of the current version now).

      1. Sorry you are having trouble. Thank for including the image, it did include a lot of important details.

        From the message it appears the CRAN mirror you are using (sorengard) does not have the binary version of the 1.0.3 version of the package and possibly tries to build the package. Also the reported version of R 2.13.1 (July 2011) is very much behind the current version R 3.4.3, so there may be a behavioral difference between that version of R and the version the package was built on at CRAN (in fact given the major number change I assume there are API differences). Best I could do on my end is add a R version declaration to the package so that older versions of R do not attempt the install. That will take about 1 month as we just released and CRAN expects updates to come no sooner than about 1 month apart on average.

      2. Hi. Thanks. Actually, I am on 3.4.3. It’s just that I have installed over the 2.13.x directory for many versions now. Some time ago I was assured that, for the most part, was a perfectly okay thing to do.

      3. It may be that it is okay until it is not. My only advice would be please try a mirror that has the built latest built version matching the latest source version.

  1. JOSS has just accepted vtreat! Here are citable links (also added to the article).

    Mount, Zumel, (2018). The vtreat R package: a statistically sound data processor for predictive modeling. Journal of Open Source Software, 3(23), 584, https://doi.org/10.21105/joss.00584 .


    Note: the content has been appearing and disappearing from JOSS. Hopefully the JOSS notice will be stable soon.

  2. Very nice tips! As working with R for many hours a day for business and personal, saving time is of the essence!

%d