Machine learning “in the database” (including systems such as Spark) is an increasingly popular topic. And where there is machine learning, there is a need for data preparation. Many machine learning algorithms expect all data to be numeric without missing values. vtreat is a package (available for Python or for […]
Estimated reading time: 8 minutes
I would like to re-share vtreat (R version, Python version) a data preparation documentation for machine learning tasks. vtreat is a system for preparing messy real world data for predictive modeling tasks (classification, regression, and so on). In particular it is very good at re-coding high-cardinality string-valued (or categorical) variables […]
Estimated reading time: 2 minutes
As we have announced before, we have ported the R version of vtreat to a new Python version of vtreat. Our latest news is: we are speaking about the Python version at PyData LA 2019 (Thursday 10:50 AM–11:35 AM in Track 2 Room).
Estimated reading time: 1 minute
vtreat is a powerful R package for preparing messy real-world data for machine learning. We have further extended the package with a number of features including rquery/rqdatatable integration (allowing vtreat application at scale on Apache Spark or data.table!). In addition vtreat and can now effectively prepare data for multi-class classification […]
Estimated reading time: 7 minutes
Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time […]
Estimated reading time: 3 minutes