I am pleased to announce that vtreat
version 0.6.0 is now available to R
users on CRAN.
vtreat
is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R
user we strongly suggest you incorporate vtreat
into your projects.
vtreat
handles, in a statistically sound fashion:
- Missing values.
- Encoding of categorical values for regularized inference and machine learning techniques.
- Categorical variables with very many values.
- Novel categorical values (that is values not seen during training).
- Variable pruning.
- y-aware scaling.
- Structured cross-validation.
- Mitigating nested model bias.
In our (biased) opinion vtreat
has the best methodology and documentation for these important data cleaning and preparation steps. vtreat
‘s current public open-source implementation is for in-memory R
analysis (we are considering ports and certifying ports of the package some time in the future, possibly for: data.table
, Spark
, Python
/Pandas
, and SQL
).
vtreat
brings a lot of power, sophistication, and convenience to your analyses, without a lot of trouble.
A new feature of vtreat
version 0.6.0 is called “custom coders.” Win-Vector LLC‘s Dr. Nina Zumel is going to start a short article series to show how this new interface can be used to extend vtreat
methodology to include the very powerful method of partial pooled inference (a term she will spend some time clearly defining and explaining). Time permitting, we may continue with articles on other applications of custom coding including: ordinal/faithful coders, monotone coders, unimodal coders, and set-valued coders.
Please help us share and promote this article series, which should start in a couple of days. This should be a fun chance to share very powerful methods with your colleagues.
Edit 9-25-2017: part 1 is now here!
Categories: Administrativia Opinion Pragmatic Data Science Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
That sounds great! Looking forward to the articles.
How did you manage to get a visualization on ranger () for random forests in R ?
Not really so much visualization of ranger, but looking at the code and asking some questions of the developers.