I would like to re-share vtreat (R version, Python version) a data preparation documentation for machine learning tasks. vtreat is a system for preparing messy real world data for predictive modeling tasks (classification, regression, and so on). In particular it is very good at re-coding high-cardinality string-valued (or categorical) variables […]

Estimated reading time: 2 minutes

(or: how to correctly use xgboost from R) R has "one-hot" encoding hidden in most of its modeling paths. Asking an R user where one-hot encoding is used is like asking a fish where there is water; they can’t point to it as it is everywhere. For example we can […]

Estimated reading time: 16 minutes

Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them.

Estimated reading time: 12 minutes