We’ve been experimenting with this for a while, and the next R vtreat package will have a back-port of the Python vtreat package sklearn pipe step interface (in addition to the standard R interface).
This means the user can express easily express modeling intent by choosing between coder$fit_transform(train_data)
, coder$fit(train_data_cal)$transform(train_data_model)
, and coder$transform(application_data)
.
We have also regenerated the current task-oriented vtreat documentation to demonstrate the new nested bias warning feature:
- Regression:
R
regression example,Python
regression example. - Classification:
R
classification example,Python
classification example. - Unsupervised data preparation:
R
unsupervised example,Python
unsupervised example. - Multinomial classification:
R
multinomial classification example,Python
multinomial classification example.
And we now have new versions of these documents showing the sklearn $fit_transform()
style notation in R.
- Regression:
R
$fit_transform()
regression example. - Classification:
R
$fit_transform()
classification example. - Unsupervised data preparation:
R
$fit_transform()
unsupervised example. - Multinomial classification:
R
$fit_transform()
multinomial classification example.
The original R interface is going to remain the standard interface for vtreat. It is more idiomatic R, and is taught in chapter 8 of Zumel, Mount; Practical Data Science with R, 2nd Edition, Manning 2019.
In contrast, the $fit_transform()
notation will always just be an adaptor over the primary R interface. However, there is a lot to be learned from sklearn’s organization and ideas, so we felt we would use make their naming convention available as a way of showing appreciation and giving credit. Some more of my notes on the grace of the sklearn interface in being a good way to manage state and generative effects (see Brendan Fong, David I. Spivak; An Invitation to Applied Category Theory, Cambridge University Press, 2019) can be found here.
Categories: Exciting Techniques Practical Data Science Pragmatic Data Science Pragmatic Machine Learning Statistics Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.