We have a new Win Vector data science article to share:
Cross-Methods are a Leak/Variance Trade-OffJohn Mount (Win Vector LLC), Nina Zumel (Win Vector LLC) March 10, 2020
We work some exciting examples of when cross-methods (cross validation, and also cross-frames) work, and when they do not work.
Cross-methods such as cross-validation, and cross-prediction are effective tools for many machine learning, statisitics, and data science related applications. They are useful for parameter selection, model selection, impact/target encoding of high cardinality variables, stacking models, and super learning. They are more statistically efficient than partitioning training data into calibration/training/holdout sets, but do not satisfy the full exchangeability conditions that full hold-out methods have. This introduces some additional statistical trade-offs when using cross-methods, beyond the obvious increases in computational cost.
Specifically, cross-methods can introduce an information leak into the modeling process. This information leak will be the subject of this post.
The entire article is a JupyterLab notebook, and can be found here. Please check it out, and share it with your favorite statisticians, machine learning researchers, and data scientists.
Categories: Pragmatic Data Science Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.