Menu Home

Use the Same Cross-Plan Between Steps

Students have asked me if it is better to use the same cross-validation plan in each step of an analysis or to use different ones. Our answer is: unless you are coordinating the many plans in some way (such as 2-way independence or some sort of combinatorial design) it is generally better to use one plan. That way minor information leaks at each stage explore less of the output variations, and don’t combine into worse leaks.

I am now sharing a note that works all of the above as specific examples: “Multiple Split Cross-Validation Data Leak” (a follow-up to our larger article “Cross-Methods are a Leak/Variance Trade-Off”).

Categories: data science Expository Writing Practical Data Science Pragmatic Data Science Pragmatic Machine Learning Statistics Tutorials

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

%d bloggers like this: