Students have asked me if it is better to use the same cross-validation plan in each step of an analysis or to use different ones. Our answer is: unless you are coordinating the many plans in some way (such as 2-way independence or some sort of combinatorial design) it is generally better to use one plan. That way minor information leaks at each stage explore less of the output variations, and don’t combine into worse leaks.
I am now sharing a note that works all of the above as specific examples: “Multiple Split Cross-Validation Data Leak” (a follow-up to our larger article “Cross-Methods are a Leak/Variance Trade-Off”).
Categories: Expository Writing Pragmatic Data Science Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.