Menu Home

What is a Good Test Set Size?

Introduction Teaching basic data science, machine learning, and statistics is great due to the questions. Students ask brilliant questions, as they see what holes are present in your presentation and scaffolding. The students are not yet conditioned to ask only what you feel is easy to answer or present. They […]

Fluid use of data

Nina Zumel and I recently wrote a few article and series on best practices in testing models and data: Random Test/Train Split is not Always Enough How Do You Know if Your Data Has Signal? How do you know if your model is going to work? A Simpler Explanation of […]

A deeper theory of testing

In some of my recent public talks (for example: here and here) I have mentioned a desire for “a deeper theory of fitting and testing.” I thought I would expand on what I meant by this. In this note I am going to cover a lot of different topics to […]

Random Test/Train Split is not Always Enough

Most data science projects are well served by a random test/train split. In our book Practical Data Science with R we strongly advise preparing data and including enough variables so that data is exchangeable, and scoring classifiers using a random test/train split. With enough data and a big enough arsenal […]