Nina and I are proud to share our lecture: “Prepping Data for Analysis using R” from ODSC West 2015.
It is about 90 minutes, and covers a lot of the theory behind the
vtreat data preparation library.
We also have a Github repository including all the lecture materials here.
Nina’s preview still (shown below) is one of my favorite slides. I think it really sets out ideas about how to think about novel levels (string values encountered during scoring that were not seen during training) in a nice problem driven way before getting into messy math (such as unknown frequency estimation).
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.