Introduction A common question in analytics, statistics, and data science projects is: how much data do you need? This question actually has very specific and clear answers! A first good answer is “it is good to have a lot.” Let’s dig deeper and get some additional more detailed quantitative answers. […]

Estimated reading time: 10 minutes

Introduction The goal of this note is to try and characterize excess generalization error: how much worse your model works in production versus how well it appeared to work during training. The clarifying point is excess generalization error (also called overfit) isn’t so much the model performing unexpectedly poorly on […]

Estimated reading time: 13 minutes

Introduction I want to spend some time thinking out loud about linear regression. As a data science consultant and teacher I spend a lot of time using linear regression and teaching linear regression. I have found each of these pursuits can degenerate into mere doctrine or instructions. “do this,” “expect […]

Estimated reading time: 12 minutes

I recently shared a bit of the history of The Science of Data Analysis. I thought I would follow that up with a quick chalk talk titled “What is Statistics?” (link)

Estimated reading time: 21 seconds

Video of our PyData Los Angeles 2019 talk Preparing Messy Real World Data for Supervised Machine Learning is now available. In this talk describe how to use vtreat, a package available in R and in Python, to correctly re-code real world data for supervised machine learning tasks. Please check it […]

Estimated reading time: 32 seconds

I was working with our copy editor on Appendix A of Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019, and ran into this little point (unfortunately) buried in the back of the book. In our opinion the R ecosystem is the fastest path to substantial data science, […]

Estimated reading time: 34 seconds

vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner. vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). […]

Estimated reading time: 2 minutes

We will be speaking at the Tuesday, September 3, 2019 BARUG. If you are in the Bay Area, please come see us. Nina Zumel & John Mount Practical Data Science with R Practical Data Science with R (Zumel and Mount) was one of the first, and most widely-read books on […]

Estimated reading time: 57 seconds

We are sharing a chalk talk rehearsal on applied probability. We use basic notions of probability theory to work through the estimation of sample size needed to reliably estimate event rates. This expands basic calculations, and then moves to the ideas of: Sample size and power for rare events. Please […]

Estimated reading time: 33 seconds