Introduction A common question in analytics, statistics, and data science projects is: how much data do you need? This question actually has very specific and clear answers! A first good answer is “it is good to have a lot.” Let’s dig deeper and get some additional more detailed quantitative answers. […]

Estimated reading time: 10 minutes

Introduction The goal of this note is to try and characterize excess generalization error: how much worse your model works in production versus how well it appeared to work during training. The clarifying point is excess generalization error (also called overfit) isn’t so much the model performing unexpectedly poorly on […]

Estimated reading time: 13 minutes

Introduction I want to spend some time thinking out loud about linear regression. As a data science consultant and teacher I spend a lot of time using linear regression and teaching linear regression. I have found each of these pursuits can degenerate into mere doctrine or instructions. “do this,” “expect […]

Estimated reading time: 12 minutes

I’d like to write a bit about measuring effect sizes and Cohen’s d. Introduction For our note let’s settle on a single simple example problem. We have two samples of real numbers a_1, …, a_n and b_1, …, b_n. All the a_i are mutually exchangeable or generated by an independent […]

Estimated reading time: 10 minutes

I recently shared a bit of the history of The Science of Data Analysis. I thought I would follow that up with a quick chalk talk titled “What is Statistics?” (link)

Estimated reading time: 21 seconds

The core of our “statistics to English translation” series is Nina Zumel’s sequence of articles: “I don’t think that means what you think it means;” Statistics to English Translation, Part 1: Accuracy Measures Statistics to English Translation, Part 2a: ’Significant’ Doesn’t Always Mean ’Important’ Statistics to English Translation, Part 2b: […]

Estimated reading time: 55 seconds

I am conducting another machine learning / AI bootcamp this week. Starting one of these always makes me want to get more statistical commentaries down, just in case I need one. These classes have to move fast, and also move correctly. In this case I want to write about decomposition […]

Estimated reading time: 5 minutes

When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example. # build an “ideal” linear process. set.seed(34524) N = 100 x1 = runif(N) x2 = runif(N) noise […]

Estimated reading time: 9 minutes

This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) […]

Estimated reading time: 11 minutes

In statistical work in the age of big data we often get hung up on differences that are statistically significant (reliable enough to show up again and again in repeated measurements), but clinically insignificant (visible in aggregation, but too small to make any real difference to individuals). An example would […]

Estimated reading time: 2 minutes