Introduction A common question in analytics, statistics, and data science projects is: how much data do you need? This question actually has very specific and clear answers! A first good answer is “it is good to have a lot.” Let’s dig deeper and get some additional more detailed quantitative answers. […]

Estimated reading time: 10 minutes

Introduction The goal of this note is to try and characterize excess generalization error: how much worse your model works in production versus how well it appeared to work during training. The clarifying point is excess generalization error (also called overfit) isn’t so much the model performing unexpectedly poorly on […]

Estimated reading time: 13 minutes

Introduction I want to spend some time thinking out loud about linear regression. As a data science consultant and teacher I spend a lot of time using linear regression and teaching linear regression. I have found each of these pursuits can degenerate into mere doctrine or instructions. “do this,” “expect […]

Estimated reading time: 12 minutes

I am sharing a new short data science video: Parameterized Juypter Notebooks. It is an example from the wvpy package showing how to programmatically re-run the same notebook with many different inputs. If you are doing data science in Python, this may help you with your projects. link

Estimated reading time: 24 seconds

I am sharing yet another data transform tutorial here! It is about coordinatized data, the larger theory encompassing pivot and un-pivot. The example is in Python, but we also supply a similar package for R users.

Estimated reading time: 18 seconds

The data algebra is a system for composing data manipulation tasks in Python. In the data algebra, operator pipelines (or even directed acyclic graphs) are the primary objects. Applying operations composes small data pipelines into larger ones. This allows the fluid specification, inspection, and sharing of data processing and data […]

Estimated reading time: 1 minute

I’ve been seeing a lot of hot takes on if one should do data science in R or in Python. I’ll comment generally on the topic, and then add my own myopic gear-head micro benchmark. I’ll jump in: If learning the language is the big step: then you are a […]

Estimated reading time: 5 minutes

I’ve just started experimenting with the Polars data frame library in Python. I really like the programmable API it exposes. In fact I am starting an experimental adapter from the data algebra to Polars. When this is complete one can use the data algebra to run the same data transform […]

Estimated reading time: 46 seconds

I am excited to share my guest lecture for Department of Statistics at the University of Illinois STAT 447: Data Science Programming Methods. And thank you to Dirk Eddelbuettel for inviting me! The talk was titled “Data Science: Street Fighting Statistics” and demonstrates two simple supervised modeling tasks in R. […]

Estimated reading time: 35 seconds

A central data science engineering problem is how to organize general data into columns for analysis. I often refer to this as denormalization, or the deliberate arranging of data so all entries of a record are in a single row in a single table. In this note I will write […]

Estimated reading time: 15 minutes