## How Much Data Do You Need?

Introduction A common question in analytics, statistics, and data science projects is: how much data do you need? This question actually has very specific and clear answers! A first good answer is “it is good to have a lot.” Let’s dig deeper and get some additional more detailed quantitative answers. […]

## Bounding Excess Generalization Error for Linear Regression Models

Introduction The goal of this note is to try and characterize excess generalization error: how much worse your model works in production versus how well it appeared to work during training. The clarifying point is excess generalization error (also called overfit) isn’t so much the model performing unexpectedly poorly on […]

Introduction I want to spend some time thinking out loud about linear regression. As a data science consultant and teacher I spend a lot of time using linear regression and teaching linear regression. I have found each of these pursuits can degenerate into mere doctrine or instructions. “do this,” “expect […]

## Linear and Logistic Regression in Practical Data Science with R 2nd Edition

One of the chapters that we are especially proud of in Practical Data Science with R is Chapter 7, “Linear and Logistic Regression.” We worked really hard to explain the fundamental principles behind both methods in a clear and easy-to-understand form, and to document diagnostics returned by the R implementations […]

## Link Functions versus Data Transforms

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for […]

## An easy way to accidentally inflate reported R-squared in linear regression models

Here is an absolutely horrible way to confuse yourself and get an inflated reported R-squared on a simple linear regression model in R. We have written about this before, but we found a new twist on the problem (interactions with categorical variable encoding) which we would like to call out […]

## Variables can synergize, even in a linear model

Introduction Suppose we have the task of predicting an outcome y given a number of variables v1,..,vk. We often want to “prune variables” or build models with fewer than all the variables. This can be to speed up modeling, decrease the cost of producing future data, improve robustness, improve explain-ability, […]

## Can we try to make an adjustment?

In most of our data science teaching (including our book Practical Data Science with R) we emphasize the deliberately easy problem of “exchangeable prediction.” We define exchangeable prediction as: given a series of observations with two distinguished classes of variables/observations denoted “x”s (denoting control variables, independent variables, experimental variables, or […]