## The Purpose of our Data Science Chalk Talk Series

I’d like to share an introduction to my data science chalk talk series (video link, series link)

## The Science of Data Analysis

I am re-reading from the great statistician John W. Tukey’s paper: Tukey, John W. “The Future of Data Analysis.” Ann. Math. Statist. 33 (1962), no. 1, pp. 1–67. doi:10.1214/aoms/1177704711. https://projecteuclid.org/euclid.aoms/1177704711 I’ve taken the liberty of pulling out some quotes that are very relevant to the usual “data science is not […]

## New Free Video Lecture: Estimating the Odds with Bayes’ Law

I am excited to share my new free video lecture: Estimating the Odds with Bayes’ Law. (link)

## A Single Parameter Family Characterizing Probability Model Performance

Introduction We’ve been writing on the distribution density shapes expected for probability models in ROC (receiver operator characteristic) plots, double density plots, and normal/logit-normal densities frameworks. I thought I would re-approach the issue with a specific family of examples.

## The Double Density Plot Contains a Lot of Useful Information

The double density plot contains a lot of useful information. This is a plot that shows the distribution of a continuous model score, conditioned on the binary categorical outcome to be predicted. As with most density plots: the y-axis is an abstract quantity called density picked such that the area […]

## Your Lopsided Model is Out to Get You

For classification problems I argue one of the biggest steps you can take to improve the quality and utility of your models is to prefer models that return scores or return probabilities instead of classification rules. Doing this also opens a second large opportunity for improvement: working with your domain […]

## Squeezing the Most Utility from Your Models

In a previous article we discussed why it’s a good idea to prefer probability models to “hard” classification models, and why you should delay setting “hard” classification rules as long as possible. But decisions have to be made, and eventually you will have to set that threshold. How do you […]

## Why Working With AUC is More Powerful Than One Might Think

I am finishing up a work-note that has some really neat implications as to why working with AUC is more powerful than one might think. I think I am far enough along to share the consequences here. This started as some, now reappraised, thoughts on the fallacy of thinking knowing […]

## Why not Square Error for Classification?

Win Vector LLC has been developing and delivering a lot of “statistics, machine learning, and data science for engineers” intensives in the past few years. These are bootcamps, or workshops, designed to help software engineers become more comfortable with machine learning and artificial intelligence tools. The current thinking is: not […]

## Unrolling the ROC

In our data science teaching, we present the ROC plot (and the area under the curve of the plot, or AUC) as a useful tool for evaluating score-based classifier models, as well as for comparing multiple such models. The ROC is informative and useful, but it’s also perhaps overly concise […]