Against Accuracy
Why a mere accurate classification rule may not meet your business needs. And why you should insist on a model that returns numeric scores for classification problems. (link)
Why a mere accurate classification rule may not meet your business needs. And why you should insist on a model that returns numeric scores for classification problems. (link)
I am sharing a new free video where I work through a great common argument that bounds expected excess generalization error as a ratio of model complexity (in rows) over training set size (again in rows), independent of problem dimension. (link) For more of my notes on support vector machines […]
(link)
Our book, Practical Data Science with R, just had its first year anniversary! The book is doing great, if you are working with R and data I recommend you check it out. (link)
I’d like to share an introduction to my data science chalk talk series (video link, series link)
I am re-reading from the great statistician John W. Tukey’s paper: Tukey, John W. “The Future of Data Analysis.” Ann. Math. Statist. 33 (1962), no. 1, pp. 1–67. doi:10.1214/aoms/1177704711. https://projecteuclid.org/euclid.aoms/1177704711 I’ve taken the liberty of pulling out some quotes that are very relevant to the usual “data science is not […]
I am excited to share my new free video lecture: Estimating the Odds with Bayes’ Law. (link)
Introduction We’ve been writing on the distribution density shapes expected for probability models in ROC (receiver operator characteristic) plots, double density plots, and normal/logit-normal densities frameworks. I thought I would re-approach the issue with a specific family of examples.
The double density plot contains a lot of useful information. This is a plot that shows the distribution of a continuous model score, conditioned on the binary categorical outcome to be predicted. As with most density plots: the y-axis is an abstract quantity called density picked such that the area […]
For classification problems I argue one of the biggest steps you can take to improve the quality and utility of your models is to prefer models that return scores or return probabilities instead of classification rules. Doing this also opens a second large opportunity for improvement: working with your domain […]