I recently came across the thoughtful article “On Moving from Statistics to Machine Learning, the Final Stage of Grief”. It makes some good points, and is worth the read. However, it also reminded me of the unexamined claim “data science is statistics done wrong.” Frankly this is not the case, though it might be better for statisticians if it were. I’d like to touch on just one of the examples here.
“On Moving from Statistics to Machine Learning, the Final Stage of Grief” includes the following apocalyptic (Marvel Comics actually) meme about Ridge Regression / Tikhonov regularization:
It is clever and funny: but it promotes the false story that statisticians use only unbiased methods and estimators, and that this unique discipline is a core part of their value.
Like everybody else, statisticians do in fact from time to time touch the ground and use biased techniques and estimators:
- The standard deviation. Right after chewing you out for not applying a useless Bessel correction to estimate the variance of a variable on a large data set, they like the rest of us take the square-root of the estimated variance to estimate the standard deviation. Guess what: that standard deviation estimate isn’t unbiased, even if the variance estimate was (notes here). Definitely a case of something always worth criticizing others on, and never worth examining one’s self about.
- Model selection and variable selection. Selecting models or variables, even from unbiased estimates is in fact a biased procedure (see Freedman’s paradox and the usual multiple comparisons issues). Yet this is routinely done, and often without any sort of Bonferroni correction.
- Bayesian methods with proper priors. Bayesian estimates with proper priors tend to be biased towards those priors. So if the priors are good, these results are biased towards the right answer. Not a bad thing.
So back to the meme.
“On Moving from Statistics to Machine Learning, the Final Stage of Grief” made the very good point that the concerns of estimating “ŷ” are different than those of estimating “Β̂”. So one can always make hay by applying the alternate criteria as an external criticism.
But let’s take the view of estimating Β̂.
The portrayed Ridge Regression / Tikhonov regularization isn’t wrong. In fact it is exactly a point-estimate of the posterior of the unknown parameter. Treat the estimate distributionally and you have a full Bayesian posterior estimate. This gives you interpretability and even a model of uncertainty. Also, when the prior distribution assumptions are met it is in fact the exact posterior distribution. So the inference is, from the point of view of priors/posteriors, fully correct. If the parameter had the assumed prior distribution then, conditioned on the observations: we have a description of the actual posterior distribution.
And that lets us get back to our article title: data science is not “statistics done wrong.” Data science is a different set of concerns, trade-offs, or operating regime:
- Effective on large data sets, in preference to correct on small data sets.
- Prediction, instead of inference.
- Many models, instead of few models.
Each of these implies different issues, abilities, and concerns.
If anything data science is operations research done wrong! Or in some cases, done correctly.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.