I recently came across the thoughtful article “On Moving from Statistics to Machine Learning, the Final Stage of Grief”. It makes some good points, and is worth the read. However, it also reminded me of the unexamined claim “data science is statistics done wrong.” Frankly this is not the case, […]
In the previous article in this series, we showed that common ensemble models like random forest and gradient boosting are uncalibrated: they are not guaranteed to estimate aggregates or rollups of the data in an unbiased way. However, they can be preferable to calibrated models such as linear or generalized […]
While reading Dr. Nina Zumel’s excellent note on bias in common ensemble methods, I ran the examples to see the effects she described (and I think it is very important that she is establishing the issue, prior to discussing mitigation). In doing that I ran into one more avoidable but strange […]
In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data is involved. However, when making predictions on individuals, a biased model may […]
This note is just a quick follow-up to our last note on correcting the bias in estimated standard deviations for binomial experiments.
This note is about attempting to remove the bias brought in by using sample standard deviation estimates to estimate an unknown true standard deviation of a population. We establish there is a bias, concentrate on why it is not important to remove it for reasonable sized samples, and (despite that) […]
Continuing our series of reading out loud from a single page of a statistics book we look at page 224 of the 1972 Dover edition of Leonard J. Savage’s “The Foundations of Statistics.” On this page we are treated to an example attributed to Leo A. Goodman in 1953 that […]
Elon Musk’s writing about a Tesla battery fire reminded me of some of the math related to trying to estimate the rate of a rare event from a single occurrence of the event (plus many non-event occurrences). In this article we work through some of the ideas.