Nina Zumel just completed an excellent short sequence of articles on picking optimal utility thresholds to convert a continuous model score for a classification problem into a deployable classification rule. Squeezing the Most Utility from Your Models Estimating Uncertainty of Utility Curves This is very compatible with our advice to […]

Estimated reading time: 1 minute

What we’ve got here is failure to communicate Suppose I were to say: “any natural number can be written uniquely, up to order, as a, possibly empty, finite product of prime number(s).” This seems possibly correct, and possibly even careful. Though, one may have to look up the terms (such […]

Estimated reading time: 13 minutes

Here is an incredibly clear, but unfortunately gruesome, example of a variation of Bayes’ Law. A good teachable point. Consider the recent CDC article “Community and Close Contact Exposures Associated with COVID-19 Among Symptomatic Adults ≥18 Years in 11 Outpatient Health Care Facilities.” It states: Adults with positive SARS-CoV-2 test […]

Estimated reading time: 10 minutes

I am working on a promising new series of notes: common data science fallacies and pitfalls. (Probably still looking for a good name for the series!) I thought I would share a few thoughts on it, and hopefully not jinx it too badly.

Estimated reading time: 4 minutes

In our data science teaching, we present the ROC plot (and the area under the curve of the plot, or AUC) as a useful tool for evaluating score-based classifier models, as well as for comparing multiple such models. The ROC is informative and useful, but it’s also perhaps overly concise […]

Estimated reading time: 12 minutes

One of my favorite mathematical anecdotes is the following story that Gian-Carlo Rota told about Solomon Lefschetz: He [Solomon Lefschetz] liked to repeat, as an example of mathematical pedantry, the story of one of E. H. Moore’s visits to Princeton, when Moore started a lecture by saying, “Let a be […]

Estimated reading time: 10 minutes

Students have asked me if it is better to use the same cross-validation plan in each step of an analysis or to use different ones. Our answer is: unless you are coordinating the many plans in some way (such as 2-way independence or some sort of combinatorial design) it is […]

Estimated reading time: 54 seconds

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that smaller […]

Estimated reading time: 14 minutes

In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref). Thomson’s (Lord Kelvin) First Tide Predicting Machine, 1876 The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the […]

Estimated reading time: 6 minutes

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for […]

Estimated reading time: 13 minutes