## When Cross-Validation is More Powerful than Regularization

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that smaller […]

## Lord Kelvin, Data Scientist

In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref). Thomson’s (Lord Kelvin) First Tide Predicting Machine, 1876 The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the […]

## Link Functions versus Data Transforms

In the linear regression section of our book Practical Data Science in R, we use the example of predicting income from a number of demographic variables (age, sex, education and employment type). In the text, we choose to regress against log10(income) rather than directly against income. One obvious reason for […]

## On indexing operators and composition

In this article I will discuss array indexing, operators, and composition in depth. If you work through this article you should end up with a very deep understanding of array indexing and the deep interpretation available when we realize indexing is an instance of function composition (or an example of […]

## Teaching pivot / un-pivot

Authors: John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” […]

## Visualizing relational joins

I want to discuss a nice series of figures used to teach relational join semantics in R for Data Science by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join: Please read on for my discussion of this diagram and teaching […]

## Coordinatized Data: A Fluid Data Specification

Authors: John Mount and Nina Zumel. Introduction It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting). Boris Artzybasheff illustration Real trust and […]

## Laplace noising versus simulated out of sample methods (cross frames)

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method […]

## On calculating AUC

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC).

## Relative error distributions, without the heavy tail theatrics

Nina Zumel prepared an excellent article on the consequences of working with relative error distributed quantities (such as wealth, income, sales, and many more) called “Living in A Lognormal World.” The article emphasizes that if you are dealing with such quantities you are already seeing effects of relative error distributions […]