#
Author Archives

### Nina Zumel

Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.

Many people who work with data are familiar with Principal Components Analysis (PCA): it’s a linear transformation technique that’s commonly used for dimension reduction, as well as for the orthogonalization of data prior to downstream modeling or analysis. In this article, we’ll talk about another PCA-style transformation: the sphering or […]

Estimated reading time: 11 minutes

Nina Zumel Recently, we’ve been reading about a new correlation coefficient, \(\xi\) (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The \(\xi\) coefficient has the following properties: If \(y\) is a function of \(x\), then \(\xi\) goes to 1 asymptotically as \(n\) […]

Estimated reading time: 11 minutes

Nina Zumel and John Mount will be speaking at the online University of San Francisco Seminar Series in Data Science! How and why to use probability models to outperform decision rules Friday April 30, 2021 12:30pm – 2pm Pacific Time See here for full details and to RSVP In this […]

Estimated reading time: 58 seconds

Recently, we showed how to use utility estimates to pick good classifier thresholds. In that article, we used model performance on an evaluation set, combined with estimates of rewards and penalties for correct and incorrect classifications, to find a threshold that optimized model utility. In this article, we will show […]

Estimated reading time: 10 minutes

In a previous article we discussed why it’s a good idea to prefer probability models to “hard” classification models, and why you should delay setting “hard” classification rules as long as possible. But decisions have to be made, and eventually you will have to set that threshold. How do you […]

Estimated reading time: 13 minutes

In our data science teaching, we present the ROC plot (and the area under the curve of the plot, or AUC) as a useful tool for evaluating score-based classifier models, as well as for comparing multiple such models. The ROC is informative and useful, but it’s also perhaps overly concise […]

Estimated reading time: 12 minutes

One of the chapters that we are especially proud of in Practical Data Science with R is Chapter 7, “Linear and Logistic Regression.” We worked really hard to explain the fundamental principles behind both methods in a clear and easy-to-understand form, and to document diagnostics returned by the R implementations […]

Estimated reading time: 52 seconds

A client recently came to us with a question: what’s a good way to monitor data or model output for changes? That is, how can you tell if new data is distributed differently from previous data, or if the distribution of scores returned by a model have changed? This client, […]

Estimated reading time: 17 minutes

Regularization is a way of avoiding overfit by restricting the magnitude of model coefficients (or in deep learning, node weights). A simple example of regularization is the use of ridge or lasso regression to fit linear models in the presence of collinear variables or (quasi-)separation. The intuition is that smaller […]

Estimated reading time: 14 minutes

When studying regression models, One of the first diagnostic plots most students learn is to plot residuals versus the model’s predictions (that is, with the predictions on the x-axis). Here’s a basic example. # build an “ideal” linear process. set.seed(34524) N = 100 x1 = runif(N) x2 = runif(N) noise […]

Estimated reading time: 9 minutes