Writing a book is a sacrifice. It takes a lot of time, represents a lot of missed opportunities, and does not (directly) pay very well. If you do a good job it may pay back in good-will, but producing a serious book is a great challenge. Nina Zumel and I […]

Estimated reading time: 1 minute

Short form: Win-Vector LLC’s Dr. Nina Zumel has a three part series on Principal Components Regression that we think is well worth your time. Part 1: the proper preparation of data (including scaling) and use of principal components analysis (particularly for supervised learning or regression). Part 2: the introduction of […]

Estimated reading time: 9 minutes

It is often said that “R is its packages.” One package of interest is ranger a fast parallel C++ implementation of random forest machine learning. Ranger is great package and at first glance appears to remove the “only 63 levels allowed for string/categorical variables” limit found in the Fortran randomForest […]

Estimated reading time: 7 minutes

In this note, we discuss principal components regression and some of the issues with it: The need for scaling. The need for pruning. The lack of “y-awareness” of the standard dimensionality reduction step.

Estimated reading time: 34 minutes

In my recent article on optimizing set diversity I mentioned the primary abstraction was of “diminishing returns” and is formalized by the theory of monotone submodular functions (though I did call out some of my own work which used a different abstraction). A proof that appears again and again in […]

Estimated reading time: 19 minutes

One of the trickier tasks in clustering is determining the appropriate number of clusters. Domain-specific knowledge is always best, when you have it, but there are a number of heuristics for getting at the likely number of clusters in your data. We cover a few of them in Chapter 8 […]

Estimated reading time: 10 minutes

The mathematical concept of set diversity is a somewhat neglected topic in current applied decision sciences and optimization. We take this opportunity to discuss the issue. The problem Consider the following problem: for a number of items U = {x_1, … x_n} pick a small set of them X = […]

Estimated reading time: 28 minutes

The combination of R plus SQL offers an attractive way to work with what we call medium-scale data: data that’s perhaps too large to gracefully work with in its entirety within your favorite desktop analysis tool (whether that be R or Excel), but too small to justify the overhead of […]

Estimated reading time: 10 minutes

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”). […]

Estimated reading time: 19 minutes

There remains a bit of a two-way snobbery that Frequentist statistics is what we teach (as so-called objective statistics remain the same no matter who works with them) and Bayesian statistics is what we do (as it tends to directly estimate posterior probabilities we are actually interested in). Nina Zumel […]

Estimated reading time: 24 minutes