Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you have not reported correlation where R-squared, likelihood or root mean square error (RMSE) would have been more appropriate.

It is tempting (but wrong) to use correlation to track the performance of model predictions. The temptation arises because we often (correctly) use correlation to evaluate possible model inputs. And the correlation function is often a convenient built-in function.

In fact on the data used to train a model one can prove correlation squared is equal R-squared under very mild assumptions. See Nina Zumel’s “Correlation and R-squared” article for a very good explanation of why this is the case. In fact correlation squared is nearly equal to R-squared for any data set that is exchangeable with the training data. So we should expect correlation squared nearly equal to R-squared on properly prepared test data. And this is the core of the paradox: correlation is a perfectly good measure on training and test data, but possibly not on data later encountered in a production environment. We don’t know that data later seen in production is in fact exchangeable with training and test data, that is something we hope for and want to track. We don’t want symmetries in the correlation function hiding a divergence such as the an unexpected changing of units or scale in production data.

The correlation function (which we will call cor(,)) has a huge number of obscuring symmetries: it is unchanged under positive scaling, shifts and the swap of its two arguments. This means it is in fact scoring if some ideal shift plus re-scaling of your model predictions is performing well instead of scoring the predictions you are using. And this is not what you want, models in a production environment are supposed to make actual good predictions. Measurements in production are supposed to tell you if the model or data have drifted (not to merely assume they have not).

Here is some R-code showing symmetries in cor(,):

> y = runif(10) > x = y + 0.5*runif(10) > cor(x,y) [1] 0.8893743 > cor(y,x) [1] 0.8893743 > cor(10*x,y) [1] 0.8893743 > cor(x+10,y) [1] 0.8893743

R-squared (written as a function as rsq(,) has none of these symmetries, it changes under simple alterations of its arguments and can become arbitrarily negative.

Here is some R-code showing the lack of symmetries in rsq(,):

> rsq = function(y,f) { 1 - sum((y-f)^2)/sum((y-mean(y))^2) } > rsq(x,y) [1] -0.4966555 > rsq(y,x) [1] 0.09424879 > rsq(10*x,y) [1] -9.197255 > rsq(x+10,y) [1] -2250.407

And here is some R-code to remind you that correlation squared and R-squared do agree on training data:

> model = lm(y~x) > rsq(y,predict(model)) [1] 0.7909866 > cor(y,predict(model))^2 [1] 0.7909866 > model Call: lm(formula = y ~ x) Coefficients: (Intercept) x -0.3309 1.1432

If you look at this with an open or learning mind it should seem very strange that a function like cor(,) with a huge number of symmetries is closely associated with a function like rsq(,) with many fewer symmetries. At this point we re-recommend Nina Zumel’s “Correlation and R-squared” article to remind ourselves why correlation squared and R-squared are the same on training data. But the point we want to leave with is that the correlation function is using its many symmetries to evaluate if some simple function of a value is a good prediction (hence correlation is a great way to vet possible model inputs), and correlation is not scoring if the unaltered predictions at hand actually are in fact good.

Categories: Expository Writing Pragmatic Data Science Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

Great post!

Isn’t this the reason the PRESS statistic was developed?

http://en.wikipedia.org/wiki/PRESS_statistic

Cheers,

Andrej