# Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.

Consider the following “harmless data frame.”

```d <- data.frame(prediction=c(0,0,-1,-2,0,0,-1,-2),
actual=c(2,3,1,2,2,3,1,2))
```

The recommended test for checking the quality of “`prediction`” related to “`actual`” is an F-test (this is the test `stats::lm` uses). We can directly run this test with `sigr` (assuming we have installed the package) as follows:

```sigr::formatFTest(d,'prediction','actual',format='html')\$formatStr
```

F Test summary: (R2=-16, F(1,6)=-5.6, p=n.s.).

`sigr` reports an R-squared of -16 (please see here for some discussion of R-squared). This may be confusing, but it correctly communicates we have no model and in fact “`prediction`” is worse than just using the average (a very traditional null-model).

However, `cor.test` appears to think “`prediction`” is a usable model:

```cor.test(d\$prediction,d\$actual)

Pearson's product-moment correlation

data:  d\$prediction and d\$actual
t = 1.1547, df = 6, p-value = 0.2921
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3977998  0.8697404
sample estimates:
cor
0.4264014
```

This is all for a prediction where `sum((d\$actual-d\$prediction)^2)==66` which is larger than `sum((d\$actual-mean(d\$actual))^2)==4`. We concentrate on effects measures (such as R-squared) as we can drive the p-values wherever we want just by adding more data rows. Our point is: you are worse off using this model than using the mean-value of the actual (2) as your constant predictor. To my mind that is not a good prediction. And `lm` seems similarly excited about “`prediction`.”

```summary(lm(actual~prediction,data=d))

Call:
lm(formula = actual ~ prediction, data = d)

Residuals:
Min       1Q   Median       3Q      Max
-0.90909 -0.43182  0.09091  0.52273  0.72727

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   2.2727     0.3521   6.455 0.000655 ***
prediction    0.3636     0.3149   1.155 0.292121
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7385 on 6 degrees of freedom
Multiple R-squared:  0.1818,	Adjusted R-squared:  0.04545
F-statistic: 1.333 on 1 and 6 DF,  p-value: 0.2921
```

One reason to not trust the `lm` result is it didn’t score the quality of “`prediction`“. It scored the quality of “`0.3636*prediction + 2.2727`.” It can be the case that “`0.3636*prediction + 2.2727`” is in fact a good predictor. But that doesn’t help us if it is “`prediction`” we are showing to our boss or putting into production. We can try to mitigate this by insisting `lm` try to stay closer to the original by turning off the intercept or offset with the “`0+`” notation. That looks like the following.

```summary(lm(actual~0+prediction,data=d))

Call:
lm(formula = actual ~ 0 + prediction, data = d)

Residuals:
Min     1Q Median     3Q    Max
0.00   0.00   1.00   2.25   3.00

Coefficients:
Estimate Std. Error t value Pr(>|t|)
prediction  -1.0000     0.6094  -1.641    0.145

Residual standard error: 1.927 on 7 degrees of freedom
Multiple R-squared:  0.2778,	Adjusted R-squared:  0.1746
F-statistic: 2.692 on 1 and 7 DF,  p-value: 0.1448
```

Even the `lm(0+)`‘s adjusted prediction is bad as we see below:

```d\$lmPred <- predict(lm(actual~0+prediction,data=d))
sum((d\$actual-d\$lmPred)^2)
[1] 26
```

Yes, the `lm(0+)` found a way to improve the prediction; but the improved prediction is still very bad (worse than using a well chosen constant). And it is hard to argue that “`-prediction`” is the same model as “`prediction`.”

Now `sigr` is fairly new code, so it is a bit bold saying it is right when it disagrees with the standard methods. However `sigr` is right in this case. The standard methods are not so much wrong as different, for two reasons:

1. They are answering different questions. The F-test is designed to check if the predictions in-hand are good or not; “`cor.test`” and “`lm %>% summary`” are designed to check if any re-scaling of the prediction is in fact good. These are different questions. Using “`cor.test`” or “`lm %>% summary`” to test the utility of a potential variable is a good idea. The reprocessing hidden in these tests is consistent with the later use of a variable in a model. Using them to score model results that are supposed be directly used is wrong.
2. From the standard R code’s point of view it isn’t obvious what the right “null model” is. Remember our original point: the quality measures on `lm(0+)` are designed to see how well `lm(0+)` is working. This means the `lm(0+)` scores the quality of its output (not its inputs) so it gets credit for flipping the sign on the prediction. Also it considers the natural null-model to be one it can form where there are no variable driven effects. Since there is no intercept or “dc-term” in these models (caused by the “`0+`” notation) the grand average is not considered a plausible null-model as it isn’t in the concept space of the modeling situation the `lm` was presented with. Or from `help(summary.lm)`:

R^2, the ‘fraction of variance explained by the model’,
R^2 = 1 – Sum(R[i]^2) / Sum((y[i]- y*)^2),
where y* is the mean of y[i] if there is an intercept and zero otherwise.

I admit, this is very confusing. But it corresponds to documentation, and makes sense from a modeling perspective. It is correct. The silent switching of null model from average to zero make sense in the context it is defined in. It does not make sense for testing our prediction, but that is just one more reason to use the proper F-test directly instead of trying to hack “`cor.test`” or “`lm(0+) %>% summary`” to calculate it for you.

And that is what `sigr` is about: standard tests (using `R` supplied implementations) with a slightly different calling convention to better document intent (which in our case is almost always measuring the quality of a model separate from model construction). It is a new library, so it doesn’t yet have the documentation needed to achieve its goal, but we will eventually get there.

Categories: Pragmatic Data Science Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

### 9 replies ›

1. Rhenan Bartels says:

Nice post!
But I have a question. The R-squared should be a real number between 0 and 1, shouldn’t?

How it could be -16?

2. Ruben says:

I still don’t understand where the negative sign comes from. The R-squared formulas square both the numerator and the denominator…

1. R-squared is R^2 = 1 – Sum(R[i]^2) / Sum((y[i]- y*)^2) (from “help(summary.lm)”). To go negative we just need Sum(R[i]^2) / Sum((y[i]- y*)^2)>1 or equivalently Sum(R[i]^2) > Sum((y[i]- y*)^2). That is R-squared is negative when the claimed prediction vector is worse than using the constant y* as your estimate.

3. Pedro says:

Thanks for the package and the discussion, but I think I couldn’t grasp what sigr is about…after building a model, the idea is to test the predictions of this model against the actual data or against a null hypothesis of a simple average model?

1. The idea is to score any prediction (from your model, or from somebody else’s model) on any data (training data, new test data). The score (say R-squared) is of the prediction at hand. For that score you also want a significance, the significance is in terms of different your score is from the behavior of a well chosen constant (say grand average) and data set size.

That may not seem like a big need (for instance R’s summary.lm() has all of this information). But then you find that some R code neglects this. For example R’s summary.glm() is missing a lot of the standard report on fit quality and significance. sigr::formatChiSqTest.glm makes up this deficiency (see https://github.com/WinVector/sigr/blob/master/R/ChiSqTest.R ).

4. I think some of the confusion is the belief that R-squared is always equal to the square of correlation. That is only true when there is no shift (constant adjustment) and scaling which would improve the correspondence (hopefully the case on training data when there is an intercept term in the model). That is deliberately not the case here: here is some great discussion on this issue: http://www.win-vector.com/blog/2011/11/correlation-and-r-squared/ . So we are not saying “negative R-squared therefore imaginary correlation”, we are saying the convenient relation between R-squared and the square of correlation breaks down because we are no longer meeting the required pre-conditions.

Also the standard R-functions work heroically to never show an R-squared in these hard cases (probably to avoid discussing these issues).

For example:

``` s <- summary(lm(actual~0+offset(prediction),data=d)) s\$r.squared # [1] 0 ```

Which must be just be patched in (and is not reported in `print(s)`). Also this value `0` is not equal to either of `1-sum(s\$residuals^2)/sum((d\$actual)^2)` or `1-sum(s\$residuals^2)/sum((d\$actual-mean(d\$actual))^2)`, violating the calculation promised in `help(summary.lm)`.

Basically `summary.lm` is lying to us to keep things “neater and easier to discuss.” From what I can see it never can express that a model-fit can be worse than using zero or some other constant. This is a consequence of `summary.lm` being designed to critique the quality of its own contribution to fit. Which is part why we say use `sigr::formatFTest`, it is designed for the task of scoring a fit that comes from somewhere else.

``` d <- data.frame(prediction=c(0,0,-1,-2,0,0,-1,-2), actual=c(2,3,1,2,2,3,1,2)) d <- d[rep(seq_len(nrow(d)),100),] # cor.test(d\$actual,d\$prediction) # # Pearson's product-moment correlation # # data: d\$actual and d\$prediction # t = 13.317, df = 798, p-value < 2.2e-16 # alternative hypothesis: true correlation is not equal to 0 # 95 percent confidence interval: # 0.3679627 0.4814850 # sample estimates: # cor # 0.4264014 summary(lm(actual~prediction,data=d)) # Call: # lm(formula = actual ~ prediction, data = d) # # Residuals: # Min 1Q Median 3Q Max # -0.90909 -0.43182 0.09091 0.52273 0.72727 # # Coefficients: # Estimate Std. Error t value Pr(>|t|) # (Intercept) 2.27273 0.03053 74.44 <2e-16 *** # prediction 0.36364 0.02731 13.32 <2e-16 *** # --- # Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 # # Residual standard error: 0.6404 on 798 degrees of freedom # Multiple R-squared: 0.1818, Adjusted R-squared: 0.1808 # F-statistic: 177.3 on 1 and 798 DF, p-value: < 2.2e-16 ```