Menu Home

Be careful evaluating model predictions

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score.

This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For example: the Mars Climate Orbiter software issued thrust commands in pound-seconds units to an engine expecting the commands to be in newton-seconds units. The two quantities are related by a constant ratio of 1.4881639, and therefore anything measured in pound-seconds units will have a correlation of 1.0 with the same measurement in newton-seconds units. However, one is not the other and the difference is why the Mars Climate Orbiter “encountered Mars at a lower than anticipated altitude and disintegrated due to atmospheric stresses.”

The need for a convenient direct F-test without accidentally triggering the implicit re-scaling that is associated with calculating a correlation is one of the reasons we supply the sigr R library. However, even then things can become confusing.


Please read on for a nasty little example.

Consider the following “harmless data frame.”

d <- data.frame(prediction=c(0,0,-1,-2,0,0,-1,-2),

The recommended test for checking the quality of “prediction” related to “actual” is an F-test (this is the test stats::lm uses). We can directly run this test with sigr (assuming we have installed the package) as follows:


F Test summary: (R2=-16, F(1,6)=-5.6, p=n.s.).

sigr reports an R-squared of -16 (please see here for some discussion of R-squared). This may be confusing, but it correctly communicates we have no model and in fact “prediction” is worse than just using the average (a very traditional null-model).

However, cor.test appears to think “prediction” is a usable model:


	Pearson's product-moment correlation

data:  d$prediction and d$actual
t = 1.1547, df = 6, p-value = 0.2921
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.3977998  0.8697404
sample estimates:

This is all for a prediction where sum((d$actual-d$prediction)^2)==66 which is larger than sum((d$actual-mean(d$actual))^2)==4. We concentrate on effects measures (such as R-squared) as we can drive the p-values wherever we want just by adding more data rows. Our point is: you are worse off using this model than using the mean-value of the actual (2) as your constant predictor. To my mind that is not a good prediction. And lm seems similarly excited about “prediction.”


lm(formula = actual ~ prediction, data = d)

     Min       1Q   Median       3Q      Max 
-0.90909 -0.43182  0.09091  0.52273  0.72727 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.2727     0.3521   6.455 0.000655 ***
prediction    0.3636     0.3149   1.155 0.292121    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7385 on 6 degrees of freedom
Multiple R-squared:  0.1818,	Adjusted R-squared:  0.04545 
F-statistic: 1.333 on 1 and 6 DF,  p-value: 0.2921

One reason to not trust the lm result is it didn’t score the quality of “prediction“. It scored the quality of “0.3636*prediction + 2.2727.” It can be the case that “0.3636*prediction + 2.2727” is in fact a good predictor. But that doesn’t help us if it is “prediction” we are showing to our boss or putting into production. We can try to mitigate this by insisting lm try to stay closer to the original by turning off the intercept or offset with the “0+” notation. That looks like the following.


lm(formula = actual ~ 0 + prediction, data = d)

   Min     1Q Median     3Q    Max 
  0.00   0.00   1.00   2.25   3.00 

           Estimate Std. Error t value Pr(>|t|)
prediction  -1.0000     0.6094  -1.641    0.145

Residual standard error: 1.927 on 7 degrees of freedom
Multiple R-squared:  0.2778,	Adjusted R-squared:  0.1746 
F-statistic: 2.692 on 1 and 7 DF,  p-value: 0.1448

Even the lm(0+)‘s adjusted prediction is bad as we see below:

d$lmPred <- predict(lm(actual~0+prediction,data=d))
[1] 26

Yes, the lm(0+) found a way to improve the prediction; but the improved prediction is still very bad (worse than using a well chosen constant). And it is hard to argue that “-prediction” is the same model as “prediction.”

Now sigr is fairly new code, so it is a bit bold saying it is right when it disagrees with the standard methods. However sigr is right in this case. The standard methods are not so much wrong as different, for two reasons:

  1. They are answering different questions. The F-test is designed to check if the predictions in-hand are good or not; “cor.test” and “lm %>% summary” are designed to check if any re-scaling of the prediction is in fact good. These are different questions. Using “cor.test” or “lm %>% summary” to test the utility of a potential variable is a good idea. The reprocessing hidden in these tests is consistent with the later use of a variable in a model. Using them to score model results that are supposed be directly used is wrong.
  2. From the standard R code’s point of view it isn’t obvious what the right “null model” is. Remember our original point: the quality measures on lm(0+) are designed to see how well lm(0+) is working. This means the lm(0+) scores the quality of its output (not its inputs) so it gets credit for flipping the sign on the prediction. Also it considers the natural null-model to be one it can form where there are no variable driven effects. Since there is no intercept or “dc-term” in these models (caused by the “0+” notation) the grand average is not considered a plausible null-model as it isn’t in the concept space of the modeling situation the lm was presented with. Or from help(summary.lm):

    R^2, the ‘fraction of variance explained by the model’,
    R^2 = 1 – Sum(R[i]^2) / Sum((y[i]- y*)^2),
    where y* is the mean of y[i] if there is an intercept and zero otherwise.

I admit, this is very confusing. But it corresponds to documentation, and makes sense from a modeling perspective. It is correct. The silent switching of null model from average to zero make sense in the context it is defined in. It does not make sense for testing our prediction, but that is just one more reason to use the proper F-test directly instead of trying to hack “cor.test” or “lm(0+) %>% summary” to calculate it for you.

And that is what sigr is about: standard tests (using R supplied implementations) with a slightly different calling convention to better document intent (which in our case is almost always measuring the quality of a model separate from model construction). It is a new library, so it doesn’t yet have the documentation needed to achieve its goal, but we will eventually get there.

Categories: Pragmatic Data Science Tutorials

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

9 replies

  1. Nice post!
    But I have a question. The R-squared should be a real number between 0 and 1, shouldn’t?

    How it could be -16?

    Thanks in advance

  2. I still don’t understand where the negative sign comes from. The R-squared formulas square both the numerator and the denominator…

    1. R-squared is R^2 = 1 – Sum(R[i]^2) / Sum((y[i]- y*)^2) (from “help(summary.lm)”). To go negative we just need Sum(R[i]^2) / Sum((y[i]- y*)^2)>1 or equivalently Sum(R[i]^2) > Sum((y[i]- y*)^2). That is R-squared is negative when the claimed prediction vector is worse than using the constant y* as your estimate.

  3. Thanks for the package and the discussion, but I think I couldn’t grasp what sigr is about…after building a model, the idea is to test the predictions of this model against the actual data or against a null hypothesis of a simple average model?

    1. The idea is to score any prediction (from your model, or from somebody else’s model) on any data (training data, new test data). The score (say R-squared) is of the prediction at hand. For that score you also want a significance, the significance is in terms of different your score is from the behavior of a well chosen constant (say grand average) and data set size.

      That may not seem like a big need (for instance R’s summary.lm() has all of this information). But then you find that some R code neglects this. For example R’s summary.glm() is missing a lot of the standard report on fit quality and significance. sigr::formatChiSqTest.glm makes up this deficiency (see ).

  4. I think some of the confusion is the belief that R-squared is always equal to the square of correlation. That is only true when there is no shift (constant adjustment) and scaling which would improve the correspondence (hopefully the case on training data when there is an intercept term in the model). That is deliberately not the case here: here is some great discussion on this issue: . So we are not saying “negative R-squared therefore imaginary correlation”, we are saying the convenient relation between R-squared and the square of correlation breaks down because we are no longer meeting the required pre-conditions.

    Also the standard R-functions work heroically to never show an R-squared in these hard cases (probably to avoid discussing these issues).

    For example:

    s <-  summary(lm(actual~0+offset(prediction),data=d))
     # [1] 0

    Which must be just be patched in (and is not reported in print(s)). Also this value 0 is not equal to either of 1-sum(s$residuals^2)/sum((d$actual)^2) or 1-sum(s$residuals^2)/sum((d$actual-mean(d$actual))^2), violating the calculation promised in help(summary.lm).

    Basically summary.lm is lying to us to keep things “neater and easier to discuss.” From what I can see it never can express that a model-fit can be worse than using zero or some other constant. This is a consequence of summary.lm being designed to critique the quality of its own contribution to fit. Which is part why we say use sigr::formatFTest, it is designed for the task of scoring a fit that comes from somewhere else.

  5. Although I agree correlation is not a good model performance metric, I’m not sure I understand why you interpret cor.test and lm as suggesting the model is good. cor.test shows there is no significant correlation between your predictions and the actual values. That is enough that I would reject my model. Likewise, your first use of lm, with an intercept, shows that not only does your predictor not match the actual values well, neither does any linear transformation of your values (based on the adjusted R^2 and t-test values). This is also enough that I would conclude that I had a poor model. I think you make a good point about lm and cor.test looking at transformations of your predictor and that this is bad if you’re planning on using the predictor directly. I’m just not sure your example makes that point if I’m interpreting your R output correctly.

    1. It is of course up to you what examples you feel are convincing. This example is unattractive in that it was small and artificial. However, we ran into the issue due to a client query on a large real-world data set. They wondered why a horrible prediction was getting a moderately okay (not good, not great) score. The answer was that they needed to directly run the F-test (which also thought the horrible prediction was in fact horrible) instead of hoping to trick cor.test or lm to run it for them. Scoring predictions using cor.test or lm is mere convenience that isn’t very wrong on good fits. On bad fits that convenience fails, and you are better of investing the time to talk directly to R’s F-test.

      As for as evaluation: you always have to look at both at least effect strength and significance. The direct F-test reports both of these correctly, and has always been the suggested test. So I say use that.

      For effect reporting both lm and cor.test are a mess. The prediction column is worse than using a the constant 2 as your prediction. To my mind that means there is no effect. You are better off using a constant. Both the tests report a positive effect (due to obscure details of what they choose to calculate in different circumstances). It doesn’t matter that the reported positive effect is small, the problem is it is wrong to say it is above zero (because neither of these tests were appropriate in this circumstance).

      For significances, you are really just asking how many rows were in your data set relative to the reported effect size. In the big data world you often see “good significances” on fairly weak effects. Even for this example I can drive the significance up by replicating the data. Our example could just as easily been:

      d <- data.frame(prediction=c(0,0,-1,-2,0,0,-1,-2),
      d <- d[rep(seq_len(nrow(d)),100),]
       # cor.test(d$actual,d$prediction)
       # 	Pearson's product-moment correlation
       # data:  d$actual and d$prediction
       # t = 13.317, df = 798, p-value < 2.2e-16
       # alternative hypothesis: true correlation is not equal to 0
       # 95 percent confidence interval:
       #  0.3679627 0.4814850
       # sample estimates:
       #       cor 
       # 0.4264014 
       # Call:
       # lm(formula = actual ~ prediction, data = d)
       # Residuals:
       #      Min       1Q   Median       3Q      Max 
       # -0.90909 -0.43182  0.09091  0.52273  0.72727 
       # Coefficients:
       #             Estimate Std. Error t value Pr(>|t|)    
       # (Intercept)  2.27273    0.03053   74.44   <2e-16 ***
       # prediction   0.36364    0.02731   13.32   <2e-16 ***
       # ---
       # Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
       # Residual standard error: 0.6404 on 798 degrees of freedom
       # Multiple R-squared:  0.1818,	Adjusted R-squared:  0.1808 
       # F-statistic: 177.3 on 1 and 798 DF,  p-value: < 2.2e-16

      Notice the p-values are as low as one could want.