Menu Home

CRU graph yet again (with R)

IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R.

If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can’t learn much of anything from the original “result.” This points out some of the pratfalls of not performing hold-out tests, not examining the modeling diagnostics and not remembering that linear regression models fail to low-variance models (i.e. when they fail they do a good job predicting the mean and vastly under-estimate variance).

Our article not an article on global warming, but an article on analysis technique. Human driven global warming is either happening or not happening independent of any bad analysis. Finding the physical truth is a bigger harder job than eliminating some bad reports (the opposite of a bad report is not necessarily the truth). Bad analyses can have many different sources (mistakes, trying to jump ahead of your colleagues on something you believe is true, trying to fake something you believe is false or be figments of overly harsh critics) and we have not heard enough to make any accusations.

First: load the data (I re-formatted it at bit so R can read it: jonesmannrogfig2c.txt, data1400.dat_.txt ) , perform the principle components reduction and fit a first

> library(lattice)
> d1400 <- read.table('data1400.dat.txt',sep='t',header=FALSE)
> d1400r <- as.matrix(d1400[,2:23])
> pcomp <- prcomp(na.omit(d1400r))
> plot(pcomp)
> vars <- data.frame(cbind(Year=d1400[,1],d1400r %*% pcomp$rotation),row.names=d1400[,1])
> jones <- read.table('jonesmannrogfig2c.txt',sep='t',header=TRUE)
> datUnion <- merge(vars,jones,all=TRUE)
> datUnion$avgTemp <- with(datUnion,(NH+CET+Central.Europe+Fennoscandia)/4.0)
> model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 ,dat=datUnion)
> summary(model)

lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5, data = datUnion)

       Min         1Q     Median         3Q        Max 
-0.8811679 -0.2658117  0.0008174  0.2933058  1.0450044 

              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  0.0065252  0.5750696   0.011   0.9910  
PC1         -0.0001683  0.0003912  -0.430   0.6679  
PC2         -0.0003678  0.0010114  -0.364   0.7168  
PC3          0.0003177  0.0014821   0.214   0.8307  
PC4          0.0044084  0.0019351   2.278   0.0246 *
PC5          0.0188520  0.0205137   0.919   0.3601  
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4505 on 113 degrees of freedom
  (484 observations deleted due to missingness)
Multiple R-squared: 0.05223,	Adjusted R-squared: 0.01029 
F-statistic: 1.245 on 5 and 113 DF,  p-value: 0.2927 

We used only 5 principle components as modeling variables, because as is typical of principle component analysis- beyond the first few components the components become vanishingly small and unsuitable to use in modeling (see graph pcomp below).

However, this gave a model with far smaller R-squared than people are reporting, so lets add in a lot of components like everybody else does (bad!).

> model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +PC10 +PC11 +PC12 + PC13 ,dat=datUnion)
> summary(model)

lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + 
    PC8 + PC9 + PC10 + PC11 + PC12 + PC13, data = datUnion)

     Min       1Q   Median       3Q      Max 
-0.87249 -0.25951  0.03996  0.25055  0.99039 

              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.431e-01  1.424e+00   0.522   0.6028    
PC1         -1.796e-04  3.665e-04  -0.490   0.6253    
PC2         -4.179e-04  9.759e-04  -0.428   0.6694    
PC3          3.306e-05  1.430e-03   0.023   0.9816    
PC4          3.416e-03  1.803e-03   1.894   0.0609 .  
PC5          4.032e-02  1.978e-02   2.039   0.0440 *  
PC6         -3.260e-03  2.660e-02  -0.123   0.9027    
PC7         -7.134e-02  3.620e-02  -1.971   0.0514 .  
PC8         -1.339e-01  7.895e-02  -1.696   0.0928 .  
PC9          7.577e-02  5.734e-02   1.321   0.1892    
PC10         2.700e-01  5.878e-02   4.594 1.22e-05 ***
PC11         8.562e-02  6.741e-02   1.270   0.2068    
PC12        -8.057e-02  1.053e-01  -0.765   0.4461    
PC13        -4.099e-02  1.064e-01  -0.385   0.7008    
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.4141 on 105 degrees of freedom
  (484 observations deleted due to missingness)
Multiple R-squared: 0.2558,	Adjusted R-squared: 0.1637 
F-statistic: 2.777 on 13 and 105 DF,  p-value: 0.001961 

This is a degenerate model that essentially didn’t fit (thought the significance on PC10 component fools the fitter, but PC10 can’t be usable- it is essentially noise). Graphically we can see the fit is not very useful (despite having a little bit of R-squared) by looking at the graph of the fit plotted in the region of fitting. Notice how the fit variance is much smaller than the true data variance even in the region of training data, this is typical of bad regression fits.

> dRange <- datUnion[datUnion$Year>=1856 & datUnion$Year<=1980,]
> xyplot(avgTemp + prediction ~Year,dat=dRange,type='l',auto.key=TRUE)

Now the statement they wanted to make is that the present looks nothing like the past. The past is only available through the fit model so what you would hope is that the model looks like the present and then the model itself separates the past and present. Instead as you see in the graphs above and below this fails two ways: the model looks nothing like the present and the model’s past looks a lot like the model’s present.

> datUnion$prediction <- predict(model,newdata=datUnion)
> xyplot(avgTemp + prediction ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)

What we could do to falsely drive the conclusion (which itself may or may not be true, it just is not supported by this technique, model or data) is create the infamous graph where we switch from modeled data in the past to actual data in the present and then act surprised that the two did not line up (which they did at no step during the fitting). I don’t have the heart to unify the colors or remove the legend, but here is the graph below:

> datUnion$dinked <- datUnion$prediction
> datUnion$dinked[!$avg)] <- NA
> xyplot(avgTemp + dinked ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)

The reason the blue points look different than the others is they came from the average temperature data instead of the model (where everything else came from). Switching the series is essentially assuming the conclusion that recent past looks very different than the far past.

Essentially this methodology was so poor it could not have illustrated or contradicted recent global warming. There are plenty of warning signs that the model fitting are problematic and the conclusion illustrated in the last graph can not actually be proved or disproved from this data (the proxy variables are too weak to be useful, that is not to say there are not other better proxy variables or modeling techniques). The problems of the presentation are, of course, not essential problems in detecting global warming (which likely is occurring and likely will be a drain on future quality of life) but problems found in a single bad analysis.

Categories: Uncategorized

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

3 replies

  1. @jmount : It isn’t discussed explicitly in the article above (or in its follow-up article either), but the significance of a regression coefficient is taken against the null hypothesis that the coefficient is zero. In other words, PC5 and PC10 are the only coefficients that are really bounded away from zero in this analysis (well, PC4, PC7, and PC8, too, if you feel generous).

    That means that most coefficients — especially the ones that actually matter, the first 3 — have uncertainty that is of the same magnitude as their estimated value. It’s just not a very good model.

  2. A related blog post you might be interested:

    There was an interesting article in Science a while back that discussed why the three sigma variance in climate models will probably always be large. I’ll have to find it. The argument, if I correctly understood it, was essentially that the propagation of uncertainties in all the component variables (and there are a lot of variables in these models, and they interact in complex, nonlinear ways) is such that the predicted uncertainties in temperature will ALWAYS be of the same magnitude as the mean change in temperature we’re trying to model. It wasn’t very hopeful that climate model accuracy will ever improve.

%d bloggers like this: