IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R.
If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can’t learn much of anything from the original “result.” This points out some of the pratfalls of not performing hold-out tests, not examining the modeling diagnostics and not remembering that linear regression models fail to low-variance models (i.e. when they fail they do a good job predicting the mean and vastly under-estimate variance).
Our article not an article on global warming, but an article on analysis technique. Human driven global warming is either happening or not happening independent of any bad analysis. Finding the physical truth is a bigger harder job than eliminating some bad reports (the opposite of a bad report is not necessarily the truth). Bad analyses can have many different sources (mistakes, trying to jump ahead of your colleagues on something you believe is true, trying to fake something you believe is false or be figments of overly harsh critics) and we have not heard enough to make any accusations.
First: load the data (I re-formatted it at bit so R can read it: jonesmannrogfig2c.txt, data1400.dat_.txt ) , perform the principle components reduction and fit a first
> library(lattice) > d1400 <- read.table('data1400.dat.txt',sep='t',header=FALSE) > d1400r <- as.matrix(d1400[,2:23]) > pcomp <- prcomp(na.omit(d1400r)) > plot(pcomp) > vars <- data.frame(cbind(Year=d1400[,1],d1400r %*% pcomp$rotation),row.names=d1400[,1]) > jones <- read.table('jonesmannrogfig2c.txt',sep='t',header=TRUE) > datUnion <- merge(vars,jones,all=TRUE) > datUnion$avgTemp <- with(datUnion,(NH+CET+Central.Europe+Fennoscandia)/4.0) > model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 ,dat=datUnion) > summary(model) Call: lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5, data = datUnion) Residuals: Min 1Q Median 3Q Max -0.8811679 -0.2658117 0.0008174 0.2933058 1.0450044 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0065252 0.5750696 0.011 0.9910 PC1 -0.0001683 0.0003912 -0.430 0.6679 PC2 -0.0003678 0.0010114 -0.364 0.7168 PC3 0.0003177 0.0014821 0.214 0.8307 PC4 0.0044084 0.0019351 2.278 0.0246 * PC5 0.0188520 0.0205137 0.919 0.3601 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4505 on 113 degrees of freedom (484 observations deleted due to missingness) Multiple R-squared: 0.05223, Adjusted R-squared: 0.01029 F-statistic: 1.245 on 5 and 113 DF, p-value: 0.2927
We used only 5 principle components as modeling variables, because as is typical of principle component analysis- beyond the first few components the components become vanishingly small and unsuitable to use in modeling (see graph pcomp below).
However, this gave a model with far smaller R-squared than people are reporting, so lets add in a lot of components like everybody else does (bad!).
> model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +PC10 +PC11 +PC12 + PC13 ,dat=datUnion) > summary(model) Call: lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 + PC12 + PC13, data = datUnion) Residuals: Min 1Q Median 3Q Max -0.87249 -0.25951 0.03996 0.25055 0.99039 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.431e-01 1.424e+00 0.522 0.6028 PC1 -1.796e-04 3.665e-04 -0.490 0.6253 PC2 -4.179e-04 9.759e-04 -0.428 0.6694 PC3 3.306e-05 1.430e-03 0.023 0.9816 PC4 3.416e-03 1.803e-03 1.894 0.0609 . PC5 4.032e-02 1.978e-02 2.039 0.0440 * PC6 -3.260e-03 2.660e-02 -0.123 0.9027 PC7 -7.134e-02 3.620e-02 -1.971 0.0514 . PC8 -1.339e-01 7.895e-02 -1.696 0.0928 . PC9 7.577e-02 5.734e-02 1.321 0.1892 PC10 2.700e-01 5.878e-02 4.594 1.22e-05 *** PC11 8.562e-02 6.741e-02 1.270 0.2068 PC12 -8.057e-02 1.053e-01 -0.765 0.4461 PC13 -4.099e-02 1.064e-01 -0.385 0.7008 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4141 on 105 degrees of freedom (484 observations deleted due to missingness) Multiple R-squared: 0.2558, Adjusted R-squared: 0.1637 F-statistic: 2.777 on 13 and 105 DF, p-value: 0.001961
This is a degenerate model that essentially didn’t fit (thought the significance on PC10 component fools the fitter, but PC10 can’t be usable- it is essentially noise). Graphically we can see the fit is not very useful (despite having a little bit of R-squared) by looking at the graph of the fit plotted in the region of fitting. Notice how the fit variance is much smaller than the true data variance even in the region of training data, this is typical of bad regression fits.
> dRange <- datUnion[datUnion$Year>=1856 & datUnion$Year<=1980,] > xyplot(avgTemp + prediction ~Year,dat=dRange,type='l',auto.key=TRUE)
Now the statement they wanted to make is that the present looks nothing like the past. The past is only available through the fit model so what you would hope is that the model looks like the present and then the model itself separates the past and present. Instead as you see in the graphs above and below this fails two ways: the model looks nothing like the present and the model’s past looks a lot like the model’s present.
> datUnion$prediction <- predict(model,newdata=datUnion) > xyplot(avgTemp + prediction ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
What we could do to falsely drive the conclusion (which itself may or may not be true, it just is not supported by this technique, model or data) is create the infamous graph where we switch from modeled data in the past to actual data in the present and then act surprised that the two did not line up (which they did at no step during the fitting). I don’t have the heart to unify the colors or remove the legend, but here is the graph below:
> datUnion$dinked <- datUnion$prediction > datUnion$dinked[!is.na(datUnion$avg)] <- NA > xyplot(avgTemp + dinked ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
The reason the blue points look different than the others is they came from the average temperature data instead of the model (where everything else came from). Switching the series is essentially assuming the conclusion that recent past looks very different than the far past.
Essentially this methodology was so poor it could not have illustrated or contradicted recent global warming. There are plenty of warning signs that the model fitting are problematic and the conclusion illustrated in the last graph can not actually be proved or disproved from this data (the proxy variables are too weak to be useful, that is not to say there are not other better proxy variables or modeling techniques). The problems of the presentation are, of course, not essential problems in detecting global warming (which likely is occurring and likely will be a drain on future quality of life) but problems found in a single bad analysis.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
By the way- the tiny principle components elements tricking the modeling system into thinking they are significant (when they are in fact unimportant noise, just not independent noise as expected by significance calculations) is great example of some of the issues discussed in: http://www.win-vector.com/blog/2009/12/statistics-to-english-translation-part-2a-’significant’-doesn’t-always-mean-’important’/
@jmount : It isn’t discussed explicitly in the article above (or in its follow-up article either), but the significance of a regression coefficient is taken against the null hypothesis that the coefficient is zero. In other words, PC5 and PC10 are the only coefficients that are really bounded away from zero in this analysis (well, PC4, PC7, and PC8, too, if you feel generous).
That means that most coefficients — especially the ones that actually matter, the first 3 — have uncertainty that is of the same magnitude as their estimated value. It’s just not a very good model.
A related blog post you might be interested:
There was an interesting article in Science a while back that discussed why the three sigma variance in climate models will probably always be large. I’ll have to find it. The argument, if I correctly understood it, was essentially that the propagation of uncertainties in all the component variables (and there are a lot of variables in these models, and they interact in complex, nonlinear ways) is such that the predicted uncertainties in temperature will ALWAYS be of the same magnitude as the mean change in temperature we’re trying to model. It wasn’t very hopeful that climate model accuracy will ever improve.