IowaHawk has a excellent article attempting to reproduce the infamous CRU climate graph using OpenOffice: Fables of the Reconstruction. We thought we would show how to produced similarly bad results using R.
If the re-constructed technique is close to what was originally done then so many bad moves were taken that you can’t learn much of anything from the original “result.” This points out some of the pratfalls of not performing hold-out tests, not examining the modeling diagnostics and not remembering that linear regression models fail to low-variance models (i.e. when they fail they do a good job predicting the mean and vastly under-estimate variance).
Our article not an article on global warming, but an article on analysis technique. Human driven global warming is either happening or not happening independent of any bad analysis. Finding the physical truth is a bigger harder job than eliminating some bad reports (the opposite of a bad report is not necessarily the truth). Bad analyses can have many different sources (mistakes, trying to jump ahead of your colleagues on something you believe is true, trying to fake something you believe is false or be figments of overly harsh critics) and we have not heard enough to make any accusations.
> library(lattice) > d1400 <- read.table('data1400.dat.txt',sep='t',header=FALSE) > d1400r <- as.matrix(d1400[,2:23]) > pcomp <- prcomp(na.omit(d1400r)) > plot(pcomp) > vars <- data.frame(cbind(Year=d1400[,1],d1400r %*% pcomp$rotation),row.names=d1400[,1]) > jones <- read.table('jonesmannrogfig2c.txt',sep='t',header=TRUE) > datUnion <- merge(vars,jones,all=TRUE) > datUnion$avgTemp <- with(datUnion,(NH+CET+Central.Europe+Fennoscandia)/4.0) > model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 ,dat=datUnion) > summary(model) Call: lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5, data = datUnion) Residuals: Min 1Q Median 3Q Max -0.8811679 -0.2658117 0.0008174 0.2933058 1.0450044 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.0065252 0.5750696 0.011 0.9910 PC1 -0.0001683 0.0003912 -0.430 0.6679 PC2 -0.0003678 0.0010114 -0.364 0.7168 PC3 0.0003177 0.0014821 0.214 0.8307 PC4 0.0044084 0.0019351 2.278 0.0246 * PC5 0.0188520 0.0205137 0.919 0.3601 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4505 on 113 degrees of freedom (484 observations deleted due to missingness) Multiple R-squared: 0.05223, Adjusted R-squared: 0.01029 F-statistic: 1.245 on 5 and 113 DF, p-value: 0.2927
We used only 5 principle components as modeling variables, because as is typical of principle component analysis- beyond the first few components the components become vanishingly small and unsuitable to use in modeling (see graph pcomp below).
However, this gave a model with far smaller R-squared than people are reporting, so lets add in a lot of components like everybody else does (bad!).
> model <- lm(avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 +PC10 +PC11 +PC12 + PC13 ,dat=datUnion) > summary(model) Call: lm(formula = avgTemp ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + PC10 + PC11 + PC12 + PC13, data = datUnion) Residuals: Min 1Q Median 3Q Max -0.87249 -0.25951 0.03996 0.25055 0.99039 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.431e-01 1.424e+00 0.522 0.6028 PC1 -1.796e-04 3.665e-04 -0.490 0.6253 PC2 -4.179e-04 9.759e-04 -0.428 0.6694 PC3 3.306e-05 1.430e-03 0.023 0.9816 PC4 3.416e-03 1.803e-03 1.894 0.0609 . PC5 4.032e-02 1.978e-02 2.039 0.0440 * PC6 -3.260e-03 2.660e-02 -0.123 0.9027 PC7 -7.134e-02 3.620e-02 -1.971 0.0514 . PC8 -1.339e-01 7.895e-02 -1.696 0.0928 . PC9 7.577e-02 5.734e-02 1.321 0.1892 PC10 2.700e-01 5.878e-02 4.594 1.22e-05 *** PC11 8.562e-02 6.741e-02 1.270 0.2068 PC12 -8.057e-02 1.053e-01 -0.765 0.4461 PC13 -4.099e-02 1.064e-01 -0.385 0.7008 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4141 on 105 degrees of freedom (484 observations deleted due to missingness) Multiple R-squared: 0.2558, Adjusted R-squared: 0.1637 F-statistic: 2.777 on 13 and 105 DF, p-value: 0.001961
This is a degenerate model that essentially didn’t fit (thought the significance on PC10 component fools the fitter, but PC10 can’t be usable- it is essentially noise). Graphically we can see the fit is not very useful (despite having a little bit of R-squared) by looking at the graph of the fit plotted in the region of fitting. Notice how the fit variance is much smaller than the true data variance even in the region of training data, this is typical of bad regression fits.
> dRange <- datUnion[datUnion$Year>=1856 & datUnion$Year<=1980,] > xyplot(avgTemp + prediction ~Year,dat=dRange,type='l',auto.key=TRUE)
Now the statement they wanted to make is that the present looks nothing like the past. The past is only available through the fit model so what you would hope is that the model looks like the present and then the model itself separates the past and present. Instead as you see in the graphs above and below this fails two ways: the model looks nothing like the present and the model’s past looks a lot like the model’s present.
> datUnion$prediction <- predict(model,newdata=datUnion) > xyplot(avgTemp + prediction ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
What we could do to falsely drive the conclusion (which itself may or may not be true, it just is not supported by this technique, model or data) is create the infamous graph where we switch from modeled data in the past to actual data in the present and then act surprised that the two did not line up (which they did at no step during the fitting). I don’t have the heart to unify the colors or remove the legend, but here is the graph below:
> datUnion$dinked <- datUnion$prediction > datUnion$dinked[!is.na(datUnion$avg)] <- NA > xyplot(avgTemp + dinked ~Year,dat=datUnion,type=c('p','smooth'),auto.key=TRUE)
The reason the blue points look different than the others is they came from the average temperature data instead of the model (where everything else came from). Switching the series is essentially assuming the conclusion that recent past looks very different than the far past.
Essentially this methodology was so poor it could not have illustrated or contradicted recent global warming. There are plenty of warning signs that the model fitting are problematic and the conclusion illustrated in the last graph can not actually be proved or disproved from this data (the proxy variables are too weak to be useful, that is not to say there are not other better proxy variables or modeling techniques). The problems of the presentation are, of course, not essential problems in detecting global warming (which likely is occurring and likely will be a drain on future quality of life) but problems found in a single bad analysis.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.