Recently saw a really fun article making the rounds: “The prevalence of statistical reporting errors in psychology (1985–2013)”, Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M. et al., Behav Res (2015), doi:10.3758/s13428-015-0664-2. The authors built an R package to check psychology papers for statistical errors. Please read on for how that is possible, some tools, and commentary.

Early automated analysis:

Trial model of a part of the Analytical Engine, built by Babbage, as displayed at the Science Museum (London) (Wikipedia).

From the abstract of Nuijten et.al. paper we have:

This study documents reporting errors in a sample of over 250,000 p-values reported in eight major psychology journals from 1985 until 2013, using the new R package “statcheck.” statcheck retrieved null-hypothesis significance testing (NHST) results from over half of the articles from this period. In line with earlier research, we found that half of all published psychology papers that use NHST contained at least one p-value that was inconsistent with its test statistic and degrees of freedom. One in eight papers contained a grossly inconsistent p-value that may have affected the statistical conclusion.

How did they do that? Has science been so systematized it is finally mechanically reproducible? Did they get access to one of the new open information extraction systems (please see *Open Information Extraction: the Second Generation* Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam for some discussion)?

No, they used the fact that the American Psychological Association defines a formal style for reporting statistical significances, just like they define a formal style for citations. Roughly it looks for text like the following:

The results of the regression indicated the two predictors explained 35.8% of the variance (R2=.38, F(2,55)=5.56, p < .01).

(From a derived style guide found at the University of Connecticut.)

The software looks for fragments like: “(R2=.38, F(2,55)=5.56, p < .01)”. So really we are looking at statistics in psychology papers because they have standards clear enough to facilitate inspection.

These statistical summaries are often put into research papers by cutting and pasting from multiple sources, as not all stat packages report all these pieces in one contiguous string. So there are many chances for human error and therefore there is a very high chance they eventually get out of sync. Think of a researcher using Microsoft Word, Microsoft Excel, and some complicated graphical interface driven software again and again as data and treatment change throughout a study. Eventually something gets out of sync. We can try to check for inconsistency as both the reported p-value and R-squared are derivable from the `F(numdf,dendf)=Fvalue`

portion.

In fact the cited example has errors. The “explained 35.8% of the variance” should likely be 38% (to match the R2 / coefficient of determination) *and* the “`F(2,55)=5.56`

” bit would entail an R-squared closer to the following: **F Test** summary: (*R*^{2}=0.17, *F*(2,55)=5.56, *p*≤0.00632) (we chose to show the actual p-value, but cutting off at a sensible limit is part of the guidelines). Likely this is a *notional* example itself built by copying and pasting to show the format (so we have no intent of mocking it). We derived this result by writing our own R function that takes the F-summaries and re-calculates the R-squared and p-value. In our case we performed the calculation by pasting the following into R: “`formatFTest(numdf=2,dendf=55,FValue=5.56)`

” which performs the calculation and formats the result close to APA style.

Really this helps point out *why* scientists should strongly prefer workflows that support reproducible research (a topic we teach using R, RStudio, knitr, Sweave, and optionally Latex). It would be better to have correct conclusions automatically transcribed into reports, instead of hoping to catch some fraction of wrong ones later. This is one reason Charles Babbage specified a printer on both his Difference Engine 2 and Analytical Engine (circa 1847)- to avoid errors!

The workflow we recommend is to include data-driven results automatically. With R-based tools after we build a linear regression model (say called `model`

) we can include an additional bit of R code in a master document that looks like the following:

```
formatFTest(model,pSmallCutoff=1.0e-12)
```

And then automatic tools would copy the summary statistics in APA format into our report automatically:

F Testsummary: (R^{2}=0.86,F(1,18)=110.7,p=4.06e-09).

These are methods we teach using tools such as R, knitr, and Sweave.

That being said we recommend reading the original paper. The ability to detect errors gives the ability to collect statistics on errors over time, so there are a number of interesting observations to be made. For more work in this spirit we suggest “An empirical study of FORTRAN programs” Knuth, Donald E., Software: Practice and Experience, Vol. 1, No. 2, 1971, doi: 10.1002/spe.4380010203.

We can even trying running statcheck on the guide; it confirms the relation between the F-value and p-value and doesn’t seem to check the R-squared (probably not part of the intended check):

x | |
---|---|

Source | 1 |

Statistic | F |

df1 | 2 |

df2 | 55 |

Test.Comparison | = |

Value | 5.56 |

Reported.Comparison | < |

Reported.P.Value | 0.01 |

Computed | 0.006321509 |

Raw | F(2,55)=5.56, p < .01 |

Error | FALSE |

DecisionError | FALSE |

OneTail | FALSE |

OneTailedInTxt | FALSE |

APAfactor | 1 |

Our R code demonstrating how to automatically produce ready to go APA style F-summaries can be found here.

Categories: Opinion

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

We are starting to wrap this formatting functionality into a package https://github.com/WinVector/sigr (see http://www.win-vector.com/blog/2016/10/adding-polished-significance-summaries-to-papers-using-r/ ) that works with

`knitr`

. Of particular interest is the Chi-square summary from logistic regression, which doesn’t even have an overall model fit quality significance reported by base R!M.B. Nuijten generously contributed a nice correction (the use of “=” instead of “<=", and using "<" for when we go below a small bound) which we have incorporated in this article and the sigr package. Obviously if we are taking on doing this, we need to do it right.

Wow- now this is an issue on Hacker News https://news.ycombinator.com/item?id=12643978 . I directly say the cited study is good work and a benefit to the community. Also I will go further to say that this “subjecting old papers to criticism” is why the papers are published, to make that possible. The whole point of science is there is a ground-truth lurking, that if we are careful we may see a glimpse of.