Menu Home

Checking claims in published statistics papers

When finishing Worry about correctness and repeatability, not p-values I got to thinking a bit more about what can you actually check when reading a paper, especially when you don’t have access to the raw data. Some of the fellow scientists I admire most have a knack for back of the envelope calculations and dimensional analysis style calculations. They could always read a few facts off a presentation that the presenter may not have meant to share. There is a joy in calculation and figuring, so I decided it would be a fun challenge to see if you could check any of the claims of “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439 from just the summary tables supplied in the paper itself.

The main summary you can extract from the paper is the distribution of the categorical variables. By combining the numbers from table 1 and table 2 you can compile a small overall table like the following:


For each of the ten conditions listed as row headers we can find in the paper: the total number of deceased subjects who had the condition, the total number of subjects who had the condition, and how subjects with the condition are distributed amount the muscular strength groups. This is in fact enough information to recover the complete linear regression design matrix for mortality models involving the three muscular strength groups and at most one of the remaining seven categorical variables. From this you can fit coefficients, find p-values, perform an ANOVA and so on.

We wanted to see if we could do a bit more. Could we build a synthetic dataset that obeyed all of the roll-ups (or margins) as shown in the above table? If we could build a complete synthetic data set that claimed to have 8762 individuals in it and matched all of the known summaries then we could try to reproduce some of the paper results, without having to monkey with the details of fitting (we could use the standard fitters already found in R).

To do this we decided to think about how many possible types of individual could be distinguished in this kind of study. Since we have the three exclusive (and complete) muscular groups, seven more binary conditions and a single binary outcome there are exactly 3*(2^7)*2 = 768 possible types of individual. Our idea was to each of these 768 possible individual signatures assign a weight representing how many individuals with the given signature are in our synthetic dataset. These weights would completely determine our data. All we would have to do is ensure:

  1. The weights are all non-negative.
  2. The weights sum up to 8762.
  3. The 768 possible signatures when summed up proportional to the weights match all of the summaries in our table.

These are just linear inequalities (on the weights) subject to non-negativity constraints. A linear program, solvable by a linear programming package. Now the weights are not uniquely determined (because all of the pairwise correlations and higher-order correlations between the non-muscular factors are not known). But we can try various directions to look to get different synthetic data sets.

Below is a typical ideal sample. Notice it only used 34 of the individual types (out of 768 possible) and fills out to 8762 individuals by repetitions proportional to the “wt” column. Notice also some weights are fractional, so if we want an un-weighted data set with 8762 rows we are going to have to round the weights to integers (which will perturb all of the counts slightly).


The above synthetic data set found the same association between low muscle strength and death as the original study. But we in fact generated 10 of these synthetic samples (by randomly favoring different signatures). A few of these synthetic data sets did not reproduce the results. For example the synthetic data set we are call group-2 shows the claimed relation when the muscle strength levels are the only variables, but loses the relation when the other conditions are added to the model.

> print(summary(lm(deceased~0

lm(formula = deceased ~ 0 + MuscularStrength.lower + MuscularStrength.middle + 
    MuscularStrength.upper, data = dg)

     Min       1Q   Median       3Q      Max 
-0.07329 -0.07329 -0.04995 -0.04896  0.95104 

                        Estimate Std. Error t value Pr(>|t|)    
MuscularStrength.lower  0.073288   0.004300   17.04   <2e-16 ***
MuscularStrength.middle 0.048956   0.004299   11.39   <2e-16 ***
MuscularStrength.upper  0.049949   0.004298   11.62   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2324 on 8761 degrees of freedom
Multiple R-squared:  0.0596,	Adjusted R-squared:  0.05927 
F-statistic: 185.1 on 3 and 8761 DF,  p-value: < 2.2e-16

That is, for this synthetic data set the conclusions of the paper do not hold (see below).

> print(summary(lm(deceased~0
   + hypertension

lm(formula = deceased ~ 0 + MuscularStrength.lower + MuscularStrength.middle + 
    MuscularStrength.upper + sedentary + current.smoker + five.drinks.weekly + 
    diabetes.millitus + hypertension + hyercholesterolaemia + 
    family.cardiovascular, data = dg)

     Min       1Q   Median       3Q      Max 
-0.22271 -0.06749 -0.03845 -0.02006  0.98879 

                         Estimate Std. Error t value Pr(>|t|)    
MuscularStrength.lower   0.012554   0.006748   1.860  0.06287 .  
MuscularStrength.middle -0.005834   0.006327  -0.922  0.35650    
MuscularStrength.upper  -0.004142   0.006069  -0.682  0.49499    
sedentary               -0.029898   0.009790  -3.054  0.00226 ** 
current.smoker           0.002652   0.009193   0.289  0.77296    
five.drinks.weekly       0.025898   0.005587   4.636 3.61e-06 ***
diabetes.millitus        0.131324   0.016718   7.855 4.46e-15 ***
hypertension             0.068976   0.006191  11.141  < 2e-16 ***
hyercholesterolaemia     0.042745   0.006680   6.399 1.64e-10 ***
family.cardiovascular    0.069631   0.008404   8.285  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2289 on 8754 degrees of freedom
Multiple R-squared:  0.08821,	Adjusted R-squared:  0.08717 
F-statistic: 84.69 on 10 and 8754 DF,  p-value: < 2.2e-16

Now of course the paper is claiming the conclusion holds on a single real dataset, not that it would hold on all datasets with the same summary statistics. But it does mean: the tables given in the paper are not specific enough to entail the claimed result. To confirm the paper’s result you would need more detailed access to their data. In the future refereeing should improve to the point where you can’t expect to publish claims without releasing the confirming data and procedures (ideas like iPython workbooks and so on). But, unfortunately, that isn’t the standard of today.

We did all of this in fun. We think the muscle strength paper is a good paper. But we also want to be ready and able to kick the tires on papers. Along those lines we are sharing code that reads a summary table (formatted like our first table) and produces synthetic datasets: ExperimentInspector. All code and data used to write this article is shared there.

If you have a serious need for a more production hardened tool of this nature, please get in touch. We would love to help with things like clinical oversight, experiment planning, regulatory compliance and fraud detection.

Categories: Programming Statistics

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

1 reply

  1. What I really would like to do (with some spare time) is build a uniform sampler for integer solutions to the first table. This would let us build an actual test of claims (under the uniform distribution of possible data sets) as described in “Testing for Independence in a Two-Way Table: New Interpretations of the Chi-Square Statistic” Persi Diaconis and Bradley Efron Volume 13, Number 3 (1985), 845-874 (see ). Its a little more complicated as we have more constraints in table 1 than just the 2-way contingency table margins, but either a Markov Chain generation technique (like in my thesis) or a counting idea (like in Barvinok’s work) would get it done.

    Or to get out of the weeds: build a library of a lot of exact tests, make it available in iPython and set up an organization that for every medical paper sets up a “results criticism workbook.” If we show how little of what medical papers claim is implied from the few data summaries they release, then writers will have much larger pressure to release actual data and procedures. Makes me think of organizations like NIST and so on.


%d bloggers like this: