We explore some of the ideas from the seminal paper “The Data-Enrichment Method” ( Henry R Lewis, Operations Research (1957) vol. 5 (4) pp. 1-5). The paper explains a technique of improving the quality of statistical inference by increasing the effective size of the data-set. This is called “Data-Enrichment.”

Now more than ever we must be familiar with the consequences of these important techniques. Especially if we don’t know if we might already be a victim of them.

“The Data-Enrichment Method” is an absolutely wonderful 1957 tongue in cheek parody of a very tempting method of accidental data manipulation. The method presented is spookily plausible and actually anticipates some very important (and correct) methods later used in the EM, Jackknife, Bootstrap and other resampling techniques (for example see: “Bootstrap Methods: Another Look at the Jackknife”, Bradley Efron. Ann. Statist. (1979) vol. 7 (1) pp. 1-26).

The idea is innocently presented with an accompanying data-set: perception of a sound at a different presented decibel levels (loudnesses):

Source.DB | Detections | Failures |
---|---|---|

62 | 5 | 40 |

65 | 10 | 30 |

68 | 15 | 20 |

71 | 20 | 10 |

74 | 25 | 5 |

77 | 30 | 3 |

From this table it is obvious that the number of detections is increasing (and the number of failures is decreasing) as the sound is presented louder and louder. This makes sense and puts a quantitative rate to our prior expectation that detection gets easier as loudness increases. For this data the trend is quite obvious and we can easily plot a regression line that accurately models the effect of Source.DB on detection rate:

But we want more. Can we increase our model precision and confidence by incorporating our domain knowledge? If we are only trying to accurately estimate the rate that loudness increases the detection level and we are willing to assume that it really does increase, then: could we not pre-prepare the data to use our domain knowledge?

The method suggested is to add in some counter-factuals that we feel confident about. For example we could (using our assumption that loudness increases detection, just to an unknown degree) notice that the 30 failures at 65 DB certainly would not have been heard if they had been run at 62 DB (even quieter). By the same reasoning we can assume that the 5 detections at 62 DB would have been heard had they been run at 65 DB, 68 DB, 71 DB, 74 Db or 77 DB. In this way we have used our starting “seed data” and our domain knowledge to boost into a much larger data set that shows the expected relation much more strongly.

The above paragraph is, of course, nonsense. I am doing the original paper an injustice by summarizing- because in the original paper the procedure seems perfectly plausible (and useful). It is not until the author works a second example that has a poor initial relation (that actually needs the enrichment) that the joke is revealed.

The second example is coin flipping. The author applies an inductive bias that “clearly standing higher up on a staircase increases the chances of a coin flip coming up heads” and then uses the data enrichment method to enhance the data set. The original data set is indeed too noisy to show the effect and the enhancement is in fact quite dramatic. The original data:

Stair.Step | Heads | Tails |
---|---|---|

1 | 4 | 6 |

2 | 5 | 5 |

3 | 7 | 3 |

4 | 4 | 6 |

5 | 6 | 4 |

6 | 5 | 5 |

7 | 6 | 4 |

8 | 6 | 4 |

9 | 3 | 7 |

10 | 4 | 6 |

The enhanced data is much more interesting:

Stair.Step | Virtual.Heads | Virtual.Tails |
---|---|---|

1 | 4 | 50 |

2 | 9 | 44 |

3 | 16 | 39 |

4 | 20 | 36 |

5 | 26 | 30 |

6 | 31 | 26 |

7 | 37 | 21 |

8 | 43 | 17 |

9 | 46 | 13 |

10 | 50 | 6 |

It is easier to see what is going on in the following plots (which show measured success rates as a function of number of stairs up the staircase and show a smoothed fit of the relationship). The original data is a noisy mess:

And the enriched data is more trend-like:

In fact the regression line fit onto the raw data even has the wrong sign (points down instead of up):

Now, obviously this is a joke. The enhancement procedure did not so much enhance the data as obliterate it. The procedure makes no sense and it is treating the procedure with undue respect to point out any one feature as being “what is wrong with it.” But the original desire is legitimate: can we use informed assumptions to gain a useful inductive bias? If we do know something should we not need less data?

The answer is yes- but we have to be careful. We must read up on the differences between Bayesian, frequentist and empirical methods and decide which set of methods is best for us. Up until now we have been fitting “by standard methods” which is really just minimizing how far the data is from the model (by moving the model around). That isn’t the only way to fit (see: “Controversies In The Foundation Of Statistics” Bradley Efron, American Mathematical Monthly (1978) vol. 85 (4) pp. 231-246).

For example a Bayesian might say that the goal of model fitting is not to pick a model that is closest to the data (maximizes the data’s plausibility with respect to the model) but to pick a model that simultaneously maximizes the product of the data’s plausibility with respect to the model and the model’s acceptability. For example we could say all models for coin-flips with negative slopes are unacceptable and pick the best model with a non-negative slope. However, assigning of degrees of acceptability (or priors) on every possible model is laborious and may require more knowledge than we have from our “reasonable prior domain knowledge.”

Another method is to use more sophisticated notions. One such method is Quantile Regression ( Roger Koenker, Cambridge University Press 2005). This methodology treats regression as a constrained optimization problem- so it is a simple matter to add in more constraints (like the slope must be positive) without having to assign arbitrary plausibilities to every possible model. Another (huge) advantage is that Quantile Regression is much more stable and even without any entered constraints recognizes that the coin-flip data is likely trend free. Here we plot the Quantile Regression analysis of the coin-data (without having added any prior constraints):

To be honest: the method got lucky- the fit is better than should be expected. But Quantile Regression is the perfect framework for adding in domain-constraints.

So: while The Data Enrichment Method is a fraud, there are ways to to enhance analysis to incorporate domain knowledge into results. Instead of saying “any bias (even useful bias) ruins fitting” one should have a cookbook of methods ready to be applied. These cookbooks hide under names like “Econometric Society Monographs” (in my opinion the econometricians really own the interface between theoretical statistics and hard-nosed applications).

Categories: Applications Expository Writing

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

Great application of the data enrichment method (censoring out “non age related deaths”, hey- you don’t get any older than dead): http://junkfoodscience.blogspot.com/2009/07/calorie-restrictive-eating-for-longer.html

A special “hats-off” to the University of East Anglia and their Climate Research Institute who seem to have improved on the data enrichment method by adding the step of throwing out the original data after they apply their desired corrections ( http://www.timesonline.co.uk/tol/news/environment/article6936328.ece ). Likely anthropogenic is real and a major threat, but these cheaters “getting to the best results first” have really muddied the waters.