Menu Home

A deeper theory of testing

In some of my recent public talks (for example: here and here) I have mentioned a desire for “a deeper theory of fitting and testing.” I thought I would expand on what I meant by this.

In this note I am going to cover a lot of different topics to try and suggest some perspective. I won’t have my usual luxury of fully defining my terms or working concrete examples. Hopefully a number of these ideas (which are related, but don’t seem to easily synthesize together) will be subjects of their own later articles.


The focus of this article is: the true goal of predictive analytics is always: to build a model that works well in production. Training and testing procedures are designed to simulate this unknown future model performance, but can be expensive and can also fail.

What we want is a good measure of future model performance, and to apply that measure in picking a model without running deep into Goodhart’s law (“When a measure becomes a target, it ceases to be a good measure.”).

Most common training and testing procedures are destructive in the sense they use up data (data used for one step may not be safely used for another step in an unbiased fashion, example: excess generalization error). In this note I thought I would expand on the ideas for extending statistical efficiency or getting more out of your training while avoiding overfitting.

6816780226 4ff0d8324a o
Destructive testing.

I will outline a few variations of model construction and testing techniques that one should keep in mind.

The problem

When applying a machine learning algorithm you are faced with the issue that any data used in the feature selection, feature encoding, parameter selection, model construction, or model selection becomes non-exchangeable with future application data. This usually evinces as an undesirable upward bias in observed model performance when estimated on this contaminated data. The model appears to perform well on the data used to construct it, and then fails in production.

These are standard statistical problems, and there are standard ways to defend against it (regularization and bagging during training to try to bias towards simpler and more robust models, followed by evaluation on additional held out data). This is certainly a necessary level of care, which we try to teach here and here. We have also written about some of the issues and limitations of hold out procedures (here and here).

The statistical issues I am particularly interested in include:

  • Holding out a large amount of data may be impractical (you may not have enough data to do so). This is the notion of “statistical efficiency” (getting the most from your data). And even when you have a lot of data you may not want to depend on holding a large quantity of it (for example this can interfere with trying to get good estimates of the distribution of rare features- a key problem in text classification).

  • The procedure may not be strong enough (especially when dealing with time based data, or with data where records are grouped in any meaningful way). In finance structured back-testing (models are only tested on data that is exclusively in the future of any data used during construction) is mandatory, and in customer analysis you don’t let transactions from any customer cross a test/trains split (despite possibly being separate rows in the dataset).

    Many beginning data scientists seem to confuse test/train split (a retrospective procedure) with controlled experiments (such as A/B testing). Test/train split is easy to implement, but does not in fact have the power of actually controlling the experimental design ahead of time.

  • It is easy to trick most common machine learning algorithms into over-fitting (including those that include a cross-validation component, often on non-duplicated exchangeable data).

  • At some point in a statistics, data science, or machine learning project you want to use the output of one model as the input to another. It turns out this is not always well-founded or safe. For even an unbiased model a down-stream ensemble model naively built using the same training data sees biased performance and is likely to use the sub-model in an inappropriate way (such as put too large a coefficient on it). This is a well-known regression to the mean effect and we have written about it in the context of variable preparation and treatment.

The standard practices

The common procedures to mitigate these bad effects include:

  • Ignoring the whole issue and using all data for training (not advised, but we see people doing it).
  • Using all data for training and using restriction methods (such as model complexity controls, regularization and bagging) to bound excess generalization error. Or using statistical adjustments (such as Bonferroni corrections) to attempt to get good out of sample performance estimates.
  • Using a standard random test/train split or cross validation methods (good procedures especially if you adapt them to your domain- and the topic of this article series).

Other possibilities

What can be tried beyond a single test/train split and beyond simple Bonferroni corrections (which quickly say you have tried so many candidate models you should no longer expect any bound on excess generalization error, a correct but unusable determination).

Structured back testing

This is in fact a standard practice in quantitative finance and time-series, but it applies many more places (and is often ignored by beginning data scientists).

Aegeus consults with early Delphic oracle Themis.

The idea is: in some cases it is naive to think a simple random split of data into test and training will successfully simulate the arrival of future data. A non-structured split may leak information across groups and time. This kills the utility of testing when you have:

  • Issues of time. In structured back-testing all data used to evaluate the model must be in the future of any data used to construct the model. This helps simulate how models are used (they are usually built before they are applied). In finance data from the future makes prediction seem easier than it will be in production (where it will not be available).
  • Issues of grouping. In e-commerce it is important that even when predicting at the transaction level that all transactions from a single customer fall wholly in test or wholly in training. Such grouping is one of the tenants of A/B test design, but also must be remembered during retrospective model evaluation.
  • Issues of replication.

    Duplicated records completely kill test/train split and cross validation. The idea is a training set drawn from a set with many duplicated (or near-duplicate) records can reliably simulate a held-out test or cross-validation set by re-sampling. This means the held-out set likely holds few surprises and good performance on test is inevitable, and not the hoped for evidence of exchangeability with future application data. Many statistical tests fail (such as p-values) as the number of degrees of freedom can no longer be estimated from the number of data rows.

    Duplication can be an essential feature of the problem (serial correlation is one example) or it can be a side effect of errors in data handling (under constrained joins being an example).

The fix is to use domain knowledge to control test/train split and cross validation. Common fixes are splitting on time, and also respecting groups labels during split. Duplication and near-duplication are harder to deal with, and may require using a re-processing of variables (possibly cryptographic or deliberately unlearnable) to control splitting.

Uniform bounds

Playtime, Jacques Tati 1967.

The standard statistical adjustments are largely variations of Bonferroni corrections to get around the multiple comparison problem. At issue is: what is the correction for models selected by optimization? Is it the number of models considered (no), the number of models possibly considered over every possible restart (very large, possibly infinite), or the number of models in the entire space (often uncountable)?

To think on this consider the following unfair betting scheme. We flip a fair coin: heads I win, tails we flip one more time (with me winning on heads, you winning on tails). This is an unfair bet that favors me, I have a 75% chance of winning.

Here is the puzzle: if we flip the coin once and it happens to come up heads– how did my ability to ask for a re-flip affect the odds of me winning? I didn’t ask for a re-flip, so one would naively think that ability didn’t enter into the picture (since it didn’t apply the observed game). However, the fact the only way the game ends at one flip is when I win means the ability to ask for a re-flip is very relevant to single-flip games. I win 100% of single flip games. Knowing the game ended at one flip informs you that I won (else I would have asked for the re-flip and the game would have gone on longer).

The issue is: probabilities treated as frequencies of repeated experiments depend on averaging over all possible outcomes. This possibly includes outcomes not seen in any fixed set of experimental runs. And this is why standard frequentist statistical corrections have the disturbing property that they depend on statistics from counterfactual situations (that is: depend on things that did not happen!).

This is why Cortes and Vapnik used a theory of uniform estimation to prove generalization ability (performance of a classifier on new exchangeable data) instead of a Bonferroni correction. The idea is: if your optimization procedure is picking from a space of classifiers such that you have proven a uniform bound that none of these classifiers is more than 10% over-fit on training data (i.e. none of them would see more than a 10% loss in performance when moving to new exchangeable data) then the classifier you end up picking from this set will also see no more than a 10% loss in performance when moving to new exchangeable data. So you can examine as many of them as you want any apply the uniform bound as your performance adjustment instead of the Bonferroni correction. The issue is: useful uniform bounds (bounds that apply equally to all elements of a set) are typically hard to prove.

The uniform bound itself is typically established by using a model complexity measure (such as VC dimension) or regularization control (such as margin).

Mixing loss and complexity

When dealing with model complexity penalties and bounds I feel there is some care to be taken in the following issue. For a classifier C we define L(C) as the training loss (large is bad) and M(C) as the margin or non-complexity bonus (so large is good). Margin is typically expressed as things like the smallest gap between examples with different labels (high frequency changes thought of as “complex”), or the ratio of data variance over coefficient norm (large coefficients thought of as “complex”). Complexity is often the number of bits needed to write down the model, or the -log() of how rare the model is under some prior distribution.

Under this setup there are some subtle differences in using the following optimization formulation for model training:

  1. Pick C minimizing L(C) subject to M(C) ≥ B

    This objective is of the style used in establishing uniform bounds. We pick the classifier with the best performance on training from a set of classifiers uniformly limited in such a way to not do much worse on new data.

  2. Pick C maximizing M(C) subject to L(C) ≤ B

    This is the objective used in hard margin support vector machines: maximize the margin over all reachable models that have perfect (separated) training performance.

  3. Pick C minimizing L(C) - c M(C)

    This is the objective actually used by soft-margin support vector machines (the support vector machines actually used in practice). This turns out to work, but you must check the solution actually has substantial margin (otherwise you have no reason to believe the model will “generalize” or perform well on new data). This is part of why soft margin is not the same hard margin (for more on the steps needed to chain together all of the arguments used to establish soft margin support vector machines see here).

    This sort of objective is essentially optimizing over an “adjusted loss”, which I have not always found to be reliable (especially when choosing among different model types). Sometimes this gets dressed up in information theory terminology (such as MDL “Occam’s Razor” theories). But one must be wary: this form of objective is assuming a very detailed (and possibly unjustified) ability to continuously price loss versus complexity. This form is attempting to exploit a lot more of the structure of L() and M(), meaning you have to supply more such structure when defining L() and M().

    That being said, most statistical adjustments and regularized models are of this general form.

A bit more on MDL style ideas

Ideas like MDL (minimum description length), Occam’s Razor and maximum entropy are often described as preventing over-fit. They in fact do not. And it is this “folk mis-perception” that I mean to criticize when I say “I don’t like MDL.”

The problem is these approaches all ignore the multiple experiment problem (how many simple hypotheses you looked at to get your “simple model” and the possible Bonferroni corrections. And the original authors of these techniques essentially show that principles like Rissanan’s MDL or Jaynes’s maximum entropy principle are in fact equivalent to assuming certain “minimax codebook” priors on the parameters (see “A Universal Prior for Integers and Estimation by Minimum Description Length”, Jorma Rissanen, Ann. Statist. Volume 11, Number 2 (1983), 416-431). These are reasonable uninformative priors, but a consequence of the Bernstein–von Mises theorem is: with enough data your chosen model tends to be independent of reasonable priors).



In “The Ladder: A Reliable Leaderboard for Machine Learning Competitions”, Avrim Blum, Moritz Hardt the authors fix a common data-leak in leaderboard scoring through the simple expedient of censoring results! Their innovation centers around:

  • Forcing the researcher to delegate model evaluation.
  • Censoring which model evaluations the researcher is allowed to see.

See here for a great example and explanation.

One observation is the ladder method immediately justifies greedy forward stepwise regression, as long as you never add a variable with negligible model improvement.

Differential Privacy

The difference engine.

In “Preserving Statistical Validity in Adaptive Data Analysis” Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth we actually see a framework of modern testing laid out (it looks like this may be a pre-print of “The reusable holdout: Preserving validity in adaptive data analysis” Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth). This allows the authors to prove hardness/impossibility results (very difficult to actually do) and also use ideas such as differential privacy to develop less biased scoring algorithms.

In particular the authors show how to simultaneously measure a large number of expectations (square in the size of the data for efficient algorithms, exponential in the size of the data for theoretical methods). This would not be possible to establish using Bonferroni correction methods. Part of the contribution is calling the difference in simultaneously estimating many summary statistics (which this paper shows possible) and simultaneously estimating many unobserved parameters (which the Cramer-Rao bound shows is impossible). The difference in difficulty is probably similar to the fact a random projection p() tends to preserve first order quantities such as distances ((b-a).(b-a) should be proportional to (p(b)-p(a)).(p(b)-p(a))), but not higher order quantities such as angles or arbitrary dot products (it is hard to bound (p(b)-p(a)).(p(c)-p(a)) in terms of (b-a).(c-a)).

What remains to expand the techniques beyond evaluation score fuzzing. Perhaps one could re-implement adaptive (or sequential decision based) machine learning algorithms (such decision trees, boosting, and so-on) in terms of these primitives. It also remains to re-explain the effective technique used (adding noise) in terms of standard statistical methods (likely priors, shrinkage, or even a non-origin James-Stein estimator).

We know Misha Bilenko has applied these techniques to improve the reliability of “learning from counts” (see also “effects codes” (see Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, Cohen et. al., or impact coding) through Laplace smearing to establish differential privacy.



This is a personal favorite vague possibility, though I haven’t adapted it into usable procedures. Roughly as we use data (to select variables, to adjust variables, to build sub-models, to build models) the data get contaminated. However the data isn’t completely useless (though using it in a naive manner can be disastrous).

This method I am interested in came from the following apparent paradox:

  • It is well know that in many situations to correctly run a Markov chain for n-steps you need access to c n independent random bits (for some constant c). Fewer bits and you provably do not generate the correct distribution of the stopping point.
  • It is also a theorem of information theory to generate from a distribution with k-bits of entropy that k bits are enough.

This seemed to imply a contradiction. There are many Markov chain Monte Carlo methods where n is large and k is small. Why then do we need so many bits to get the chain into a proper stopping distribution?

Russell Impagliazzo and David Zukerman resolved this in “How to Recycle Random Bits.” Proceeding SFCS ’89 Proceedings of the 30th Annual Symposium on Foundations of Computer Science Pages 248-253 The idea is: take c n random bits. Use pseudo-random generator ideas to split this into two new streams of bits: c n pseudo-random used to run the Makrov chain and a smaller sequence of c n - k reprocessed random bits that look conditionally independent of the Markov chain’s stopping state. So essentially you need c n random bits to run the Markov chain, but you can give back c n - k of them and end up net-consuming only k of them.

There likely are a family of similar techniques and theorems for data: some sort of re-processing and re-sampling may allow a lot more principled use of data than one might first imagine. It may be the only practical such techniques are those already seen in the censoring and differential privacy solutions, or there may be other interesting ways to cryptographically re-use data.


We need a statistical toolkit broader than:

  • Bonferroni correction
  • Complexity controls, regularization
  • Bagging
  • Hold out tests
  • Cross-validation

A larger list (allowing the possibility of better results in more situations) might include at least the following:

  • Bonferroni correction
  • Complexity controls, regularization
  • Bagging
  • Hold out tests
  • Cross-validation
  • Structred back-testing
  • Uniform bounds
  • Censored/lock-box procedures
  • Differential privacy techniques
  • Recycling

Concrete follow-up articles

Categories: Opinion Pragmatic Data Science

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

2 replies

  1. I just stumbled upon your blog.

    Excellent writeup. Definitely the testing of ML needs to be beefed up if we have any hope in industrializing the technology.

    Great work!

  2. Here is a neat paper answering the question: what if the purpose of training is to do well with future examples that are altered by an adversary?” The function to minimize for the logistic-regression style learning is particularly clever (and the plus epsilon, minus epsilon fuzzing is interesting).

%d bloggers like this: