Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method inspired by differential privacy methods, that Nina and I respect but don’t actually use in production.

Nested dolls, Wikimedia Commons

Please read on for my discussion of some of the limitations of the technique, and how we solve the problem for impact coding (also called “effects codes”), and a worked example in R.We define a nested model as any model where the results of a sub-model are used as inputs for a later model. Common examples include variable preparation, ensemble methods, super-learning, and stacking.

Nested models are very common in machine learning. They occur when y-aware methods (that is methods that look at the outcome to predict) are used in data preparation, or when models are combined (as in stacking or super learning). They deliver more modeling power and are an important technique. The downside is: nested models can introduce an undesirable bias which we call “nested model bias” which can lead to very strong over-fit and bad excess generalization error. Nina shares a good discussion of this and a number of examples here.

One possible mitigation technique is adding Laplace noise to try and break the undesirable dependence between modeling stages. It is a clever technique inspired by the ideas of differential privacy and usually works about as well as the techniques we recommend in practice (cross-frames or simulated out of sample data). The Laplace noising technique is different than classic Laplace smoothing (and formally more powerful as Nina points out in her talk). So it is an interesting alternative that we enjoy discussing. However, we have never seen a published precise theorem that links the performance guarantees given by differential privacy to the nested modeling situation. And I now think such a theorem would actually have fairly unsatisfying statement as a one possible “bad real world data” situation violates the usual “no re-use” requirements of differential privacy; duplicated or related columns or variables break the Laplace noising technique. It may seem an odd worry, but in practice anything you don’t actually work to prevent can occur in messy real world data.

Let’s work an example. For our nested model problem we will train a model predicting if an outcome `y`

is true or false based on 5 weakly correlated independent variables. Each of these variables has 40 possible string values (so they are categorical variables) and we have only 500 rows of training data. So the variables are fairly complex: they have a lot of degrees of freedom relative to how much training data we have. For evaluation we assume we have 10000 more rows of evaluation data generated the same way the training data was produced. For this classification problem we will use a simple logistic regression model.

We will prepare the data one of two ways: using Misha Bilenko’s count encoding (defined in his references) or using vtreat impact coding. In each case the re-coding of variables reduces the apparent model complexity and deals with special cases such as novel levels occurring during test (that is variable string values seen in the test set that didn’t happen to occur during training). This preparation is essential as standard contrast or one-hot coding produces quasi-separated data unsuitable for logistic regression (and as always, do not use hash-encoding with linear methods). Each of these two encodings has the potential of introducing the previously mentioned undesirable nested modeling bias, so we are going to compare a number of mitigation strategies.

The nested modeling techniques we will compare include:

- “vtreat impact coding” using simulated out of sample data methods (the
`mkCrossFrameCExperiment`

technique). This is the technique we recommend using. It includes simulating out of sample data through cross-validation techniques, minor smoothing/regularization, and variable and level significance pruning. - “Jackknifed count coding”, count coding using one-way hold out for nested model mitigation. An efficient deterministic direct approach that works well on this problem (though doesn’t work as well as vtreat when a problem requires structured back-testing methods).
- “Split count coding”, count coding with count models build on half the training data and the logistic regression fit on the compliment. This is a correct technique that improves as we have more data (and if we send a larger fraction to the coding step).
- “Naive count coding”, count coding with no nested model mitigation (for comparison).
- “Laplace noised count coding” the method discussed by Misha Bilenko.
- “Laplace noised count coding (abs)” a variation of above that restricts to non-negative pseudo-counts.

Here are typical results on data (all examples can be found here):

We have plotted the pseudo R-squared (fraction of deviance explained) for 10 re-runs of the experiment for each modeling technique (all techniques seeing the exact same data). What we want are large pseudo R-squared values on the test data. Notice the dominant methods are vtreat and jackknifing. But also notice that Laplace noised count coding is, as it often is, in the “best on test” pack and certainly worth considering (as it is very easy to implement, especially for online learning stores). Also note that the common worry about jackknife coding (that it reverses scores on some rare counts) seems not to be hurting performance (also note Laplace noising can also swap such score).

We have given the Laplace noising methods an extra benefit in allowing them to examine test performance to pick their Laplace noise rate. In practice this could introduce an upward bias on observed test performance (the Laplace method model may look better on test data than it will on future application data), but we are going to allow this to give the Laplace noising methods a beyond reasonable chance. Also as we don’t tend to use Laplace noising in production we have only seen it work in simple situations with a small number of unrelated variables (you can probably prove some strong theorems for Laplace smoothing for a single variable; it is reasoning about joint distribution of many variables that is likely to the problem).

Let’s now try a a variation of the problem where each data column or independent variable is duplicated or entered into the data schema four more times. This is of course a ridiculous artificial situation which we are exaggerating to clearly show the effect. But library code needs to work in the limit (as you don’t know ahead of time what users will throw at it) and there are a lot of mechanisms that do produce duplicate, near-duplicate, and related columns in data sources used for data science (one of the difference between data science and classical statistics is data science tends to apply machine learning techniques on very under-curated data sets).

Differential privacy defenses are vulnerable to repetition as adversaries can average the repeated experiments to strip off defensive noise. This is an issue that is deliberately under-discussed in differential privacy applications (as correctly and practically defining and defending against related queries is quite hard). In our case repeated queries to a given column are safe (as we see the exact same noise applied during data collection each time), but queries between related columns are dangerous (as they have different noise which then can be averaged out).

The results on our artificial “each column five times” data set are below:

Notice that the Laplace noising technique test performances are significantly degraded (performance on held-out test usually being a better simulation of future model performance than performance on the training set). In addition to over-fit some of the loss of performance is coming from the prepared data driving bad behavior in the logistic regression (quasi-separation, rank deficiencies, and convergence issues), which is why we see the naive method’s performance also changing.

And that is our acid test of Laplace noising. We knew Laplace noising worked on example problems, so we always felt obligated to discuss it as an alternative to the simulated out of sample (also called cross-frame or “level 1”) methods we use and recommend. We don’t use Laplace noising in production, so we didn’t have a lot of experience with it at scale (especially with many variables). We suspect there is an easy proof the technique works for one variable, and now suspect there is not a straightforward pleasing formulation of such a result in the presence of many variables (as such a statement is going to have to constrain both joint distribution variables, and the downstream modeling procedures).

Categories: Exciting Techniques Expository Writing Opinion Statistics

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

One thing we did not show was the pushing the row data to the

`glm`

and letting R’s native indicator/dummy-variable/contrast encoding deal with the problem. This is because the results are so awful they change the scale of the graph. Of course the problem is avoidable (either through feature preparation as mentioned in the above article, or through better modeling discipline such as step-wise methods, regularization, lasso regression, and so on). Also we found every once in a while the training set did not see all the possible levels and thus could not even be run on the entirety of the test set (glm throughs an exception if it sees such novel value during scoring, this happens in about 4 in 10000 examples- but it is always a worry with high cardinality categorical variables).For fun we have produced a run that skipped the novel rows and ran the direct method (as shown in the graph below). Notice it always horribly over-fits and gets negative pseudo-R2 on held out data (which is just saying using the model is worse than using an unconditional average).

But what this is reminding us is: the loss of a sub-optimal encoding is often less than the loss in using no re-encoding (though obviously we were able to show an example where Laplace nosing was also returning negative pseudo-R2s on test). Though obviously you don’t want either loss: which is why you work hard to get a well founded encoding scheme. We often talk about the technical details of these techniques (as they are important), but the big lesson is you often need to use re-coding techniques and they are very much worth the (mitigable) problems they cause you to worry about (such as nested model bias, please see here for a presentation on the topic).