Menu Home

A Single Parameter Family Characterizing Probability Model Performance

Introduction

We’ve been writing on the distribution density shapes expected for probability models in ROC (receiver operator characteristic) plots, double density plots, and normal/logit-normal densities frameworks. I thought I would re-approach the issue with a specific family of examples.

Definitions

Let’s define a “probability model” as a model that returns predicted probabilities. Essentially this means the model returns numbers between zero and one, and the model is a good probability model if these predictions obey additional conditions (such as being large on positive examples, small on negative examples, obeying the probability axioms, and even balance conditions).

We will call a probability model fully calibrated if we have E[outcome | prediction] = prediction for all observed values of prediction. We note, not all calibrated models are fully calibrated.

Our Example

Any fully calibrated probability model’s performance can be characterized as follows.

We assume an unconditioned distribution of the model predictions: density(p). Then the distribution of predictions on positive truth-value examples must be a shape of the form c p density(p) for some constant c, and the distributions of the predictions conditioned to come from the negative truth-value examples must be of the form d (1 - p) density(p) for some constant d. Our point is: being fully calibrated is a very strong condition. In this situation knowing the unconditioned distribution of predictions and the positive outcome prevalence determines the two conditional prediction distributions.

Let’s work a very interesting specific example of the above. Let’s take our unconditional distribution of model predictions to be Beta distributed with shape parameters a and b. Then some algebra allows us to derive the fully balanced condition implies that the distribution of the predictions on known positive outcomes must be Beta distributed with parameters (a + 1, b), and the distribution of the predictions on the known negative outcomes must be Beta distributed with parameters (a, b + 1).

By the methods of an earlier note we know the the prevalence must obey the equation:

   prevalence E[prediction | positive] +
      (1 - prevalence) E[prediction | negative] = prevalence

Substituting in the known expected values for the conditional Beta distributions we have:

   prevalence (a + 1) / (a + b + 1) +
      (1 - prevalence) a / (a + b + 1) = prevalence

Which implies prevalence = a / (a + b). This prevalence is also the mean of the unconditional distribution and the unique point where the two conditional densities cross (this is confirmed by some algebra on the Beta densities).

Examining an Instance

Let’s see an instance of this example in R.

We pick an example prevalence.

This leaves us one degree of freedom remaining, let’s use this to set a to 1.4.

Confirm a / (a + b) matches the specified prevalence.

## [1] 0.2

Demonstrating the Relations

Basic Summaries and Densities

Let’s look at all three of the distributions of interest: the unconditional distribution of predictions, the distribution of predictions on known positive examples, and the distribution of examples on known negative examples.

The conditional densities weighted by what fraction of the system they are should add up to the unconditional density. That is easy to confirm quantitatively.

## [1] 2.220446e-15

With a bit more work we can depict this graphically.

Unnamed chunk 8 1

Note the conditioned curves are the densities re-scaled by to integrate to their fraction of contribution, this moves the crossing point away from the prevalence.

Full Calibration

We can also confirm the the depicted model score is fully calibrated. We are looking for the following line to match the line y = x.

Unnamed chunk 9 1

Conditional Summaries

The per-class means and prevalences can be confirmed as follows.

mean class_prevalence class
0.3000000 0.2 positive_class
0.1750018 0.8 negative_class
## [1] 0.2000014

The conditional curves cross where beta(a, b+1)(x) = beta(a+1, b)(x). This is at x = 1/(1 + beta(a, b+1)/beta(a+1, b)), which is also a / (a + b).

## [1] 2.775558e-17
## [1] -5.329071e-15

A Graphical Summary

All of the above relations can be summarized in the following annotated graph.

Unnamed chunk 12 1

As an ROC Plot

And we can, of course, represent the model performance as an ROC plot.

Unnamed chunk 13 1

Some Limiting Cases

For a given prevalence situation the prediction densities of all fully calibrated models that have an unconditional beta distribution are given by shape parameters:

  • (c a, c b) for the unconditional distribution of predictions.
  • (c a + 1, c b) for the distribution of the prediction on positive examples.
  • (c a, c b + 1) for the distribution of the prediction on negative examples.

Where (a, b) are positive numbers such that prevalence = a / (a + b) and c is an arbitrary positive constant.

For this family of model performances the limit as we take c to positive infinity is the constant model that always predicts the prevalence. This is an oblivious or uninformative model, but it is fully calibrated on its single limiting prediction.

If we take the limit of c to zero from above, then the limiting model is an impulse at zero of perfect negative predictions and an impulse at one of perfect positive predictions.

For intermediate values of c we get models that are fully calibrated, match the observed prevalence, and have varying degrees of quality for their predictions.

If a = b, then the previously discussed uniform distribution example is part of the model family.

Conclusion

For a given problem prevalence, the set of fully calibrated model performance densities that have an unconditional Beta distribution form a single parameter family. When model performance is characterized by a single parameter of this nature, then using a single derived parameter such as AUC (area under the curve) becomes a proper model comparison procedure in that dominant model quality is ordered the same way as the score. (Please see ref, though this is not the model characterization family we anticipated there; also a different characterization than normal or logit normal scoring; some notes on the asymmetries in this formulation can be found here.)

Categories: data science Practical Data Science Pragmatic Data Science Pragmatic Machine Learning Statistics

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

1 reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: