The double density plot contains a lot of useful information.
This is a plot that shows the distribution of a continuous model score, conditioned on the binary categorical outcome to be predicted. As with most density plots: the y-axis is an abstract quantity called density picked such that the area of each curve integrates to 1.
An example is given here.
The really cool observation I wanted to share is: if we know this classifier is well calibrated, then we can recover the positive category prevalence from the graph.
A well calibrated probability score is one such that E[outcome == TRUE] = E[prediction]
. For such a classifier we must have for the unknown positive outcome prevalence p
. This is because the following relation holds in this case:
p E[prediction | on positive curve] + (1 - p) E[prediction | on negative curve] = p
This follows as p
and 1-p
are the relative sizes of the positive and negative classes, prior to being re-scaled to integrate to one as part of the density. The conditional expectations E[prediction | on positive curve]
and E[prediction | on negative curve]
are depicted on the double density plot, so from them we can recover the prevalence p
.
The recovery of the prevalence from the two conditional means is shown in the earlier figure.
We have some additional results coming out for what I am currently calling “fully calibrated probability scores.” These are scores such that E[outcome == TRUE | prediction = p] = p
for all p
in the interval [0, 1]
. This includes a very interesting special case where it is easy to show that the prevalence is the probability value where the density curves cross.
Categories: Pragmatic Data Science Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.