I am finishing up a work-note that has some really neat implications as to why working with AUC is more powerful than one might think.
I think I am far enough along to share the consequences here. This started as some, now reappraised, thoughts on the fallacy of thinking knowing the AUC (area under the curve) means you know the shape of the ROC plot (receiver operating characteristic plot]. I now think for many practical applications the AUC number carries a lot more information about the ROC shape than one might expect.
For the experienced practitioner the following ROC shape is very familiar.
Example from vtreat for Python documentation, KDD2009 credit churn problem.
What my new note is working to establish is: under fairly mild assumptions this curve is going to be very close to
(1 - sensitivity)^q + (1 - specificity)^q = 1
where q is a constant to be determined.
This is a big deal as the ROC curve encodes every possible trade-off of sensitivity versus specificity. If you know your population prevalence of what you are trying to predict and some approximation of your false-positive versus false negative costs, then picking the right point on the ROC curve tells you the best classification rule for your utility that can be derived from the given model score.
In addition, curves of this form are nested, the better ones are simultaneously better at all sensitivity/specificity trade-offs. For curves of this family better AUC does mean better (dominant or containing) ROC plots.
For instance the reported AUC was 0.75 corresponds to a q of about 0.61. The curve (1-sensitivity)^0.61 + (1-specificity)^0.61 = 1 looks like the following.
Our claim is, modulo presentation scaling, these two plots are very similar. The (1 – sensitivity)^0.61 + (1-specificity)^0.61 = 1 curve is an idealization of the observed data. We have an example of using the ideal curve to pick optimal trade-offs here.
We think confirming your ROC curve looks like the ideal shape is worth while (and more variables, more data, and more detailed models can help ensure this). Once you have a near-ideal ROC curve then the curve is completely determined by the AUC and you can even look at high/low sensitivity/specificity situations.
Categories: Mathematics Opinion Pragmatic Data Science Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Found a neat reference on ROC plots.
Douglas Mossman, Hongying Peng, “Using Dual Beta Distributions to Create “Proper” ROC Curves Based on Rating Category Data”, Medical Decision Making, Vol 36, Issue 3, 2016; https://journals.sagepub.com/doi/abs/10.1177/0272989X15582210 .
The authors have the nice idea of using the idealizing approximation:
Sensitivity = 1 – I(t; a, 1)
Specificity = 1 – I(t; 1, b)
Where I(t; a, b) is the normalized incomplete beta function.
The nice thing is, when one of these parameters is 1 then this is a simple polynomial and we get the relation:
(1 – Specificity)^(1/b) + (1 – Sensitivity)^(1/a) = 1
And the curve is convex, which is nice (though misses some cases). I think one of the advantages of the convex curve is: it allows finding the optimal Sensitivity/Specificity just by looking at the normals/derivative.
The author’s claim is: fitting the parametric trade-off curve is more reliable in making correct trade-offs. Likely the parametric method is more statistically efficient as long as its distributional assumptions are not too far off.