I think I am far enough along to share the consequences here. This started as some, now reappraised, thoughts on the fallacy of thinking knowing the AUC (area under the curve) means you know the shape of the ROC plot (receiver operating characteristic plot]. I now think for many practical applications the AUC number carries a lot more information about the ROC shape than one might expect.
For the experienced practitioner the following ROC shape is very familiar.
What my new note is working to establish is: under fairly mild assumptions this curve is going to be very close to
(1 - sensitivity)^q + (1 - specificity)^q = 1
where q is a constant to be determined.
This is a big deal as the ROC curve encodes every possible trade-off of sensitivity versus specificity. If you know your population prevalence of what you are trying to predict and some approximation of your false-positive versus false negative costs, then picking the right point on the ROC curve tells you the best classification rule for your utility that can be derived from the given model score.
In addition, curves of this form are nested, the better ones are simultaneously better at all sensitivity/specificity trade-offs. For curves of this family better AUC does mean better (dominant or containing) ROC plots.
For instance the reported AUC was 0.75 corresponds to a q of about 0.61. The curve (1-sensitivity)^0.61 + (1-specificity)^0.61 = 1 looks like the following.
Our claim is, modulo presentation scaling, these two plots are very similar. The (1 – sensitivity)^0.61 + (1-specificity)^0.61 = 1 curve is an idealization of the observed data. We have an example of using the ideal curve to pick optimal trade-offs here.
We think confirming your ROC curve looks like the ideal shape is worth while (and more variables, more data, and more detailed models can help ensure this). Once you have a near-ideal ROC curve then the curve is completely determined by the AUC and you can even look at high/low sensitivity/specificity situations.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.