Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC).

R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots are produced and how AUC can be calculated. Bob Horton’s article showed how elegantly the points on the ROC plot are expressed in terms of sorting and cumulative summation.

The next step is computing AUC. Obviously computing area is a solved problem. The issue is how you deal with interpolating between points and the conventions of what to do with data that has identical scores. An elegant interpretation of the usual tie breaking rules is: for every point on the ROC curve we must have either all of the data above a given score threshold or none of the data above a given score threshold. This is the issue alluded to when the original article states:

This brings up another limitation of this simple approach; by assuming that the rank order of the outcomes embodies predictive information from the model, it does not properly handle sequences of cases that all have the same score.

This problem is quite easy to explain with an example. Consider the following data.

```
```d <- data.frame(pred=c(1,1,2,2),y=c(FALSE,FALSE,TRUE,FALSE))
print(d)
## pred y
## 1 1 FALSE
## 2 1 FALSE
## 3 2 TRUE
## 4 2 FALSE

Using code adapted from the original article we can quickly get an interesting summary.

```
```ord <- order(d$pred, decreasing=TRUE) # sort by prediction reversed
labels <- d$y[ord]
data.frame(TPR=cumsum(labels)/sum(labels),
FPR=cumsum(!labels)/sum(!labels),
labels=labels,
pred=d$pred[ord])
## TPR FPR labels pred
## 1 1 0.0000000 TRUE 2
## 2 1 0.3333333 FALSE 2
## 3 1 0.6666667 FALSE 1
## 4 1 1.0000000 FALSE 1

The problem is: we need to take all of the points with the same prediction score as an atomic unit (we take all of them or none of them). Notice also `TPR`

is always 1 (an undesirable effect).

We do not really want rows 1 and 3 in our plot or area calculations. In fact the values in row 1 and 3 are not fully determined as they can vary depending on details of tie breaking in the sorting (though the values recorded in rows 2 and 4 can not so vary). Also (especially after deleting rows) we may need to add in ideal points with `(FPR,TPR)=(0,0)`

and `(FPR,TPR)=(1,1)`

to complete our plot and area calculations.

What we want is a plot where ties are handled. Such plots look like the following:

```
```# devtools::install_github('WinVector/WVPlots')
library('WVPlots') # see: https://github.com/WinVector/WVPlots
WVPlots::ROCPlot(d,'pred','y',TRUE,'example plot')

There is a fairly elegant way to get the necessary adjusted plotting frame: use differencing (the opposite of cumulative sums) to find where the `pred`

column changes, and limit to those rows.

The code is as follows (also found in our `sigr`

library here):

```
```calcAUC <- function(modelPredictions,yValues) {
ord <- order(modelPredictions, decreasing=TRUE)
yValues <- yValues[ord]
modelPredictions <- modelPredictions[ord]
x <- cumsum(!yValues)/max(1,sum(!yValues)) # FPR = x-axis
y <- cumsum(yValues)/max(1,sum(yValues)) # TPR = y-axis
# each point should be fully after a bunch of points or fully before a
# decision level. remove dups to achieve this.
dup <- c(modelPredictions[-1]>=modelPredictions[-length(modelPredictions)],
FALSE)
# And add in ideal endpoints just in case (redundancy here is not a problem).
x <- c(0,x[!dup],1)
y <- c(0,y[!dup],1)
# sum areas of segments (triangle topped vertical rectangles)
n <- length(y)
area <- sum( ((y[-1]+y[-n])/2) * (x[-1]-x[-n]) )
area
}

This correctly calculates the AUC.

```
```# devtools::install_github('WinVector/sigr')
library('sigr') # see: https://github.com/WinVector/sigr
calcAUC(d$pred,d$y)
## [1] 0.8333333

I think this extension maintains the spirit of the original. We have also shown how complexity increases as you move from code known to work on a particular data set at hand, to library code that may be exposed to data with unanticipated structures or degeneracies (this is why Quicksort, which has an elegant description, often has monstrous implementations; please see here for a rant on that topic).

Categories: Expository Writing Pragmatic Data Science Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

I like the statistical definition of the AUC which is given in (Tom Fawcett, 2006).

You can found [there](https://goo.gl/aRH11N) how I use it to compute an estimate of the AUC and a 95% confidence interval.

I’m not decided yet how to handle equalities between cases and controls. For now, If for example my scores are constant, I will predict an AUC of 0% because my scores for cases needs to be strictly greater than my scores for controls. Yet, I wonder if counting an equality as 0.5 in the `sum` would be better, as it would reflect deciding as “tossing a coin”. In this cases, scores which are constant would give an AUC of 50%.

Do you have an opinion on this?

Also, I wonder how computations based on the ROC curve handle these equalities?

Another good AUC reference is “Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation” S. J. Mason, and N. E. Graham Q.J.R Metrol. Soc. (2002) 128, pp 2145-2166.

I think you do need to at least treat ties are 1/2 in most formulations. In sigr we get difference in AUC scores by comparing one population to pooled or mixed population (using a null-hypothesis that the two populations are the same, and therefore poolable).